2025-05-07T20:22:35.2633365Z Current runner version: '2.323.0' 2025-05-07T20:22:35.2643219Z Runner name: 'i-06f3d8044a6f79407' 2025-05-07T20:22:35.2644634Z Machine name: 'ip-10-0-69-200' 2025-05-07T20:22:35.2648830Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:22:35.2651872Z Contents: read 2025-05-07T20:22:35.2652601Z Metadata: read 2025-05-07T20:22:35.2653339Z Packages: read 2025-05-07T20:22:35.2654066Z ##[endgroup] 2025-05-07T20:22:35.2657355Z Secret source: None 2025-05-07T20:22:35.2658402Z Prepare workflow directory 2025-05-07T20:22:35.3205801Z Prepare all required actions 2025-05-07T20:22:35.3241711Z Getting action download info 2025-05-07T20:22:35.5052362Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:22:35.7991376Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:22:36.1737532Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:22:37.7857766Z Getting action download info 2025-05-07T20:22:37.9220447Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:22:38.1484126Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.10, 12.6.3, 12.6.3, clang) 2025-05-07T20:22:38.2033583Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:22:38.2150737Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:22:38.2163085Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:22:38.2164309Z ##[endgroup] 2025-05-07T20:22:39.3996613Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:22:39.3997102Z Instance Type: g5.4xlarge 2025-05-07T20:22:39.3997443Z AMI Name: unknown 2025-05-07T20:22:39.4037450Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:22:44.7435026Z ##[group]Run actions/checkout@v4 2025-05-07T20:22:44.7435341Z with: 2025-05-07T20:22:44.7435567Z submodules: true 2025-05-07T20:22:44.7435806Z repository: pytorch/FBGEMM 2025-05-07T20:22:44.7436207Z token: *** 2025-05-07T20:22:44.7436412Z ssh-strict: true 2025-05-07T20:22:44.7436625Z ssh-user: git 2025-05-07T20:22:44.7436857Z persist-credentials: true 2025-05-07T20:22:44.7437108Z clean: true 2025-05-07T20:22:44.7437343Z sparse-checkout-cone-mode: true 2025-05-07T20:22:44.7437623Z fetch-depth: 1 2025-05-07T20:22:44.7437882Z fetch-tags: false 2025-05-07T20:22:44.7438103Z show-progress: true 2025-05-07T20:22:44.7438332Z lfs: false 2025-05-07T20:22:44.7438543Z set-safe-directory: true 2025-05-07T20:22:44.7438794Z env: 2025-05-07T20:22:44.7439009Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:44.7439326Z BUILD_ENV: build_binary 2025-05-07T20:22:44.7439591Z BUILD_TARGET: genai 2025-05-07T20:22:44.7439814Z BUILD_VARIANT: cuda 2025-05-07T20:22:44.7440078Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:44.7440337Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:44.7440580Z ##[endgroup] 2025-05-07T20:22:44.8588730Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:22:44.8589896Z ##[group]Getting Git version info 2025-05-07T20:22:44.8590338Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.8590957Z [command]/usr/bin/git version 2025-05-07T20:22:44.8591229Z git version 2.47.1 2025-05-07T20:22:44.8612564Z ##[endgroup] 2025-05-07T20:22:44.8625319Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/9f623c0f-50d9-4606-af33-1a85c87373d0' before making global git config changes 2025-05-07T20:22:44.8626232Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:22:44.8639419Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.8677729Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.8680809Z ##[group]Initializing the repository 2025-05-07T20:22:44.8685416Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.8726733Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-05-07T20:22:44.8727365Z hint: is subject to change. To configure the initial branch name to use in all 2025-05-07T20:22:44.8727923Z hint: of your new repositories, which will suppress this warning, call: 2025-05-07T20:22:44.8728299Z hint: 2025-05-07T20:22:44.8728598Z hint: git config --global init.defaultBranch 2025-05-07T20:22:44.8728928Z hint: 2025-05-07T20:22:44.8729258Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-05-07T20:22:44.8729799Z hint: 'development'. The just-created branch can be renamed via this command: 2025-05-07T20:22:44.8730205Z hint: 2025-05-07T20:22:44.8730434Z hint: git branch -m 2025-05-07T20:22:44.8730934Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/ 2025-05-07T20:22:44.8740162Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM 2025-05-07T20:22:44.8775469Z ##[endgroup] 2025-05-07T20:22:44.8775967Z ##[group]Disabling automatic garbage collection 2025-05-07T20:22:44.8779925Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:22:44.8813573Z ##[endgroup] 2025-05-07T20:22:44.8814375Z ##[group]Setting up auth 2025-05-07T20:22:44.8819340Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:22:44.8850907Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:22:44.9225036Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:22:44.9257624Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:22:44.9600223Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:44.9648603Z ##[endgroup] 2025-05-07T20:22:44.9649010Z ##[group]Fetching the repository 2025-05-07T20:22:44.9657292Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:22:45.3966749Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:22:45.3967558Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:22:45.3990950Z ##[endgroup] 2025-05-07T20:22:45.3994257Z ##[group]Determining the checkout info 2025-05-07T20:22:45.3994698Z ##[endgroup] 2025-05-07T20:22:45.3998974Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:22:45.4047043Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:22:45.4090367Z ##[group]Checking out the ref 2025-05-07T20:22:45.4094584Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:22:45.5187488Z Note: switching to 'refs/remotes/pull/4066/merge'. 2025-05-07T20:22:45.5187915Z 2025-05-07T20:22:45.5188239Z You are in 'detached HEAD' state. You can look around, make experimental 2025-05-07T20:22:45.5188991Z changes and commit them, and you can discard any commits you make in this 2025-05-07T20:22:45.5189718Z state without impacting any branches by switching back to a branch. 2025-05-07T20:22:45.5190160Z 2025-05-07T20:22:45.5190487Z If you want to create a new branch to retain commits you create, you may 2025-05-07T20:22:45.5191149Z do so (now or later) by using -c with the switch command. Example: 2025-05-07T20:22:45.5191523Z 2025-05-07T20:22:45.5191688Z git switch -c 2025-05-07T20:22:45.5191957Z 2025-05-07T20:22:45.5192137Z Or undo this operation with: 2025-05-07T20:22:45.5192395Z 2025-05-07T20:22:45.5192521Z git switch - 2025-05-07T20:22:45.5193054Z 2025-05-07T20:22:45.5193380Z Turn off this advice by setting config variable advice.detachedHead to false 2025-05-07T20:22:45.5193846Z 2025-05-07T20:22:45.5194372Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:22:45.5202236Z ##[endgroup] 2025-05-07T20:22:45.5202648Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:22:45.5208383Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:45.5259887Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:22:45.5292611Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:22:45.5324669Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:22:45.5352885Z ##[endgroup] 2025-05-07T20:22:45.5353279Z ##[group]Fetching submodules 2025-05-07T20:22:45.5356916Z [command]/usr/bin/git submodule sync 2025-05-07T20:22:45.5702380Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:22:45.6034899Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit' 2025-05-07T20:22:45.6037678Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel' 2025-05-07T20:22:45.6041077Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo' 2025-05-07T20:22:45.6044678Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass' 2025-05-07T20:22:45.6048831Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest' 2025-05-07T20:22:45.6052714Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch' 2025-05-07T20:22:45.6056825Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json' 2025-05-07T20:22:45.6088024Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'... 2025-05-07T20:22:46.1641568Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'... 2025-05-07T20:22:46.6085491Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'... 2025-05-07T20:22:47.0510137Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'... 2025-05-07T20:22:48.1706791Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'... 2025-05-07T20:22:48.5045348Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'... 2025-05-07T20:22:48.7837553Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'... 2025-05-07T20:22:49.8658035Z From https://github.com/asmjit/asmjit 2025-05-07T20:22:49.8658699Z * branch e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD 2025-05-07T20:22:49.9126923Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:22:51.1939768Z From https://github.com/jwfromm/composable_kernel 2025-05-07T20:22:51.1940270Z * branch 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD 2025-05-07T20:22:51.4731636Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:22:52.1882961Z From https://github.com/pytorch/cpuinfo 2025-05-07T20:22:52.1884165Z * branch 6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD 2025-05-07T20:22:52.2988420Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:22:53.4650700Z From https://github.com/jwfromm/cutlass 2025-05-07T20:22:53.4651161Z * branch 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD 2025-05-07T20:22:54.1616943Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:22:54.9598506Z From https://github.com/google/googletest 2025-05-07T20:22:54.9598960Z * branch f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD 2025-05-07T20:22:54.9999188Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:22:55.6035397Z From https://github.com/ROCmSoftwarePlatform/hipify_torch 2025-05-07T20:22:55.6035885Z * branch 420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD 2025-05-07T20:22:55.6120516Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:22:56.3444387Z From https://github.com/nlohmann/json 2025-05-07T20:22:56.3444992Z * branch 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD 2025-05-07T20:22:56.4580367Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:22:56.4598651Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:22:56.4933642Z Entering 'external/asmjit' 2025-05-07T20:22:56.4965482Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.4997047Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.5028362Z Entering 'external/cutlass' 2025-05-07T20:22:56.5060683Z Entering 'external/googletest' 2025-05-07T20:22:56.5091684Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.5125205Z Entering 'external/json' 2025-05-07T20:22:56.5169657Z ##[endgroup] 2025-05-07T20:22:56.5170067Z ##[group]Persisting credentials for submodules 2025-05-07T20:22:56.5176568Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:22:56.5506477Z Entering 'external/asmjit' 2025-05-07T20:22:56.5572569Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.5644558Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.5711835Z Entering 'external/cutlass' 2025-05-07T20:22:56.5786872Z Entering 'external/googletest' 2025-05-07T20:22:56.5852612Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.5918727Z Entering 'external/json' 2025-05-07T20:22:56.6003561Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:22:56.6336826Z Entering 'external/asmjit' 2025-05-07T20:22:56.6400731Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:22:56.6403006Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.6463530Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:22:56.6466490Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.6526661Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:22:56.6529626Z Entering 'external/cutlass' 2025-05-07T20:22:56.6590423Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:22:56.6593376Z Entering 'external/googletest' 2025-05-07T20:22:56.6653891Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:22:56.6657256Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.6718203Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:22:56.6721107Z Entering 'external/json' 2025-05-07T20:22:56.6784015Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:22:56.6890434Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:22:56.7221487Z Entering 'external/asmjit' 2025-05-07T20:22:56.7253983Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.7285308Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.7317315Z Entering 'external/cutlass' 2025-05-07T20:22:56.7349185Z Entering 'external/googletest' 2025-05-07T20:22:56.7381787Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.7414142Z Entering 'external/json' 2025-05-07T20:22:56.7463841Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:22:56.7791454Z Entering 'external/asmjit' 2025-05-07T20:22:56.7825965Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.7859032Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.7890877Z Entering 'external/cutlass' 2025-05-07T20:22:56.7944295Z Entering 'external/googletest' 2025-05-07T20:22:56.7958745Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.7990022Z Entering 'external/json' 2025-05-07T20:22:56.8033900Z ##[endgroup] 2025-05-07T20:22:56.8075900Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:22:56.8102895Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:22:56.8277994Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:22:56.8278322Z with: 2025-05-07T20:22:56.8278569Z name: fbgemm_genai_x86_clang_py3.10_cu12.6.3.whl 2025-05-07T20:22:56.8278890Z merge-multiple: false 2025-05-07T20:22:56.8279153Z repository: pytorch/FBGEMM 2025-05-07T20:22:56.8279415Z run-id: 14891846252 2025-05-07T20:22:56.8279625Z env: 2025-05-07T20:22:56.8279850Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.8280151Z BUILD_ENV: build_binary 2025-05-07T20:22:56.8280429Z BUILD_TARGET: genai 2025-05-07T20:22:56.8280681Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.8280925Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:56.8281180Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.8281424Z ##[endgroup] 2025-05-07T20:22:57.0613770Z Downloading single artifact 2025-05-07T20:22:57.1601487Z Preparing to download the following artifacts: 2025-05-07T20:22:57.1602396Z - fbgemm_genai_x86_clang_py3.10_cu12.6.3.whl (ID: 3081363083, Size: 12540944, Expected Digest: sha256:afbb98e930da7c62e149bc1ea88813f21873c24e8bb8269009e6340258c9d98e) 2025-05-07T20:22:57.2129883Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-0a2daaca-7a55-5fcf-bcc5-f66fdbd32d30/artifacts/648fc1a1b73d5d5cd1d464169b896b3a80c98aae0ebb5ca5326862fe4d644842.zip 2025-05-07T20:22:57.2131368Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:57.3075918Z (node:56910) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:22:57.3076889Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:22:57.4982128Z SHA256 digest of downloaded artifact is afbb98e930da7c62e149bc1ea88813f21873c24e8bb8269009e6340258c9d98e 2025-05-07T20:22:57.4982721Z Artifact download completed successfully. 2025-05-07T20:22:57.4983059Z Total of 1 artifact(s) downloaded 2025-05-07T20:22:57.4988613Z Download artifact has finished successfully 2025-05-07T20:22:57.5248679Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:22:57.5249067Z with: 2025-05-07T20:22:57.5249284Z driver-version: 570.133.07 2025-05-07T20:22:57.5249532Z env: 2025-05-07T20:22:57.5249750Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:57.5250055Z BUILD_ENV: build_binary 2025-05-07T20:22:57.5250312Z BUILD_TARGET: genai 2025-05-07T20:22:57.5250545Z BUILD_VARIANT: cuda 2025-05-07T20:22:57.5250796Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:57.5251064Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:57.5251306Z ##[endgroup] 2025-05-07T20:22:57.5351162Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:22:57.5351547Z with: 2025-05-07T20:22:57.5351925Z timeout_minutes: 10 2025-05-07T20:22:57.5352157Z max_attempts: 3 2025-05-07T20:22:57.5376116Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:22:57.5399903Z retry_wait_seconds: 10 2025-05-07T20:22:57.5400165Z polling_interval_seconds: 1 2025-05-07T20:22:57.5400426Z warning_on_retry: true 2025-05-07T20:22:57.5400673Z continue_on_error: false 2025-05-07T20:22:57.5400914Z env: 2025-05-07T20:22:57.5401152Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:57.5401486Z BUILD_ENV: build_binary 2025-05-07T20:22:57.5401731Z BUILD_TARGET: genai 2025-05-07T20:22:57.5401954Z BUILD_VARIANT: cuda 2025-05-07T20:22:57.5402195Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:57.5402454Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:57.5402692Z DRIVER_VERSION: 570.133.07 2025-05-07T20:22:57.5402935Z ##[endgroup] 2025-05-07T20:22:57.6207303Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:22:57.6209261Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:22:57.6209683Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:22:58.2346419Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:22:58.2347157Z No packages marked for removal. 2025-05-07T20:22:58.2411526Z Dependencies resolved. 2025-05-07T20:22:58.2422195Z Nothing to do. 2025-05-07T20:22:58.2423149Z Complete! 2025-05-07T20:22:58.2744408Z + install_nvidia_driver_common 2025-05-07T20:22:58.2750599Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:22:58.2750951Z + lspci 2025-05-07T20:22:58.2751915Z Before installing NVIDIA driver 2025-05-07T20:22:58.2936027Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:58.2936874Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:58.2937467Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:58.2938359Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:58.2939137Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:58.2939723Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:58.2940220Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:58.2940710Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:58.2941159Z + lsmod 2025-05-07T20:22:58.2981536Z Module Size Used by 2025-05-07T20:22:58.2982182Z xt_conntrack 16384 1 2025-05-07T20:22:58.2982705Z nft_chain_nat 16384 3 2025-05-07T20:22:58.2983215Z xt_MASQUERADE 20480 1 2025-05-07T20:22:58.2983828Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:58.2984497Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:58.2985290Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:58.2986168Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:58.2986798Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:58.2987389Z xfrm_user 57344 1 2025-05-07T20:22:58.2987908Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:58.2988487Z xt_addrtype 16384 2 2025-05-07T20:22:58.2989007Z nft_compat 20480 4 2025-05-07T20:22:58.2989603Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:58.2990438Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:58.2991125Z br_netfilter 36864 0 2025-05-07T20:22:58.2991448Z bridge 323584 1 br_netfilter 2025-05-07T20:22:58.2991756Z stp 16384 1 bridge 2025-05-07T20:22:58.2992052Z llc 16384 2 bridge,stp 2025-05-07T20:22:58.2992343Z overlay 167936 0 2025-05-07T20:22:58.2992588Z tls 135168 0 2025-05-07T20:22:58.2992849Z nls_ascii 16384 1 2025-05-07T20:22:58.2993104Z nls_cp437 20480 1 2025-05-07T20:22:58.2993346Z vfat 24576 1 2025-05-07T20:22:58.2993604Z fat 86016 1 vfat 2025-05-07T20:22:58.2993878Z sunrpc 696320 1 2025-05-07T20:22:58.2994125Z i8042 45056 0 2025-05-07T20:22:58.2994381Z serio 28672 3 i8042 2025-05-07T20:22:58.2994657Z ena 180224 0 2025-05-07T20:22:58.2994910Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:58.2995187Z button 24576 0 2025-05-07T20:22:58.2995446Z sch_fq_codel 20480 17 2025-05-07T20:22:58.2995703Z dm_mod 188416 0 2025-05-07T20:22:58.2995956Z fuse 163840 1 2025-05-07T20:22:58.2996213Z loop 36864 0 2025-05-07T20:22:58.2996478Z dax 45056 1 dm_mod 2025-05-07T20:22:58.2996752Z configfs 57344 1 2025-05-07T20:22:58.2997018Z dmi_sysfs 20480 0 2025-05-07T20:22:58.2997278Z crc32_pclmul 16384 0 2025-05-07T20:22:58.2997530Z crc32c_intel 24576 0 2025-05-07T20:22:58.2997793Z efivarfs 24576 1 2025-05-07T20:22:58.2998048Z + modinfo nvidia 2025-05-07T20:22:58.3000056Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:58.3000569Z import_ns: DMA_BUF 2025-05-07T20:22:58.3000827Z alias: char-major-195-* 2025-05-07T20:22:58.3001155Z version: 570.133.07 2025-05-07T20:22:58.3001404Z supported: external 2025-05-07T20:22:58.3001666Z license: Dual MIT/GPL 2025-05-07T20:22:58.3001959Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:58.3002415Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:58.3002923Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:58.3003256Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:58.3003602Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:58.3003935Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:58.3004253Z depends: i2c-core,drm 2025-05-07T20:22:58.3004512Z retpoline: Y 2025-05-07T20:22:58.3004740Z name: nvidia 2025-05-07T20:22:58.3005098Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:58.3005595Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:58.3006084Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:58.3006866Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:58.3007362Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:58.3007838Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:58.3008351Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:58.3008770Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:58.3009087Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:58.3009463Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:58.3010032Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:58.3010565Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:58.3011071Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:58.3011485Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:58.3011866Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:58.3012270Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:58.3012652Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:58.3013085Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:58.3013501Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:58.3013936Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:58.3014349Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:58.3014694Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:58.3015074Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:58.3015449Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:58.3015798Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:58.3016130Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:58.3016463Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:58.3016795Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:58.3017116Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:58.3017463Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:58.3017847Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:58.3018281Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:58.3018631Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:58.3018983Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:58.3019328Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:58.3019679Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:58.3020012Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:58.3020311Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:58.3020645Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:58.3020973Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:58.3021296Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:58.3021638Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:58.3022004Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:58.3022358Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:58.3022700Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:58.3023061Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:58.3023404Z parm: rm_firmware_active:charp 2025-05-07T20:22:58.3023851Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:22:58.3024107Z ++ command -v nvidia-smi 2025-05-07T20:22:58.3024370Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:22:58.3024634Z + set +e 2025-05-07T20:22:58.3024951Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:23:00.1163815Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:23:00.1164194Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:00.1164435Z + '[' 0 -ne 0 ']' 2025-05-07T20:23:00.1164667Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:23:00.1164945Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:23:00.1165381Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:23:00.1165865Z + set -e 2025-05-07T20:23:00.1166420Z + '[' 1 -eq 0 ']' 2025-05-07T20:23:00.1166824Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:23:00.1167289Z + post_install_nvidia_driver_common 2025-05-07T20:23:00.1170162Z + sudo modprobe nvidia 2025-05-07T20:23:00.2451983Z + echo 'After installing NVIDIA driver' 2025-05-07T20:23:00.2452317Z + lspci 2025-05-07T20:23:00.2452540Z After installing NVIDIA driver 2025-05-07T20:23:00.2572619Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:00.2573133Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:00.2573698Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:00.2574422Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:00.2575101Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:00.2575636Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:00.2576153Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:00.2576633Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:00.2577054Z + lsmod 2025-05-07T20:23:00.2604848Z Module Size Used by 2025-05-07T20:23:00.2605163Z nvidia_uvm 1884160 0 2025-05-07T20:23:00.2605431Z nvidia 11583488 1 nvidia_uvm 2025-05-07T20:23:00.2605724Z drm 602112 1 nvidia 2025-05-07T20:23:00.2606039Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:00.2606345Z backlight 24576 1 drm 2025-05-07T20:23:00.2606633Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:00.2606926Z xt_conntrack 16384 1 2025-05-07T20:23:00.2607191Z nft_chain_nat 16384 3 2025-05-07T20:23:00.2607447Z xt_MASQUERADE 20480 1 2025-05-07T20:23:00.2607751Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:00.2608093Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:00.2608491Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:00.2608930Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:00.2609252Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:00.2609546Z xfrm_user 57344 1 2025-05-07T20:23:00.2609815Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:00.2610106Z xt_addrtype 16384 2 2025-05-07T20:23:00.2610362Z nft_compat 20480 4 2025-05-07T20:23:00.2610674Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:00.2611082Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:00.2611455Z br_netfilter 36864 0 2025-05-07T20:23:00.2611729Z bridge 323584 1 br_netfilter 2025-05-07T20:23:00.2612024Z stp 16384 1 bridge 2025-05-07T20:23:00.2612300Z llc 16384 2 bridge,stp 2025-05-07T20:23:00.2612585Z overlay 167936 0 2025-05-07T20:23:00.2612835Z tls 135168 0 2025-05-07T20:23:00.2613079Z nls_ascii 16384 1 2025-05-07T20:23:00.2613655Z nls_cp437 20480 1 2025-05-07T20:23:00.2613907Z vfat 24576 1 2025-05-07T20:23:00.2614153Z fat 86016 1 vfat 2025-05-07T20:23:00.2614422Z sunrpc 696320 1 2025-05-07T20:23:00.2614668Z i8042 45056 0 2025-05-07T20:23:00.2614927Z serio 28672 3 i8042 2025-05-07T20:23:00.2615196Z ena 180224 0 2025-05-07T20:23:00.2615450Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:00.2615704Z button 24576 0 2025-05-07T20:23:00.2615949Z sch_fq_codel 20480 17 2025-05-07T20:23:00.2616210Z dm_mod 188416 0 2025-05-07T20:23:00.2616457Z fuse 163840 1 2025-05-07T20:23:00.2616700Z loop 36864 0 2025-05-07T20:23:00.2617105Z dax 45056 1 dm_mod 2025-05-07T20:23:00.2617379Z configfs 57344 1 2025-05-07T20:23:00.2617625Z dmi_sysfs 20480 0 2025-05-07T20:23:00.2617880Z crc32_pclmul 16384 0 2025-05-07T20:23:00.2618275Z crc32c_intel 24576 0 2025-05-07T20:23:00.2618551Z efivarfs 24576 1 2025-05-07T20:23:00.2618801Z + modinfo nvidia 2025-05-07T20:23:00.2623644Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:00.2624110Z import_ns: DMA_BUF 2025-05-07T20:23:00.2624362Z alias: char-major-195-* 2025-05-07T20:23:00.2624637Z version: 570.133.07 2025-05-07T20:23:00.2624888Z supported: external 2025-05-07T20:23:00.2625135Z license: Dual MIT/GPL 2025-05-07T20:23:00.2625424Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:00.2625764Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:00.2626076Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:00.2626405Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:00.2626748Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:00.2627080Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:00.2627388Z depends: i2c-core,drm 2025-05-07T20:23:00.2627647Z retpoline: Y 2025-05-07T20:23:00.2627876Z name: nvidia 2025-05-07T20:23:00.2628230Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:00.2628698Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:00.2629143Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:00.2629550Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:00.2629862Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:00.2630161Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:00.2630475Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:00.2630773Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:00.2631085Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:00.2631449Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:00.2631832Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:00.2632169Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:00.2632473Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:00.2632774Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:00.2633136Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:00.2633531Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:00.2633907Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:00.2634314Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:00.2634726Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:00.2635143Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:00.2635550Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:00.2635887Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:00.2636258Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:00.2636737Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:00.2637084Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:00.2637406Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:00.2637736Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:00.2638055Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:00.2638367Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:00.2638716Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:00.2639071Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:00.2639401Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:00.2639741Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:00.2640085Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:00.2640512Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:00.2640859Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:00.2641186Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:00.2641485Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:00.2641811Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:00.2642137Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:00.2642449Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:00.2642782Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:00.2643141Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:00.2643484Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:00.2643815Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:00.2644164Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:00.2644496Z parm: rm_firmware_active:charp 2025-05-07T20:23:00.2644782Z + set +e 2025-05-07T20:23:00.2644982Z + nvidia-smi 2025-05-07T20:23:01.6714108Z Wed May 7 20:23:01 2025 2025-05-07T20:23:01.6714526Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.6715059Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:01.6715552Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:01.6716044Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:01.6716587Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:01.6717028Z | | | MIG M. | 2025-05-07T20:23:01.6717363Z |=========================================+========================+======================| 2025-05-07T20:23:01.6779830Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:01.6780294Z | 0% 29C P0 63W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:01.6780683Z | | | N/A | 2025-05-07T20:23:01.6781074Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:01.6781470Z 2025-05-07T20:23:01.6781859Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.6782284Z | Processes: | 2025-05-07T20:23:01.6782728Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:01.6783138Z | ID ID Usage | 2025-05-07T20:23:01.6783484Z |=========================================================================================| 2025-05-07T20:23:01.6784562Z | No running processes found | 2025-05-07T20:23:01.6785315Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:02.0987070Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:03.5045692Z NVIDIA A10G 2025-05-07T20:23:03.7751961Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:03.7752250Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:03.7752502Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:03.7752792Z + set -e 2025-05-07T20:23:03.7753014Z INFO: Ignoring allowed status 0 2025-05-07T20:23:03.7763252Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:03.7765213Z + sudo yum install -y yum-utils 2025-05-07T20:23:04.2289004Z Last metadata expiration check: 0:05:26 ago on Wed May 7 20:17:38 2025. 2025-05-07T20:23:04.2534948Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:04.2929478Z Dependencies resolved. 2025-05-07T20:23:04.3109820Z Nothing to do. 2025-05-07T20:23:04.3110599Z Complete! 2025-05-07T20:23:04.3482675Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:04.3483301Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:04.3484167Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:04.6353867Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:04.6909914Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:05.2057510Z nvidia-container-toolkit 14 kB/s | 833 B 00:00 2025-05-07T20:23:05.2304388Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:05.2704512Z Dependencies resolved. 2025-05-07T20:23:05.2884350Z ================================================================================ 2025-05-07T20:23:05.2884979Z Package Arch Version Repository Size 2025-05-07T20:23:05.2885386Z ================================================================================ 2025-05-07T20:23:05.2885701Z Downgrading: 2025-05-07T20:23:05.2886078Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:05.2886686Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:05.2887043Z 2025-05-07T20:23:05.2887136Z Transaction Summary 2025-05-07T20:23:05.2887394Z ================================================================================ 2025-05-07T20:23:05.2887713Z Downgrade 2 Packages 2025-05-07T20:23:05.2887864Z 2025-05-07T20:23:05.2887976Z Total download size: 6.8 M 2025-05-07T20:23:05.2888598Z Downloading Packages: 2025-05-07T20:23:05.3605172Z (1/2): nvidia-container-toolkit-base-1.16.2-1.x 80 MB/s | 5.6 MB 00:00 2025-05-07T20:23:05.4116222Z (2/2): nvidia-container-toolkit-1.16.2-1.x86_64 10 MB/s | 1.2 MB 00:00 2025-05-07T20:23:05.4128836Z -------------------------------------------------------------------------------- 2025-05-07T20:23:05.4132191Z Total 55 MB/s | 6.8 MB 00:00 2025-05-07T20:23:05.4134836Z Running transaction check 2025-05-07T20:23:05.4236679Z Transaction check succeeded. 2025-05-07T20:23:05.4237117Z Running transaction test 2025-05-07T20:23:05.4530111Z Transaction test succeeded. 2025-05-07T20:23:05.4532578Z Running transaction 2025-05-07T20:23:05.9986793Z Preparing : 1/1 2025-05-07T20:23:06.1029794Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:06.1050882Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:06.1251512Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:06.1252100Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:06.1350911Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:06.1373507Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:07.5374801Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:07.5375421Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:07.5375958Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:07.5376487Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:07.6781679Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:07.6783191Z WARNING: 2025-05-07T20:23:07.6783488Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:07.6783726Z 2025-05-07T20:23:07.6783823Z Available Versions: 2025-05-07T20:23:07.6783980Z 2025-05-07T20:23:07.6784086Z Version 2023.7.20250331: 2025-05-07T20:23:07.6784408Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:07.6784663Z 2025-05-07T20:23:07.6784792Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:07.6785014Z 2025-05-07T20:23:07.6785102Z Release notes: 2025-05-07T20:23:07.6785520Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:07.6785897Z 2025-05-07T20:23:07.6786247Z Version 2023.7.20250414: 2025-05-07T20:23:07.6786622Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:07.6786973Z 2025-05-07T20:23:07.6787124Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:07.6787448Z 2025-05-07T20:23:07.6787580Z Release notes: 2025-05-07T20:23:07.6788104Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:07.6788500Z 2025-05-07T20:23:07.6788620Z Version 2023.7.20250428: 2025-05-07T20:23:07.6802053Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:07.6802321Z 2025-05-07T20:23:07.6802449Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:07.6802666Z 2025-05-07T20:23:07.6802756Z Release notes: 2025-05-07T20:23:07.6803218Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:07.6803586Z 2025-05-07T20:23:07.6803708Z ================================================================================ 2025-05-07T20:23:07.7145335Z 2025-05-07T20:23:07.7145482Z 2025-05-07T20:23:07.7145761Z Downgraded: 2025-05-07T20:23:07.7146135Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:07.7146713Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:07.7147073Z 2025-05-07T20:23:07.7147158Z Complete! 2025-05-07T20:23:07.7615078Z + sudo systemctl restart docker 2025-05-07T20:23:11.7047071Z Wed May 7 20:23:11 2025 2025-05-07T20:23:11.7047500Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:11.7048000Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:11.7048493Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:11.7048992Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:11.7049515Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:11.7049947Z | | | MIG M. | 2025-05-07T20:23:11.7050282Z |=========================================+========================+======================| 2025-05-07T20:23:11.7131700Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:11.7132642Z | 0% 29C P0 63W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:11.7133047Z | | | N/A | 2025-05-07T20:23:11.7133439Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:11.7133834Z 2025-05-07T20:23:11.7134234Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:11.7134697Z | Processes: | 2025-05-07T20:23:11.7135139Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:11.7135710Z | ID ID Usage | 2025-05-07T20:23:11.7136048Z |=========================================================================================| 2025-05-07T20:23:11.7137560Z | No running processes found | 2025-05-07T20:23:11.7138029Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:12.5995466Z Command completed after 1 attempt(s). 2025-05-07T20:23:12.6082808Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:12.6083298Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:12.6097009Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:12.6097359Z env: 2025-05-07T20:23:12.6097599Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:12.6097903Z BUILD_ENV: build_binary 2025-05-07T20:23:12.6098261Z BUILD_TARGET: genai 2025-05-07T20:23:12.6098510Z BUILD_VARIANT: cuda 2025-05-07T20:23:12.6098746Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:12.6099014Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:12.6099322Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.6099652Z ##[endgroup] 2025-05-07T20:23:12.9478360Z ################################################################################ 2025-05-07T20:23:12.9478839Z # Print System Info 2025-05-07T20:23:12.9479159Z # 2025-05-07T20:23:12.9494219Z # [2025-05-07T20:23:12.949Z] + print_system_info 2025-05-07T20:23:12.9494730Z ################################################################################ 2025-05-07T20:23:12.9495054Z 2025-05-07T20:23:12.9495215Z ################################################################################ 2025-05-07T20:23:12.9495699Z [INFO] Printing environment variables ... 2025-05-07T20:23:12.9496115Z + printenv 2025-05-07T20:23:12.9496293Z 2025-05-07T20:23:12.9519336Z SHELL=/bin/bash 2025-05-07T20:23:12.9519726Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:12.9520257Z BUILD_VARIANT=cuda 2025-05-07T20:23:12.9521031Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_9737db9c-fa2f-4aa5-8f0b-5d1fd405ba6f 2025-05-07T20:23:12.9521888Z GITHUB_ACTION=__run 2025-05-07T20:23:12.9522326Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.9522840Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:12.9523210Z RUNNER_NAME=i-06f3d8044a6f79407 2025-05-07T20:23:12.9523611Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:12.9524083Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:12.9524517Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:12.9525048Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:12.9525664Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:12.9525957Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:12.9526252Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:12.9526754Z *** 2025-05-07T20:23:12.9526962Z LOGNAME=ec2-user 2025-05-07T20:23:12.9527199Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:12.9527479Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:12.9527718Z GITHUB_ACTIONS=true 2025-05-07T20:23:12.9527939Z SYSTEMD_EXEC_PID=55476 2025-05-07T20:23:12.9528223Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:12.9528771Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:12.9529277Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:12.9529564Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:12.9529828Z RUNNER_OS=Linux 2025-05-07T20:23:12.9530051Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:12.9530303Z HOME=/home/ec2-user 2025-05-07T20:23:12.9530557Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:12.9530846Z LANG=C.UTF-8 2025-05-07T20:23:12.9531152Z RUNNER_TRACKING_ID=github_b457ea54-0b6b-45b3-bdbc-45cac5aef1d8 2025-05-07T20:23:12.9531518Z RUNNER_ARCH=X64 2025-05-07T20:23:12.9531804Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:12.9532375Z BUILD_TARGET=genai 2025-05-07T20:23:12.9532917Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_9737db9c-fa2f-4aa5-8f0b-5d1fd405ba6f 2025-05-07T20:23:12.9533790Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_9737db9c-fa2f-4aa5-8f0b-5d1fd405ba6f 2025-05-07T20:23:12.9534531Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:12.9535209Z INVOCATION_ID=1482d53b51c24cadbdb69d1e5516bd3d 2025-05-07T20:23:12.9535544Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:12.9535814Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:12.9536397Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_9737db9c-fa2f-4aa5-8f0b-5d1fd405ba6f 2025-05-07T20:23:12.9537015Z BUILD_ENV=build_binary 2025-05-07T20:23:12.9537255Z GITHUB_ACTOR=q10 2025-05-07T20:23:12.9537473Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:12.9537704Z KERN_NAME_LC=linux 2025-05-07T20:23:12.9537941Z BUILD_CUDA_VERSION=12.6.3 2025-05-07T20:23:12.9538353Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:12.9538701Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:12.9538953Z USER=ec2-user 2025-05-07T20:23:12.9539191Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:12.9539478Z SHLVL=1 2025-05-07T20:23:12.9539684Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:12.9539997Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:12.9540445Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:12.9540813Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:12.9541061Z KERN_NAME=Linux 2025-05-07T20:23:12.9541292Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:12.9541707Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:12.9542140Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:12.9542460Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:12.9542813Z JOURNAL_STREAM=8:90275 2025-05-07T20:23:12.9543289Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:12.9543820Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:12.9544291Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:12.9544787Z GITHUB_BASE_REF=main 2025-05-07T20:23:12.9545098Z CI=true 2025-05-07T20:23:12.9545393Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:12.9545774Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:12.9546160Z GITHUB_ACTION_REF= 2025-05-07T20:23:12.9546512Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:12.9547401Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_9737db9c-fa2f-4aa5-8f0b-5d1fd405ba6f 2025-05-07T20:23:12.9548083Z MACHINE_NAME=x86_64 2025-05-07T20:23:12.9548307Z _=/usr/bin/printenv 2025-05-07T20:23:12.9548452Z 2025-05-07T20:23:12.9548577Z ################################################################################ 2025-05-07T20:23:12.9548897Z [INFO] Print ldd version ... 2025-05-07T20:23:12.9549166Z + ldd --version 2025-05-07T20:23:12.9549304Z 2025-05-07T20:23:12.9549403Z ldd (GNU libc) 2.34 2025-05-07T20:23:12.9549681Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:12.9550129Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:12.9550670Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:12.9551132Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:12.9551357Z 2025-05-07T20:23:12.9551482Z ################################################################################ 2025-05-07T20:23:12.9551790Z [INFO] Print CPU info ... 2025-05-07T20:23:12.9552037Z + nproc 2025-05-07T20:23:12.9552154Z 2025-05-07T20:23:12.9569674Z 16 2025-05-07T20:23:12.9571771Z 2025-05-07T20:23:12.9572059Z + lscpu 2025-05-07T20:23:12.9572244Z 2025-05-07T20:23:12.9685484Z Architecture: x86_64 2025-05-07T20:23:12.9686489Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:12.9690698Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9691486Z Byte Order: Little Endian 2025-05-07T20:23:12.9692121Z CPU(s): 16 2025-05-07T20:23:12.9692710Z On-line CPU(s) list: 0-15 2025-05-07T20:23:12.9693351Z Vendor ID: AuthenticAMD 2025-05-07T20:23:12.9694035Z Model name: AMD EPYC 7R32 2025-05-07T20:23:12.9694484Z CPU family: 23 2025-05-07T20:23:12.9694961Z Model: 49 2025-05-07T20:23:12.9695260Z Thread(s) per core: 2 2025-05-07T20:23:12.9695553Z Core(s) per socket: 8 2025-05-07T20:23:12.9695850Z Socket(s): 1 2025-05-07T20:23:12.9696133Z Stepping: 0 2025-05-07T20:23:12.9696436Z BogoMIPS: 5599.99 2025-05-07T20:23:12.9698646Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9700763Z Hypervisor vendor: KVM 2025-05-07T20:23:12.9701076Z Virtualization type: full 2025-05-07T20:23:12.9701592Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:12.9701962Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:12.9702354Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:12.9702844Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:12.9703305Z NUMA node(s): 1 2025-05-07T20:23:12.9703732Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:12.9704207Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:12.9704694Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:12.9705081Z Vulnerability L1tf: Not affected 2025-05-07T20:23:12.9705438Z Vulnerability Mds: Not affected 2025-05-07T20:23:12.9705896Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:12.9706434Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:12.9706811Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:12.9707591Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:12.9708410Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:12.9709015Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:12.9709708Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:12.9710619Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:12.9711369Z Vulnerability Srbds: Not affected 2025-05-07T20:23:12.9711756Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:12.9712084Z 2025-05-07T20:23:12.9712177Z + cat /proc/cpuinfo 2025-05-07T20:23:12.9712317Z 2025-05-07T20:23:12.9712410Z processor : 0 2025-05-07T20:23:12.9712626Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9712881Z cpu family : 23 2025-05-07T20:23:12.9713096Z model : 49 2025-05-07T20:23:12.9713304Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9713556Z stepping : 0 2025-05-07T20:23:12.9713775Z microcode : 0x830107f 2025-05-07T20:23:12.9714126Z cpu MHz : 3305.720 2025-05-07T20:23:12.9714349Z cache size : 512 KB 2025-05-07T20:23:12.9714582Z physical id : 0 2025-05-07T20:23:12.9714791Z siblings : 16 2025-05-07T20:23:12.9714998Z core id : 0 2025-05-07T20:23:12.9715202Z cpu cores : 8 2025-05-07T20:23:12.9715403Z apicid : 0 2025-05-07T20:23:12.9715611Z initial apicid : 0 2025-05-07T20:23:12.9715830Z fpu : yes 2025-05-07T20:23:12.9716027Z fpu_exception : yes 2025-05-07T20:23:12.9716252Z cpuid level : 13 2025-05-07T20:23:12.9716463Z wp : yes 2025-05-07T20:23:12.9718591Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9720887Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9721378Z bogomips : 5599.99 2025-05-07T20:23:12.9721603Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9721843Z clflush size : 64 2025-05-07T20:23:12.9722061Z cache_alignment : 64 2025-05-07T20:23:12.9722333Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9722657Z power management: 2025-05-07T20:23:12.9722794Z 2025-05-07T20:23:12.9722881Z processor : 1 2025-05-07T20:23:12.9723099Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9723340Z cpu family : 23 2025-05-07T20:23:12.9723543Z model : 49 2025-05-07T20:23:12.9723752Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9723999Z stepping : 0 2025-05-07T20:23:12.9724229Z microcode : 0x830107f 2025-05-07T20:23:12.9724483Z cpu MHz : 3136.739 2025-05-07T20:23:12.9724701Z cache size : 512 KB 2025-05-07T20:23:12.9724919Z physical id : 0 2025-05-07T20:23:12.9725134Z siblings : 16 2025-05-07T20:23:12.9725336Z core id : 1 2025-05-07T20:23:12.9725533Z cpu cores : 8 2025-05-07T20:23:12.9725735Z apicid : 2 2025-05-07T20:23:12.9725934Z initial apicid : 2 2025-05-07T20:23:12.9726141Z fpu : yes 2025-05-07T20:23:12.9726341Z fpu_exception : yes 2025-05-07T20:23:12.9726562Z cpuid level : 13 2025-05-07T20:23:12.9726769Z wp : yes 2025-05-07T20:23:12.9728772Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9731055Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9731549Z bogomips : 5599.99 2025-05-07T20:23:12.9731771Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9732006Z clflush size : 64 2025-05-07T20:23:12.9732231Z cache_alignment : 64 2025-05-07T20:23:12.9732503Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9732815Z power management: 2025-05-07T20:23:12.9732954Z 2025-05-07T20:23:12.9733041Z processor : 2 2025-05-07T20:23:12.9733258Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9733492Z cpu family : 23 2025-05-07T20:23:12.9733703Z model : 49 2025-05-07T20:23:12.9733914Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9734155Z stepping : 0 2025-05-07T20:23:12.9734402Z microcode : 0x830107f 2025-05-07T20:23:12.9734647Z cpu MHz : 3312.706 2025-05-07T20:23:12.9734865Z cache size : 512 KB 2025-05-07T20:23:12.9735077Z physical id : 0 2025-05-07T20:23:12.9735289Z siblings : 16 2025-05-07T20:23:12.9735579Z core id : 2 2025-05-07T20:23:12.9735775Z cpu cores : 8 2025-05-07T20:23:12.9735977Z apicid : 4 2025-05-07T20:23:12.9736183Z initial apicid : 4 2025-05-07T20:23:12.9736390Z fpu : yes 2025-05-07T20:23:12.9736593Z fpu_exception : yes 2025-05-07T20:23:12.9736815Z cpuid level : 13 2025-05-07T20:23:12.9737018Z wp : yes 2025-05-07T20:23:12.9739215Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9741517Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9742010Z bogomips : 5599.99 2025-05-07T20:23:12.9742235Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9742483Z clflush size : 64 2025-05-07T20:23:12.9742698Z cache_alignment : 64 2025-05-07T20:23:12.9742976Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9743297Z power management: 2025-05-07T20:23:12.9743430Z 2025-05-07T20:23:12.9743520Z processor : 3 2025-05-07T20:23:12.9743730Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9743978Z cpu family : 23 2025-05-07T20:23:12.9744186Z model : 49 2025-05-07T20:23:12.9744403Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9744686Z stepping : 0 2025-05-07T20:23:12.9744899Z microcode : 0x830107f 2025-05-07T20:23:12.9745123Z cpu MHz : 3299.680 2025-05-07T20:23:12.9745343Z cache size : 512 KB 2025-05-07T20:23:12.9745559Z physical id : 0 2025-05-07T20:23:12.9745764Z siblings : 16 2025-05-07T20:23:12.9745968Z core id : 3 2025-05-07T20:23:12.9746178Z cpu cores : 8 2025-05-07T20:23:12.9746373Z apicid : 6 2025-05-07T20:23:12.9746651Z initial apicid : 6 2025-05-07T20:23:12.9746967Z fpu : yes 2025-05-07T20:23:12.9761194Z fpu_exception : yes 2025-05-07T20:23:12.9761478Z cpuid level : 13 2025-05-07T20:23:12.9761703Z wp : yes 2025-05-07T20:23:12.9763709Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9766046Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9766550Z bogomips : 5599.99 2025-05-07T20:23:12.9766786Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9767029Z clflush size : 64 2025-05-07T20:23:12.9767255Z cache_alignment : 64 2025-05-07T20:23:12.9767535Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9767862Z power management: 2025-05-07T20:23:12.9767999Z 2025-05-07T20:23:12.9768082Z processor : 4 2025-05-07T20:23:12.9768306Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9768559Z cpu family : 23 2025-05-07T20:23:12.9768767Z model : 49 2025-05-07T20:23:12.9768986Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9769235Z stepping : 0 2025-05-07T20:23:12.9769443Z microcode : 0x830107f 2025-05-07T20:23:12.9769677Z cpu MHz : 3285.380 2025-05-07T20:23:12.9769902Z cache size : 512 KB 2025-05-07T20:23:12.9770117Z physical id : 0 2025-05-07T20:23:12.9770332Z siblings : 16 2025-05-07T20:23:12.9770539Z core id : 4 2025-05-07T20:23:12.9770736Z cpu cores : 8 2025-05-07T20:23:12.9770942Z apicid : 8 2025-05-07T20:23:12.9771355Z initial apicid : 8 2025-05-07T20:23:12.9771565Z fpu : yes 2025-05-07T20:23:12.9771772Z fpu_exception : yes 2025-05-07T20:23:12.9771996Z cpuid level : 13 2025-05-07T20:23:12.9772201Z wp : yes 2025-05-07T20:23:12.9774326Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9776660Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9777158Z bogomips : 5599.99 2025-05-07T20:23:12.9777388Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9777624Z clflush size : 64 2025-05-07T20:23:12.9777850Z cache_alignment : 64 2025-05-07T20:23:12.9778244Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9778566Z power management: 2025-05-07T20:23:12.9778711Z 2025-05-07T20:23:12.9778794Z processor : 5 2025-05-07T20:23:12.9779011Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9779252Z cpu family : 23 2025-05-07T20:23:12.9779463Z model : 49 2025-05-07T20:23:12.9779671Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9779918Z stepping : 0 2025-05-07T20:23:12.9780130Z microcode : 0x830107f 2025-05-07T20:23:12.9780358Z cpu MHz : 3328.621 2025-05-07T20:23:12.9780570Z cache size : 512 KB 2025-05-07T20:23:12.9780787Z physical id : 0 2025-05-07T20:23:12.9781002Z siblings : 16 2025-05-07T20:23:12.9781200Z core id : 5 2025-05-07T20:23:12.9781404Z cpu cores : 8 2025-05-07T20:23:12.9781617Z apicid : 10 2025-05-07T20:23:12.9781819Z initial apicid : 10 2025-05-07T20:23:12.9782035Z fpu : yes 2025-05-07T20:23:12.9782245Z fpu_exception : yes 2025-05-07T20:23:12.9782474Z cpuid level : 13 2025-05-07T20:23:12.9782681Z wp : yes 2025-05-07T20:23:12.9784675Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9786938Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9787434Z bogomips : 5599.99 2025-05-07T20:23:12.9787652Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9787894Z clflush size : 64 2025-05-07T20:23:12.9788127Z cache_alignment : 64 2025-05-07T20:23:12.9788404Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9788727Z power management: 2025-05-07T20:23:12.9788861Z 2025-05-07T20:23:12.9788952Z processor : 6 2025-05-07T20:23:12.9789165Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9789413Z cpu family : 23 2025-05-07T20:23:12.9789623Z model : 49 2025-05-07T20:23:12.9789836Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9790078Z stepping : 0 2025-05-07T20:23:12.9790294Z microcode : 0x830107f 2025-05-07T20:23:12.9790531Z cpu MHz : 2138.347 2025-05-07T20:23:12.9790745Z cache size : 512 KB 2025-05-07T20:23:12.9790967Z physical id : 0 2025-05-07T20:23:12.9791183Z siblings : 16 2025-05-07T20:23:12.9791379Z core id : 6 2025-05-07T20:23:12.9791581Z cpu cores : 8 2025-05-07T20:23:12.9791788Z apicid : 12 2025-05-07T20:23:12.9791991Z initial apicid : 12 2025-05-07T20:23:12.9792209Z fpu : yes 2025-05-07T20:23:12.9792415Z fpu_exception : yes 2025-05-07T20:23:12.9792631Z cpuid level : 13 2025-05-07T20:23:12.9792939Z wp : yes 2025-05-07T20:23:12.9795074Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9797368Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9797861Z bogomips : 5599.99 2025-05-07T20:23:12.9798087Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9798331Z clflush size : 64 2025-05-07T20:23:12.9798549Z cache_alignment : 64 2025-05-07T20:23:12.9798836Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9799158Z power management: 2025-05-07T20:23:12.9799293Z 2025-05-07T20:23:12.9799384Z processor : 7 2025-05-07T20:23:12.9799600Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9799850Z cpu family : 23 2025-05-07T20:23:12.9800059Z model : 49 2025-05-07T20:23:12.9800264Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9800511Z stepping : 0 2025-05-07T20:23:12.9800724Z microcode : 0x830107f 2025-05-07T20:23:12.9800949Z cpu MHz : 3318.654 2025-05-07T20:23:12.9801176Z cache size : 512 KB 2025-05-07T20:23:12.9801398Z physical id : 0 2025-05-07T20:23:12.9801605Z siblings : 16 2025-05-07T20:23:12.9801812Z core id : 7 2025-05-07T20:23:12.9802011Z cpu cores : 8 2025-05-07T20:23:12.9802206Z apicid : 14 2025-05-07T20:23:12.9802417Z initial apicid : 14 2025-05-07T20:23:12.9802636Z fpu : yes 2025-05-07T20:23:12.9802838Z fpu_exception : yes 2025-05-07T20:23:12.9803062Z cpuid level : 13 2025-05-07T20:23:12.9803273Z wp : yes 2025-05-07T20:23:12.9805327Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9807592Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9808085Z bogomips : 5599.99 2025-05-07T20:23:12.9808304Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9808537Z clflush size : 64 2025-05-07T20:23:12.9808759Z cache_alignment : 64 2025-05-07T20:23:12.9809024Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9809339Z power management: 2025-05-07T20:23:12.9809470Z 2025-05-07T20:23:12.9809554Z processor : 8 2025-05-07T20:23:12.9809761Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9809996Z cpu family : 23 2025-05-07T20:23:12.9810199Z model : 49 2025-05-07T20:23:12.9810398Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9810638Z stepping : 0 2025-05-07T20:23:12.9810851Z microcode : 0x830107f 2025-05-07T20:23:12.9811077Z cpu MHz : 3307.796 2025-05-07T20:23:12.9811296Z cache size : 512 KB 2025-05-07T20:23:12.9811516Z physical id : 0 2025-05-07T20:23:12.9811729Z siblings : 16 2025-05-07T20:23:12.9811977Z core id : 0 2025-05-07T20:23:12.9812262Z cpu cores : 8 2025-05-07T20:23:12.9812457Z apicid : 1 2025-05-07T20:23:12.9812654Z initial apicid : 1 2025-05-07T20:23:12.9812863Z fpu : yes 2025-05-07T20:23:12.9813056Z fpu_exception : yes 2025-05-07T20:23:12.9813296Z cpuid level : 13 2025-05-07T20:23:12.9813575Z wp : yes 2025-05-07T20:23:12.9815570Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9818023Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9818584Z bogomips : 5599.99 2025-05-07T20:23:12.9818810Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9819050Z clflush size : 64 2025-05-07T20:23:12.9819268Z cache_alignment : 64 2025-05-07T20:23:12.9819546Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9819864Z power management: 2025-05-07T20:23:12.9819999Z 2025-05-07T20:23:12.9820090Z processor : 9 2025-05-07T20:23:12.9820310Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9820553Z cpu family : 23 2025-05-07T20:23:12.9820758Z model : 49 2025-05-07T20:23:12.9820966Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9821211Z stepping : 0 2025-05-07T20:23:12.9821417Z microcode : 0x830107f 2025-05-07T20:23:12.9821644Z cpu MHz : 3231.751 2025-05-07T20:23:12.9821861Z cache size : 512 KB 2025-05-07T20:23:12.9822131Z physical id : 0 2025-05-07T20:23:12.9822340Z siblings : 16 2025-05-07T20:23:12.9822544Z core id : 1 2025-05-07T20:23:12.9822755Z cpu cores : 8 2025-05-07T20:23:12.9822953Z apicid : 3 2025-05-07T20:23:12.9823156Z initial apicid : 3 2025-05-07T20:23:12.9823374Z fpu : yes 2025-05-07T20:23:12.9823572Z fpu_exception : yes 2025-05-07T20:23:12.9823796Z cpuid level : 13 2025-05-07T20:23:12.9824006Z wp : yes 2025-05-07T20:23:12.9826023Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9828281Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9828778Z bogomips : 5599.99 2025-05-07T20:23:12.9829002Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9829235Z clflush size : 64 2025-05-07T20:23:12.9829457Z cache_alignment : 64 2025-05-07T20:23:12.9829738Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9830059Z power management: 2025-05-07T20:23:12.9830215Z 2025-05-07T20:23:12.9830299Z processor : 10 2025-05-07T20:23:12.9830522Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9830777Z cpu family : 23 2025-05-07T20:23:12.9830980Z model : 49 2025-05-07T20:23:12.9831189Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9831434Z stepping : 0 2025-05-07T20:23:12.9831641Z microcode : 0x830107f 2025-05-07T20:23:12.9831870Z cpu MHz : 3342.585 2025-05-07T20:23:12.9832089Z cache size : 512 KB 2025-05-07T20:23:12.9832303Z physical id : 0 2025-05-07T20:23:12.9832519Z siblings : 16 2025-05-07T20:23:12.9832724Z core id : 2 2025-05-07T20:23:12.9832920Z cpu cores : 8 2025-05-07T20:23:12.9833126Z apicid : 5 2025-05-07T20:23:12.9833333Z initial apicid : 5 2025-05-07T20:23:12.9833543Z fpu : yes 2025-05-07T20:23:12.9833746Z fpu_exception : yes 2025-05-07T20:23:12.9833969Z cpuid level : 13 2025-05-07T20:23:12.9834175Z wp : yes 2025-05-07T20:23:12.9836209Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9838553Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9839045Z bogomips : 5599.99 2025-05-07T20:23:12.9839358Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9839595Z clflush size : 64 2025-05-07T20:23:12.9839817Z cache_alignment : 64 2025-05-07T20:23:12.9840093Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9840406Z power management: 2025-05-07T20:23:12.9840546Z 2025-05-07T20:23:12.9840632Z processor : 11 2025-05-07T20:23:12.9840853Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9841091Z cpu family : 23 2025-05-07T20:23:12.9841301Z model : 49 2025-05-07T20:23:12.9841517Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9841758Z stepping : 0 2025-05-07T20:23:12.9841975Z microcode : 0x830107f 2025-05-07T20:23:12.9842204Z cpu MHz : 3299.652 2025-05-07T20:23:12.9842417Z cache size : 512 KB 2025-05-07T20:23:12.9842637Z physical id : 0 2025-05-07T20:23:12.9842853Z siblings : 16 2025-05-07T20:23:12.9843052Z core id : 3 2025-05-07T20:23:12.9843256Z cpu cores : 8 2025-05-07T20:23:12.9843459Z apicid : 7 2025-05-07T20:23:12.9843655Z initial apicid : 7 2025-05-07T20:23:12.9843878Z fpu : yes 2025-05-07T20:23:12.9844080Z fpu_exception : yes 2025-05-07T20:23:12.9844295Z cpuid level : 13 2025-05-07T20:23:12.9844505Z wp : yes 2025-05-07T20:23:12.9846483Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9848735Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9849227Z bogomips : 5599.99 2025-05-07T20:23:12.9849445Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9849689Z clflush size : 64 2025-05-07T20:23:12.9849914Z cache_alignment : 64 2025-05-07T20:23:12.9850185Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9850503Z power management: 2025-05-07T20:23:12.9850635Z 2025-05-07T20:23:12.9850724Z processor : 12 2025-05-07T20:23:12.9850945Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9851188Z cpu family : 23 2025-05-07T20:23:12.9851403Z model : 49 2025-05-07T20:23:12.9851608Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9851861Z stepping : 0 2025-05-07T20:23:12.9852079Z microcode : 0x830107f 2025-05-07T20:23:12.9852310Z cpu MHz : 3303.989 2025-05-07T20:23:12.9852523Z cache size : 512 KB 2025-05-07T20:23:12.9852741Z physical id : 0 2025-05-07T20:23:12.9852950Z siblings : 16 2025-05-07T20:23:12.9853151Z core id : 4 2025-05-07T20:23:12.9853354Z cpu cores : 8 2025-05-07T20:23:12.9853555Z apicid : 9 2025-05-07T20:23:12.9853754Z initial apicid : 9 2025-05-07T20:23:12.9853970Z fpu : yes 2025-05-07T20:23:12.9854174Z fpu_exception : yes 2025-05-07T20:23:12.9854428Z cpuid level : 13 2025-05-07T20:23:12.9854649Z wp : yes 2025-05-07T20:23:12.9857488Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9860009Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9860496Z bogomips : 5599.99 2025-05-07T20:23:12.9860717Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9860960Z clflush size : 64 2025-05-07T20:23:12.9861177Z cache_alignment : 64 2025-05-07T20:23:12.9861581Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9861906Z power management: 2025-05-07T20:23:12.9862042Z 2025-05-07T20:23:12.9862133Z processor : 13 2025-05-07T20:23:12.9862347Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9862586Z cpu family : 23 2025-05-07T20:23:12.9862793Z model : 49 2025-05-07T20:23:12.9863004Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9863259Z stepping : 0 2025-05-07T20:23:12.9863478Z microcode : 0x830107f 2025-05-07T20:23:12.9863717Z cpu MHz : 3316.029 2025-05-07T20:23:12.9863935Z cache size : 512 KB 2025-05-07T20:23:12.9864178Z physical id : 0 2025-05-07T20:23:12.9864401Z siblings : 16 2025-05-07T20:23:12.9864643Z core id : 5 2025-05-07T20:23:12.9864846Z cpu cores : 8 2025-05-07T20:23:12.9865042Z apicid : 11 2025-05-07T20:23:12.9865262Z initial apicid : 11 2025-05-07T20:23:12.9865556Z fpu : yes 2025-05-07T20:23:12.9865782Z fpu_exception : yes 2025-05-07T20:23:12.9866061Z cpuid level : 13 2025-05-07T20:23:12.9866266Z wp : yes 2025-05-07T20:23:12.9868408Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9870759Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9871246Z bogomips : 5599.99 2025-05-07T20:23:12.9871462Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9871697Z clflush size : 64 2025-05-07T20:23:12.9871914Z cache_alignment : 64 2025-05-07T20:23:12.9872186Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9872506Z power management: 2025-05-07T20:23:12.9872641Z 2025-05-07T20:23:12.9872729Z processor : 14 2025-05-07T20:23:12.9872940Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9873179Z cpu family : 23 2025-05-07T20:23:12.9873386Z model : 49 2025-05-07T20:23:12.9873585Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9873828Z stepping : 0 2025-05-07T20:23:12.9874037Z microcode : 0x830107f 2025-05-07T20:23:12.9874256Z cpu MHz : 2976.858 2025-05-07T20:23:12.9874480Z cache size : 512 KB 2025-05-07T20:23:12.9874690Z physical id : 0 2025-05-07T20:23:12.9874899Z siblings : 16 2025-05-07T20:23:12.9875103Z core id : 6 2025-05-07T20:23:12.9875299Z cpu cores : 8 2025-05-07T20:23:12.9875501Z apicid : 13 2025-05-07T20:23:12.9875709Z initial apicid : 13 2025-05-07T20:23:12.9875919Z fpu : yes 2025-05-07T20:23:12.9876119Z fpu_exception : yes 2025-05-07T20:23:12.9876340Z cpuid level : 13 2025-05-07T20:23:12.9876548Z wp : yes 2025-05-07T20:23:12.9878535Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9881022Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9881516Z bogomips : 5599.99 2025-05-07T20:23:12.9881735Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9881970Z clflush size : 64 2025-05-07T20:23:12.9882195Z cache_alignment : 64 2025-05-07T20:23:12.9882468Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9882776Z power management: 2025-05-07T20:23:12.9882919Z 2025-05-07T20:23:12.9883102Z processor : 15 2025-05-07T20:23:12.9883329Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9883563Z cpu family : 23 2025-05-07T20:23:12.9883774Z model : 49 2025-05-07T20:23:12.9883981Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9884222Z stepping : 0 2025-05-07T20:23:12.9884467Z microcode : 0x830107f 2025-05-07T20:23:12.9884708Z cpu MHz : 3305.938 2025-05-07T20:23:12.9884916Z cache size : 512 KB 2025-05-07T20:23:12.9885136Z physical id : 0 2025-05-07T20:23:12.9885357Z siblings : 16 2025-05-07T20:23:12.9885557Z core id : 7 2025-05-07T20:23:12.9885757Z cpu cores : 8 2025-05-07T20:23:12.9885959Z apicid : 15 2025-05-07T20:23:12.9886163Z initial apicid : 15 2025-05-07T20:23:12.9886378Z fpu : yes 2025-05-07T20:23:12.9886579Z fpu_exception : yes 2025-05-07T20:23:12.9886798Z cpuid level : 13 2025-05-07T20:23:12.9887000Z wp : yes 2025-05-07T20:23:12.9888999Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9891280Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9891774Z bogomips : 5599.99 2025-05-07T20:23:12.9891992Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9892227Z clflush size : 64 2025-05-07T20:23:12.9892449Z cache_alignment : 64 2025-05-07T20:23:12.9892722Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9893035Z power management: 2025-05-07T20:23:12.9893168Z 2025-05-07T20:23:12.9893177Z 2025-05-07T20:23:12.9893298Z ################################################################################ 2025-05-07T20:23:12.9893607Z [INFO] Print PCI info ... 2025-05-07T20:23:12.9893848Z + lspci -v 2025-05-07T20:23:12.9893969Z 2025-05-07T20:23:12.9894183Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:12.9894617Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:12.9894952Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:12.9895164Z 2025-05-07T20:23:12.9895368Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:12.9895758Z Physical Slot: 1 2025-05-07T20:23:12.9896007Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.9896212Z 2025-05-07T20:23:12.9896468Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:12.9896899Z Physical Slot: 1 2025-05-07T20:23:12.9897163Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:12.9897390Z 2025-05-07T20:23:12.9897664Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:12.9898202Z Physical Slot: 3 2025-05-07T20:23:12.9898451Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.9898798Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.9899160Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:12.9899385Z 2025-05-07T20:23:12.9899691Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.9900297Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.9900585Z Physical Slot: 4 2025-05-07T20:23:12.9900840Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:12.9901224Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.9901583Z Capabilities: 2025-05-07T20:23:12.9901844Z Kernel driver in use: nvme 2025-05-07T20:23:12.9902011Z 2025-05-07T20:23:12.9902503Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.9902988Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:12.9903334Z Physical Slot: 5 2025-05-07T20:23:12.9903577Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.9903935Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.9904334Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:12.9912360Z Capabilities: 2025-05-07T20:23:12.9912655Z Kernel driver in use: ena 2025-05-07T20:23:12.9912896Z Kernel modules: ena 2025-05-07T20:23:12.9913045Z 2025-05-07T20:23:12.9913219Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:12.9913606Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:12.9913906Z Physical Slot: 30 2025-05-07T20:23:12.9914295Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:12.9914750Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:12.9915154Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:12.9915607Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:12.9915947Z Capabilities: 2025-05-07T20:23:12.9916217Z Kernel driver in use: nvidia 2025-05-07T20:23:12.9916469Z Kernel modules: nvidia 2025-05-07T20:23:12.9916626Z 2025-05-07T20:23:12.9916935Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:12.9917462Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:12.9917756Z Physical Slot: 31 2025-05-07T20:23:12.9917996Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:12.9918359Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:12.9918755Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:12.9919081Z Capabilities: 2025-05-07T20:23:12.9919349Z Kernel driver in use: nvme 2025-05-07T20:23:12.9919514Z 2025-05-07T20:23:12.9919518Z 2025-05-07T20:23:12.9919643Z ################################################################################ 2025-05-07T20:23:12.9919973Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:12.9920257Z + uname -a 2025-05-07T20:23:12.9920377Z 2025-05-07T20:23:12.9920783Z Linux ip-10-0-69-200.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:12.9921280Z 2025-05-07T20:23:12.9921361Z + uname -m 2025-05-07T20:23:12.9921478Z 2025-05-07T20:23:12.9921550Z x86_64 2025-05-07T20:23:12.9921659Z 2025-05-07T20:23:12.9921741Z + cat /proc/version 2025-05-07T20:23:12.9921881Z 2025-05-07T20:23:12.9922423Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:12.9923052Z 2025-05-07T20:23:12.9923151Z + cat /etc/os-release 2025-05-07T20:23:12.9923295Z 2025-05-07T20:23:12.9923384Z NAME="Amazon Linux" 2025-05-07T20:23:12.9923602Z VERSION="2023" 2025-05-07T20:23:12.9923800Z ID="amzn" 2025-05-07T20:23:12.9923984Z ID_LIKE="fedora" 2025-05-07T20:23:12.9924194Z VERSION_ID="2023" 2025-05-07T20:23:12.9924435Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:12.9924775Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:12.9925056Z ANSI_COLOR="0;33" 2025-05-07T20:23:12.9925304Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:12.9925832Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:12.9926266Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:12.9926682Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:12.9927123Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:12.9927490Z VENDOR_NAME="AWS" 2025-05-07T20:23:12.9927734Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:12.9928020Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:12.9928175Z 2025-05-07T20:23:12.9928414Z ################################################################################ 2025-05-07T20:23:12.9928716Z # Print EC2 Instance Info 2025-05-07T20:23:12.9928956Z # 2025-05-07T20:23:12.9929177Z # [2025-05-07T20:23:12.990Z] + print_ec2_info 2025-05-07T20:23:12.9929488Z ################################################################################ 2025-05-07T20:23:12.9929705Z 2025-05-07T20:23:13.0028558Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:13.0146803Z instance-id: i-06f3d8044a6f79407 2025-05-07T20:23:13.0254352Z instance-type: g5.4xlarge 2025-05-07T20:23:13.0294069Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:13.0294431Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:13.0303775Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:13.0304134Z env: 2025-05-07T20:23:13.0304357Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:13.0304673Z BUILD_ENV: build_binary 2025-05-07T20:23:13.0304926Z BUILD_TARGET: genai 2025-05-07T20:23:13.0305155Z BUILD_VARIANT: cuda 2025-05-07T20:23:13.0305444Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:13.0305710Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:13.0306014Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:13.0306352Z ##[endgroup] 2025-05-07T20:23:13.3657103Z ################################################################################ 2025-05-07T20:23:13.3657484Z [INFO] Printing general display info ... 2025-05-07T20:23:13.3688660Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:13.4789268Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:13.4798043Z /usr/bin/sudo 2025-05-07T20:23:13.4808680Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:13.4819368Z /usr/bin/yum 2025-05-07T20:23:13.4821050Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:13.4840894Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:13.9006286Z Last metadata expiration check: 0:00:08 ago on Wed May 7 20:23:05 2025. 2025-05-07T20:23:13.9740443Z ================================================================================ 2025-05-07T20:23:13.9741210Z WARNING: 2025-05-07T20:23:13.9741742Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:13.9742273Z 2025-05-07T20:23:13.9742454Z Available Versions: 2025-05-07T20:23:13.9742760Z 2025-05-07T20:23:13.9742933Z Version 2023.7.20250331: 2025-05-07T20:23:13.9743553Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:13.9744081Z 2025-05-07T20:23:13.9744345Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:13.9744616Z 2025-05-07T20:23:13.9744701Z Release notes: 2025-05-07T20:23:13.9745111Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:13.9745483Z 2025-05-07T20:23:13.9745582Z Version 2023.7.20250414: 2025-05-07T20:23:13.9745890Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:13.9746146Z 2025-05-07T20:23:13.9746263Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:13.9746479Z 2025-05-07T20:23:13.9746564Z Release notes: 2025-05-07T20:23:13.9746964Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:13.9747327Z 2025-05-07T20:23:13.9747427Z Version 2023.7.20250428: 2025-05-07T20:23:13.9747739Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:13.9748213Z 2025-05-07T20:23:13.9748330Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:13.9748543Z 2025-05-07T20:23:13.9748634Z Release notes: 2025-05-07T20:23:13.9749024Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:13.9749393Z 2025-05-07T20:23:13.9749502Z ================================================================================ 2025-05-07T20:23:14.0895948Z Dependencies resolved. 2025-05-07T20:23:14.1184684Z ================================================================================ 2025-05-07T20:23:14.1185094Z Package Arch Version Repository Size 2025-05-07T20:23:14.1185495Z ================================================================================ 2025-05-07T20:23:14.1185804Z Upgrading: 2025-05-07T20:23:14.1186172Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:14.1186761Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:14.1187133Z 2025-05-07T20:23:14.1187435Z Transaction Summary 2025-05-07T20:23:14.1187694Z ================================================================================ 2025-05-07T20:23:14.1187996Z Upgrade 2 Packages 2025-05-07T20:23:14.1188144Z 2025-05-07T20:23:14.1188276Z Total download size: 6.9 M 2025-05-07T20:23:14.1189917Z Downloading Packages: 2025-05-07T20:23:14.1605779Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 31 MB/s | 1.2 MB 00:00 2025-05-07T20:23:14.3385371Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 26 MB/s | 5.7 MB 00:00 2025-05-07T20:23:14.3395266Z -------------------------------------------------------------------------------- 2025-05-07T20:23:14.3396394Z Total 31 MB/s | 6.9 MB 00:00 2025-05-07T20:23:14.3398859Z Running transaction check 2025-05-07T20:23:14.3493974Z Transaction check succeeded. 2025-05-07T20:23:14.3494285Z Running transaction test 2025-05-07T20:23:14.3788176Z Transaction test succeeded. 2025-05-07T20:23:14.3790920Z Running transaction 2025-05-07T20:23:14.9307673Z Preparing : 1/1 2025-05-07T20:23:15.0363690Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:15.0390226Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:15.0587005Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:15.0587688Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:15.0698618Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:15.0723482Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:15.2174719Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:15.2175318Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:15.2175954Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:15.2176491Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:15.3631642Z ================================================================================ 2025-05-07T20:23:15.3632004Z WARNING: 2025-05-07T20:23:15.3632258Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:15.3632497Z 2025-05-07T20:23:15.3632591Z Available Versions: 2025-05-07T20:23:15.3632745Z 2025-05-07T20:23:15.3632844Z Version 2023.7.20250331: 2025-05-07T20:23:15.3633162Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:15.3633422Z 2025-05-07T20:23:15.3633550Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:15.3633761Z 2025-05-07T20:23:15.3633857Z Release notes: 2025-05-07T20:23:15.3634267Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:15.3634904Z 2025-05-07T20:23:15.3635008Z Version 2023.7.20250414: 2025-05-07T20:23:15.3635322Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:15.3635570Z 2025-05-07T20:23:15.3635696Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:15.3635905Z 2025-05-07T20:23:15.3636001Z Release notes: 2025-05-07T20:23:15.3636399Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:15.3636768Z 2025-05-07T20:23:15.3636859Z Version 2023.7.20250428: 2025-05-07T20:23:15.3637172Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:15.3637420Z 2025-05-07T20:23:15.3637539Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:15.3637753Z 2025-05-07T20:23:15.3637839Z Release notes: 2025-05-07T20:23:15.3638236Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:15.3638601Z 2025-05-07T20:23:15.3638923Z ================================================================================ 2025-05-07T20:23:15.4193169Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:15.4193585Z 2025-05-07T20:23:15.4193671Z Upgraded: 2025-05-07T20:23:15.4194057Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:15.4194725Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:15.4195127Z 2025-05-07T20:23:15.4195209Z Complete! 2025-05-07T20:23:15.4668737Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:15.4691703Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:15.8951794Z Last metadata expiration check: 0:00:10 ago on Wed May 7 20:23:05 2025. 2025-05-07T20:23:15.9190453Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:15.9597901Z Dependencies resolved. 2025-05-07T20:23:15.9775832Z ================================================================================ 2025-05-07T20:23:15.9776291Z Package Architecture Version Repository Size 2025-05-07T20:23:15.9776715Z ================================================================================ 2025-05-07T20:23:15.9777006Z Installing: 2025-05-07T20:23:15.9777303Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:15.9777572Z 2025-05-07T20:23:15.9777670Z Transaction Summary 2025-05-07T20:23:15.9777915Z ================================================================================ 2025-05-07T20:23:15.9778324Z Install 1 Package 2025-05-07T20:23:15.9778469Z 2025-05-07T20:23:15.9778596Z Total download size: 319 k 2025-05-07T20:23:15.9779444Z Installed size: 837 k 2025-05-07T20:23:15.9781180Z Downloading Packages: 2025-05-07T20:23:16.0501875Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 7.4 MB/s | 319 kB 00:00 2025-05-07T20:23:16.0507492Z -------------------------------------------------------------------------------- 2025-05-07T20:23:16.0510476Z Total 4.3 MB/s | 319 kB 00:00 2025-05-07T20:23:16.0664964Z Running transaction check 2025-05-07T20:23:16.0719809Z Transaction check succeeded. 2025-05-07T20:23:16.0720752Z Running transaction test 2025-05-07T20:23:16.1184426Z Transaction test succeeded. 2025-05-07T20:23:16.1187698Z Running transaction 2025-05-07T20:23:16.2238166Z Preparing : 1/1 2025-05-07T20:23:16.2769460Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:16.4931286Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:16.6146443Z ================================================================================ 2025-05-07T20:23:16.6146823Z WARNING: 2025-05-07T20:23:16.6147077Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:16.6147608Z 2025-05-07T20:23:16.6147711Z Available Versions: 2025-05-07T20:23:16.6147877Z 2025-05-07T20:23:16.6147978Z Version 2023.7.20250331: 2025-05-07T20:23:16.6148296Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:16.6148548Z 2025-05-07T20:23:16.6148680Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:16.6148889Z 2025-05-07T20:23:16.6148979Z Release notes: 2025-05-07T20:23:16.6149384Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:16.6149756Z 2025-05-07T20:23:16.6149848Z Version 2023.7.20250414: 2025-05-07T20:23:16.6150164Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:16.6150413Z 2025-05-07T20:23:16.6150529Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:16.6150742Z 2025-05-07T20:23:16.6150828Z Release notes: 2025-05-07T20:23:16.6151222Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:16.6151590Z 2025-05-07T20:23:16.6151867Z Version 2023.7.20250428: 2025-05-07T20:23:16.6152175Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:16.6152430Z 2025-05-07T20:23:16.6152546Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:16.6152754Z 2025-05-07T20:23:16.6152845Z Release notes: 2025-05-07T20:23:16.6153235Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:16.6153603Z 2025-05-07T20:23:16.6153715Z ================================================================================ 2025-05-07T20:23:16.6490885Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:16.6491226Z 2025-05-07T20:23:16.6491315Z Installed: 2025-05-07T20:23:16.6491635Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:16.6491931Z 2025-05-07T20:23:16.6492022Z Complete! 2025-05-07T20:23:16.6940252Z + hostname 2025-05-07T20:23:16.6940421Z 2025-05-07T20:23:16.6954402Z ip-10-0-69-200.ec2.internal 2025-05-07T20:23:16.6956341Z 2025-05-07T20:23:16.6956925Z + sudo lshw -C display 2025-05-07T20:23:16.6957095Z 2025-05-07T20:23:17.1156919Z *-display:0 UNCLAIMED 2025-05-07T20:23:17.1157408Z description: VGA compatible controller 2025-05-07T20:23:17.1157903Z product: Amazon.com, Inc. 2025-05-07T20:23:17.1158335Z vendor: Amazon.com, Inc. 2025-05-07T20:23:17.1158742Z physical id: 3 2025-05-07T20:23:17.1159110Z bus info: pci@0000:00:03.0 2025-05-07T20:23:17.1159518Z version: 00 2025-05-07T20:23:17.1159849Z width: 32 bits 2025-05-07T20:23:17.1160189Z clock: 33MHz 2025-05-07T20:23:17.1160582Z capabilities: vga_controller bus_master 2025-05-07T20:23:17.1161084Z configuration: latency=0 2025-05-07T20:23:17.1161597Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:17.1162140Z *-display:1 2025-05-07T20:23:17.1162521Z description: 3D controller 2025-05-07T20:23:17.1162974Z product: GA102GL [A10G] 2025-05-07T20:23:17.1163393Z vendor: NVIDIA Corporation 2025-05-07T20:23:17.1163792Z physical id: 1e 2025-05-07T20:23:17.1164127Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:17.1164481Z version: a1 2025-05-07T20:23:17.1164787Z width: 64 bits 2025-05-07T20:23:17.1165105Z clock: 33MHz 2025-05-07T20:23:17.1165526Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:17.1166120Z configuration: driver=nvidia latency=0 2025-05-07T20:23:17.1167076Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:17.1200236Z 2025-05-07T20:23:17.1200658Z ################################################################################ 2025-05-07T20:23:17.1200994Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:17.1329692Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:17.1498963Z Wed May 7 20:23:17 2025 2025-05-07T20:23:17.1499487Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:17.1500013Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:17.1500501Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:17.1500989Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:17.1501508Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:17.1501935Z | | | MIG M. | 2025-05-07T20:23:17.1502263Z |=========================================+========================+======================| 2025-05-07T20:23:17.1583548Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:17.1584231Z | 0% 30C P0 60W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:17.1584613Z | | | N/A | 2025-05-07T20:23:17.1584998Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:17.1585390Z 2025-05-07T20:23:17.1585768Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:17.1586192Z | Processes: | 2025-05-07T20:23:17.1586628Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:17.1587039Z | ID ID Usage | 2025-05-07T20:23:17.1587384Z |=========================================================================================| 2025-05-07T20:23:17.1588487Z | No running processes found | 2025-05-07T20:23:17.1588951Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:17.2973066Z ################################################################################ 2025-05-07T20:23:17.2973446Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:17.3113504Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:17.3114394Z [CHECK] rocminfo not found 2025-05-07T20:23:17.3123395Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:17.3124493Z [CHECK] rocm-smi not found 2025-05-07T20:23:17.3187512Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:17.3187953Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:17.3200734Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:17.3201093Z env: 2025-05-07T20:23:17.3201333Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:17.3201647Z BUILD_ENV: build_binary 2025-05-07T20:23:17.3201901Z BUILD_TARGET: genai 2025-05-07T20:23:17.3202137Z BUILD_VARIANT: cuda 2025-05-07T20:23:17.3202382Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:17.3202648Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:17.3202951Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:17.3203292Z ##[endgroup] 2025-05-07T20:23:17.6545927Z ################################################################################ 2025-05-07T20:23:17.6546287Z # Setup Miniconda 2025-05-07T20:23:17.6546507Z # 2025-05-07T20:23:17.6561209Z # [2025-05-07T20:23:17.655Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:17.6561624Z ################################################################################ 2025-05-07T20:23:17.6561841Z 2025-05-07T20:23:17.6577336Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:17.7461464Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:17.7461828Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:17.7462032Z 2025-05-07T20:23:17.7478265Z 2025-05-07T20:23:17.7478581Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:17.7499527Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:18.5997697Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:18.5998125Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:18.5998390Z 2025-05-07T20:23:18.6142319Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:19.0664661Z Unpacking payload ... 2025-05-07T20:23:19.5850654Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:20.3832730Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:22.4856892Z 2025-05-07T20:23:22.4857593Z Installing base environment... 2025-05-07T20:23:22.4857905Z 2025-05-07T20:23:23.5659360Z Preparing transaction: ...working... done 2025-05-07T20:23:26.5532420Z Executing transaction: ...working... done 2025-05-07T20:23:27.2100444Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:27.2984074Z installation finished. 2025-05-07T20:23:27.2992504Z 2025-05-07T20:23:27.2992729Z + rm -f miniconda.sh 2025-05-07T20:23:27.2992915Z 2025-05-07T20:23:27.3294869Z 2025-05-07T20:23:27.3295237Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:23:27.3295732Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:23:27.6945897Z 2025-05-07T20:23:27.6946195Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:23:27.6946746Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:23:27.6947255Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:23:27.6947628Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:23:27.6947997Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:23:27.6948392Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:23:27.6948828Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:23:27.6949274Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:23:27.6949736Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:23:27.6950516Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:23:27.6951047Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:23:27.6951426Z modified /home/ec2-user/.bashrc 2025-05-07T20:23:27.6951619Z 2025-05-07T20:23:27.6951817Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:23:27.6952121Z 2025-05-07T20:23:27.7597578Z 2025-05-07T20:23:27.7598067Z + . /home/ec2-user/.bashrc 2025-05-07T20:23:27.7598350Z 2025-05-07T20:23:28.5881212Z 2025-05-07T20:23:28.5881944Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:23:28.5906895Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:23:41.9849225Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:23:43.5317991Z Solving environment: - \ | / - \ | / - \ | / done 2025-05-07T20:23:43.6288550Z 2025-05-07T20:23:43.6288826Z ## Package Plan ## 2025-05-07T20:23:43.6288981Z 2025-05-07T20:23:43.6290921Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:43.6291193Z 2025-05-07T20:23:43.6291291Z added / updated specs: 2025-05-07T20:23:43.6291561Z - conda-libmamba-solver 2025-05-07T20:23:43.6291821Z - libarchive 2025-05-07T20:23:43.6292038Z - libmamba 2025-05-07T20:23:43.6292289Z - libmambapy 2025-05-07T20:23:43.6292422Z 2025-05-07T20:23:43.6292426Z 2025-05-07T20:23:43.6292565Z The following packages will be downloaded: 2025-05-07T20:23:43.6292783Z 2025-05-07T20:23:43.6292898Z package | build 2025-05-07T20:23:43.6293221Z ---------------------------|----------------- 2025-05-07T20:23:43.6293640Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:23:43.6294117Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:23:43.6294542Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:23:43.6295023Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:23:43.6295473Z ------------------------------------------------------------ 2025-05-07T20:23:43.6295809Z Total: 1.4 MB 2025-05-07T20:23:43.6296025Z 2025-05-07T20:23:43.6296139Z The following packages will be UPDATED: 2025-05-07T20:23:43.6296356Z 2025-05-07T20:23:43.6300600Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:23:43.6301395Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:23:43.6301779Z 2025-05-07T20:23:43.6302003Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:23:43.6302355Z 2025-05-07T20:23:43.6302700Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:23:43.6303510Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:23:43.6303996Z 2025-05-07T20:23:43.6304006Z 2025-05-07T20:23:43.6304010Z 2025-05-07T20:23:43.6304155Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:43.6304535Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:23:43.6304757Z 2025-05-07T20:23:43.6310233Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:23:43.6310502Z 2025-05-07T20:23:43.6312504Z 2025-05-07T20:23:43.6325858Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:23:43.6326138Z 2025-05-07T20:23:43.6326150Z 2025-05-07T20:23:43.6326154Z 2025-05-07T20:23:43.7083348Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:23:43.7099910Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.7100297Z 2025-05-07T20:23:43.7244925Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.7245181Z 2025-05-07T20:23:43.7393293Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.7393549Z 2025-05-07T20:23:43.7393556Z 2025-05-07T20:23:43.7441176Z ca-certificates-2025 | 149 KB | # | 11%  2025-05-07T20:23:43.7441446Z 2025-05-07T20:23:43.7442074Z 2025-05-07T20:23:43.7552799Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.7553067Z 2025-05-07T20:23:43.7553070Z 2025-05-07T20:23:43.7628810Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.7629256Z 2025-05-07T20:23:43.7629264Z 2025-05-07T20:23:43.7629269Z 2025-05-07T20:23:43.7640334Z conda-libmamba-solve | 41 KB | ###9 | 39%  2025-05-07T20:23:43.7640720Z 2025-05-07T20:23:43.7640724Z 2025-05-07T20:23:43.7640728Z 2025-05-07T20:23:43.7813379Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.7813784Z 2025-05-07T20:23:43.7813788Z 2025-05-07T20:23:43.7813792Z 2025-05-07T20:23:43.8448377Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.8448795Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.8454906Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.8455258Z 2025-05-07T20:23:43.8455456Z 2025-05-07T20:23:43.8455854Z  2025-05-07T20:23:43.8456078Z 2025-05-07T20:23:43.8456105Z 2025-05-07T20:23:43.8456292Z  2025-05-07T20:23:43.8456511Z 2025-05-07T20:23:43.8456514Z 2025-05-07T20:23:43.8456529Z 2025-05-07T20:23:43.8457446Z  done 2025-05-07T20:23:43.9459730Z Preparing transaction: \ done 2025-05-07T20:23:44.0464850Z Verifying transaction: / done 2025-05-07T20:23:45.3483436Z Executing transaction: \ | / - \ | / - \ | / - \ done 2025-05-07T20:23:47.0518528Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:23:47.0542901Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:23:47.9829647Z Channels: 2025-05-07T20:23:47.9829984Z - defaults 2025-05-07T20:23:47.9830288Z Platform: linux-64 2025-05-07T20:23:49.1893077Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:23:49.3061892Z Solving environment: - \ Channels: 2025-05-07T20:23:49.3062354Z - defaults 2025-05-07T20:23:49.3062663Z Platform: linux-64 2025-05-07T20:23:49.5995428Z Collecting package metadata (repodata.json): / - \ | done 2025-05-07T20:23:49.8151296Z Solving environment: - \ | / done 2025-05-07T20:23:49.8981901Z done 2025-05-07T20:23:49.9639742Z 2025-05-07T20:23:49.9640147Z ## Package Plan ## 2025-05-07T20:23:49.9640456Z 2025-05-07T20:23:49.9640772Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:49.9641250Z 2025-05-07T20:23:49.9641456Z added / updated specs: 2025-05-07T20:23:49.9641947Z - conda 2025-05-07T20:23:49.9642195Z 2025-05-07T20:23:49.9642203Z 2025-05-07T20:23:49.9642452Z The following packages will be downloaded: 2025-05-07T20:23:49.9642890Z 2025-05-07T20:23:49.9643135Z package | build 2025-05-07T20:23:49.9643730Z ---------------------------|----------------- 2025-05-07T20:23:49.9644132Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:23:49.9644808Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:23:49.9645197Z ------------------------------------------------------------ 2025-05-07T20:23:49.9645539Z Total: 1.4 MB 2025-05-07T20:23:49.9645753Z 2025-05-07T20:23:49.9645871Z The following packages will be UPDATED: 2025-05-07T20:23:49.9646131Z 2025-05-07T20:23:49.9646525Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:23:49.9647051Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:23:49.9647298Z 2025-05-07T20:23:49.9647302Z 2025-05-07T20:23:49.9647307Z 2025-05-07T20:23:49.9647454Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:49.9647836Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:23:49.9648058Z 2025-05-07T20:23:50.0068026Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:23:50.0517128Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.0517401Z 2025-05-07T20:23:50.1666886Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.1668109Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.2239127Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.2239368Z 2025-05-07T20:23:50.2239854Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.2240090Z 2025-05-07T20:23:50.2245305Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.2245649Z 2025-05-07T20:23:50.2245850Z 2025-05-07T20:23:50.2246018Z  done 2025-05-07T20:23:50.3248732Z Preparing transaction: \ done 2025-05-07T20:23:50.4255020Z Verifying transaction: / done 2025-05-07T20:23:52.4334244Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:23:53.0639557Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:23:53.0643496Z + conda clean --packages --tarball -y 2025-05-07T20:23:53.0643723Z 2025-05-07T20:23:54.0684327Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:23:54.0684677Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:23:54.1319409Z 2025-05-07T20:23:54.1328465Z + conda clean --all -y 2025-05-07T20:23:54.1328647Z 2025-05-07T20:23:54.6709792Z There are no unused tarball(s) to remove. 2025-05-07T20:23:54.6710156Z Will remove 1 index cache(s). 2025-05-07T20:23:54.6710451Z There are no unused package(s) to remove. 2025-05-07T20:23:54.6710769Z There are no tempfile(s) to remove. 2025-05-07T20:23:54.6711061Z There are no logfile(s) to remove. 2025-05-07T20:23:54.7327498Z 2025-05-07T20:23:54.7332780Z + conda info 2025-05-07T20:23:54.7332950Z 2025-05-07T20:23:55.5017812Z 2025-05-07T20:23:55.5018565Z active environment : base 2025-05-07T20:23:55.5019082Z active env location : /home/ec2-user/miniconda 2025-05-07T20:23:55.5019545Z shell level : 1 2025-05-07T20:23:55.5019837Z user config file : /home/ec2-user/.condarc 2025-05-07T20:23:55.5020236Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:23:55.5020616Z conda version : 25.3.1 2025-05-07T20:23:55.5020892Z conda-build version : not installed 2025-05-07T20:23:55.5021196Z python version : 3.13.2.final.0 2025-05-07T20:23:55.5021497Z solver : libmamba (default) 2025-05-07T20:23:55.5021801Z virtual packages : __archspec=1=zen2 2025-05-07T20:23:55.5022100Z __conda=25.3.1=0 2025-05-07T20:23:55.5022377Z __cuda=12.8=0 2025-05-07T20:23:55.5022652Z __glibc=2.34=0 2025-05-07T20:23:55.5022923Z __linux=6.1.130=0 2025-05-07T20:23:55.5023197Z __unix=0=0 2025-05-07T20:23:55.5023876Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:23:55.5024293Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:23:55.5024642Z conda av metadata url : None 2025-05-07T20:23:55.5025012Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:23:55.5025440Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:23:55.5025828Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:23:55.5026203Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:23:55.5026581Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:23:55.5026916Z /home/ec2-user/.conda/pkgs 2025-05-07T20:23:55.5027256Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:23:55.5027593Z /home/ec2-user/.conda/envs 2025-05-07T20:23:55.5027889Z platform : linux-64 2025-05-07T20:23:55.5028731Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:23:55.5029701Z UID:GID : 1000:1000 2025-05-07T20:23:55.5029972Z netrc file : None 2025-05-07T20:23:55.5030227Z offline mode : False 2025-05-07T20:23:55.5030401Z 2025-05-07T20:23:55.5686879Z 2025-05-07T20:23:55.5687303Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:23:55.5688026Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_f43e8320-63a0-46a8-acf5-3813a231fef4 ... 2025-05-07T20:23:55.5690271Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:23:55.5860285Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.10 2025-05-07T20:23:55.5860788Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.10 2025-05-07T20:23:55.5878557Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:55.5878912Z env: 2025-05-07T20:23:55.5879136Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:55.5879444Z BUILD_ENV: build_binary 2025-05-07T20:23:55.5879707Z BUILD_TARGET: genai 2025-05-07T20:23:55.5879938Z BUILD_VARIANT: cuda 2025-05-07T20:23:55.5880169Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:55.5880430Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:55.5880737Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:55.5881065Z ##[endgroup] 2025-05-07T20:23:55.9259010Z ################################################################################ 2025-05-07T20:23:55.9259398Z # Create Conda Environment 2025-05-07T20:23:55.9259648Z # 2025-05-07T20:23:55.9275626Z # [2025-05-07T20:23:55.927Z] + create_conda_environment build_binary 3.10 2025-05-07T20:23:55.9276053Z ################################################################################ 2025-05-07T20:23:55.9284082Z 2025-05-07T20:23:55.9290759Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:56.0221349Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:56.0221739Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:23:56.0222085Z + conda info --envs 2025-05-07T20:23:56.0222227Z 2025-05-07T20:23:56.7673748Z 2025-05-07T20:23:56.7674235Z # conda environments: 2025-05-07T20:23:56.7674534Z # 2025-05-07T20:23:56.7674767Z base /home/ec2-user/miniconda 2025-05-07T20:23:56.7674996Z 2025-05-07T20:23:56.8327488Z 2025-05-07T20:23:56.8328129Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:23:58.4608146Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:23:58.4608434Z 2025-05-07T20:23:58.4624511Z 2025-05-07T20:23:58.4633736Z [SETUP] Creating new Conda environment (Python 3.10) ... 2025-05-07T20:23:58.4656888Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.10 2025-05-07T20:23:59.2270562Z Channels: 2025-05-07T20:23:59.2270812Z - defaults 2025-05-07T20:23:59.2271034Z Platform: linux-64 2025-05-07T20:24:00.7558276Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done 2025-05-07T20:24:00.8564017Z Solving environment: / done 2025-05-07T20:24:00.8906059Z 2025-05-07T20:24:00.8906371Z ## Package Plan ## 2025-05-07T20:24:00.8906561Z 2025-05-07T20:24:00.8906810Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:00.8907141Z 2025-05-07T20:24:00.8907240Z added / updated specs: 2025-05-07T20:24:00.8907495Z - python=3.10 2025-05-07T20:24:00.8907631Z 2025-05-07T20:24:00.8907635Z 2025-05-07T20:24:00.8907758Z The following packages will be downloaded: 2025-05-07T20:24:00.8907988Z 2025-05-07T20:24:00.8908130Z package | build 2025-05-07T20:24:00.8908458Z ---------------------------|----------------- 2025-05-07T20:24:00.8908814Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:00.8909223Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:00.8909651Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:00.8910069Z python-3.10.16 | he870216_1 26.9 MB 2025-05-07T20:24:00.8910852Z setuptools-78.1.1 | py310h06a4308_0 1.7 MB 2025-05-07T20:24:00.8911255Z wheel-0.45.1 | py310h06a4308_0 115 KB 2025-05-07T20:24:00.8911624Z ------------------------------------------------------------ 2025-05-07T20:24:00.8911957Z Total: 28.8 MB 2025-05-07T20:24:00.8912170Z 2025-05-07T20:24:00.8912301Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:00.8912534Z 2025-05-07T20:24:00.8912950Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:00.8913405Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:00.8913825Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:24:00.8914311Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:00.8914859Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:00.8915333Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:00.8915760Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:00.8916209Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:00.8916686Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:00.8917191Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:24:00.8917613Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:00.8918038Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:00.8918450Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:00.8918849Z python pkgs/main/linux-64::python-3.10.16-he870216_1 2025-05-07T20:24:00.8919280Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:00.8919755Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py310h06a4308_0 2025-05-07T20:24:00.8920232Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:00.8920614Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:00.8920998Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:00.8921420Z wheel pkgs/main/linux-64::wheel-0.45.1-py310h06a4308_0 2025-05-07T20:24:00.8921815Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:00.8922187Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:00.8922434Z 2025-05-07T20:24:00.8922438Z 2025-05-07T20:24:00.8922443Z 2025-05-07T20:24:00.8922589Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:00.8922970Z python-3.10.16 | 26.9 MB | | 0% 2025-05-07T20:24:00.8923196Z 2025-05-07T20:24:00.8923587Z setuptools-78.1.1 | 1.7 MB | | 0%  2025-05-07T20:24:00.8923830Z 2025-05-07T20:24:00.8923838Z 2025-05-07T20:24:00.8925967Z ca-certificates-2025 | 129 KB | | 0%  2025-05-07T20:24:00.8926236Z 2025-05-07T20:24:00.8926240Z 2025-05-07T20:24:00.8932118Z 2025-05-07T20:24:00.8953805Z wheel-0.45.1 | 115 KB | | 0%  2025-05-07T20:24:00.8954058Z 2025-05-07T20:24:00.8954062Z 2025-05-07T20:24:00.8954065Z 2025-05-07T20:24:00.8955064Z 2025-05-07T20:24:00.8965647Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:00.8966061Z 2025-05-07T20:24:00.8966066Z 2025-05-07T20:24:00.8966080Z 2025-05-07T20:24:00.8966093Z 2025-05-07T20:24:00.8966096Z 2025-05-07T20:24:00.9379315Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:00.9379597Z 2025-05-07T20:24:00.9380272Z 2025-05-07T20:24:00.9502603Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:00.9503016Z 2025-05-07T20:24:00.9555462Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:00.9556175Z 2025-05-07T20:24:00.9556180Z 2025-05-07T20:24:00.9588643Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:00.9588991Z 2025-05-07T20:24:00.9588997Z 2025-05-07T20:24:00.9589002Z 2025-05-07T20:24:00.9589367Z 2025-05-07T20:24:00.9649003Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:00.9649364Z 2025-05-07T20:24:00.9649368Z 2025-05-07T20:24:00.9649372Z 2025-05-07T20:24:00.9649376Z 2025-05-07T20:24:00.9650110Z 2025-05-07T20:24:00.9797473Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:00.9797849Z 2025-05-07T20:24:00.9797856Z 2025-05-07T20:24:00.9797861Z 2025-05-07T20:24:00.9881542Z wheel-0.45.1 | 115 KB | ########## | 100%  2025-05-07T20:24:00.9881884Z 2025-05-07T20:24:00.9881888Z 2025-05-07T20:24:00.9881892Z 2025-05-07T20:24:00.9881895Z 2025-05-07T20:24:00.9881899Z 2025-05-07T20:24:00.9908773Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:01.0863752Z python-3.10.16 | 26.9 MB | 7 | 8% 2025-05-07T20:24:01.0864014Z 2025-05-07T20:24:01.0864018Z 2025-05-07T20:24:01.0864022Z 2025-05-07T20:24:01.0864329Z 2025-05-07T20:24:01.0869448Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.0869819Z 2025-05-07T20:24:01.0869825Z 2025-05-07T20:24:01.0869830Z 2025-05-07T20:24:01.0872789Z 2025-05-07T20:24:01.0909389Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.1143938Z python-3.10.16 | 26.9 MB | ##2 | 23% 2025-05-07T20:24:01.1144178Z 2025-05-07T20:24:01.1144182Z 2025-05-07T20:24:01.1144327Z 2025-05-07T20:24:01.1148311Z wheel-0.45.1 | 115 KB | ########## | 100%  2025-05-07T20:24:01.1148653Z 2025-05-07T20:24:01.1148657Z 2025-05-07T20:24:01.1148661Z 2025-05-07T20:24:01.1910100Z wheel-0.45.1 | 115 KB | ########## | 100%  2025-05-07T20:24:01.3477137Z python-3.10.16 | 26.9 MB | #######8 | 79% 2025-05-07T20:24:01.3595078Z python-3.10.16 | 26.9 MB | ########## | 100% 2025-05-07T20:24:01.3595322Z 2025-05-07T20:24:01.3598115Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:01.3598615Z 2025-05-07T20:24:01.9397548Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:01.9404327Z python-3.10.16 | 26.9 MB | ########## | 100% 2025-05-07T20:24:01.9404829Z 2025-05-07T20:24:01.9405046Z 2025-05-07T20:24:01.9405263Z  2025-05-07T20:24:01.9405467Z 2025-05-07T20:24:01.9405471Z 2025-05-07T20:24:01.9405637Z  2025-05-07T20:24:01.9405857Z 2025-05-07T20:24:01.9405861Z 2025-05-07T20:24:01.9405864Z 2025-05-07T20:24:01.9406033Z  2025-05-07T20:24:01.9406307Z 2025-05-07T20:24:01.9406313Z 2025-05-07T20:24:01.9406328Z 2025-05-07T20:24:01.9406334Z 2025-05-07T20:24:01.9406595Z  2025-05-07T20:24:01.9406837Z 2025-05-07T20:24:01.9406841Z 2025-05-07T20:24:01.9406845Z 2025-05-07T20:24:01.9406848Z 2025-05-07T20:24:01.9406852Z 2025-05-07T20:24:01.9407060Z  done 2025-05-07T20:24:02.1514109Z Preparing transaction: \ | done 2025-05-07T20:24:03.3220196Z Verifying transaction: - \ | / - \ | / - \ | done 2025-05-07T20:24:05.5441099Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:05.5945170Z # 2025-05-07T20:24:05.5945509Z # To activate this environment, use 2025-05-07T20:24:05.5945810Z # 2025-05-07T20:24:05.5946073Z # $ conda activate build_binary 2025-05-07T20:24:05.5946443Z # 2025-05-07T20:24:05.5946749Z # To deactivate an active environment, use 2025-05-07T20:24:05.5947535Z # 2025-05-07T20:24:05.5947842Z # $ conda deactivate 2025-05-07T20:24:05.5948097Z 2025-05-07T20:24:05.6985478Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:05.7006847Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:08.6473259Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (25.1) 2025-05-07T20:24:08.6473870Z Collecting pip 2025-05-07T20:24:08.6474202Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:08.6475024Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:08.6475860Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 54.3 MB/s eta 0:00:00 2025-05-07T20:24:08.6476229Z Installing collected packages: pip 2025-05-07T20:24:08.6476536Z Attempting uninstall: pip 2025-05-07T20:24:08.6476832Z Found existing installation: pip 25.1 2025-05-07T20:24:08.6477145Z Uninstalling pip-25.1: 2025-05-07T20:24:08.6477435Z Successfully uninstalled pip-25.1 2025-05-07T20:24:08.6477770Z Successfully installed pip-25.1.1 2025-05-07T20:24:08.6477967Z 2025-05-07T20:24:08.7106865Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:08.7129588Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:09.5653385Z Channels: 2025-05-07T20:24:09.5653647Z - conda-forge 2025-05-07T20:24:09.5653900Z Platform: linux-64 2025-05-07T20:24:20.2413627Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:24:21.8440601Z Solving environment: \ | / - \ | done 2025-05-07T20:24:21.9042031Z 2025-05-07T20:24:21.9042461Z ## Package Plan ## 2025-05-07T20:24:21.9042707Z 2025-05-07T20:24:21.9042932Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:21.9043235Z 2025-05-07T20:24:21.9043330Z added / updated specs: 2025-05-07T20:24:21.9043609Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:24:21.9043836Z 2025-05-07T20:24:21.9043840Z 2025-05-07T20:24:21.9043963Z The following packages will be downloaded: 2025-05-07T20:24:21.9044178Z 2025-05-07T20:24:21.9044299Z package | build 2025-05-07T20:24:21.9044614Z ---------------------------|----------------- 2025-05-07T20:24:21.9044981Z cffi-1.17.1 | py310h8deb56e_0 238 KB conda-forge 2025-05-07T20:24:21.9045428Z cryptography-44.0.3 | py310h6c63255_0 1.5 MB conda-forge 2025-05-07T20:24:21.9045878Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:24:21.9046301Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:24:21.9046719Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:24:21.9047131Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:24:21.9047551Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:24:21.9047985Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:24:21.9048415Z python_abi-3.10 | 2_cp310 4 KB conda-forge 2025-05-07T20:24:21.9048865Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:24:21.9049359Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:24:21.9049786Z ------------------------------------------------------------ 2025-05-07T20:24:21.9050125Z Total: 6.3 MB 2025-05-07T20:24:21.9050332Z 2025-05-07T20:24:21.9050460Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:21.9050689Z 2025-05-07T20:24:21.9050876Z cffi conda-forge/linux-64::cffi-1.17.1-py310h8deb56e_0 2025-05-07T20:24:21.9051373Z cryptography conda-forge/linux-64::cryptography-44.0.3-py310h6c63255_0 2025-05-07T20:24:21.9052263Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:24:21.9052712Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:24:21.9053191Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:24:21.9053656Z python_abi conda-forge/linux-64::python_abi-3.10-2_cp310 2025-05-07T20:24:21.9055977Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:24:21.9056885Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:24:21.9057237Z 2025-05-07T20:24:21.9057352Z The following packages will be UPDATED: 2025-05-07T20:24:21.9057557Z 2025-05-07T20:24:21.9057953Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:21.9058841Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:24:21.9059483Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:24:21.9060103Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:24:21.9060462Z 2025-05-07T20:24:21.9060472Z 2025-05-07T20:24:21.9060476Z 2025-05-07T20:24:21.9060618Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:21.9060987Z openssl-3.5.0 | 3.0 MB | | 0% 2025-05-07T20:24:21.9061217Z 2025-05-07T20:24:21.9061662Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:24:21.9061920Z 2025-05-07T20:24:21.9064969Z 2025-05-07T20:24:21.9076464Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:24:21.9076825Z 2025-05-07T20:24:21.9076830Z 2025-05-07T20:24:21.9076836Z 2025-05-07T20:24:21.9086369Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:24:21.9086718Z 2025-05-07T20:24:21.9086736Z 2025-05-07T20:24:21.9086741Z 2025-05-07T20:24:21.9103190Z 2025-05-07T20:24:21.9109351Z cffi-1.17.1 | 238 KB | | 0%  2025-05-07T20:24:21.9109704Z 2025-05-07T20:24:21.9109710Z 2025-05-07T20:24:21.9109720Z 2025-05-07T20:24:21.9109725Z 2025-05-07T20:24:21.9109740Z 2025-05-07T20:24:21.9114265Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:24:21.9114656Z 2025-05-07T20:24:21.9114662Z 2025-05-07T20:24:21.9114667Z 2025-05-07T20:24:21.9114681Z 2025-05-07T20:24:21.9114687Z 2025-05-07T20:24:21.9114702Z 2025-05-07T20:24:21.9123971Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:24:21.9124347Z 2025-05-07T20:24:21.9124353Z 2025-05-07T20:24:21.9124358Z 2025-05-07T20:24:21.9124376Z 2025-05-07T20:24:21.9124381Z 2025-05-07T20:24:21.9124386Z 2025-05-07T20:24:21.9126361Z 2025-05-07T20:24:21.9130993Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:24:21.9131431Z 2025-05-07T20:24:21.9131437Z 2025-05-07T20:24:21.9131443Z 2025-05-07T20:24:21.9131448Z 2025-05-07T20:24:21.9131453Z 2025-05-07T20:24:21.9131458Z 2025-05-07T20:24:21.9131463Z 2025-05-07T20:24:21.9131472Z 2025-05-07T20:24:21.9133163Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:24:21.9133576Z 2025-05-07T20:24:21.9133582Z 2025-05-07T20:24:21.9133587Z 2025-05-07T20:24:21.9133592Z 2025-05-07T20:24:21.9133597Z 2025-05-07T20:24:21.9133607Z 2025-05-07T20:24:21.9133612Z 2025-05-07T20:24:21.9133617Z 2025-05-07T20:24:21.9133631Z 2025-05-07T20:24:21.9134493Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:21.9134882Z 2025-05-07T20:24:21.9134887Z 2025-05-07T20:24:21.9134892Z 2025-05-07T20:24:21.9134897Z 2025-05-07T20:24:21.9134902Z 2025-05-07T20:24:21.9134908Z 2025-05-07T20:24:21.9134917Z 2025-05-07T20:24:21.9134922Z 2025-05-07T20:24:21.9134927Z 2025-05-07T20:24:21.9134932Z 2025-05-07T20:24:21.9829654Z python_abi-3.10 | 4 KB | | 0%  2025-05-07T20:24:21.9834961Z 2025-05-07T20:24:21.9891242Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:21.9891609Z 2025-05-07T20:24:21.9891847Z 2025-05-07T20:24:22.0116289Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:22.0116558Z 2025-05-07T20:24:22.0116562Z 2025-05-07T20:24:22.0116566Z 2025-05-07T20:24:22.0205342Z libgomp-15.1.0 | 442 KB | 3 | 4%  2025-05-07T20:24:22.0205684Z 2025-05-07T20:24:22.0205932Z 2025-05-07T20:24:22.0205939Z 2025-05-07T20:24:22.0205943Z 2025-05-07T20:24:22.0211863Z cffi-1.17.1 | 238 KB | 6 | 7%  2025-05-07T20:24:22.0212205Z 2025-05-07T20:24:22.0212209Z 2025-05-07T20:24:22.0212213Z 2025-05-07T20:24:22.0212216Z 2025-05-07T20:24:22.0213258Z 2025-05-07T20:24:22.0278124Z pyopenssl-25.0.0 | 120 KB | #3 | 13%  2025-05-07T20:24:22.0278422Z 2025-05-07T20:24:22.0278426Z 2025-05-07T20:24:22.0278429Z 2025-05-07T20:24:22.0278433Z 2025-05-07T20:24:22.0283135Z 2025-05-07T20:24:22.0352638Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:22.0353025Z 2025-05-07T20:24:22.0353029Z 2025-05-07T20:24:22.0353711Z 2025-05-07T20:24:22.0402612Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:22.0402986Z 2025-05-07T20:24:22.0402991Z 2025-05-07T20:24:22.0402997Z 2025-05-07T20:24:22.0403002Z 2025-05-07T20:24:22.0403007Z 2025-05-07T20:24:22.0403025Z 2025-05-07T20:24:22.0416358Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:24:22.0449896Z openssl-3.5.0 | 3.0 MB | | 1% 2025-05-07T20:24:22.0450241Z 2025-05-07T20:24:22.0450247Z 2025-05-07T20:24:22.0450252Z 2025-05-07T20:24:22.0450256Z 2025-05-07T20:24:22.0471627Z cffi-1.17.1 | 238 KB | ########## | 100%  2025-05-07T20:24:22.0471991Z 2025-05-07T20:24:22.0472009Z 2025-05-07T20:24:22.0472015Z 2025-05-07T20:24:22.0472022Z 2025-05-07T20:24:22.0472028Z 2025-05-07T20:24:22.0472051Z 2025-05-07T20:24:22.0690275Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:22.0690667Z 2025-05-07T20:24:22.0690673Z 2025-05-07T20:24:22.0690678Z 2025-05-07T20:24:22.0690691Z 2025-05-07T20:24:22.0690697Z 2025-05-07T20:24:22.0690702Z 2025-05-07T20:24:22.0692647Z 2025-05-07T20:24:22.0738520Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:24:22.0739020Z 2025-05-07T20:24:22.0739045Z 2025-05-07T20:24:22.0739051Z 2025-05-07T20:24:22.0739056Z 2025-05-07T20:24:22.0739061Z 2025-05-07T20:24:22.0739066Z 2025-05-07T20:24:22.0741908Z 2025-05-07T20:24:22.0874991Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:22.0875417Z 2025-05-07T20:24:22.0875422Z 2025-05-07T20:24:22.0882458Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:22.0882815Z 2025-05-07T20:24:22.0882828Z 2025-05-07T20:24:22.0919592Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:22.0919937Z 2025-05-07T20:24:22.0919943Z 2025-05-07T20:24:22.0919948Z 2025-05-07T20:24:22.0919963Z 2025-05-07T20:24:22.0919969Z 2025-05-07T20:24:22.0919974Z 2025-05-07T20:24:22.0919979Z 2025-05-07T20:24:22.0919984Z 2025-05-07T20:24:22.0919990Z 2025-05-07T20:24:22.0921588Z 2025-05-07T20:24:22.0947191Z python_abi-3.10 | 4 KB | ########## | 100%  2025-05-07T20:24:22.0947603Z 2025-05-07T20:24:22.0947609Z 2025-05-07T20:24:22.0947614Z 2025-05-07T20:24:22.0947619Z 2025-05-07T20:24:22.0947624Z 2025-05-07T20:24:22.0947630Z 2025-05-07T20:24:22.0947635Z 2025-05-07T20:24:22.0947640Z 2025-05-07T20:24:22.0947645Z 2025-05-07T20:24:22.0948482Z 2025-05-07T20:24:22.1054112Z python_abi-3.10 | 4 KB | ########## | 100%  2025-05-07T20:24:22.1054513Z 2025-05-07T20:24:22.1054736Z 2025-05-07T20:24:22.1054741Z 2025-05-07T20:24:22.1054746Z 2025-05-07T20:24:22.1054751Z 2025-05-07T20:24:22.1054756Z 2025-05-07T20:24:22.1054762Z 2025-05-07T20:24:22.1054767Z 2025-05-07T20:24:22.1054772Z 2025-05-07T20:24:22.1073602Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:22.1073991Z 2025-05-07T20:24:22.1073996Z 2025-05-07T20:24:22.1074001Z 2025-05-07T20:24:22.1074007Z 2025-05-07T20:24:22.1074012Z 2025-05-07T20:24:22.1074017Z 2025-05-07T20:24:22.1074022Z 2025-05-07T20:24:22.1074027Z 2025-05-07T20:24:22.1098661Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:24:22.1098979Z 2025-05-07T20:24:22.1098983Z 2025-05-07T20:24:22.1098987Z 2025-05-07T20:24:22.1098990Z 2025-05-07T20:24:22.1098994Z 2025-05-07T20:24:22.1098997Z 2025-05-07T20:24:22.1099009Z 2025-05-07T20:24:22.1099012Z 2025-05-07T20:24:22.1100441Z 2025-05-07T20:24:22.1136351Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:22.1136742Z 2025-05-07T20:24:22.1136755Z 2025-05-07T20:24:22.1136759Z 2025-05-07T20:24:22.1136763Z 2025-05-07T20:24:22.1136766Z 2025-05-07T20:24:22.1136770Z 2025-05-07T20:24:22.1136773Z 2025-05-07T20:24:22.1136777Z 2025-05-07T20:24:22.1194834Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:22.1195248Z 2025-05-07T20:24:22.1195252Z 2025-05-07T20:24:22.1195255Z 2025-05-07T20:24:22.1195259Z 2025-05-07T20:24:22.1198388Z 2025-05-07T20:24:22.1417089Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:22.1885930Z openssl-3.5.0 | 3.0 MB | #########8 | 98% 2025-05-07T20:24:22.2033564Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:22.2033820Z 2025-05-07T20:24:22.2033825Z 2025-05-07T20:24:22.2033830Z 2025-05-07T20:24:22.2038876Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:22.2039238Z 2025-05-07T20:24:22.2039254Z 2025-05-07T20:24:22.2039259Z 2025-05-07T20:24:22.2076681Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:22.2077022Z 2025-05-07T20:24:22.2077027Z 2025-05-07T20:24:22.2077033Z 2025-05-07T20:24:22.2077038Z 2025-05-07T20:24:22.2081169Z cffi-1.17.1 | 238 KB | ########## | 100%  2025-05-07T20:24:22.2081518Z 2025-05-07T20:24:22.2081524Z 2025-05-07T20:24:22.2081529Z 2025-05-07T20:24:22.2081534Z 2025-05-07T20:24:22.2227182Z cffi-1.17.1 | 238 KB | ########## | 100%  2025-05-07T20:24:22.2227439Z 2025-05-07T20:24:22.2227600Z 2025-05-07T20:24:22.2227606Z 2025-05-07T20:24:22.2227610Z 2025-05-07T20:24:22.2227613Z 2025-05-07T20:24:22.2227743Z 2025-05-07T20:24:22.2227834Z 2025-05-07T20:24:22.2230888Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:22.2231644Z 2025-05-07T20:24:22.2231653Z 2025-05-07T20:24:22.2231671Z 2025-05-07T20:24:22.2231679Z 2025-05-07T20:24:22.2231706Z 2025-05-07T20:24:22.2231713Z 2025-05-07T20:24:22.2231720Z 2025-05-07T20:24:22.2320177Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:22.2321019Z 2025-05-07T20:24:22.2321027Z 2025-05-07T20:24:22.2321034Z 2025-05-07T20:24:22.2321042Z 2025-05-07T20:24:22.2321049Z 2025-05-07T20:24:22.2321056Z 2025-05-07T20:24:22.2321063Z 2025-05-07T20:24:22.2321070Z 2025-05-07T20:24:22.2321077Z 2025-05-07T20:24:22.2321084Z 2025-05-07T20:24:22.2776318Z python_abi-3.10 | 4 KB | ########## | 100%  2025-05-07T20:24:22.2776734Z 2025-05-07T20:24:22.2776740Z 2025-05-07T20:24:22.2776746Z 2025-05-07T20:24:22.2776751Z 2025-05-07T20:24:22.2776756Z 2025-05-07T20:24:22.2776761Z 2025-05-07T20:24:22.2776766Z 2025-05-07T20:24:22.2776772Z 2025-05-07T20:24:22.2776777Z 2025-05-07T20:24:22.2786130Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:22.2786530Z 2025-05-07T20:24:22.2786535Z 2025-05-07T20:24:22.2786767Z 2025-05-07T20:24:22.2786773Z 2025-05-07T20:24:22.2786778Z 2025-05-07T20:24:22.2786783Z 2025-05-07T20:24:22.2786788Z 2025-05-07T20:24:22.2786793Z 2025-05-07T20:24:22.2786799Z 2025-05-07T20:24:22.3026025Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:22.3026411Z 2025-05-07T20:24:22.3026416Z 2025-05-07T20:24:22.3026422Z 2025-05-07T20:24:22.3026427Z 2025-05-07T20:24:22.3026432Z 2025-05-07T20:24:22.3026437Z 2025-05-07T20:24:22.3026442Z 2025-05-07T20:24:22.3026448Z 2025-05-07T20:24:22.3031408Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:22.3031833Z 2025-05-07T20:24:22.3031838Z 2025-05-07T20:24:22.3031843Z 2025-05-07T20:24:22.3031849Z 2025-05-07T20:24:22.3031854Z 2025-05-07T20:24:22.3031859Z 2025-05-07T20:24:22.3031864Z 2025-05-07T20:24:22.3031871Z 2025-05-07T20:24:22.3386673Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:22.3387109Z 2025-05-07T20:24:22.3387114Z 2025-05-07T20:24:22.3387120Z 2025-05-07T20:24:22.3387125Z 2025-05-07T20:24:22.3387130Z 2025-05-07T20:24:22.3387144Z 2025-05-07T20:24:22.3393704Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:22.3394098Z 2025-05-07T20:24:22.3394104Z 2025-05-07T20:24:22.3394109Z 2025-05-07T20:24:22.3394115Z 2025-05-07T20:24:22.3394128Z 2025-05-07T20:24:22.3394133Z 2025-05-07T20:24:22.3689621Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:22.3690021Z 2025-05-07T20:24:22.3691978Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:22.3692382Z 2025-05-07T20:24:22.4170261Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:22.4176816Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:22.4177309Z 2025-05-07T20:24:22.4177614Z 2025-05-07T20:24:22.4177863Z  2025-05-07T20:24:22.4178245Z 2025-05-07T20:24:22.4178251Z 2025-05-07T20:24:22.4178489Z  2025-05-07T20:24:22.4178773Z 2025-05-07T20:24:22.4178779Z 2025-05-07T20:24:22.4178784Z 2025-05-07T20:24:22.4179025Z  2025-05-07T20:24:22.4179268Z 2025-05-07T20:24:22.4179272Z 2025-05-07T20:24:22.4179276Z 2025-05-07T20:24:22.4179279Z 2025-05-07T20:24:22.4179484Z  2025-05-07T20:24:22.4179694Z 2025-05-07T20:24:22.4179698Z 2025-05-07T20:24:22.4179702Z 2025-05-07T20:24:22.4179705Z 2025-05-07T20:24:22.4179709Z 2025-05-07T20:24:22.4179884Z  2025-05-07T20:24:22.4180102Z 2025-05-07T20:24:22.4180106Z 2025-05-07T20:24:22.4180109Z 2025-05-07T20:24:22.4180113Z 2025-05-07T20:24:22.4180117Z 2025-05-07T20:24:22.4180120Z 2025-05-07T20:24:22.4180347Z  2025-05-07T20:24:22.4180633Z 2025-05-07T20:24:22.4180636Z 2025-05-07T20:24:22.4180640Z 2025-05-07T20:24:22.4180644Z 2025-05-07T20:24:22.4180647Z 2025-05-07T20:24:22.4180651Z 2025-05-07T20:24:22.4180655Z 2025-05-07T20:24:22.4180842Z  2025-05-07T20:24:22.4181087Z 2025-05-07T20:24:22.4181091Z 2025-05-07T20:24:22.4181094Z 2025-05-07T20:24:22.4181098Z 2025-05-07T20:24:22.4181102Z 2025-05-07T20:24:22.4181110Z 2025-05-07T20:24:22.4181113Z 2025-05-07T20:24:22.4181117Z 2025-05-07T20:24:22.4181305Z  2025-05-07T20:24:22.4181540Z 2025-05-07T20:24:22.4181544Z 2025-05-07T20:24:22.4181547Z 2025-05-07T20:24:22.4181551Z 2025-05-07T20:24:22.4181554Z 2025-05-07T20:24:22.4181558Z 2025-05-07T20:24:22.4181561Z 2025-05-07T20:24:22.4181565Z 2025-05-07T20:24:22.4181569Z 2025-05-07T20:24:22.4181951Z  2025-05-07T20:24:22.4182189Z 2025-05-07T20:24:22.4182192Z 2025-05-07T20:24:22.4182196Z 2025-05-07T20:24:22.4182199Z 2025-05-07T20:24:22.4182203Z 2025-05-07T20:24:22.4182206Z 2025-05-07T20:24:22.4182210Z 2025-05-07T20:24:22.4182213Z 2025-05-07T20:24:22.4182217Z 2025-05-07T20:24:22.4182221Z 2025-05-07T20:24:22.4182425Z  done 2025-05-07T20:24:22.5186122Z Preparing transaction: - done 2025-05-07T20:24:22.6188925Z Verifying transaction: | done 2025-05-07T20:24:24.1213805Z Executing transaction: - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:24.2978531Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:24:26.0173640Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:24:26.0188128Z [SETUP] Installing libxcrypt ... 2025-05-07T20:24:26.0211691Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:24:26.8872281Z Channels: 2025-05-07T20:24:26.8872525Z - conda-forge 2025-05-07T20:24:26.8872758Z Platform: linux-64 2025-05-07T20:24:30.2654168Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:30.6314219Z Solving environment: \ done 2025-05-07T20:24:30.6921947Z 2025-05-07T20:24:30.6924832Z ## Package Plan ## 2025-05-07T20:24:30.6925072Z 2025-05-07T20:24:30.6925287Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:30.6925626Z 2025-05-07T20:24:30.6925726Z added / updated specs: 2025-05-07T20:24:30.6925980Z - libxcrypt 2025-05-07T20:24:30.6926113Z 2025-05-07T20:24:30.6926118Z 2025-05-07T20:24:30.6926244Z The following packages will be downloaded: 2025-05-07T20:24:30.6926470Z 2025-05-07T20:24:30.6926587Z package | build 2025-05-07T20:24:30.6926914Z ---------------------------|----------------- 2025-05-07T20:24:30.6927297Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:24:30.6927707Z ------------------------------------------------------------ 2025-05-07T20:24:30.6928048Z Total: 98 KB 2025-05-07T20:24:30.6928255Z 2025-05-07T20:24:30.6928390Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:30.6928612Z 2025-05-07T20:24:30.6928831Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:24:30.6929125Z 2025-05-07T20:24:30.6929135Z 2025-05-07T20:24:30.6929139Z 2025-05-07T20:24:30.6929282Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:30.8468569Z libxcrypt-4.4.36 | 98 KB | | 0% 2025-05-07T20:24:30.8487080Z libxcrypt-4.4.36 | 98 KB | #6 | 16% 2025-05-07T20:24:30.8585755Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:30.8588564Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:30.8588910Z 2025-05-07T20:24:30.8589185Z done 2025-05-07T20:24:30.9592038Z Preparing transaction: / done 2025-05-07T20:24:31.0596685Z Verifying transaction: \ done 2025-05-07T20:24:31.1601349Z Executing transaction: / done 2025-05-07T20:24:34.5960424Z [SETUP] Copying over ... 2025-05-07T20:24:34.5961378Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.10/crypt.h 2025-05-07T20:24:34.5962053Z 2025-05-07T20:24:34.5990125Z 2025-05-07T20:24:36.2398060Z [SETUP] Installed Python version: Python 3.10.16 2025-05-07T20:24:36.2399301Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:24:36.2435798Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:36.2436255Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:36.2449895Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:36.2450429Z env: 2025-05-07T20:24:36.2450668Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:36.2450981Z BUILD_ENV: build_binary 2025-05-07T20:24:36.2451229Z BUILD_TARGET: genai 2025-05-07T20:24:36.2451464Z BUILD_VARIANT: cuda 2025-05-07T20:24:36.2451704Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:24:36.2451961Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:36.2452270Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:36.2452605Z ##[endgroup] 2025-05-07T20:24:36.5814048Z ################################################################################ 2025-05-07T20:24:36.5814559Z # Install C/C++ Compilers 2025-05-07T20:24:36.5814907Z # 2025-05-07T20:24:36.5831334Z # [2025-05-07T20:24:36.582Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:24:36.5831739Z ################################################################################ 2025-05-07T20:24:36.5831951Z 2025-05-07T20:24:36.5848797Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:36.6729659Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:36.6740510Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:24:36.6763712Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:24:37.5386749Z Channels: 2025-05-07T20:24:37.5387009Z - conda-forge 2025-05-07T20:24:37.5387241Z Platform: linux-64 2025-05-07T20:24:40.8251811Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:41.1931521Z Solving environment: \ done 2025-05-07T20:24:41.2548821Z 2025-05-07T20:24:41.2549035Z ## Package Plan ## 2025-05-07T20:24:41.2549260Z 2025-05-07T20:24:41.2549556Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:41.2549965Z 2025-05-07T20:24:41.2550091Z added / updated specs: 2025-05-07T20:24:41.2550434Z - sysroot_linux-64=2.17 2025-05-07T20:24:41.2550653Z 2025-05-07T20:24:41.2550658Z 2025-05-07T20:24:41.2550804Z The following packages will be downloaded: 2025-05-07T20:24:41.2551025Z 2025-05-07T20:24:41.2551146Z package | build 2025-05-07T20:24:41.2551477Z ---------------------------|----------------- 2025-05-07T20:24:41.2551904Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:24:41.2552404Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:24:41.2552965Z ------------------------------------------------------------ 2025-05-07T20:24:41.2553409Z Total: 15.4 MB 2025-05-07T20:24:41.2553626Z 2025-05-07T20:24:41.2553769Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:41.2554005Z 2025-05-07T20:24:41.2554292Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:24:41.2554874Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:24:41.2555194Z 2025-05-07T20:24:41.2555202Z 2025-05-07T20:24:41.2555212Z 2025-05-07T20:24:41.2555364Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:41.2556018Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:41.2556253Z 2025-05-07T20:24:41.4642668Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:24:41.4650641Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:41.4650893Z 2025-05-07T20:24:41.4746672Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:24:41.4748275Z 2025-05-07T20:24:41.5643821Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:41.6324935Z sysroot_linux-64-2.1 | 14.5 MB | ########9 | 90% 2025-05-07T20:24:41.7271404Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:41.7271668Z 2025-05-07T20:24:41.7272613Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:41.7272872Z 2025-05-07T20:24:42.2048173Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:42.2051337Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:42.2052095Z 2025-05-07T20:24:42.2052527Z 2025-05-07T20:24:42.2052961Z  done 2025-05-07T20:24:42.3055368Z Preparing transaction: / done 2025-05-07T20:24:42.5066649Z Verifying transaction: \ | done 2025-05-07T20:24:42.7109631Z Executing transaction: - \ done 2025-05-07T20:24:42.8638043Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:24:42.8638376Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:24:44.5443291Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:24:44.5459186Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:24:44.5481817Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:24:45.4450373Z Channels: 2025-05-07T20:24:45.4450718Z - conda-forge 2025-05-07T20:24:45.4451068Z Platform: linux-64 2025-05-07T20:24:48.6872572Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:49.6410113Z Solving environment: \ | / done 2025-05-07T20:24:49.7045632Z 2025-05-07T20:24:49.7046008Z ## Package Plan ## 2025-05-07T20:24:49.7046264Z 2025-05-07T20:24:49.7046525Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:49.7046852Z 2025-05-07T20:24:49.7046951Z added / updated specs: 2025-05-07T20:24:49.7047218Z - gxx_linux-64=11.4.0 2025-05-07T20:24:49.7047381Z 2025-05-07T20:24:49.7047384Z 2025-05-07T20:24:49.7047519Z The following packages will be downloaded: 2025-05-07T20:24:49.7047767Z 2025-05-07T20:24:49.7047886Z package | build 2025-05-07T20:24:49.7048211Z ---------------------------|----------------- 2025-05-07T20:24:49.7048628Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:24:49.7049132Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:24:49.7049601Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:24:49.7050049Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:24:49.7050498Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:24:49.7050935Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:24:49.7051372Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:24:49.7051849Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:24:49.7052324Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:24:49.7052768Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:24:49.7053241Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:24:49.7053728Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:24:49.7054125Z ------------------------------------------------------------ 2025-05-07T20:24:49.7054466Z Total: 91.6 MB 2025-05-07T20:24:49.7054683Z 2025-05-07T20:24:49.7054814Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:49.7055040Z 2025-05-07T20:24:49.7055319Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:24:49.7056137Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:24:49.7057116Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:24:49.7057649Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:24:49.7058289Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:24:49.7058949Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:24:49.7059525Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:49.7060101Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:24:49.7060603Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:24:49.7061144Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:49.7061512Z 2025-05-07T20:24:49.7061631Z The following packages will be UPDATED: 2025-05-07T20:24:49.7061838Z 2025-05-07T20:24:49.7062166Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:24:49.7062888Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:24:49.7063295Z 2025-05-07T20:24:49.7063305Z 2025-05-07T20:24:49.7063309Z 2025-05-07T20:24:49.7063455Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:49.7063843Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:49.7064075Z 2025-05-07T20:24:49.7064495Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:49.7064732Z 2025-05-07T20:24:49.7064736Z 2025-05-07T20:24:49.7074813Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:49.7075170Z 2025-05-07T20:24:49.7075173Z 2025-05-07T20:24:49.7075177Z 2025-05-07T20:24:49.7092130Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:49.7092397Z 2025-05-07T20:24:49.7092401Z 2025-05-07T20:24:49.7092405Z 2025-05-07T20:24:49.7092420Z 2025-05-07T20:24:49.7112812Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:49.7113078Z 2025-05-07T20:24:49.7113209Z 2025-05-07T20:24:49.7113236Z 2025-05-07T20:24:49.7113240Z 2025-05-07T20:24:49.7118145Z 2025-05-07T20:24:49.7118516Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:49.7118822Z 2025-05-07T20:24:49.7118826Z 2025-05-07T20:24:49.7118836Z 2025-05-07T20:24:49.7118840Z 2025-05-07T20:24:49.7118843Z 2025-05-07T20:24:49.7118847Z 2025-05-07T20:24:49.7121152Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:24:49.7121458Z 2025-05-07T20:24:49.7121468Z 2025-05-07T20:24:49.7121472Z 2025-05-07T20:24:49.7121476Z 2025-05-07T20:24:49.7121479Z 2025-05-07T20:24:49.7121483Z 2025-05-07T20:24:49.7121486Z 2025-05-07T20:24:49.7126794Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:24:49.7127134Z 2025-05-07T20:24:49.7127139Z 2025-05-07T20:24:49.7127188Z 2025-05-07T20:24:49.7127193Z 2025-05-07T20:24:49.7127199Z 2025-05-07T20:24:49.7127204Z 2025-05-07T20:24:49.7127209Z 2025-05-07T20:24:49.7127215Z 2025-05-07T20:24:49.7127499Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:49.7127814Z 2025-05-07T20:24:49.7127819Z 2025-05-07T20:24:49.7127824Z 2025-05-07T20:24:49.7127828Z 2025-05-07T20:24:49.7127833Z 2025-05-07T20:24:49.7127837Z 2025-05-07T20:24:49.7127842Z 2025-05-07T20:24:49.7127846Z 2025-05-07T20:24:49.7131727Z 2025-05-07T20:24:49.7133301Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:24:49.7133662Z 2025-05-07T20:24:49.7133668Z 2025-05-07T20:24:49.7133674Z 2025-05-07T20:24:49.7133679Z 2025-05-07T20:24:49.7133684Z 2025-05-07T20:24:49.7133689Z 2025-05-07T20:24:49.7133695Z 2025-05-07T20:24:49.7133699Z 2025-05-07T20:24:49.7133704Z 2025-05-07T20:24:49.7133710Z 2025-05-07T20:24:49.7138752Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:24:49.7139057Z 2025-05-07T20:24:49.7139061Z 2025-05-07T20:24:49.7139064Z 2025-05-07T20:24:49.7139068Z 2025-05-07T20:24:49.7139072Z 2025-05-07T20:24:49.7139075Z 2025-05-07T20:24:49.7139079Z 2025-05-07T20:24:49.7139189Z 2025-05-07T20:24:49.7139193Z 2025-05-07T20:24:49.7139197Z 2025-05-07T20:24:49.7139200Z 2025-05-07T20:24:49.8344234Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:24:49.8344568Z 2025-05-07T20:24:49.8344572Z 2025-05-07T20:24:49.8350430Z 2025-05-07T20:24:49.8350435Z 2025-05-07T20:24:49.8367350Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:49.8367743Z 2025-05-07T20:24:49.9350841Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:49.9351117Z 2025-05-07T20:24:49.9351121Z 2025-05-07T20:24:49.9351125Z 2025-05-07T20:24:49.9356281Z 2025-05-07T20:24:49.9461655Z libstdcxx-15.1.0 | 3.7 MB | #8 | 19%  2025-05-07T20:24:49.9461986Z 2025-05-07T20:24:49.9705905Z gxx_impl_linux-64-11 | 11.2 MB | 3 | 4%  2025-05-07T20:24:49.9706306Z 2025-05-07T20:24:49.9706313Z 2025-05-07T20:24:50.0143346Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:50.0143675Z 2025-05-07T20:24:50.0143679Z 2025-05-07T20:24:50.0143682Z 2025-05-07T20:24:50.0144411Z 2025-05-07T20:24:50.0416397Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:50.0416687Z 2025-05-07T20:24:50.0416691Z 2025-05-07T20:24:50.0416904Z 2025-05-07T20:24:50.0494478Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:50.0495476Z 2025-05-07T20:24:50.0664797Z gxx_impl_linux-64-11 | 11.2 MB | ##8 | 28%  2025-05-07T20:24:50.0665150Z 2025-05-07T20:24:50.0665154Z 2025-05-07T20:24:50.0665157Z 2025-05-07T20:24:50.0665161Z 2025-05-07T20:24:50.0665165Z 2025-05-07T20:24:50.0705971Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:50.0706361Z 2025-05-07T20:24:50.0706367Z 2025-05-07T20:24:50.1339641Z libstdcxx-devel_linu | 11.1 MB | ####4 | 45%  2025-05-07T20:24:50.1417988Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:50.1418375Z 2025-05-07T20:24:50.1418460Z 2025-05-07T20:24:50.1418490Z 2025-05-07T20:24:50.1497485Z binutils_impl_linux- | 6.0 MB | ######3 | 63%  2025-05-07T20:24:50.1500213Z 2025-05-07T20:24:50.1938051Z gxx_impl_linux-64-11 | 11.2 MB | #######4 | 74%  2025-05-07T20:24:50.1938537Z 2025-05-07T20:24:50.1942184Z 2025-05-07T20:24:50.2300785Z libstdcxx-devel_linu | 11.1 MB | #######1 | 71%  2025-05-07T20:24:50.2301068Z 2025-05-07T20:24:50.2301072Z 2025-05-07T20:24:50.2301076Z 2025-05-07T20:24:50.2301079Z 2025-05-07T20:24:50.2302764Z 2025-05-07T20:24:50.2303874Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:50.2304203Z 2025-05-07T20:24:50.2304207Z 2025-05-07T20:24:50.2304249Z 2025-05-07T20:24:50.2304255Z 2025-05-07T20:24:50.2304259Z 2025-05-07T20:24:50.2339393Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:50.2794348Z gcc_impl_linux-64-11 | 53.0 MB | 5 | 5% 2025-05-07T20:24:50.2794618Z 2025-05-07T20:24:50.2794844Z 2025-05-07T20:24:50.2794850Z 2025-05-07T20:24:50.2795059Z 2025-05-07T20:24:50.2795069Z 2025-05-07T20:24:50.2795075Z 2025-05-07T20:24:50.2962208Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:24:50.2962680Z 2025-05-07T20:24:50.2962697Z 2025-05-07T20:24:50.3341587Z libstdcxx-devel_linu | 11.1 MB | #########8 | 99%  2025-05-07T20:24:50.3523070Z gcc_impl_linux-64-11 | 53.0 MB | #1 | 12% 2025-05-07T20:24:50.3523355Z 2025-05-07T20:24:50.3523422Z 2025-05-07T20:24:50.3523429Z 2025-05-07T20:24:50.3524040Z 2025-05-07T20:24:50.3533227Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:50.3533515Z 2025-05-07T20:24:50.3533759Z 2025-05-07T20:24:50.3533771Z 2025-05-07T20:24:50.3535153Z 2025-05-07T20:24:50.3828131Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:50.3828415Z 2025-05-07T20:24:50.3828420Z 2025-05-07T20:24:50.3828425Z 2025-05-07T20:24:50.4164538Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:50.4164827Z 2025-05-07T20:24:50.4164833Z 2025-05-07T20:24:50.4164838Z 2025-05-07T20:24:50.4164842Z 2025-05-07T20:24:50.4164848Z 2025-05-07T20:24:50.4167986Z 2025-05-07T20:24:50.4178173Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:50.4178476Z 2025-05-07T20:24:50.4178481Z 2025-05-07T20:24:50.4178485Z 2025-05-07T20:24:50.4178490Z 2025-05-07T20:24:50.4178500Z 2025-05-07T20:24:50.4178504Z 2025-05-07T20:24:50.4250366Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:50.4250669Z 2025-05-07T20:24:50.4250672Z 2025-05-07T20:24:50.4250677Z 2025-05-07T20:24:50.4250701Z 2025-05-07T20:24:50.4250705Z 2025-05-07T20:24:50.4250709Z 2025-05-07T20:24:50.4251990Z 2025-05-07T20:24:50.4342023Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:24:50.4779776Z gcc_impl_linux-64-11 | 53.0 MB | #8 | 19% 2025-05-07T20:24:50.4780177Z 2025-05-07T20:24:50.4780182Z 2025-05-07T20:24:50.4780187Z 2025-05-07T20:24:50.4780193Z 2025-05-07T20:24:50.4780198Z 2025-05-07T20:24:50.4780204Z 2025-05-07T20:24:50.4780209Z 2025-05-07T20:24:50.4884045Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:50.4884486Z 2025-05-07T20:24:50.4884496Z 2025-05-07T20:24:50.4884505Z 2025-05-07T20:24:50.4884514Z 2025-05-07T20:24:50.4884522Z 2025-05-07T20:24:50.4884531Z 2025-05-07T20:24:50.4884537Z 2025-05-07T20:24:50.4892817Z 2025-05-07T20:24:50.4930853Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:50.4931190Z 2025-05-07T20:24:50.4931195Z 2025-05-07T20:24:50.4931227Z 2025-05-07T20:24:50.4931232Z 2025-05-07T20:24:50.4931237Z 2025-05-07T20:24:50.4931242Z 2025-05-07T20:24:50.4931245Z 2025-05-07T20:24:50.4933327Z 2025-05-07T20:24:50.5260331Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:50.5260674Z 2025-05-07T20:24:50.5260681Z 2025-05-07T20:24:50.5260687Z 2025-05-07T20:24:50.5260693Z 2025-05-07T20:24:50.5260701Z 2025-05-07T20:24:50.5260717Z 2025-05-07T20:24:50.5260725Z 2025-05-07T20:24:50.5260730Z 2025-05-07T20:24:50.5264159Z 2025-05-07T20:24:50.5297867Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:24:50.5298233Z 2025-05-07T20:24:50.5298237Z 2025-05-07T20:24:50.5298242Z 2025-05-07T20:24:50.5298246Z 2025-05-07T20:24:50.5298251Z 2025-05-07T20:24:50.5298256Z 2025-05-07T20:24:50.5298260Z 2025-05-07T20:24:50.5298264Z 2025-05-07T20:24:50.5299694Z 2025-05-07T20:24:50.5343184Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:50.5456350Z gcc_impl_linux-64-11 | 53.0 MB | ##4 | 24% 2025-05-07T20:24:50.5456596Z 2025-05-07T20:24:50.5456987Z 2025-05-07T20:24:50.5457003Z 2025-05-07T20:24:50.5457011Z 2025-05-07T20:24:50.5457017Z 2025-05-07T20:24:50.5457022Z 2025-05-07T20:24:50.5457059Z 2025-05-07T20:24:50.5457069Z 2025-05-07T20:24:50.5457075Z 2025-05-07T20:24:50.5463596Z 2025-05-07T20:24:50.5506233Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:24:50.5506663Z 2025-05-07T20:24:50.5506669Z 2025-05-07T20:24:50.5506674Z 2025-05-07T20:24:50.5506688Z 2025-05-07T20:24:50.5506694Z 2025-05-07T20:24:50.5506699Z 2025-05-07T20:24:50.5506704Z 2025-05-07T20:24:50.5506712Z 2025-05-07T20:24:50.5506719Z 2025-05-07T20:24:50.5509277Z 2025-05-07T20:24:50.5720194Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:50.5720953Z 2025-05-07T20:24:50.5721724Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:50.5722053Z 2025-05-07T20:24:50.5772671Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:50.5772964Z 2025-05-07T20:24:50.5772969Z 2025-05-07T20:24:50.5772972Z 2025-05-07T20:24:50.5772976Z 2025-05-07T20:24:50.5772980Z 2025-05-07T20:24:50.5773186Z 2025-05-07T20:24:50.5773190Z 2025-05-07T20:24:50.5773194Z 2025-05-07T20:24:50.5773197Z 2025-05-07T20:24:50.5773201Z 2025-05-07T20:24:50.5773205Z 2025-05-07T20:24:50.5806753Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:24:50.5807198Z 2025-05-07T20:24:50.5807204Z 2025-05-07T20:24:50.5807210Z 2025-05-07T20:24:50.5807215Z 2025-05-07T20:24:50.5807220Z 2025-05-07T20:24:50.5807226Z 2025-05-07T20:24:50.5807231Z 2025-05-07T20:24:50.5807236Z 2025-05-07T20:24:50.5807242Z 2025-05-07T20:24:50.5807247Z 2025-05-07T20:24:50.5807252Z 2025-05-07T20:24:50.6042769Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:50.6043129Z 2025-05-07T20:24:50.6045878Z 2025-05-07T20:24:50.6347571Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:50.7159191Z gcc_impl_linux-64-11 | 53.0 MB | ###3 | 33% 2025-05-07T20:24:50.7159557Z 2025-05-07T20:24:50.7159590Z 2025-05-07T20:24:50.7159594Z 2025-05-07T20:24:50.7159597Z 2025-05-07T20:24:50.7159601Z 2025-05-07T20:24:50.7159831Z 2025-05-07T20:24:50.7402360Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:50.7567305Z gcc_impl_linux-64-11 | 53.0 MB | ####7 | 47% 2025-05-07T20:24:50.7567669Z 2025-05-07T20:24:50.7567675Z 2025-05-07T20:24:50.7567681Z 2025-05-07T20:24:50.7567686Z 2025-05-07T20:24:50.7568204Z 2025-05-07T20:24:50.7616210Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:50.7616543Z 2025-05-07T20:24:50.7616555Z 2025-05-07T20:24:50.7616559Z 2025-05-07T20:24:50.7616563Z 2025-05-07T20:24:50.7616567Z 2025-05-07T20:24:50.7616571Z 2025-05-07T20:24:50.7616598Z 2025-05-07T20:24:50.7622608Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:50.7623016Z 2025-05-07T20:24:50.7623021Z 2025-05-07T20:24:50.7623027Z 2025-05-07T20:24:50.7623030Z 2025-05-07T20:24:50.7623045Z 2025-05-07T20:24:50.7623049Z 2025-05-07T20:24:50.7625351Z 2025-05-07T20:24:50.8367497Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:50.8367832Z 2025-05-07T20:24:50.8367835Z 2025-05-07T20:24:50.8367839Z 2025-05-07T20:24:50.8367844Z 2025-05-07T20:24:50.8367847Z 2025-05-07T20:24:50.8367851Z 2025-05-07T20:24:50.8367855Z 2025-05-07T20:24:50.8367858Z 2025-05-07T20:24:50.8367862Z 2025-05-07T20:24:50.8368390Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:50.8368769Z 2025-05-07T20:24:50.8368774Z 2025-05-07T20:24:50.8368789Z 2025-05-07T20:24:50.8368795Z 2025-05-07T20:24:50.8368800Z 2025-05-07T20:24:50.8368805Z 2025-05-07T20:24:50.8368810Z 2025-05-07T20:24:50.8368840Z 2025-05-07T20:24:50.8368846Z 2025-05-07T20:24:50.8401507Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:50.8401906Z 2025-05-07T20:24:50.8401912Z 2025-05-07T20:24:50.8401917Z 2025-05-07T20:24:50.8401934Z 2025-05-07T20:24:50.8401939Z 2025-05-07T20:24:50.8401944Z 2025-05-07T20:24:50.8401949Z 2025-05-07T20:24:50.8402306Z 2025-05-07T20:24:50.8407984Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:50.8408323Z 2025-05-07T20:24:50.8408329Z 2025-05-07T20:24:50.8408334Z 2025-05-07T20:24:50.8408348Z 2025-05-07T20:24:50.8408355Z 2025-05-07T20:24:50.8408361Z 2025-05-07T20:24:50.8408366Z 2025-05-07T20:24:50.8408372Z 2025-05-07T20:24:50.8485239Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:50.9034497Z gcc_impl_linux-64-11 | 53.0 MB | #####5 | 56% 2025-05-07T20:24:50.9034830Z 2025-05-07T20:24:50.9034874Z 2025-05-07T20:24:50.9035208Z 2025-05-07T20:24:50.9035228Z 2025-05-07T20:24:50.9035248Z 2025-05-07T20:24:50.9035255Z 2025-05-07T20:24:50.9035261Z 2025-05-07T20:24:50.9035266Z 2025-05-07T20:24:50.9035273Z 2025-05-07T20:24:50.9035337Z 2025-05-07T20:24:50.9043811Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:50.9044411Z 2025-05-07T20:24:50.9044417Z 2025-05-07T20:24:50.9044422Z 2025-05-07T20:24:50.9044427Z 2025-05-07T20:24:50.9044433Z 2025-05-07T20:24:50.9044438Z 2025-05-07T20:24:50.9044444Z 2025-05-07T20:24:50.9044459Z 2025-05-07T20:24:50.9044464Z 2025-05-07T20:24:50.9046404Z 2025-05-07T20:24:50.9487176Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:50.9704849Z gcc_impl_linux-64-11 | 53.0 MB | ######6 | 66% 2025-05-07T20:24:50.9705141Z 2025-05-07T20:24:50.9705148Z 2025-05-07T20:24:50.9705155Z 2025-05-07T20:24:50.9705161Z 2025-05-07T20:24:50.9705167Z 2025-05-07T20:24:50.9705174Z 2025-05-07T20:24:50.9705221Z 2025-05-07T20:24:50.9705227Z 2025-05-07T20:24:50.9705232Z 2025-05-07T20:24:50.9705236Z 2025-05-07T20:24:50.9705241Z 2025-05-07T20:24:50.9713126Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:50.9713516Z 2025-05-07T20:24:50.9713520Z 2025-05-07T20:24:50.9713524Z 2025-05-07T20:24:50.9713527Z 2025-05-07T20:24:50.9713531Z 2025-05-07T20:24:50.9713535Z 2025-05-07T20:24:50.9713538Z 2025-05-07T20:24:50.9713542Z 2025-05-07T20:24:50.9713546Z 2025-05-07T20:24:50.9713549Z 2025-05-07T20:24:50.9713952Z 2025-05-07T20:24:51.0488470Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:51.1493319Z gcc_impl_linux-64-11 | 53.0 MB | #######7 | 77% 2025-05-07T20:24:51.1511583Z gcc_impl_linux-64-11 | 53.0 MB | ########8 | 88% 2025-05-07T20:24:51.1511821Z 2025-05-07T20:24:51.1511825Z 2025-05-07T20:24:51.1512165Z 2025-05-07T20:24:51.2516110Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:51.3002196Z gcc_impl_linux-64-11 | 53.0 MB | #########8 | 98% 2025-05-07T20:24:51.3002561Z 2025-05-07T20:24:51.3587921Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:51.6359136Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:51.6359444Z 2025-05-07T20:24:51.6359448Z 2025-05-07T20:24:52.0859997Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:52.0862168Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:52.0862663Z 2025-05-07T20:24:52.0862970Z 2025-05-07T20:24:52.0863222Z  2025-05-07T20:24:52.0863502Z 2025-05-07T20:24:52.0863508Z 2025-05-07T20:24:52.0863751Z  2025-05-07T20:24:52.0864033Z 2025-05-07T20:24:52.0864038Z 2025-05-07T20:24:52.0864043Z 2025-05-07T20:24:52.0864308Z  2025-05-07T20:24:52.0864595Z 2025-05-07T20:24:52.0864601Z 2025-05-07T20:24:52.0864606Z 2025-05-07T20:24:52.0864611Z 2025-05-07T20:24:52.0864846Z  2025-05-07T20:24:52.0865152Z 2025-05-07T20:24:52.0865157Z 2025-05-07T20:24:52.0865163Z 2025-05-07T20:24:52.0865168Z 2025-05-07T20:24:52.0865173Z 2025-05-07T20:24:52.0865415Z  2025-05-07T20:24:52.0865647Z 2025-05-07T20:24:52.0865651Z 2025-05-07T20:24:52.0865654Z 2025-05-07T20:24:52.0865658Z 2025-05-07T20:24:52.0865662Z 2025-05-07T20:24:52.0865665Z 2025-05-07T20:24:52.0865862Z  2025-05-07T20:24:52.0866083Z 2025-05-07T20:24:52.0866087Z 2025-05-07T20:24:52.0866090Z 2025-05-07T20:24:52.0866094Z 2025-05-07T20:24:52.0866098Z 2025-05-07T20:24:52.0866101Z 2025-05-07T20:24:52.0866105Z 2025-05-07T20:24:52.0866603Z  2025-05-07T20:24:52.0866840Z 2025-05-07T20:24:52.0866844Z 2025-05-07T20:24:52.0866847Z 2025-05-07T20:24:52.0866851Z 2025-05-07T20:24:52.0866855Z 2025-05-07T20:24:52.0866858Z 2025-05-07T20:24:52.0867003Z 2025-05-07T20:24:52.0867007Z 2025-05-07T20:24:52.0867210Z  2025-05-07T20:24:52.0867426Z 2025-05-07T20:24:52.0867430Z 2025-05-07T20:24:52.0867433Z 2025-05-07T20:24:52.0867437Z 2025-05-07T20:24:52.0867441Z 2025-05-07T20:24:52.0867444Z 2025-05-07T20:24:52.0867448Z 2025-05-07T20:24:52.0867451Z 2025-05-07T20:24:52.0867455Z 2025-05-07T20:24:52.0867647Z  2025-05-07T20:24:52.0867863Z 2025-05-07T20:24:52.0867867Z 2025-05-07T20:24:52.0867870Z 2025-05-07T20:24:52.0867874Z 2025-05-07T20:24:52.0867878Z 2025-05-07T20:24:52.0867881Z 2025-05-07T20:24:52.0867885Z 2025-05-07T20:24:52.0867896Z 2025-05-07T20:24:52.0867900Z 2025-05-07T20:24:52.0867903Z 2025-05-07T20:24:52.0868096Z  2025-05-07T20:24:52.0868314Z 2025-05-07T20:24:52.0868318Z 2025-05-07T20:24:52.0868327Z 2025-05-07T20:24:52.0868330Z 2025-05-07T20:24:52.0868334Z 2025-05-07T20:24:52.0868338Z 2025-05-07T20:24:52.0868341Z 2025-05-07T20:24:52.0868345Z 2025-05-07T20:24:52.0868354Z 2025-05-07T20:24:52.0868358Z 2025-05-07T20:24:52.0868361Z 2025-05-07T20:24:52.0868563Z  done 2025-05-07T20:24:52.1878642Z Preparing transaction: \ done 2025-05-07T20:24:52.4889613Z Verifying transaction: / - \ done 2025-05-07T20:24:52.5905272Z Executing transaction: / done 2025-05-07T20:24:52.7546388Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:24:56.6265952Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:56.6266520Z 2025-05-07T20:24:56.6279831Z 2025-05-07T20:24:56.6297935Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:24:56.6298617Z 2025-05-07T20:24:56.6310622Z 2025-05-07T20:24:56.6328433Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:24:56.6329000Z 2025-05-07T20:24:56.6340464Z 2025-05-07T20:24:56.6357952Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:24:56.6358490Z 2025-05-07T20:24:56.6370791Z 2025-05-07T20:24:58.5182244Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:58.5182533Z 2025-05-07T20:24:58.5813910Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:00.4598630Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:00.4598922Z 2025-05-07T20:25:00.5232242Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:02.3970071Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:02.3970375Z 2025-05-07T20:25:02.4588754Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:04.3356728Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:04.3357020Z 2025-05-07T20:25:04.3974980Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:04.3979742Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:04.3980171Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:04.3980380Z 2025-05-07T20:25:06.2822143Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:06.2822482Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:06.2822777Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:06.2823050Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:06.2823443Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:06.2826082Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:06.2826533Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:06.2826907Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:06.2827173Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:06.2827430Z #define __CHAR_BIT__ 8 2025-05-07T20:25:06.2827849Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:06.2828099Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:06.2828361Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:06.2828641Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:06.2828917Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:06.2829228Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.2829535Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:06.2829831Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:06.2830158Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:06.2830491Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:06.2830903Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:06.2831323Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:06.2831644Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:06.2831932Z #define __GCC_IEC_559 2 2025-05-07T20:25:06.2832182Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:06.2832468Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:06.2832739Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:06.2833020Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:06.2833385Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.2833738Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:06.2834017Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:06.2834294Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:06.2834564Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:06.2834835Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:06.2835095Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:06.2835363Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:06.2835634Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:06.2835956Z #define __INT8_C(c) c 2025-05-07T20:25:06.2836203Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:06.2836508Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.2836829Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:06.2837154Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:06.2837519Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:06.2837803Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:06.2838070Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.2838356Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:06.2838644Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:06.2839038Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:06.2839461Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:06.2839756Z #define __linux 1 2025-05-07T20:25:06.2839992Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:06.2840280Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:06.2840569Z #define __unix 1 2025-05-07T20:25:06.2840797Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:06.2841083Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:06.2841356Z #define __WINT_MIN__ 0U 2025-05-07T20:25:06.2841601Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:06.2841893Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:06.2842171Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:06.2842441Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:06.2842691Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:06.2842986Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:06.2843315Z #define __INT64_C(c) c ## L 2025-05-07T20:25:06.2843606Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:06.2843910Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:06.2844182Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:06.2844534Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:06.2844910Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:06.2845264Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:06.2845526Z #define __DBL_DIG__ 15 2025-05-07T20:25:06.2845759Z #define __FLT32_DIG__ 6 2025-05-07T20:25:06.2846065Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:06.2846416Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:06.2846741Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:06.2847069Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:06.2847415Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:06.2847664Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:06.2847929Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:06.2848309Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:06.2848706Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:06.2848983Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:06.2849242Z #define __unix__ 1 2025-05-07T20:25:06.2849463Z #define __INT_WIDTH__ 32 2025-05-07T20:25:06.2849717Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:06.2849975Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:06.2850226Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:06.2850498Z #define __UINT16_C(c) c 2025-05-07T20:25:06.2850742Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:06.2851003Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:06.2851364Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:06.2851736Z #define __gnu_linux__ 1 2025-05-07T20:25:06.2851986Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:06.2852263Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:06.2852557Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.2852835Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:06.2853111Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:06.2853409Z #define __GNUC__ 11 2025-05-07T20:25:06.2853630Z #define __pie__ 2 2025-05-07T20:25:06.2862727Z #define __MMX__ 1 2025-05-07T20:25:06.2862968Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:06.2863239Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:06.2863573Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:06.2863850Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:06.2864201Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:06.2864603Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.2864920Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:06.2865189Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:06.2865460Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:06.2865749Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:06.2866004Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:06.2866257Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:06.2866539Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:06.2866829Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:06.2867102Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:06.2867383Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:06.2867630Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:06.2867893Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:06.2868162Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:06.2868418Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:06.2868674Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:06.2868993Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:06.2869349Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:06.2869622Z #define __SSE2_MATH__ 1 2025-05-07T20:25:06.2869867Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:06.2870164Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.2870459Z #define __amd64 1 2025-05-07T20:25:06.2870688Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:06.2870952Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:06.2871260Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:06.2871573Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:06.2871834Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:06.2872305Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:06.2872564Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:06.2873204Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:06.2873468Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:06.2873734Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:06.2873998Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:06.2874271Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:06.2874650Z #define __x86_64 1 2025-05-07T20:25:06.2874885Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:06.2875250Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:06.2875707Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:06.2876156Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:06.2876621Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:06.2876998Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:06.2877249Z #define __LP64__ 1 2025-05-07T20:25:06.2877479Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.2877829Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:06.2878208Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:06.2878481Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:06.2878751Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:06.2879040Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:06.2879315Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:06.2879577Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:06.2879839Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:06.2880099Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:06.2880361Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:06.2880684Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:06.2881041Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:06.2881318Z #define __FLT_DIG__ 6 2025-05-07T20:25:06.2881543Z #define __NO_INLINE__ 1 2025-05-07T20:25:06.2881787Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:06.2882113Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:06.2882456Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:06.2882716Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:06.2882980Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:06.2883254Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:06.2883542Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:06.2883804Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:06.2884095Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:06.2884379Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:06.2884645Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:06.2884950Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:06.2885273Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:06.2885537Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:06.2885794Z #define __FLT128_DIG__ 33 2025-05-07T20:25:06.2886026Z #define __INT32_C(c) c 2025-05-07T20:25:06.2886268Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:06.2886549Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:06.2886824Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:06.2887103Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:06.2887419Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:06.2887716Z #define unix 1 2025-05-07T20:25:06.2887948Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:06.2888264Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.2888560Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:06.2888871Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:06.2889200Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:06.2889451Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:06.2889707Z #define __ELF__ 1 2025-05-07T20:25:06.2889940Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:06.2890223Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:06.2890490Z #define __FLT_RADIX__ 2 2025-05-07T20:25:06.2890742Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:06.2891104Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:06.2891567Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:06.2891826Z #define __SSE_MATH__ 1 2025-05-07T20:25:06.2892051Z #define __k8 1 2025-05-07T20:25:06.2892341Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:06.2892828Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:06.2893124Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:06.2893445Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:06.2893726Z #define __LDBL_DIG__ 18 2025-05-07T20:25:06.2893968Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:06.2894222Z #define __x86_64__ 1 2025-05-07T20:25:06.2894452Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:06.2894749Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:06.2895083Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.2895381Z #define __FLT64_DIG__ 15 2025-05-07T20:25:06.2895670Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.2896030Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:06.2896341Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.2896610Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:06.2896887Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.2897179Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:06.2897554Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:06.2897950Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:06.2898364Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:06.2898699Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:06.2899021Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:06.2899318Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:06.2899595Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:06.2899906Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:06.2900186Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:06.2900421Z #define __SEG_FS 1 2025-05-07T20:25:06.2900652Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:06.2900937Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:06.2901205Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.2901492Z #define __SEG_GS 1 2025-05-07T20:25:06.2901801Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:06.2902186Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:06.2902453Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:06.2902743Z #define __INT16_TYPE__ short int 2025-05-07T20:25:06.2903022Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:06.2903309Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:06.2903574Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:06.2903821Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:06.2904074Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:06.2904414Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:06.2904798Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.2905080Z #define linux 1 2025-05-07T20:25:06.2905307Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.2905585Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:06.2905850Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:06.2906103Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:06.2906362Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:06.2906625Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:06.2906972Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:06.2907382Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:06.2907717Z #define __code_model_small__ 1 2025-05-07T20:25:06.2907991Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:06.2908281Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:06.2908527Z #define __k8__ 1 2025-05-07T20:25:06.2908748Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:06.2909036Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:06.2909340Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:06.2909575Z #define __pic__ 2 2025-05-07T20:25:06.2909927Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.2910242Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:06.2910531Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.2910865Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:06.2911235Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:06.2911667Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:06.2911933Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:06.2912227Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:06.2912540Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:06.2912785Z #define __linux__ 1 2025-05-07T20:25:06.2913016Z #define __INT64_TYPE__ long int 2025-05-07T20:25:06.2913303Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:06.2913584Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:06.2913858Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:06.2914119Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:06.2914409Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.2914752Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:06.2915049Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:06.2915314Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:06.2915627Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:06.2915930Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:06.2916256Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:06.2916614Z #define __SSE__ 1 2025-05-07T20:25:06.2916850Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:06.2917181Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:06.2917521Z #define __amd64__ 1 2025-05-07T20:25:06.2917745Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:06.2917992Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:06.2918268Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:06.2918540Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:06.2918807Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:06.2919082Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:06.2919343Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:06.2919618Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:06.2919878Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:06.2920229Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:06.2920699Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:06.2921042Z #define _LP64 1 2025-05-07T20:25:06.2921255Z #define __UINT8_C(c) c 2025-05-07T20:25:06.2921494Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:06.2921752Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:06.2922021Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:06.2922293Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:06.2922587Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:06.2922941Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:06.2923432Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:06.2923833Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.2924124Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:06.2924437Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:06.2924803Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:06.2925170Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:06.2925434Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:06.2925773Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:06.2926131Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:06.2926395Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:06.2926644Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:06.2926892Z #define __FXSR__ 1 2025-05-07T20:25:06.2927191Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:06.2927639Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:06.2928048Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:06.2928449Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:06.2928707Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:06.2929042Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:06.2929397Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:06.2929717Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:06.2929954Z #define __PIC__ 2 2025-05-07T20:25:06.2930203Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:06.2930605Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:06.2930990Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:06.2931326Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:06.2931646Z #define __SSE2__ 1 2025-05-07T20:25:06.2931871Z #define __INT32_TYPE__ int 2025-05-07T20:25:06.2932128Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:06.2932384Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:06.2932721Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:06.2933082Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:06.2933348Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:06.2933619Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:06.2933889Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.2934164Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:06.2934416Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:06.2934667Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:06.2934962Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.2935255Z #define __PIE__ 2 2025-05-07T20:25:06.2935579Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:06.2935970Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:06.2936306Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:06.2936668Z #define __INT16_C(c) c 2025-05-07T20:25:06.2936892Z #define __STDC__ 1 2025-05-07T20:25:06.2937117Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:06.2937391Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:06.2937658Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:06.2937953Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:06.2938407Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:06.2938739Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:06.2939010Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:06.2939285Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:06.2939550Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:06.2939835Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:06.2940123Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:06.2940394Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:06.2940692Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:06.2941082Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:06.2941456Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:06.2941762Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:06.2942054Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:06.2942310Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:06.2942473Z 2025-05-07T20:25:06.3462444Z 2025-05-07T20:25:06.3463266Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:06.3463791Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:06.3464051Z 2025-05-07T20:25:08.2259820Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:08.2260185Z #define __cpp_attributes 200809L 2025-05-07T20:25:08.2260553Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:08.2261045Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:08.2261423Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:08.2261799Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:08.2262264Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:08.2262743Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:08.2263133Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:08.2263559Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:08.2264362Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:08.2264775Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:08.2265152Z #define __CHAR_BIT__ 8 2025-05-07T20:25:08.2265424Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:08.2265747Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:08.2266392Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:08.2266800Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:08.2267085Z #define __cpp_static_assert 201411L 2025-05-07T20:25:08.2267390Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:08.2267703Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.2268011Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:08.2268316Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:08.2268654Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:08.2268988Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:08.2269395Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:08.2269821Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:08.2270142Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:08.2270425Z #define __GCC_IEC_559 2 2025-05-07T20:25:08.2270678Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:08.2270961Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:08.2271245Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:08.2271544Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:08.2271845Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:08.2272168Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:08.2272489Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:08.2272830Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.2273160Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:08.2273434Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:08.2273722Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:08.2274009Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:08.2274311Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:08.2274590Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:08.2274863Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:08.2275144Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:08.2275488Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:08.2275829Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:08.2276083Z #define __INT8_C(c) c 2025-05-07T20:25:08.2276331Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:08.2276614Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:08.2276938Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.2277271Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:08.2277555Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:08.2277853Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:08.2278171Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:08.2278530Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:08.2278823Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:08.2279110Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:08.2279386Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.2279679Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:08.2279963Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:08.2280368Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:08.2280793Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:08.2281084Z #define __linux 1 2025-05-07T20:25:08.2281325Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:08.2281614Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:08.2281903Z #define __unix 1 2025-05-07T20:25:08.2282133Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:08.2282433Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:08.2282736Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:08.2283009Z #define __WINT_MIN__ 0U 2025-05-07T20:25:08.2283265Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:08.2283564Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:08.2284007Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:08.2284287Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:08.2284551Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:08.2284835Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:08.2285142Z #define __INT64_C(c) c ## L 2025-05-07T20:25:08.2285501Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:08.2285800Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:08.2286085Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:08.2286396Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:08.2286681Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:08.2286948Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:08.2287306Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:08.2287706Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:08.2287971Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:08.2288251Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:08.2288535Z #define __DBL_DIG__ 15 2025-05-07T20:25:08.2288784Z #define __FLT32_DIG__ 6 2025-05-07T20:25:08.2289093Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:08.2289442Z #define __GXX_WEAK__ 1 2025-05-07T20:25:08.2289687Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:08.2289940Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:08.2290281Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:08.2290635Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:08.2299125Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:08.2299470Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:08.2299811Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:08.2300224Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:08.2300627Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:08.2300905Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:08.2301167Z #define __unix__ 1 2025-05-07T20:25:08.2301397Z #define __INT_WIDTH__ 32 2025-05-07T20:25:08.2301645Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:08.2301898Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:08.2302157Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:08.2302422Z #define __UINT16_C(c) c 2025-05-07T20:25:08.2302662Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:08.2302926Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:08.2303279Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:08.2303643Z #define __gnu_linux__ 1 2025-05-07T20:25:08.2303885Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:08.2304144Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:08.2304429Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:08.2304717Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.2304988Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:08.2305245Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:08.2305498Z #define __GNUC__ 11 2025-05-07T20:25:08.2305718Z #define __GXX_RTTI 1 2025-05-07T20:25:08.2305936Z #define __pie__ 2 2025-05-07T20:25:08.2306157Z #define __MMX__ 1 2025-05-07T20:25:08.2306382Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:08.2306643Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:08.2306932Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:08.2307199Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:08.2307447Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:08.2307748Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:08.2308069Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:08.2308411Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:08.2308783Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:08.2309094Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.2309404Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:08.2309667Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:08.2309939Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:08.2310249Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:08.2310541Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:08.2311036Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:08.2311305Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:08.2311587Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:08.2311885Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:08.2312157Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:08.2312533Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:08.2312793Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:08.2313060Z #define __cplusplus 201703L 2025-05-07T20:25:08.2313326Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:08.2313620Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:08.2313928Z #define __DEPRECATED 1 2025-05-07T20:25:08.2314184Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:08.2314487Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:08.2314748Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:08.2315067Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:08.2315424Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:08.2315705Z #define __SSE2_MATH__ 1 2025-05-07T20:25:08.2315960Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:08.2316257Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.2316551Z #define __amd64 1 2025-05-07T20:25:08.2316775Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:08.2317046Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:08.2317313Z #define __GNUG__ 11 2025-05-07T20:25:08.2317571Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:08.2317877Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:08.2318133Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:08.2318393Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:08.2318663Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:08.2318921Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:08.2319200Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:08.2319497Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:08.2319756Z #define __cpp_hex_float 201603L 2025-05-07T20:25:08.2320030Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:08.2320308Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:08.2320578Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:08.2320847Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:08.2321116Z #define __x86_64 1 2025-05-07T20:25:08.2321338Z #define __cpp_lambdas 200907L 2025-05-07T20:25:08.2321613Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:08.2321986Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:08.2322368Z #define __cpp_template_auto 201606L 2025-05-07T20:25:08.2322726Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:08.2323176Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:08.2323644Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:08.2324062Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:08.2324323Z #define __LP64__ 1 2025-05-07T20:25:08.2324553Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.2324903Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:08.2325282Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:08.2325562Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:08.2325838Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:08.2326120Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:08.2326395Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:08.2326649Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:08.2326915Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:08.2327247Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:08.2327606Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:08.2327881Z #define __FLT_DIG__ 6 2025-05-07T20:25:08.2328118Z #define __NO_INLINE__ 1 2025-05-07T20:25:08.2328361Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:08.2328683Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:08.2329033Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:08.2329297Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:08.2329713Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:08.2329975Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:08.2330257Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:08.2330551Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:08.2330809Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:08.2331233Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:08.2331517Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:08.2331789Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:08.2332095Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:08.2332438Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:08.2332724Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:08.2332988Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:08.2333245Z #define __FLT128_DIG__ 33 2025-05-07T20:25:08.2333481Z #define __INT32_C(c) c 2025-05-07T20:25:08.2333728Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:08.2334016Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:08.2334372Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:08.2334656Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:08.2334968Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:08.2335274Z #define unix 1 2025-05-07T20:25:08.2335496Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:08.2335757Z #define __cpp_rtti 199711L 2025-05-07T20:25:08.2336025Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:08.2336338Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.2336641Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:08.2336953Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:08.2337285Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:08.2337533Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:08.2337823Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:08.2338260Z #define __ELF__ 1 2025-05-07T20:25:08.2338537Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:08.2338818Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:08.2339101Z #define __FLT_RADIX__ 2 2025-05-07T20:25:08.2339356Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:08.2339711Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:08.2340076Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:08.2340352Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:08.2340629Z #define __k8 1 2025-05-07T20:25:08.2340929Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:08.2341303Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:08.2341597Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:08.2341901Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:08.2342168Z #define __LDBL_DIG__ 18 2025-05-07T20:25:08.2342407Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:08.2342666Z #define __x86_64__ 1 2025-05-07T20:25:08.2342906Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:08.2343209Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:08.2343551Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.2343906Z #define __FLT64_DIG__ 15 2025-05-07T20:25:08.2344195Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.2344541Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:08.2344859Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.2345130Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:08.2345402Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.2345703Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:08.2346070Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:08.2346461Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:08.2346755Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:08.2347079Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:08.2347400Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:08.2347718Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:08.2348014Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:08.2348407Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:08.2348713Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:08.2348997Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:08.2349242Z #define __SEG_FS 1 2025-05-07T20:25:08.2349469Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:08.2349832Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:08.2350110Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.2350392Z #define __SEG_GS 1 2025-05-07T20:25:08.2350704Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:08.2351087Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:08.2351365Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:08.2351650Z #define __INT16_TYPE__ short int 2025-05-07T20:25:08.2351946Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:08.2352261Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:08.2352562Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:08.2352810Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:08.2353081Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:08.2353425Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:08.2353817Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.2354180Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:08.2354518Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:08.2354823Z #define linux 1 2025-05-07T20:25:08.2355047Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.2355326Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:08.2356037Z #define __EXCEPTIONS 1 2025-05-07T20:25:08.2356387Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:08.2356748Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:08.2357097Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:08.2357388Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:08.2357738Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:08.2358123Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:08.2358474Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:08.2358804Z #define __code_model_small__ 1 2025-05-07T20:25:08.2359078Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:08.2359392Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:08.2359693Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:08.2359971Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:08.2360261Z #define __k8__ 1 2025-05-07T20:25:08.2360485Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:08.2360772Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:08.2361071Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:08.2361309Z #define __pic__ 2 2025-05-07T20:25:08.2361561Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.2361877Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:08.2362146Z #define __cpp_decltype 200707L 2025-05-07T20:25:08.2362441Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.2362773Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:08.2363141Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:08.2363499Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:08.2363838Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:08.2364168Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:08.2364458Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:08.2364713Z #define __linux__ 1 2025-05-07T20:25:08.2364944Z #define __INT64_TYPE__ long int 2025-05-07T20:25:08.2365202Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:08.2365467Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:08.2365742Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:08.2366026Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:08.2366345Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:08.2366644Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.2366954Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:08.2367225Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:08.2367777Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:08.2368084Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:08.2368408Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:08.2368765Z #define __SSE__ 1 2025-05-07T20:25:08.2368998Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:08.2369471Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:08.2369810Z #define __amd64__ 1 2025-05-07T20:25:08.2370039Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:08.2370289Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:08.2370565Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:08.2370833Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:08.2371103Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:08.2371363Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:08.2371642Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:08.2371902Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:08.2372248Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:08.2372714Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:08.2373064Z #define _LP64 1 2025-05-07T20:25:08.2373274Z #define __UINT8_C(c) c 2025-05-07T20:25:08.2373520Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:08.2373811Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:08.2374105Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:08.2374369Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:08.2374725Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:08.2375176Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:08.2375551Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.2375849Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:08.2376154Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:08.2376465Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:08.2376847Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:08.2377217Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:08.2377477Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:08.2377742Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:08.2378168Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:08.2378531Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:08.2378792Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:08.2379042Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:08.2379287Z #define __FXSR__ 1 2025-05-07T20:25:08.2379590Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:08.2380037Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:08.2380443Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:08.2380746Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:08.2381016Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:08.2381317Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:08.2381606Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:08.2381883Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:08.2382246Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:08.2382606Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:08.2382877Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:08.2383130Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:08.2383367Z #define __PIC__ 2 2025-05-07T20:25:08.2383624Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:08.2384057Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:08.2384467Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:08.2384801Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:08.2385148Z #define __cpp_constexpr 201603L 2025-05-07T20:25:08.2385413Z #define __SSE2__ 1 2025-05-07T20:25:08.2385646Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:08.2385939Z #define __INT32_TYPE__ int 2025-05-07T20:25:08.2386194Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:08.2386555Z #define __cpp_exceptions 199711L 2025-05-07T20:25:08.2386836Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:08.2387171Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:08.2387523Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:08.2387950Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:08.2388223Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:08.2388485Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.2388761Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:08.2389012Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:08.2389273Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:08.2389560Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:08.2389853Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.2390150Z #define __PIE__ 2 2025-05-07T20:25:08.2390467Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:08.2390880Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:08.2391196Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:08.2391536Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:08.2391900Z #define __INT16_C(c) c 2025-05-07T20:25:08.2392127Z #define __STDC__ 1 2025-05-07T20:25:08.2392343Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:08.2392609Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:08.2392884Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:08.2393142Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:08.2393436Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:08.2393819Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:08.2394162Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:08.2394425Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:08.2394719Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:08.2395002Z #define __SSE_MATH__ 1 2025-05-07T20:25:08.2395239Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:08.2395528Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:08.2395843Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:08.2396125Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:08.2396420Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:08.2396696Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:08.2396993Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:08.2397395Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:08.2397769Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:08.2398079Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:08.2398367Z #define _GNU_SOURCE 1 2025-05-07T20:25:08.2398616Z #define __cpp_init_captures 201304L 2025-05-07T20:25:08.2398899Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:08.2399147Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:08.2399318Z 2025-05-07T20:25:08.2901208Z 2025-05-07T20:25:08.2901944Z + conda run -n build_binary c++ --version 2025-05-07T20:25:08.2902202Z 2025-05-07T20:25:10.1615770Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:10.1616192Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:10.1616645Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:10.1617184Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:10.1617523Z 2025-05-07T20:25:10.1617527Z 2025-05-07T20:25:10.2242804Z 2025-05-07T20:25:10.2243665Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:10.2244239Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:10.2244592Z 2025-05-07T20:25:12.1665937Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:12.1668401Z 2025-05-07T20:25:12.1668869Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:12.1669435Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:12.1669749Z 2025-05-07T20:25:14.1113662Z #define __cplusplus 201703L 2025-05-07T20:25:14.1115851Z 2025-05-07T20:25:14.1116805Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:14.1166296Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:14.1166722Z . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:14.1179768Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:14.1180120Z env: 2025-05-07T20:25:14.1180356Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:14.1180654Z BUILD_ENV: build_binary 2025-05-07T20:25:14.1180903Z BUILD_TARGET: genai 2025-05-07T20:25:14.1181140Z BUILD_VARIANT: cuda 2025-05-07T20:25:14.1181373Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:14.1181633Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:14.1181940Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:14.1182269Z ##[endgroup] 2025-05-07T20:25:14.4525831Z ################################################################################ 2025-05-07T20:25:14.4526324Z # Install CUDA 2025-05-07T20:25:14.4526613Z # 2025-05-07T20:25:14.4540807Z # [2025-05-07T20:25:14.453Z] + install_cuda build_binary 12.6.3 2025-05-07T20:25:14.4541344Z ################################################################################ 2025-05-07T20:25:14.4541641Z 2025-05-07T20:25:14.4556121Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:14.5413473Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:14.5413978Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:14.5417939Z + conda clean --packages --tarball -y 2025-05-07T20:25:14.5418315Z 2025-05-07T20:25:15.2504669Z Will remove 32 (142.2 MB) tarball(s). 2025-05-07T20:25:15.2505340Z Will remove 6 (617 KB) package(s). 2025-05-07T20:25:15.3123207Z 2025-05-07T20:25:15.3131447Z + conda clean --all -y 2025-05-07T20:25:15.3131652Z 2025-05-07T20:25:15.9839713Z There are no unused tarball(s) to remove. 2025-05-07T20:25:15.9840550Z Will remove 1 index cache(s). 2025-05-07T20:25:15.9841200Z There are no unused package(s) to remove. 2025-05-07T20:25:15.9841849Z There are no tempfile(s) to remove. 2025-05-07T20:25:15.9842456Z There are no logfile(s) to remove. 2025-05-07T20:25:16.0462557Z 2025-05-07T20:25:16.0476599Z [INSTALL] Installing CUDA 12.6.3 ... 2025-05-07T20:25:16.0500513Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3 2025-05-07T20:25:16.9585733Z Channels: 2025-05-07T20:25:16.9586119Z - conda-forge 2025-05-07T20:25:16.9586443Z Platform: linux-64 2025-05-07T20:25:27.4431921Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:25:28.5172551Z Solving environment: / - \ | done 2025-05-07T20:25:28.5903816Z 2025-05-07T20:25:28.5904139Z ## Package Plan ## 2025-05-07T20:25:28.5904385Z 2025-05-07T20:25:28.5904680Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:28.5905013Z 2025-05-07T20:25:28.5905117Z added / updated specs: 2025-05-07T20:25:28.5905359Z - cuda=12.6.3 2025-05-07T20:25:28.5905500Z 2025-05-07T20:25:28.5905530Z 2025-05-07T20:25:28.5905655Z The following packages will be downloaded: 2025-05-07T20:25:28.5905875Z 2025-05-07T20:25:28.5905998Z package | build 2025-05-07T20:25:28.5906472Z ---------------------------|----------------- 2025-05-07T20:25:28.5906938Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:25:28.5907519Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:25:28.5908154Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:25:28.5908739Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:25:28.5909259Z cuda-12.6.3 | ha804496_0 26 KB conda-forge 2025-05-07T20:25:28.5909694Z cuda-cccl_linux-64-12.6.77 | ha770c72_0 1.0 MB conda-forge 2025-05-07T20:25:28.5910200Z cuda-command-line-tools-12.6.3| ha770c72_0 20 KB conda-forge 2025-05-07T20:25:28.5911679Z cuda-compiler-12.6.3 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:25:28.5912180Z cuda-crt-dev_linux-64-12.6.85| ha770c72_0 87 KB conda-forge 2025-05-07T20:25:28.5912808Z cuda-crt-tools-12.6.85 | ha770c72_0 26 KB conda-forge 2025-05-07T20:25:28.5913258Z cuda-cudart-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:28.5913737Z cuda-cudart-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:28.5914234Z cuda-cudart-dev_linux-64-12.6.77| h3f2d84a_0 357 KB conda-forge 2025-05-07T20:25:28.5914740Z cuda-cudart-static-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:28.5915254Z cuda-cudart-static_linux-64-12.6.77| h3f2d84a_0 744 KB conda-forge 2025-05-07T20:25:28.5915775Z cuda-cudart_linux-64-12.6.77| h3f2d84a_0 184 KB conda-forge 2025-05-07T20:25:28.5916262Z cuda-cuobjdump-12.6.77 | hbd13f7d_1 241 KB conda-forge 2025-05-07T20:25:28.5916776Z cuda-cupti-12.6.80 | hbd13f7d_0 1.9 MB conda-forge 2025-05-07T20:25:28.5917228Z cuda-cupti-dev-12.6.80 | h5888daf_0 3.4 MB conda-forge 2025-05-07T20:25:28.5917693Z cuda-cuxxfilt-12.6.77 | hbd13f7d_1 211 KB conda-forge 2025-05-07T20:25:28.5918159Z cuda-driver-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:28.5918645Z cuda-driver-dev_linux-64-12.6.77| h3f2d84a_0 35 KB conda-forge 2025-05-07T20:25:28.5919116Z cuda-gdb-12.6.77 | h50b4baa_1 370 KB conda-forge 2025-05-07T20:25:28.5919559Z cuda-libraries-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:28.5920038Z cuda-libraries-dev-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:28.5920508Z cuda-nsight-12.6.77 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:25:28.5920951Z cuda-nvcc-12.6.85 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:25:28.5921414Z cuda-nvcc-dev_linux-64-12.6.85| he91c749_0 10.8 MB conda-forge 2025-05-07T20:25:28.5921895Z cuda-nvcc-impl-12.6.85 | h85509e4_0 25 KB conda-forge 2025-05-07T20:25:28.5922357Z cuda-nvcc-tools-12.6.85 | he02047a_0 23.0 MB conda-forge 2025-05-07T20:25:28.5922825Z cuda-nvcc_linux-64-12.6.85 | h04802cd_0 25 KB conda-forge 2025-05-07T20:25:28.5923287Z cuda-nvdisasm-12.6.77 | hbd13f7d_1 47.6 MB conda-forge 2025-05-07T20:25:28.5923736Z cuda-nvml-dev-12.6.77 | hbd13f7d_1 159 KB conda-forge 2025-05-07T20:25:28.5924183Z cuda-nvprof-12.6.80 | hbd13f7d_0 2.6 MB conda-forge 2025-05-07T20:25:28.5924631Z cuda-nvprune-12.6.77 | hbd13f7d_1 66 KB conda-forge 2025-05-07T20:25:28.5925077Z cuda-nvrtc-12.6.85 | hbd13f7d_0 17.3 MB conda-forge 2025-05-07T20:25:28.5925516Z cuda-nvrtc-dev-12.6.85 | h5888daf_0 31 KB conda-forge 2025-05-07T20:25:28.5925964Z cuda-nvtx-12.6.77 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:25:28.5926422Z cuda-nvvm-dev_linux-64-12.6.85| ha770c72_0 25 KB conda-forge 2025-05-07T20:25:28.5926891Z cuda-nvvm-impl-12.6.85 | he02047a_0 7.7 MB conda-forge 2025-05-07T20:25:28.5927351Z cuda-nvvm-tools-12.6.85 | he02047a_0 10.4 MB conda-forge 2025-05-07T20:25:28.5927797Z cuda-nvvp-12.6.80 | hbd13f7d_1 109.3 MB conda-forge 2025-05-07T20:25:28.5928237Z cuda-opencl-12.6.77 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:25:28.5928693Z cuda-opencl-dev-12.6.77 | h5888daf_0 93 KB conda-forge 2025-05-07T20:25:28.5929219Z cuda-profiler-api-12.6.77 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:25:28.5929816Z cuda-runtime-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:28.5930286Z cuda-sanitizer-api-12.6.77 | hbd13f7d_1 8.9 MB conda-forge 2025-05-07T20:25:28.5930840Z cuda-toolkit-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:28.5931277Z cuda-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:28.5931709Z cuda-version-12.6 | h7480c83_3 20 KB conda-forge 2025-05-07T20:25:28.5932163Z cuda-visual-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:28.5932628Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:25:28.5933061Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:25:28.5933444Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:25:28.5933915Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:25:28.5934433Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:25:28.5934951Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:25:28.5935448Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:25:28.5935901Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:25:28.5936366Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:25:28.5936834Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:25:28.5937276Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:25:28.5937677Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:28.5938218Z gds-tools-1.11.1.6 | h5888daf_4 37.8 MB conda-forge 2025-05-07T20:25:28.5938620Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:25:28.5939000Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:28.5939402Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:25:28.5939797Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:25:28.5940187Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:25:28.5940605Z libcublas-12.6.4.1 | h5888daf_1 256.2 MB conda-forge 2025-05-07T20:25:28.5941054Z libcublas-dev-12.6.4.1 | h5888daf_1 88 KB conda-forge 2025-05-07T20:25:28.5941493Z libcufft-11.3.0.4 | hbd13f7d_0 156.2 MB conda-forge 2025-05-07T20:25:28.5941932Z libcufft-dev-11.3.0.4 | h5888daf_0 33 KB conda-forge 2025-05-07T20:25:28.5942378Z libcufile-1.11.1.6 | h12f29b5_4 900 KB conda-forge 2025-05-07T20:25:28.5942823Z libcufile-dev-1.11.1.6 | h5888daf_4 35 KB conda-forge 2025-05-07T20:25:28.5943268Z libcurand-10.3.7.77 | hbd13f7d_0 39.9 MB conda-forge 2025-05-07T20:25:28.5943722Z libcurand-dev-10.3.7.77 | h5888daf_0 262 KB conda-forge 2025-05-07T20:25:28.5944178Z libcusolver-11.7.1.2 | h5888daf_1 95.8 MB conda-forge 2025-05-07T20:25:28.5944637Z libcusolver-dev-11.7.1.2 | h5888daf_1 59 KB conda-forge 2025-05-07T20:25:28.5945101Z libcusparse-12.5.4.2 | hbd13f7d_0 118.6 MB conda-forge 2025-05-07T20:25:28.5945566Z libcusparse-dev-12.5.4.2 | h5888daf_0 51 KB conda-forge 2025-05-07T20:25:28.5946033Z libedit-3.1.20191231 | he28a2e2_2 121 KB conda-forge 2025-05-07T20:25:28.5946465Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:25:28.5947002Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:25:28.5947453Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:25:28.5947974Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:25:28.5948411Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:25:28.5948858Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:25:28.5949321Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:25:28.5949720Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:25:28.5950126Z libnpp-12.3.1.54 | h5888daf_0 93.4 MB conda-forge 2025-05-07T20:25:28.5950554Z libnpp-dev-12.3.1.54 | h5888daf_0 441 KB conda-forge 2025-05-07T20:25:28.5950976Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T20:25:28.5951382Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:25:28.5951811Z libnvfatbin-12.6.77 | hbd13f7d_0 783 KB conda-forge 2025-05-07T20:25:28.5952281Z libnvfatbin-dev-12.6.77 | h5888daf_0 26 KB conda-forge 2025-05-07T20:25:28.5952742Z libnvjitlink-12.6.85 | hbd13f7d_0 14.9 MB conda-forge 2025-05-07T20:25:28.5953212Z libnvjitlink-dev-12.6.85 | h5888daf_0 25 KB conda-forge 2025-05-07T20:25:28.5953675Z libnvjpeg-12.3.3.54 | h5888daf_0 2.4 MB conda-forge 2025-05-07T20:25:28.5954131Z libnvjpeg-dev-12.3.3.54 | ha770c72_0 31 KB conda-forge 2025-05-07T20:25:28.5954559Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:25:28.5954977Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:25:28.5955417Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:25:28.5956301Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:25:28.5956732Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:25:28.5957138Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:25:28.5957567Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:25:28.5958007Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:25:28.5958429Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:25:28.5958842Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:25:28.5959268Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:25:28.5959731Z nsight-compute-2024.3.2.3 | hb5ebaad_0 443.1 MB conda-forge 2025-05-07T20:25:28.5960174Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:25:28.5960556Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:25:28.5960949Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:25:28.5961398Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:25:28.5961839Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:25:28.5962262Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:25:28.5962717Z python-3.10.13 |hd12c33a_1_cpython 24.5 MB conda-forge 2025-05-07T20:25:28.5963150Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:25:28.5963559Z sqlite-3.32.3 | hcee41ef_1 1.4 MB conda-forge 2025-05-07T20:25:28.5964118Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:25:28.5964529Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:25:28.5964938Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:25:28.5965516Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:25:28.5965971Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:25:28.5966430Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:25:28.5966914Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:25:28.5967369Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:25:28.5967823Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:25:28.5968281Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:25:28.5968717Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:25:28.5969142Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:25:28.5969589Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:25:28.5970062Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:25:28.5970539Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:25:28.5971000Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:28.5971450Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:25:28.5971903Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:28.5972346Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:25:28.5972799Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:25:28.5973268Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:25:28.5973722Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:25:28.5974145Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:25:28.5974529Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:25:28.5974906Z ------------------------------------------------------------ 2025-05-07T20:25:28.5975241Z Total: 1.63 GB 2025-05-07T20:25:28.5975456Z 2025-05-07T20:25:28.5975590Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:28.5975814Z 2025-05-07T20:25:28.5976020Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:25:28.5976445Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:25:28.5976868Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:25:28.5977334Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:25:28.5977773Z cuda conda-forge/noarch::cuda-12.6.3-ha804496_0 2025-05-07T20:25:28.5978321Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 2025-05-07T20:25:28.5978950Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 2025-05-07T20:25:28.5979560Z cuda-compiler conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 2025-05-07T20:25:28.5980109Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:28.5980667Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 2025-05-07T20:25:28.5981193Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 2025-05-07T20:25:28.5981720Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 2025-05-07T20:25:28.5982405Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:28.5983009Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 2025-05-07T20:25:28.5983890Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:28.5984503Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:28.5985067Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.5985582Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 2025-05-07T20:25:28.5986092Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 2025-05-07T20:25:28.5986627Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.5987170Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 2025-05-07T20:25:28.5987748Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:28.5988278Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 2025-05-07T20:25:28.5988779Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 2025-05-07T20:25:28.5989400Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 2025-05-07T20:25:28.5989943Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 2025-05-07T20:25:28.5990425Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 2025-05-07T20:25:28.5990954Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 2025-05-07T20:25:28.5991519Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 2025-05-07T20:25:28.5992063Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 2025-05-07T20:25:28.5992628Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 2025-05-07T20:25:28.5993181Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.5993707Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.5994217Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 2025-05-07T20:25:28.5994726Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.5995350Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 2025-05-07T20:25:28.5995904Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 2025-05-07T20:25:28.5996401Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 2025-05-07T20:25:28.5996925Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:28.5997490Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 2025-05-07T20:25:28.5998041Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 2025-05-07T20:25:28.5998552Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 2025-05-07T20:25:28.5999036Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 2025-05-07T20:25:28.5999565Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 2025-05-07T20:25:28.6000133Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 2025-05-07T20:25:28.6000680Z cuda-runtime conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 2025-05-07T20:25:28.6001233Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.6001789Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 2025-05-07T20:25:28.6002268Z cuda-tools conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 2025-05-07T20:25:28.6002747Z cuda-version conda-forge/noarch::cuda-version-12.6-h7480c83_3 2025-05-07T20:25:28.6003399Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 2025-05-07T20:25:28.6003952Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:25:28.6004482Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:25:28.6004893Z expat conda-forge/linux-64::expat-2.7.0-h5888daf_0 2025-05-07T20:25:28.6005415Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:25:28.6006028Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:25:28.6006626Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:25:28.6007205Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:25:28.6007713Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:25:28.6008214Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:25:28.6008713Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:25:28.6009186Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:25:28.6009622Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:25:28.6010044Z gds-tools conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 2025-05-07T20:25:28.6010472Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:25:28.6010854Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:25:28.6011269Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:25:28.6011685Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:25:28.6012090Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:25:28.6012536Z libcublas conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 2025-05-07T20:25:28.6013049Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 2025-05-07T20:25:28.6013545Z libcufft conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 2025-05-07T20:25:28.6014030Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 2025-05-07T20:25:28.6014528Z libcufile conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 2025-05-07T20:25:28.6015033Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 2025-05-07T20:25:28.6015532Z libcurand conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 2025-05-07T20:25:28.6016043Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 2025-05-07T20:25:28.6016567Z libcusolver conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 2025-05-07T20:25:28.6017097Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 2025-05-07T20:25:28.6017634Z libcusparse conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 2025-05-07T20:25:28.6018258Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 2025-05-07T20:25:28.6018778Z libedit conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 2025-05-07T20:25:28.6019242Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:25:28.6019714Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:25:28.6020219Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:25:28.6020734Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:25:28.6021211Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:25:28.6021675Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:25:28.6022150Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:25:28.6022584Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:25:28.6023123Z libnpp conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 2025-05-07T20:25:28.6023592Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 2025-05-07T20:25:28.6024049Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T20:25:28.6024555Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:25:28.6025021Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 2025-05-07T20:25:28.6025555Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 2025-05-07T20:25:28.6026095Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 2025-05-07T20:25:28.6026637Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 2025-05-07T20:25:28.6027168Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 2025-05-07T20:25:28.6027679Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 2025-05-07T20:25:28.6028171Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:25:28.6028613Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T20:25:28.6029098Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:25:28.6029565Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:25:28.6030002Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:25:28.6039256Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:25:28.6039789Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:25:28.6040254Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:25:28.6040694Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:28.6041115Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:25:28.6041609Z nsight-compute conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 2025-05-07T20:25:28.6042098Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:25:28.6042481Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:25:28.6042887Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:25:28.6043377Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:25:28.6043870Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:25:28.6044341Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:25:28.6044833Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:25:28.6045269Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:25:28.6045704Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:25:28.6046200Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:25:28.6046726Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:25:28.6047265Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:25:28.6047851Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:25:28.6048390Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:25:28.6048901Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:25:28.6049429Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:25:28.6049908Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:25:28.6050386Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:25:28.6050862Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:25:28.6051564Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:25:28.6052153Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:25:28.6052689Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:25:28.6053269Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:25:28.6053962Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:25:28.6054526Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:25:28.6055079Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:25:28.6055892Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:25:28.6056529Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:25:28.6057169Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:25:28.6057431Z 2025-05-07T20:25:28.6057564Z The following packages will be UPDATED: 2025-05-07T20:25:28.6057772Z 2025-05-07T20:25:28.6058126Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:25:28.6058739Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:28.6059114Z 2025-05-07T20:25:28.6059344Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:25:28.6059659Z 2025-05-07T20:25:28.6059956Z python pkgs/main::python-3.10.16-he870216_1 --> conda-forge::python-3.10.13-hd12c33a_1_cpython 2025-05-07T20:25:28.6060586Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 2025-05-07T20:25:28.6061160Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:25:28.6061480Z 2025-05-07T20:25:28.6061504Z 2025-05-07T20:25:28.6061508Z 2025-05-07T20:25:28.6061654Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:28.6062037Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:25:28.6062269Z 2025-05-07T20:25:28.6062664Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:25:28.6062903Z 2025-05-07T20:25:28.6062907Z 2025-05-07T20:25:28.6063121Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:25:28.6063362Z 2025-05-07T20:25:28.6063366Z 2025-05-07T20:25:28.6063370Z 2025-05-07T20:25:28.6063594Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:25:28.6063857Z 2025-05-07T20:25:28.6063860Z 2025-05-07T20:25:28.6063864Z 2025-05-07T20:25:28.6063868Z 2025-05-07T20:25:28.6064100Z cuda-nsight-12.6.77 | 113.2 MB | | 0%  2025-05-07T20:25:28.6064497Z 2025-05-07T20:25:28.6064500Z 2025-05-07T20:25:28.6064504Z 2025-05-07T20:25:28.6064508Z 2025-05-07T20:25:28.6064511Z 2025-05-07T20:25:28.6064754Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:25:28.6065024Z 2025-05-07T20:25:28.6065028Z 2025-05-07T20:25:28.6065031Z 2025-05-07T20:25:28.6065035Z 2025-05-07T20:25:28.6065039Z 2025-05-07T20:25:28.6065045Z 2025-05-07T20:25:28.6066027Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:25:28.6066312Z 2025-05-07T20:25:28.6066316Z 2025-05-07T20:25:28.6066320Z 2025-05-07T20:25:28.6066323Z 2025-05-07T20:25:28.6066327Z 2025-05-07T20:25:28.6066330Z 2025-05-07T20:25:28.6066338Z 2025-05-07T20:25:28.6075864Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:25:28.6076170Z 2025-05-07T20:25:28.6076174Z 2025-05-07T20:25:28.6076178Z 2025-05-07T20:25:28.6076181Z 2025-05-07T20:25:28.6076185Z 2025-05-07T20:25:28.6076188Z 2025-05-07T20:25:28.6076192Z 2025-05-07T20:25:28.6076195Z 2025-05-07T20:25:28.6077176Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:25:28.6077541Z 2025-05-07T20:25:28.6077762Z 2025-05-07T20:25:28.6077767Z 2025-05-07T20:25:28.6077770Z 2025-05-07T20:25:28.6077774Z 2025-05-07T20:25:28.6077778Z 2025-05-07T20:25:28.6077781Z 2025-05-07T20:25:28.6077785Z 2025-05-07T20:25:28.6077905Z 2025-05-07T20:25:28.6080125Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:25:28.6080415Z 2025-05-07T20:25:28.6080418Z 2025-05-07T20:25:28.6080422Z 2025-05-07T20:25:28.6080426Z 2025-05-07T20:25:28.6080429Z 2025-05-07T20:25:28.6080433Z 2025-05-07T20:25:28.6080436Z 2025-05-07T20:25:28.6080440Z 2025-05-07T20:25:28.6080444Z 2025-05-07T20:25:28.6080447Z 2025-05-07T20:25:28.6081375Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:25:28.6081707Z 2025-05-07T20:25:28.6081711Z 2025-05-07T20:25:28.6081715Z 2025-05-07T20:25:28.6081718Z 2025-05-07T20:25:28.6081722Z 2025-05-07T20:25:28.6081739Z 2025-05-07T20:25:28.6081742Z 2025-05-07T20:25:28.6081746Z 2025-05-07T20:25:28.6081750Z 2025-05-07T20:25:28.6081761Z 2025-05-07T20:25:28.6081765Z 2025-05-07T20:25:28.6083272Z python-3.10.13 | 24.5 MB | | 0%  2025-05-07T20:25:28.6083613Z 2025-05-07T20:25:28.6083616Z 2025-05-07T20:25:28.6083620Z 2025-05-07T20:25:28.6083623Z 2025-05-07T20:25:28.6083626Z 2025-05-07T20:25:28.6083630Z 2025-05-07T20:25:28.6083633Z 2025-05-07T20:25:28.6083636Z 2025-05-07T20:25:28.6083640Z 2025-05-07T20:25:28.6083643Z 2025-05-07T20:25:28.6083647Z 2025-05-07T20:25:28.6083650Z 2025-05-07T20:25:28.6084857Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:25:28.6085205Z 2025-05-07T20:25:28.6085209Z 2025-05-07T20:25:28.6085212Z 2025-05-07T20:25:28.6085216Z 2025-05-07T20:25:28.6085220Z 2025-05-07T20:25:28.6085229Z 2025-05-07T20:25:28.6085232Z 2025-05-07T20:25:28.6085236Z 2025-05-07T20:25:28.6085239Z 2025-05-07T20:25:28.6085243Z 2025-05-07T20:25:28.6085246Z 2025-05-07T20:25:28.6085250Z 2025-05-07T20:25:28.6085259Z 2025-05-07T20:25:28.6086810Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:25:28.6087112Z 2025-05-07T20:25:28.6087123Z 2025-05-07T20:25:28.6087126Z 2025-05-07T20:25:28.6087130Z 2025-05-07T20:25:28.6087134Z 2025-05-07T20:25:28.6087143Z 2025-05-07T20:25:28.6087147Z 2025-05-07T20:25:28.6087151Z 2025-05-07T20:25:28.6087154Z 2025-05-07T20:25:28.6087158Z 2025-05-07T20:25:28.6087162Z 2025-05-07T20:25:28.6087165Z 2025-05-07T20:25:28.6087169Z 2025-05-07T20:25:28.6087173Z 2025-05-07T20:25:28.6088243Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:25:28.6088618Z 2025-05-07T20:25:28.6088622Z 2025-05-07T20:25:28.6088633Z 2025-05-07T20:25:28.6088636Z 2025-05-07T20:25:28.6088640Z 2025-05-07T20:25:28.6088644Z 2025-05-07T20:25:28.6088647Z 2025-05-07T20:25:28.6088651Z 2025-05-07T20:25:28.6088654Z 2025-05-07T20:25:28.6088658Z 2025-05-07T20:25:28.6088662Z 2025-05-07T20:25:28.6088675Z 2025-05-07T20:25:28.6088682Z 2025-05-07T20:25:28.6088685Z 2025-05-07T20:25:28.6088689Z 2025-05-07T20:25:28.6095836Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:25:28.6096165Z 2025-05-07T20:25:28.6096169Z 2025-05-07T20:25:28.6096173Z 2025-05-07T20:25:28.6096176Z 2025-05-07T20:25:28.6096180Z 2025-05-07T20:25:28.6096183Z 2025-05-07T20:25:28.6096187Z 2025-05-07T20:25:28.6096190Z 2025-05-07T20:25:28.6096201Z 2025-05-07T20:25:28.6096204Z 2025-05-07T20:25:28.6096208Z 2025-05-07T20:25:28.6096211Z 2025-05-07T20:25:28.6096215Z 2025-05-07T20:25:28.6096219Z 2025-05-07T20:25:28.6096222Z 2025-05-07T20:25:28.6096226Z 2025-05-07T20:25:28.6097398Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:25:28.6097785Z 2025-05-07T20:25:28.6097788Z 2025-05-07T20:25:28.6097792Z 2025-05-07T20:25:28.6097796Z 2025-05-07T20:25:28.6097799Z 2025-05-07T20:25:28.6097810Z 2025-05-07T20:25:28.6097941Z 2025-05-07T20:25:28.6097945Z 2025-05-07T20:25:28.6097949Z 2025-05-07T20:25:28.6097953Z 2025-05-07T20:25:28.6097956Z 2025-05-07T20:25:28.6097960Z 2025-05-07T20:25:28.6098035Z 2025-05-07T20:25:28.6098039Z 2025-05-07T20:25:28.6098042Z 2025-05-07T20:25:28.6098046Z 2025-05-07T20:25:28.6098129Z 2025-05-07T20:25:28.6099438Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:25:28.6099768Z 2025-05-07T20:25:28.6099771Z 2025-05-07T20:25:28.6099775Z 2025-05-07T20:25:28.6099787Z 2025-05-07T20:25:28.6099790Z 2025-05-07T20:25:28.6099794Z 2025-05-07T20:25:28.6099805Z 2025-05-07T20:25:28.6099808Z 2025-05-07T20:25:28.6099812Z 2025-05-07T20:25:28.6099815Z 2025-05-07T20:25:28.6099819Z 2025-05-07T20:25:28.6099822Z 2025-05-07T20:25:28.6099826Z 2025-05-07T20:25:28.6099829Z 2025-05-07T20:25:28.6099833Z 2025-05-07T20:25:28.6099837Z 2025-05-07T20:25:28.6099840Z 2025-05-07T20:25:28.6099844Z 2025-05-07T20:25:28.6101007Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:25:28.6101328Z 2025-05-07T20:25:28.6101332Z 2025-05-07T20:25:28.6101349Z 2025-05-07T20:25:28.6101353Z 2025-05-07T20:25:28.6101356Z 2025-05-07T20:25:28.6101360Z 2025-05-07T20:25:28.6101363Z 2025-05-07T20:25:28.6101367Z 2025-05-07T20:25:28.6101370Z 2025-05-07T20:25:28.6101374Z 2025-05-07T20:25:28.6101377Z 2025-05-07T20:25:28.6101381Z 2025-05-07T20:25:28.6101385Z 2025-05-07T20:25:28.6101388Z 2025-05-07T20:25:28.6101392Z 2025-05-07T20:25:28.6101403Z 2025-05-07T20:25:28.6101406Z 2025-05-07T20:25:28.6101410Z 2025-05-07T20:25:28.6101413Z 2025-05-07T20:25:28.6998224Z ... (more hidden) ... 2025-05-07T20:25:28.7003231Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:25:28.7003610Z 2025-05-07T20:25:28.7009524Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:25:28.7009804Z 2025-05-07T20:25:28.7009822Z 2025-05-07T20:25:28.7025709Z libcufft-11.3.0.4 | 156.2 MB | 1 | 1%  2025-05-07T20:25:28.7026025Z 2025-05-07T20:25:28.7026031Z 2025-05-07T20:25:28.7026411Z 2025-05-07T20:25:28.8002897Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:25:28.8006930Z nsight-compute-2024. | 443.1 MB | 1 | 1% 2025-05-07T20:25:28.8008814Z 2025-05-07T20:25:28.8012071Z libcublas-12.6.4.1 | 256.2 MB | 1 | 2%  2025-05-07T20:25:28.8012324Z 2025-05-07T20:25:28.8013943Z 2025-05-07T20:25:28.8027225Z libcufft-11.3.0.4 | 156.2 MB | 4 | 5%  2025-05-07T20:25:28.8027551Z 2025-05-07T20:25:28.8027558Z 2025-05-07T20:25:28.8029530Z 2025-05-07T20:25:28.9002822Z libcusparse-12.5.4.2 | 118.6 MB | 2 | 3%  2025-05-07T20:25:28.9007443Z nsight-compute-2024. | 443.1 MB | 2 | 2% 2025-05-07T20:25:28.9007883Z 2025-05-07T20:25:28.9015991Z libcublas-12.6.4.1 | 256.2 MB | 3 | 4%  2025-05-07T20:25:28.9016286Z 2025-05-07T20:25:28.9016513Z 2025-05-07T20:25:28.9028846Z libcufft-11.3.0.4 | 156.2 MB | 7 | 7%  2025-05-07T20:25:28.9029239Z 2025-05-07T20:25:28.9029265Z 2025-05-07T20:25:28.9031378Z 2025-05-07T20:25:28.9550903Z libcusparse-12.5.4.2 | 118.6 MB | 5 | 6%  2025-05-07T20:25:28.9551185Z 2025-05-07T20:25:28.9551189Z 2025-05-07T20:25:28.9551193Z 2025-05-07T20:25:28.9551810Z 2025-05-07T20:25:29.0007433Z cuda-nsight-12.6.77 | 113.2 MB | | 0%  2025-05-07T20:25:29.0007869Z nsight-compute-2024. | 443.1 MB | 3 | 3% 2025-05-07T20:25:29.0008112Z 2025-05-07T20:25:29.0022136Z libcublas-12.6.4.1 | 256.2 MB | 5 | 5%  2025-05-07T20:25:29.0022397Z 2025-05-07T20:25:29.0022402Z 2025-05-07T20:25:29.0030828Z libcufft-11.3.0.4 | 156.2 MB | # | 11%  2025-05-07T20:25:29.0031183Z 2025-05-07T20:25:29.0031187Z 2025-05-07T20:25:29.0032921Z 2025-05-07T20:25:29.0551471Z libcusparse-12.5.4.2 | 118.6 MB | 8 | 8%  2025-05-07T20:25:29.0551778Z 2025-05-07T20:25:29.0551782Z 2025-05-07T20:25:29.0551786Z 2025-05-07T20:25:29.0553194Z 2025-05-07T20:25:29.1033352Z cuda-nsight-12.6.77 | 113.2 MB | 1 | 2%  2025-05-07T20:25:29.1033645Z 2025-05-07T20:25:29.1033649Z 2025-05-07T20:25:29.1034079Z 2025-05-07T20:25:29.1110606Z libcusparse-12.5.4.2 | 118.6 MB | #1 | 11%  2025-05-07T20:25:29.1110959Z 2025-05-07T20:25:29.1241494Z libcublas-12.6.4.1 | 256.2 MB | 6 | 7%  2025-05-07T20:25:29.1258713Z nsight-compute-2024. | 443.1 MB | 4 | 4% 2025-05-07T20:25:29.1258973Z 2025-05-07T20:25:29.1258977Z 2025-05-07T20:25:29.1555264Z libcufft-11.3.0.4 | 156.2 MB | #3 | 14%  2025-05-07T20:25:29.1555696Z 2025-05-07T20:25:29.1555701Z 2025-05-07T20:25:29.1555705Z 2025-05-07T20:25:29.1557145Z 2025-05-07T20:25:29.2159295Z cuda-nsight-12.6.77 | 113.2 MB | 4 | 5%  2025-05-07T20:25:29.2159623Z 2025-05-07T20:25:29.2159627Z 2025-05-07T20:25:29.2161866Z 2025-05-07T20:25:29.2227972Z libcusparse-12.5.4.2 | 118.6 MB | #4 | 14%  2025-05-07T20:25:29.2228343Z 2025-05-07T20:25:29.2344249Z libcublas-12.6.4.1 | 256.2 MB | 8 | 9%  2025-05-07T20:25:29.2465408Z nsight-compute-2024. | 443.1 MB | 4 | 5% 2025-05-07T20:25:29.2465672Z 2025-05-07T20:25:29.2466869Z 2025-05-07T20:25:29.2559049Z libcufft-11.3.0.4 | 156.2 MB | #6 | 16%  2025-05-07T20:25:29.2559320Z 2025-05-07T20:25:29.2559324Z 2025-05-07T20:25:29.2559327Z 2025-05-07T20:25:29.2560337Z 2025-05-07T20:25:29.3228775Z cuda-nsight-12.6.77 | 113.2 MB | 7 | 8%  2025-05-07T20:25:29.3229446Z 2025-05-07T20:25:29.3354731Z libcublas-12.6.4.1 | 256.2 MB | 9 | 10%  2025-05-07T20:25:29.3562203Z nsight-compute-2024. | 443.1 MB | 5 | 6% 2025-05-07T20:25:29.3562470Z 2025-05-07T20:25:29.3562474Z 2025-05-07T20:25:29.3562493Z 2025-05-07T20:25:29.3562497Z 2025-05-07T20:25:29.3593371Z cuda-nsight-12.6.77 | 113.2 MB | #1 | 11%  2025-05-07T20:25:29.3593658Z 2025-05-07T20:25:29.3593662Z 2025-05-07T20:25:29.3816715Z libcufft-11.3.0.4 | 156.2 MB | #8 | 19%  2025-05-07T20:25:29.3817070Z 2025-05-07T20:25:29.3817074Z 2025-05-07T20:25:29.3819245Z 2025-05-07T20:25:29.4231609Z libcusparse-12.5.4.2 | 118.6 MB | #6 | 17%  2025-05-07T20:25:29.4232643Z 2025-05-07T20:25:29.4549611Z libcublas-12.6.4.1 | 256.2 MB | #1 | 11%  2025-05-07T20:25:29.4564915Z nsight-compute-2024. | 443.1 MB | 6 | 6% 2025-05-07T20:25:29.4565234Z 2025-05-07T20:25:29.4565239Z 2025-05-07T20:25:29.4565244Z 2025-05-07T20:25:29.4565250Z 2025-05-07T20:25:29.4715382Z cuda-nsight-12.6.77 | 113.2 MB | #4 | 14%  2025-05-07T20:25:29.4715768Z 2025-05-07T20:25:29.4718511Z 2025-05-07T20:25:29.4817605Z libcufft-11.3.0.4 | 156.2 MB | ##1 | 21%  2025-05-07T20:25:29.4817978Z 2025-05-07T20:25:29.4817984Z 2025-05-07T20:25:29.4818179Z 2025-05-07T20:25:29.5284477Z libcusparse-12.5.4.2 | 118.6 MB | #9 | 20%  2025-05-07T20:25:29.5284958Z 2025-05-07T20:25:29.5584115Z libcublas-12.6.4.1 | 256.2 MB | #2 | 13%  2025-05-07T20:25:29.5584469Z 2025-05-07T20:25:29.5584481Z 2025-05-07T20:25:29.5584486Z 2025-05-07T20:25:29.5584492Z 2025-05-07T20:25:29.5627252Z cuda-nsight-12.6.77 | 113.2 MB | #7 | 17%  2025-05-07T20:25:29.5819824Z nsight-compute-2024. | 443.1 MB | 7 | 7% 2025-05-07T20:25:29.5820189Z 2025-05-07T20:25:29.5821617Z 2025-05-07T20:25:29.5824099Z libcufft-11.3.0.4 | 156.2 MB | ##3 | 24%  2025-05-07T20:25:29.5824453Z 2025-05-07T20:25:29.5824458Z 2025-05-07T20:25:29.5825781Z 2025-05-07T20:25:29.6593088Z libcusparse-12.5.4.2 | 118.6 MB | ##2 | 23%  2025-05-07T20:25:29.6593482Z 2025-05-07T20:25:29.6593494Z 2025-05-07T20:25:29.6593499Z 2025-05-07T20:25:29.6596131Z 2025-05-07T20:25:29.6630102Z cuda-nsight-12.6.77 | 113.2 MB | ## | 20%  2025-05-07T20:25:29.6735165Z nsight-compute-2024. | 443.1 MB | 8 | 8% 2025-05-07T20:25:29.6736329Z 2025-05-07T20:25:29.6820717Z libcublas-12.6.4.1 | 256.2 MB | #4 | 14%  2025-05-07T20:25:29.6821080Z 2025-05-07T20:25:29.6821086Z 2025-05-07T20:25:29.6821646Z 2025-05-07T20:25:29.6863150Z libcusparse-12.5.4.2 | 118.6 MB | ##5 | 26%  2025-05-07T20:25:29.6863527Z 2025-05-07T20:25:29.6865411Z 2025-05-07T20:25:29.7596190Z libcufft-11.3.0.4 | 156.2 MB | ##6 | 26%  2025-05-07T20:25:29.7596554Z 2025-05-07T20:25:29.7596560Z 2025-05-07T20:25:29.7596566Z 2025-05-07T20:25:29.7598527Z 2025-05-07T20:25:29.7630742Z cuda-nsight-12.6.77 | 113.2 MB | ##3 | 24%  2025-05-07T20:25:29.7737365Z nsight-compute-2024. | 443.1 MB | 8 | 9% 2025-05-07T20:25:29.7739467Z 2025-05-07T20:25:29.7893138Z libcublas-12.6.4.1 | 256.2 MB | #5 | 16%  2025-05-07T20:25:29.7893503Z 2025-05-07T20:25:29.7893508Z 2025-05-07T20:25:29.7893514Z 2025-05-07T20:25:29.7972774Z libcusparse-12.5.4.2 | 118.6 MB | ##8 | 28%  2025-05-07T20:25:29.7973181Z 2025-05-07T20:25:29.7973186Z 2025-05-07T20:25:29.8631345Z libcufft-11.3.0.4 | 156.2 MB | ##8 | 28%  2025-05-07T20:25:29.8666861Z nsight-compute-2024. | 443.1 MB | 9 | 10% 2025-05-07T20:25:29.8667229Z 2025-05-07T20:25:29.8667234Z 2025-05-07T20:25:29.8667239Z 2025-05-07T20:25:29.8672503Z 2025-05-07T20:25:29.8739771Z cuda-nsight-12.6.77 | 113.2 MB | ##6 | 27%  2025-05-07T20:25:29.8740992Z 2025-05-07T20:25:29.8895265Z libcublas-12.6.4.1 | 256.2 MB | #7 | 17%  2025-05-07T20:25:29.8895632Z 2025-05-07T20:25:29.8895638Z 2025-05-07T20:25:29.8895643Z 2025-05-07T20:25:29.9052435Z libcusparse-12.5.4.2 | 118.6 MB | ###1 | 31%  2025-05-07T20:25:29.9052826Z 2025-05-07T20:25:29.9053254Z 2025-05-07T20:25:29.9673246Z libcufft-11.3.0.4 | 156.2 MB | ### | 31%  2025-05-07T20:25:29.9673619Z 2025-05-07T20:25:29.9673625Z 2025-05-07T20:25:29.9673630Z 2025-05-07T20:25:29.9674963Z 2025-05-07T20:25:29.9742122Z cuda-nsight-12.6.77 | 113.2 MB | ##9 | 30%  2025-05-07T20:25:29.9742989Z 2025-05-07T20:25:29.9761027Z libcublas-12.6.4.1 | 256.2 MB | #8 | 19%  2025-05-07T20:25:29.9898626Z nsight-compute-2024. | 443.1 MB | # | 10% 2025-05-07T20:25:29.9898990Z 2025-05-07T20:25:29.9898996Z 2025-05-07T20:25:29.9900566Z 2025-05-07T20:25:30.0053386Z libcusparse-12.5.4.2 | 118.6 MB | ###4 | 34%  2025-05-07T20:25:30.0053764Z 2025-05-07T20:25:30.0054601Z 2025-05-07T20:25:30.0674174Z libcufft-11.3.0.4 | 156.2 MB | ###2 | 33%  2025-05-07T20:25:30.0674544Z 2025-05-07T20:25:30.0674550Z 2025-05-07T20:25:30.0674555Z 2025-05-07T20:25:30.0675191Z 2025-05-07T20:25:30.0742443Z cuda-nsight-12.6.77 | 113.2 MB | ###2 | 33%  2025-05-07T20:25:30.0743123Z 2025-05-07T20:25:30.0765494Z libcublas-12.6.4.1 | 256.2 MB | #9 | 20%  2025-05-07T20:25:30.1011984Z nsight-compute-2024. | 443.1 MB | #1 | 11% 2025-05-07T20:25:30.1012363Z 2025-05-07T20:25:30.1012369Z 2025-05-07T20:25:30.1012883Z 2025-05-07T20:25:30.1172175Z libcusparse-12.5.4.2 | 118.6 MB | ###7 | 37%  2025-05-07T20:25:30.1172482Z 2025-05-07T20:25:30.1172487Z 2025-05-07T20:25:30.1674495Z libcufft-11.3.0.4 | 156.2 MB | ###5 | 35%  2025-05-07T20:25:30.1674830Z 2025-05-07T20:25:30.1674834Z 2025-05-07T20:25:30.1674837Z 2025-05-07T20:25:30.1676301Z 2025-05-07T20:25:30.1799560Z cuda-nsight-12.6.77 | 113.2 MB | ###6 | 36%  2025-05-07T20:25:30.1799933Z 2025-05-07T20:25:30.1972130Z libcublas-12.6.4.1 | 256.2 MB | ##1 | 21%  2025-05-07T20:25:30.2013762Z nsight-compute-2024. | 443.1 MB | #2 | 12% 2025-05-07T20:25:30.2014061Z 2025-05-07T20:25:30.2014349Z 2025-05-07T20:25:30.2020832Z 2025-05-07T20:25:30.2174025Z libcusparse-12.5.4.2 | 118.6 MB | #### | 40%  2025-05-07T20:25:30.2174301Z 2025-05-07T20:25:30.2174305Z 2025-05-07T20:25:30.2802842Z libcufft-11.3.0.4 | 156.2 MB | ###7 | 37%  2025-05-07T20:25:30.2805428Z 2025-05-07T20:25:30.2973971Z libcublas-12.6.4.1 | 256.2 MB | ##2 | 23%  2025-05-07T20:25:30.3014085Z nsight-compute-2024. | 443.1 MB | #2 | 13% 2025-05-07T20:25:30.3014379Z 2025-05-07T20:25:30.3014385Z 2025-05-07T20:25:30.3014390Z 2025-05-07T20:25:30.3175620Z libcusparse-12.5.4.2 | 118.6 MB | ####3 | 43%  2025-05-07T20:25:30.3175989Z 2025-05-07T20:25:30.3175993Z 2025-05-07T20:25:30.3384452Z libcufft-11.3.0.4 | 156.2 MB | #### | 40%  2025-05-07T20:25:30.3384720Z 2025-05-07T20:25:30.3384724Z 2025-05-07T20:25:30.3384727Z 2025-05-07T20:25:30.3387933Z 2025-05-07T20:25:30.3804904Z cuda-nsight-12.6.77 | 113.2 MB | ###9 | 39%  2025-05-07T20:25:30.3805259Z 2025-05-07T20:25:30.3982208Z libcublas-12.6.4.1 | 256.2 MB | ##4 | 25%  2025-05-07T20:25:30.4025626Z nsight-compute-2024. | 443.1 MB | #3 | 14% 2025-05-07T20:25:30.4025904Z 2025-05-07T20:25:30.4025908Z 2025-05-07T20:25:30.4027643Z 2025-05-07T20:25:30.4212188Z libcusparse-12.5.4.2 | 118.6 MB | ####6 | 46%  2025-05-07T20:25:30.4212478Z 2025-05-07T20:25:30.4213596Z 2025-05-07T20:25:30.4386956Z libcufft-11.3.0.4 | 156.2 MB | ####2 | 42%  2025-05-07T20:25:30.4387228Z 2025-05-07T20:25:30.4387240Z 2025-05-07T20:25:30.4387244Z 2025-05-07T20:25:30.4387248Z 2025-05-07T20:25:30.4866036Z cuda-nsight-12.6.77 | 113.2 MB | ####1 | 42%  2025-05-07T20:25:30.4868053Z 2025-05-07T20:25:30.5045784Z libcublas-12.6.4.1 | 256.2 MB | ##6 | 26%  2025-05-07T20:25:30.5065500Z nsight-compute-2024. | 443.1 MB | #4 | 15% 2025-05-07T20:25:30.5065757Z 2025-05-07T20:25:30.5065885Z 2025-05-07T20:25:30.5067717Z 2025-05-07T20:25:30.5215152Z libcusparse-12.5.4.2 | 118.6 MB | ####9 | 49%  2025-05-07T20:25:30.5215556Z 2025-05-07T20:25:30.5215562Z 2025-05-07T20:25:30.5388868Z libcufft-11.3.0.4 | 156.2 MB | ####4 | 45%  2025-05-07T20:25:30.5389244Z 2025-05-07T20:25:30.5389250Z 2025-05-07T20:25:30.5389255Z 2025-05-07T20:25:30.5390973Z 2025-05-07T20:25:30.5902965Z cuda-nsight-12.6.77 | 113.2 MB | ####5 | 45%  2025-05-07T20:25:30.5903365Z 2025-05-07T20:25:30.6131109Z libcublas-12.6.4.1 | 256.2 MB | ##7 | 27%  2025-05-07T20:25:30.6161155Z nsight-compute-2024. | 443.1 MB | #5 | 15% 2025-05-07T20:25:30.6161510Z 2025-05-07T20:25:30.6161515Z 2025-05-07T20:25:30.6161520Z 2025-05-07T20:25:30.6217233Z libcusparse-12.5.4.2 | 118.6 MB | #####2 | 52%  2025-05-07T20:25:30.6217623Z 2025-05-07T20:25:30.6218453Z 2025-05-07T20:25:30.6447290Z libcufft-11.3.0.4 | 156.2 MB | ####7 | 47%  2025-05-07T20:25:30.6447660Z 2025-05-07T20:25:30.6447683Z 2025-05-07T20:25:30.6447689Z 2025-05-07T20:25:30.6448029Z 2025-05-07T20:25:30.6903952Z cuda-nsight-12.6.77 | 113.2 MB | ####7 | 48%  2025-05-07T20:25:30.6904358Z 2025-05-07T20:25:30.7162519Z libcublas-12.6.4.1 | 256.2 MB | ##8 | 29%  2025-05-07T20:25:30.7162873Z 2025-05-07T20:25:30.7162879Z 2025-05-07T20:25:30.7164948Z 2025-05-07T20:25:30.7206707Z libcusparse-12.5.4.2 | 118.6 MB | #####4 | 55%  2025-05-07T20:25:30.7224585Z nsight-compute-2024. | 443.1 MB | #6 | 16% 2025-05-07T20:25:30.7224946Z 2025-05-07T20:25:30.7226334Z 2025-05-07T20:25:30.7450260Z libcufft-11.3.0.4 | 156.2 MB | ####9 | 50%  2025-05-07T20:25:30.7450634Z 2025-05-07T20:25:30.7450640Z 2025-05-07T20:25:30.7450645Z 2025-05-07T20:25:30.7451377Z 2025-05-07T20:25:30.7947540Z cuda-nsight-12.6.77 | 113.2 MB | ##### | 51%  2025-05-07T20:25:30.7948706Z 2025-05-07T20:25:30.8209939Z libcublas-12.6.4.1 | 256.2 MB | ### | 30%  2025-05-07T20:25:30.8228222Z nsight-compute-2024. | 443.1 MB | #6 | 17% 2025-05-07T20:25:30.8228580Z 2025-05-07T20:25:30.8230138Z 2025-05-07T20:25:30.8423201Z libcufft-11.3.0.4 | 156.2 MB | #####1 | 52%  2025-05-07T20:25:30.8423649Z 2025-05-07T20:25:30.8423655Z 2025-05-07T20:25:30.8423987Z 2025-05-07T20:25:30.8453110Z libcusparse-12.5.4.2 | 118.6 MB | #####7 | 58%  2025-05-07T20:25:30.8453505Z 2025-05-07T20:25:30.8453511Z 2025-05-07T20:25:30.8453516Z 2025-05-07T20:25:30.8455732Z 2025-05-07T20:25:30.8960769Z cuda-nsight-12.6.77 | 113.2 MB | #####3 | 54%  2025-05-07T20:25:30.8961172Z 2025-05-07T20:25:30.9211316Z libcublas-12.6.4.1 | 256.2 MB | ###1 | 32%  2025-05-07T20:25:30.9290174Z nsight-compute-2024. | 443.1 MB | #7 | 18% 2025-05-07T20:25:30.9290528Z 2025-05-07T20:25:30.9290534Z 2025-05-07T20:25:30.9423737Z libcufft-11.3.0.4 | 156.2 MB | #####4 | 54%  2025-05-07T20:25:30.9424116Z 2025-05-07T20:25:30.9424122Z 2025-05-07T20:25:30.9424527Z 2025-05-07T20:25:30.9482857Z libcusparse-12.5.4.2 | 118.6 MB | ###### | 61%  2025-05-07T20:25:30.9483233Z 2025-05-07T20:25:30.9483252Z 2025-05-07T20:25:30.9483257Z 2025-05-07T20:25:30.9485683Z 2025-05-07T20:25:30.9997610Z cuda-nsight-12.6.77 | 113.2 MB | #####6 | 57%  2025-05-07T20:25:30.9998078Z 2025-05-07T20:25:31.0214558Z libcublas-12.6.4.1 | 256.2 MB | ###3 | 33%  2025-05-07T20:25:31.0291291Z nsight-compute-2024. | 443.1 MB | #8 | 18% 2025-05-07T20:25:31.0291640Z 2025-05-07T20:25:31.0291992Z 2025-05-07T20:25:31.0424727Z libcufft-11.3.0.4 | 156.2 MB | #####6 | 57%  2025-05-07T20:25:31.0425088Z 2025-05-07T20:25:31.0425093Z 2025-05-07T20:25:31.0425807Z 2025-05-07T20:25:31.0488529Z libcusparse-12.5.4.2 | 118.6 MB | ######3 | 63%  2025-05-07T20:25:31.0488910Z 2025-05-07T20:25:31.0488916Z 2025-05-07T20:25:31.0488921Z 2025-05-07T20:25:31.0489115Z 2025-05-07T20:25:31.1002782Z cuda-nsight-12.6.77 | 113.2 MB | #####9 | 60%  2025-05-07T20:25:31.1003269Z 2025-05-07T20:25:31.1219176Z libcublas-12.6.4.1 | 256.2 MB | ###4 | 35%  2025-05-07T20:25:31.1292999Z nsight-compute-2024. | 443.1 MB | #9 | 19% 2025-05-07T20:25:31.1293427Z 2025-05-07T20:25:31.1293442Z 2025-05-07T20:25:31.1488379Z libcufft-11.3.0.4 | 156.2 MB | #####9 | 59%  2025-05-07T20:25:31.1488740Z 2025-05-07T20:25:31.1488746Z 2025-05-07T20:25:31.1488751Z 2025-05-07T20:25:31.1488757Z 2025-05-07T20:25:31.1631684Z cuda-nsight-12.6.77 | 113.2 MB | ######2 | 63%  2025-05-07T20:25:31.1632067Z 2025-05-07T20:25:31.1632073Z 2025-05-07T20:25:31.1632676Z 2025-05-07T20:25:31.2065842Z libcusparse-12.5.4.2 | 118.6 MB | ######6 | 66%  2025-05-07T20:25:31.2066975Z 2025-05-07T20:25:31.2223545Z libcublas-12.6.4.1 | 256.2 MB | ###6 | 36%  2025-05-07T20:25:31.2423860Z nsight-compute-2024. | 443.1 MB | ## | 20% 2025-05-07T20:25:31.2424235Z 2025-05-07T20:25:31.2424240Z 2025-05-07T20:25:31.2492818Z libcufft-11.3.0.4 | 156.2 MB | ######1 | 61%  2025-05-07T20:25:31.2493178Z 2025-05-07T20:25:31.2493197Z 2025-05-07T20:25:31.2493203Z 2025-05-07T20:25:31.2493208Z 2025-05-07T20:25:31.2635602Z cuda-nsight-12.6.77 | 113.2 MB | ######6 | 66%  2025-05-07T20:25:31.2636003Z 2025-05-07T20:25:31.2636008Z 2025-05-07T20:25:31.2636672Z 2025-05-07T20:25:31.3113911Z libcusparse-12.5.4.2 | 118.6 MB | ######9 | 69%  2025-05-07T20:25:31.3114920Z 2025-05-07T20:25:31.3247601Z libcublas-12.6.4.1 | 256.2 MB | ###7 | 38%  2025-05-07T20:25:31.3507203Z nsight-compute-2024. | 443.1 MB | ## | 21% 2025-05-07T20:25:31.3507560Z 2025-05-07T20:25:31.3507565Z 2025-05-07T20:25:31.3584094Z libcufft-11.3.0.4 | 156.2 MB | ######3 | 64%  2025-05-07T20:25:31.3584455Z 2025-05-07T20:25:31.3584460Z 2025-05-07T20:25:31.3584466Z 2025-05-07T20:25:31.3584748Z 2025-05-07T20:25:31.3639079Z cuda-nsight-12.6.77 | 113.2 MB | ######9 | 69%  2025-05-07T20:25:31.3639467Z 2025-05-07T20:25:31.3639473Z 2025-05-07T20:25:31.3641116Z 2025-05-07T20:25:31.4146358Z libcusparse-12.5.4.2 | 118.6 MB | #######1 | 72%  2025-05-07T20:25:31.4147358Z 2025-05-07T20:25:31.4253229Z libcublas-12.6.4.1 | 256.2 MB | ###9 | 39%  2025-05-07T20:25:31.4584728Z nsight-compute-2024. | 443.1 MB | ##1 | 22% 2025-05-07T20:25:31.4585087Z 2025-05-07T20:25:31.4585093Z 2025-05-07T20:25:31.4585098Z 2025-05-07T20:25:31.4587098Z 2025-05-07T20:25:31.4625554Z cuda-nsight-12.6.77 | 113.2 MB | #######2 | 72%  2025-05-07T20:25:31.4625956Z 2025-05-07T20:25:31.4626444Z 2025-05-07T20:25:31.4640907Z libcufft-11.3.0.4 | 156.2 MB | ######5 | 66%  2025-05-07T20:25:31.4641277Z 2025-05-07T20:25:31.4641282Z 2025-05-07T20:25:31.4643705Z 2025-05-07T20:25:31.5192930Z libcusparse-12.5.4.2 | 118.6 MB | #######5 | 75%  2025-05-07T20:25:31.5193942Z 2025-05-07T20:25:31.5298611Z libcublas-12.6.4.1 | 256.2 MB | #### | 40%  2025-05-07T20:25:31.5626306Z nsight-compute-2024. | 443.1 MB | ##2 | 22% 2025-05-07T20:25:31.5626695Z 2025-05-07T20:25:31.5626702Z 2025-05-07T20:25:31.5684606Z libcufft-11.3.0.4 | 156.2 MB | ######8 | 68%  2025-05-07T20:25:31.5684969Z 2025-05-07T20:25:31.5684975Z 2025-05-07T20:25:31.5684980Z 2025-05-07T20:25:31.5687687Z 2025-05-07T20:25:31.5857968Z cuda-nsight-12.6.77 | 113.2 MB | #######5 | 75%  2025-05-07T20:25:31.5858426Z 2025-05-07T20:25:31.5858430Z 2025-05-07T20:25:31.5858434Z 2025-05-07T20:25:31.6224289Z libcusparse-12.5.4.2 | 118.6 MB | #######7 | 78%  2025-05-07T20:25:31.6227751Z 2025-05-07T20:25:31.6298660Z libcublas-12.6.4.1 | 256.2 MB | ####1 | 42%  2025-05-07T20:25:31.6626688Z nsight-compute-2024. | 443.1 MB | ##3 | 23% 2025-05-07T20:25:31.6627059Z 2025-05-07T20:25:31.6629160Z 2025-05-07T20:25:31.6745860Z libcufft-11.3.0.4 | 156.2 MB | ####### | 70%  2025-05-07T20:25:31.6746174Z 2025-05-07T20:25:31.6746177Z 2025-05-07T20:25:31.6746194Z 2025-05-07T20:25:31.6749117Z 2025-05-07T20:25:31.6861728Z cuda-nsight-12.6.77 | 113.2 MB | #######7 | 78%  2025-05-07T20:25:31.6862382Z 2025-05-07T20:25:31.6862386Z 2025-05-07T20:25:31.6863011Z 2025-05-07T20:25:31.7227914Z libcusparse-12.5.4.2 | 118.6 MB | ######## | 81%  2025-05-07T20:25:31.7229984Z 2025-05-07T20:25:31.7427155Z libcublas-12.6.4.1 | 256.2 MB | ####3 | 43%  2025-05-07T20:25:31.7630145Z nsight-compute-2024. | 443.1 MB | ##4 | 24% 2025-05-07T20:25:31.7630532Z 2025-05-07T20:25:31.7630539Z 2025-05-07T20:25:31.7748914Z libcufft-11.3.0.4 | 156.2 MB | #######2 | 73%  2025-05-07T20:25:31.7749228Z 2025-05-07T20:25:31.7749232Z 2025-05-07T20:25:31.7749236Z 2025-05-07T20:25:31.7749950Z 2025-05-07T20:25:31.7894803Z cuda-nsight-12.6.77 | 113.2 MB | ########1 | 81%  2025-05-07T20:25:31.7895110Z 2025-05-07T20:25:31.7895114Z 2025-05-07T20:25:31.7899025Z 2025-05-07T20:25:31.8232440Z libcusparse-12.5.4.2 | 118.6 MB | ########3 | 83%  2025-05-07T20:25:31.8233339Z 2025-05-07T20:25:31.8428457Z libcublas-12.6.4.1 | 256.2 MB | ####4 | 45%  2025-05-07T20:25:31.8632435Z nsight-compute-2024. | 443.1 MB | ##4 | 25% 2025-05-07T20:25:31.8632703Z 2025-05-07T20:25:31.8632707Z 2025-05-07T20:25:31.8752740Z libcufft-11.3.0.4 | 156.2 MB | #######5 | 75%  2025-05-07T20:25:31.8753004Z 2025-05-07T20:25:31.8753008Z 2025-05-07T20:25:31.8753012Z 2025-05-07T20:25:31.8755354Z 2025-05-07T20:25:31.9232663Z cuda-nsight-12.6.77 | 113.2 MB | ########4 | 84%  2025-05-07T20:25:31.9233013Z 2025-05-07T20:25:31.9433302Z libcublas-12.6.4.1 | 256.2 MB | ####6 | 46%  2025-05-07T20:25:31.9633561Z nsight-compute-2024. | 443.1 MB | ##5 | 26% 2025-05-07T20:25:31.9633925Z 2025-05-07T20:25:31.9636305Z 2025-05-07T20:25:31.9755449Z libcufft-11.3.0.4 | 156.2 MB | #######8 | 78%  2025-05-07T20:25:31.9755963Z 2025-05-07T20:25:31.9755967Z 2025-05-07T20:25:31.9755970Z 2025-05-07T20:25:31.9756193Z 2025-05-07T20:25:31.9937784Z cuda-nsight-12.6.77 | 113.2 MB | ########8 | 88%  2025-05-07T20:25:31.9938167Z 2025-05-07T20:25:31.9938171Z 2025-05-07T20:25:31.9939596Z 2025-05-07T20:25:32.0233560Z libcusparse-12.5.4.2 | 118.6 MB | ########6 | 86%  2025-05-07T20:25:32.0235634Z 2025-05-07T20:25:32.0437142Z libcublas-12.6.4.1 | 256.2 MB | ####7 | 48%  2025-05-07T20:25:32.0778463Z nsight-compute-2024. | 443.1 MB | ##6 | 27% 2025-05-07T20:25:32.0778768Z 2025-05-07T20:25:32.0778772Z 2025-05-07T20:25:32.0939323Z libcufft-11.3.0.4 | 156.2 MB | ######## | 81%  2025-05-07T20:25:32.0939624Z 2025-05-07T20:25:32.0939630Z 2025-05-07T20:25:32.0940325Z 2025-05-07T20:25:32.1053191Z libcusparse-12.5.4.2 | 118.6 MB | ########8 | 88%  2025-05-07T20:25:32.1053501Z 2025-05-07T20:25:32.1053507Z 2025-05-07T20:25:32.1053513Z 2025-05-07T20:25:32.1054167Z 2025-05-07T20:25:32.1255901Z cuda-nsight-12.6.77 | 113.2 MB | #########1 | 91%  2025-05-07T20:25:32.1259269Z 2025-05-07T20:25:32.1443833Z libcublas-12.6.4.1 | 256.2 MB | ####9 | 49%  2025-05-07T20:25:32.1867162Z nsight-compute-2024. | 443.1 MB | ##7 | 27% 2025-05-07T20:25:32.1867433Z 2025-05-07T20:25:32.1870549Z 2025-05-07T20:25:32.1995334Z libcufft-11.3.0.4 | 156.2 MB | ########3 | 83%  2025-05-07T20:25:32.1995604Z 2025-05-07T20:25:32.1995608Z 2025-05-07T20:25:32.1995611Z 2025-05-07T20:25:32.2055250Z libcusparse-12.5.4.2 | 118.6 MB | ######### | 91%  2025-05-07T20:25:32.2055844Z 2025-05-07T20:25:32.2055848Z 2025-05-07T20:25:32.2055852Z 2025-05-07T20:25:32.2055855Z 2025-05-07T20:25:32.2357350Z cuda-nsight-12.6.77 | 113.2 MB | #########4 | 95%  2025-05-07T20:25:32.2357710Z 2025-05-07T20:25:32.2530609Z libcublas-12.6.4.1 | 256.2 MB | ##### | 51%  2025-05-07T20:25:32.2998126Z nsight-compute-2024. | 443.1 MB | ##8 | 28% 2025-05-07T20:25:32.2998532Z 2025-05-07T20:25:32.2998538Z 2025-05-07T20:25:32.3002221Z 2025-05-07T20:25:32.3050019Z libcusparse-12.5.4.2 | 118.6 MB | #########3 | 94%  2025-05-07T20:25:32.3050306Z 2025-05-07T20:25:32.3050972Z 2025-05-07T20:25:32.3058335Z libcufft-11.3.0.4 | 156.2 MB | ########5 | 86%  2025-05-07T20:25:32.3058609Z 2025-05-07T20:25:32.3058613Z 2025-05-07T20:25:32.3058617Z 2025-05-07T20:25:32.3058621Z 2025-05-07T20:25:32.3384868Z cuda-nsight-12.6.77 | 113.2 MB | #########7 | 98%  2025-05-07T20:25:32.3385170Z 2025-05-07T20:25:32.3532082Z libcublas-12.6.4.1 | 256.2 MB | #####2 | 52%  2025-05-07T20:25:32.3998560Z nsight-compute-2024. | 443.1 MB | ##9 | 29% 2025-05-07T20:25:32.3998851Z 2025-05-07T20:25:32.3998855Z 2025-05-07T20:25:32.3999407Z 2025-05-07T20:25:32.4118597Z libcusparse-12.5.4.2 | 118.6 MB | #########6 | 96%  2025-05-07T20:25:32.4118951Z 2025-05-07T20:25:32.4118957Z 2025-05-07T20:25:32.4463994Z libcufft-11.3.0.4 | 156.2 MB | ########8 | 88%  2025-05-07T20:25:32.4464278Z 2025-05-07T20:25:32.4535193Z libcublas-12.6.4.1 | 256.2 MB | #####3 | 54%  2025-05-07T20:25:32.5039385Z nsight-compute-2024. | 443.1 MB | ##9 | 30% 2025-05-07T20:25:32.5039794Z 2025-05-07T20:25:32.5039799Z 2025-05-07T20:25:32.5041446Z 2025-05-07T20:25:32.5119093Z libcusparse-12.5.4.2 | 118.6 MB | #########8 | 99%  2025-05-07T20:25:32.5119388Z 2025-05-07T20:25:32.5123533Z 2025-05-07T20:25:32.5537920Z libcufft-11.3.0.4 | 156.2 MB | ######### | 90%  2025-05-07T20:25:32.5538811Z 2025-05-07T20:25:32.5545239Z libcublas-12.6.4.1 | 256.2 MB | #####5 | 55%  2025-05-07T20:25:32.6120756Z nsight-compute-2024. | 443.1 MB | ### | 31% 2025-05-07T20:25:32.6121023Z 2025-05-07T20:25:32.6121026Z 2025-05-07T20:25:32.6538474Z libcufft-11.3.0.4 | 156.2 MB | #########2 | 93%  2025-05-07T20:25:32.6539458Z 2025-05-07T20:25:32.6544234Z libcublas-12.6.4.1 | 256.2 MB | #####6 | 57%  2025-05-07T20:25:32.7121787Z nsight-compute-2024. | 443.1 MB | ###1 | 32% 2025-05-07T20:25:32.7122246Z 2025-05-07T20:25:32.7122578Z 2025-05-07T20:25:32.7544109Z libcufft-11.3.0.4 | 156.2 MB | #########5 | 95%  2025-05-07T20:25:32.7547116Z 2025-05-07T20:25:32.7550883Z libcublas-12.6.4.1 | 256.2 MB | #####8 | 58%  2025-05-07T20:25:32.8125109Z nsight-compute-2024. | 443.1 MB | ###2 | 33% 2025-05-07T20:25:32.8125378Z 2025-05-07T20:25:32.8125649Z 2025-05-07T20:25:32.8547598Z libcufft-11.3.0.4 | 156.2 MB | #########7 | 97%  2025-05-07T20:25:32.8547870Z 2025-05-07T20:25:32.8552035Z libcublas-12.6.4.1 | 256.2 MB | #####9 | 60%  2025-05-07T20:25:32.9184425Z nsight-compute-2024. | 443.1 MB | ###3 | 34% 2025-05-07T20:25:32.9184681Z 2025-05-07T20:25:32.9184684Z 2025-05-07T20:25:32.9549873Z libcufft-11.3.0.4 | 156.2 MB | #########9 | 100%  2025-05-07T20:25:32.9551904Z 2025-05-07T20:25:32.9556427Z libcublas-12.6.4.1 | 256.2 MB | ######1 | 61%  2025-05-07T20:25:33.0550880Z nsight-compute-2024. | 443.1 MB | ###4 | 35% 2025-05-07T20:25:33.0551817Z 2025-05-07T20:25:33.0558592Z libcublas-12.6.4.1 | 256.2 MB | ######2 | 63%  2025-05-07T20:25:33.1551326Z nsight-compute-2024. | 443.1 MB | ###5 | 36% 2025-05-07T20:25:33.1552291Z 2025-05-07T20:25:33.1593646Z libcublas-12.6.4.1 | 256.2 MB | ######4 | 65%  2025-05-07T20:25:33.2551136Z nsight-compute-2024. | 443.1 MB | ###7 | 37% 2025-05-07T20:25:33.2552735Z 2025-05-07T20:25:33.2870569Z libcublas-12.6.4.1 | 256.2 MB | ######6 | 67%  2025-05-07T20:25:33.3554746Z nsight-compute-2024. | 443.1 MB | ###8 | 38% 2025-05-07T20:25:33.3556086Z 2025-05-07T20:25:33.3871332Z libcublas-12.6.4.1 | 256.2 MB | ######8 | 69%  2025-05-07T20:25:33.4560908Z nsight-compute-2024. | 443.1 MB | ###9 | 39% 2025-05-07T20:25:33.4562172Z 2025-05-07T20:25:33.4872839Z libcublas-12.6.4.1 | 256.2 MB | ####### | 70%  2025-05-07T20:25:33.5629862Z nsight-compute-2024. | 443.1 MB | #### | 40% 2025-05-07T20:25:33.5630258Z 2025-05-07T20:25:33.5875550Z libcublas-12.6.4.1 | 256.2 MB | #######2 | 72%  2025-05-07T20:25:33.6630558Z nsight-compute-2024. | 443.1 MB | ####1 | 42% 2025-05-07T20:25:33.6630927Z 2025-05-07T20:25:33.6875085Z libcublas-12.6.4.1 | 256.2 MB | #######4 | 74%  2025-05-07T20:25:33.7683441Z nsight-compute-2024. | 443.1 MB | ####2 | 43% 2025-05-07T20:25:33.7686338Z 2025-05-07T20:25:33.7875624Z libcublas-12.6.4.1 | 256.2 MB | #######5 | 76%  2025-05-07T20:25:33.8722861Z nsight-compute-2024. | 443.1 MB | ####3 | 44% 2025-05-07T20:25:33.8723310Z 2025-05-07T20:25:33.8879307Z libcublas-12.6.4.1 | 256.2 MB | #######7 | 78%  2025-05-07T20:25:33.9724839Z nsight-compute-2024. | 443.1 MB | ####5 | 45% 2025-05-07T20:25:33.9725196Z 2025-05-07T20:25:33.9894328Z libcublas-12.6.4.1 | 256.2 MB | #######9 | 79%  2025-05-07T20:25:34.0777302Z nsight-compute-2024. | 443.1 MB | ####6 | 46% 2025-05-07T20:25:34.0777596Z 2025-05-07T20:25:34.0903759Z libcublas-12.6.4.1 | 256.2 MB | ########1 | 81%  2025-05-07T20:25:34.1778154Z nsight-compute-2024. | 443.1 MB | ####7 | 47% 2025-05-07T20:25:34.1778492Z 2025-05-07T20:25:34.1904225Z libcublas-12.6.4.1 | 256.2 MB | ########2 | 83%  2025-05-07T20:25:34.2780022Z nsight-compute-2024. | 443.1 MB | ####8 | 49% 2025-05-07T20:25:34.2780370Z 2025-05-07T20:25:34.2942290Z libcublas-12.6.4.1 | 256.2 MB | ########4 | 85%  2025-05-07T20:25:34.3782427Z nsight-compute-2024. | 443.1 MB | ####9 | 50% 2025-05-07T20:25:34.3782693Z 2025-05-07T20:25:34.3996610Z libcublas-12.6.4.1 | 256.2 MB | ########6 | 87%  2025-05-07T20:25:34.4784613Z nsight-compute-2024. | 443.1 MB | ##### | 51% 2025-05-07T20:25:34.4785451Z 2025-05-07T20:25:34.4998835Z libcublas-12.6.4.1 | 256.2 MB | ########8 | 88%  2025-05-07T20:25:34.5660251Z nsight-compute-2024. | 443.1 MB | #####2 | 52% 2025-05-07T20:25:34.5660654Z 2025-05-07T20:25:34.5660929Z 2025-05-07T20:25:34.5660935Z 2025-05-07T20:25:34.5665874Z 2025-05-07T20:25:34.5800239Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:25:34.5800564Z 2025-05-07T20:25:34.6089607Z libcublas-12.6.4.1 | 256.2 MB | ######### | 90%  2025-05-07T20:25:34.6284334Z nsight-compute-2024. | 443.1 MB | #####3 | 53% 2025-05-07T20:25:34.6284610Z 2025-05-07T20:25:34.6284614Z 2025-05-07T20:25:34.6284618Z 2025-05-07T20:25:34.6284622Z 2025-05-07T20:25:34.6286007Z 2025-05-07T20:25:34.6949733Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:25:34.6952170Z 2025-05-07T20:25:34.7284774Z libcublas-12.6.4.1 | 256.2 MB | #########1 | 92%  2025-05-07T20:25:34.7285054Z 2025-05-07T20:25:34.7285058Z 2025-05-07T20:25:34.7285061Z 2025-05-07T20:25:34.7285082Z 2025-05-07T20:25:34.7285278Z 2025-05-07T20:25:34.7306909Z cuda-nvvp-12.6.80 | 109.3 MB | 2 | 2%  2025-05-07T20:25:34.8228983Z nsight-compute-2024. | 443.1 MB | #####4 | 54% 2025-05-07T20:25:34.8234352Z 2025-05-07T20:25:34.8287943Z libcublas-12.6.4.1 | 256.2 MB | #########3 | 94%  2025-05-07T20:25:34.8288209Z 2025-05-07T20:25:34.8288213Z 2025-05-07T20:25:34.8288217Z 2025-05-07T20:25:34.8288221Z 2025-05-07T20:25:34.8288888Z 2025-05-07T20:25:34.8614425Z cuda-nvvp-12.6.80 | 109.3 MB | 5 | 6%  2025-05-07T20:25:34.9288611Z nsight-compute-2024. | 443.1 MB | #####5 | 55% 2025-05-07T20:25:34.9288872Z 2025-05-07T20:25:34.9288882Z 2025-05-07T20:25:34.9288886Z 2025-05-07T20:25:34.9288889Z 2025-05-07T20:25:34.9291010Z 2025-05-07T20:25:34.9434057Z cuda-nvvp-12.6.80 | 109.3 MB | 8 | 9%  2025-05-07T20:25:34.9434372Z 2025-05-07T20:25:34.9892279Z libcublas-12.6.4.1 | 256.2 MB | #########5 | 95%  2025-05-07T20:25:35.0293290Z nsight-compute-2024. | 443.1 MB | #####6 | 56% 2025-05-07T20:25:35.0293666Z 2025-05-07T20:25:35.0293672Z 2025-05-07T20:25:35.0293689Z 2025-05-07T20:25:35.0293694Z 2025-05-07T20:25:35.0296567Z 2025-05-07T20:25:35.0497557Z cuda-nvvp-12.6.80 | 109.3 MB | #1 | 12%  2025-05-07T20:25:35.0498693Z 2025-05-07T20:25:35.0927530Z libcublas-12.6.4.1 | 256.2 MB | #########6 | 97%  2025-05-07T20:25:35.1294376Z nsight-compute-2024. | 443.1 MB | #####7 | 57% 2025-05-07T20:25:35.1294726Z 2025-05-07T20:25:35.1294730Z 2025-05-07T20:25:35.1294734Z 2025-05-07T20:25:35.1294738Z 2025-05-07T20:25:35.1297136Z 2025-05-07T20:25:35.1531500Z cuda-nvvp-12.6.80 | 109.3 MB | #5 | 15%  2025-05-07T20:25:35.1534038Z 2025-05-07T20:25:35.1957803Z libcublas-12.6.4.1 | 256.2 MB | #########8 | 98%  2025-05-07T20:25:35.2294653Z nsight-compute-2024. | 443.1 MB | #####8 | 58% 2025-05-07T20:25:35.2295021Z 2025-05-07T20:25:35.2295025Z 2025-05-07T20:25:35.2295029Z 2025-05-07T20:25:35.2295033Z 2025-05-07T20:25:35.2297890Z 2025-05-07T20:25:35.2532553Z cuda-nvvp-12.6.80 | 109.3 MB | #8 | 19%  2025-05-07T20:25:35.2533913Z 2025-05-07T20:25:35.2984006Z libcublas-12.6.4.1 | 256.2 MB | #########9 | 100%  2025-05-07T20:25:35.3295290Z nsight-compute-2024. | 443.1 MB | #####9 | 59% 2025-05-07T20:25:35.3295645Z 2025-05-07T20:25:35.3295649Z 2025-05-07T20:25:35.3295653Z 2025-05-07T20:25:35.3295656Z 2025-05-07T20:25:35.3296471Z 2025-05-07T20:25:35.3492615Z cuda-nvvp-12.6.80 | 109.3 MB | ##2 | 23%  2025-05-07T20:25:35.3492950Z 2025-05-07T20:25:35.3492954Z 2025-05-07T20:25:35.3494339Z 2025-05-07T20:25:35.3991784Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:25:35.3992117Z 2025-05-07T20:25:35.3992120Z 2025-05-07T20:25:35.3992124Z 2025-05-07T20:25:35.3992128Z 2025-05-07T20:25:35.3992132Z 2025-05-07T20:25:35.3993038Z 2025-05-07T20:25:35.4142861Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:25:35.4406455Z nsight-compute-2024. | 443.1 MB | #####9 | 60% 2025-05-07T20:25:35.4406996Z 2025-05-07T20:25:35.4407000Z 2025-05-07T20:25:35.4407003Z 2025-05-07T20:25:35.4407007Z 2025-05-07T20:25:35.4407019Z 2025-05-07T20:25:35.4991760Z cuda-nvvp-12.6.80 | 109.3 MB | ##6 | 26%  2025-05-07T20:25:35.4992061Z 2025-05-07T20:25:35.4992065Z 2025-05-07T20:25:35.4992068Z 2025-05-07T20:25:35.4992072Z 2025-05-07T20:25:35.4992085Z 2025-05-07T20:25:35.4992192Z 2025-05-07T20:25:35.5276896Z libcusolver-11.7.1.2 | 95.8 MB | 3 | 3%  2025-05-07T20:25:35.5541514Z nsight-compute-2024. | 443.1 MB | ###### | 61% 2025-05-07T20:25:35.5541824Z 2025-05-07T20:25:35.5541828Z 2025-05-07T20:25:35.5541832Z 2025-05-07T20:25:35.5541836Z 2025-05-07T20:25:35.5541839Z 2025-05-07T20:25:35.5995420Z cuda-nvvp-12.6.80 | 109.3 MB | ##9 | 29%  2025-05-07T20:25:35.5995718Z 2025-05-07T20:25:35.5995722Z 2025-05-07T20:25:35.5995725Z 2025-05-07T20:25:35.5995729Z 2025-05-07T20:25:35.5995732Z 2025-05-07T20:25:35.5995744Z 2025-05-07T20:25:35.6453828Z libcusolver-11.7.1.2 | 95.8 MB | 6 | 7%  2025-05-07T20:25:35.6568862Z nsight-compute-2024. | 443.1 MB | ######1 | 62% 2025-05-07T20:25:35.6569115Z 2025-05-07T20:25:35.6569119Z 2025-05-07T20:25:35.6569122Z 2025-05-07T20:25:35.6569126Z 2025-05-07T20:25:35.6570541Z 2025-05-07T20:25:35.6995681Z cuda-nvvp-12.6.80 | 109.3 MB | ###2 | 33%  2025-05-07T20:25:35.6995965Z 2025-05-07T20:25:35.6995969Z 2025-05-07T20:25:35.6995973Z 2025-05-07T20:25:35.6995976Z 2025-05-07T20:25:35.6995980Z 2025-05-07T20:25:35.6995983Z 2025-05-07T20:25:35.7614115Z libcusolver-11.7.1.2 | 95.8 MB | 9 | 10%  2025-05-07T20:25:35.7614414Z 2025-05-07T20:25:35.7614417Z 2025-05-07T20:25:35.7614421Z 2025-05-07T20:25:35.7614436Z 2025-05-07T20:25:35.7615906Z 2025-05-07T20:25:35.7622307Z cuda-nvvp-12.6.80 | 109.3 MB | ###5 | 36%  2025-05-07T20:25:35.7997672Z nsight-compute-2024. | 443.1 MB | ######2 | 62% 2025-05-07T20:25:35.8006003Z 2025-05-07T20:25:35.8006471Z 2025-05-07T20:25:35.8006478Z 2025-05-07T20:25:35.8006484Z 2025-05-07T20:25:35.8006490Z 2025-05-07T20:25:35.8006495Z 2025-05-07T20:25:35.8667371Z libcusolver-11.7.1.2 | 95.8 MB | #3 | 13%  2025-05-07T20:25:35.8667792Z 2025-05-07T20:25:35.8667797Z 2025-05-07T20:25:35.8667802Z 2025-05-07T20:25:35.8667807Z 2025-05-07T20:25:35.8669302Z 2025-05-07T20:25:35.8734176Z cuda-nvvp-12.6.80 | 109.3 MB | ###8 | 39%  2025-05-07T20:25:35.9024050Z nsight-compute-2024. | 443.1 MB | ######3 | 63% 2025-05-07T20:25:35.9024420Z 2025-05-07T20:25:35.9024425Z 2025-05-07T20:25:35.9024431Z 2025-05-07T20:25:35.9024436Z 2025-05-07T20:25:35.9024441Z 2025-05-07T20:25:35.9024447Z 2025-05-07T20:25:35.9672819Z libcusolver-11.7.1.2 | 95.8 MB | #6 | 16%  2025-05-07T20:25:35.9673262Z 2025-05-07T20:25:35.9673269Z 2025-05-07T20:25:35.9673274Z 2025-05-07T20:25:35.9673291Z 2025-05-07T20:25:35.9673297Z 2025-05-07T20:25:35.9834053Z cuda-nvvp-12.6.80 | 109.3 MB | ####1 | 42%  2025-05-07T20:25:36.0030922Z nsight-compute-2024. | 443.1 MB | ######3 | 64% 2025-05-07T20:25:36.0031289Z 2025-05-07T20:25:36.0031445Z 2025-05-07T20:25:36.0031452Z 2025-05-07T20:25:36.0031474Z 2025-05-07T20:25:36.0031508Z 2025-05-07T20:25:36.0034371Z 2025-05-07T20:25:36.0768321Z libcusolver-11.7.1.2 | 95.8 MB | #9 | 20%  2025-05-07T20:25:36.0768736Z 2025-05-07T20:25:36.0768742Z 2025-05-07T20:25:36.0768746Z 2025-05-07T20:25:36.0768752Z 2025-05-07T20:25:36.0772620Z 2025-05-07T20:25:36.0870995Z cuda-nvvp-12.6.80 | 109.3 MB | ####4 | 45%  2025-05-07T20:25:36.1037258Z nsight-compute-2024. | 443.1 MB | ######4 | 65% 2025-05-07T20:25:36.1037635Z 2025-05-07T20:25:36.1037640Z 2025-05-07T20:25:36.1037646Z 2025-05-07T20:25:36.1037651Z 2025-05-07T20:25:36.1037656Z 2025-05-07T20:25:36.1037834Z 2025-05-07T20:25:36.1822256Z libcusolver-11.7.1.2 | 95.8 MB | ##2 | 23%  2025-05-07T20:25:36.1822609Z 2025-05-07T20:25:36.1822613Z 2025-05-07T20:25:36.1822617Z 2025-05-07T20:25:36.1822620Z 2025-05-07T20:25:36.1824106Z 2025-05-07T20:25:36.1897377Z cuda-nvvp-12.6.80 | 109.3 MB | ####7 | 48%  2025-05-07T20:25:36.2050393Z nsight-compute-2024. | 443.1 MB | ######5 | 65% 2025-05-07T20:25:36.2050791Z 2025-05-07T20:25:36.2050798Z 2025-05-07T20:25:36.2050805Z 2025-05-07T20:25:36.2050812Z 2025-05-07T20:25:36.2050818Z 2025-05-07T20:25:36.2050824Z 2025-05-07T20:25:36.2825953Z libcusolver-11.7.1.2 | 95.8 MB | ##5 | 26%  2025-05-07T20:25:36.2826378Z 2025-05-07T20:25:36.2826384Z 2025-05-07T20:25:36.2826389Z 2025-05-07T20:25:36.2826409Z 2025-05-07T20:25:36.2828608Z 2025-05-07T20:25:36.2900917Z cuda-nvvp-12.6.80 | 109.3 MB | ##### | 51%  2025-05-07T20:25:36.3054700Z nsight-compute-2024. | 443.1 MB | ######5 | 66% 2025-05-07T20:25:36.3055081Z 2025-05-07T20:25:36.3055087Z 2025-05-07T20:25:36.3055092Z 2025-05-07T20:25:36.3055098Z 2025-05-07T20:25:36.3055103Z 2025-05-07T20:25:36.3055119Z 2025-05-07T20:25:36.3846140Z libcusolver-11.7.1.2 | 95.8 MB | ##9 | 29%  2025-05-07T20:25:36.3846496Z 2025-05-07T20:25:36.3846500Z 2025-05-07T20:25:36.3846504Z 2025-05-07T20:25:36.3846507Z 2025-05-07T20:25:36.3847919Z 2025-05-07T20:25:36.3902002Z cuda-nvvp-12.6.80 | 109.3 MB | #####3 | 54%  2025-05-07T20:25:36.4056978Z nsight-compute-2024. | 443.1 MB | ######6 | 67% 2025-05-07T20:25:36.4057289Z 2025-05-07T20:25:36.4057295Z 2025-05-07T20:25:36.4057304Z 2025-05-07T20:25:36.4057416Z 2025-05-07T20:25:36.4057423Z 2025-05-07T20:25:36.4057439Z 2025-05-07T20:25:36.4706367Z libcusolver-11.7.1.2 | 95.8 MB | ###2 | 32%  2025-05-07T20:25:36.4706678Z 2025-05-07T20:25:36.4706682Z 2025-05-07T20:25:36.4910918Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:25:36.4974499Z nsight-compute-2024. | 443.1 MB | ######7 | 67% 2025-05-07T20:25:36.4974773Z 2025-05-07T20:25:36.4974777Z 2025-05-07T20:25:36.4974780Z 2025-05-07T20:25:36.4974784Z 2025-05-07T20:25:36.4974788Z 2025-05-07T20:25:36.5191562Z cuda-nvvp-12.6.80 | 109.3 MB | #####6 | 57%  2025-05-07T20:25:36.5191961Z 2025-05-07T20:25:36.5191965Z 2025-05-07T20:25:36.5191969Z 2025-05-07T20:25:36.5191972Z 2025-05-07T20:25:36.5191976Z 2025-05-07T20:25:36.5191980Z 2025-05-07T20:25:36.5343905Z libcusolver-11.7.1.2 | 95.8 MB | ###5 | 36%  2025-05-07T20:25:36.5344217Z 2025-05-07T20:25:36.5344221Z 2025-05-07T20:25:36.5344225Z 2025-05-07T20:25:36.5344228Z 2025-05-07T20:25:36.5344232Z 2025-05-07T20:25:36.5344252Z 2025-05-07T20:25:36.5347714Z 2025-05-07T20:25:36.6064998Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:25:36.6084216Z nsight-compute-2024. | 443.1 MB | ######8 | 68% 2025-05-07T20:25:36.6084517Z 2025-05-07T20:25:36.6084522Z 2025-05-07T20:25:36.6084525Z 2025-05-07T20:25:36.6084529Z 2025-05-07T20:25:36.6084533Z 2025-05-07T20:25:36.6345856Z cuda-nvvp-12.6.80 | 109.3 MB | #####9 | 60%  2025-05-07T20:25:36.6346156Z 2025-05-07T20:25:36.6346160Z 2025-05-07T20:25:36.6346164Z 2025-05-07T20:25:36.6346168Z 2025-05-07T20:25:36.6346171Z 2025-05-07T20:25:36.6346175Z 2025-05-07T20:25:36.6353131Z 2025-05-07T20:25:36.6356859Z libnpp-12.3.1.54 | 93.4 MB | 2 | 3%  2025-05-07T20:25:36.6357253Z 2025-05-07T20:25:36.6357257Z 2025-05-07T20:25:36.6357261Z 2025-05-07T20:25:36.6357264Z 2025-05-07T20:25:36.6357268Z 2025-05-07T20:25:36.6357271Z 2025-05-07T20:25:36.7111253Z libcusolver-11.7.1.2 | 95.8 MB | ###8 | 39%  2025-05-07T20:25:36.7111675Z 2025-05-07T20:25:36.7111680Z 2025-05-07T20:25:36.7111685Z 2025-05-07T20:25:36.7111691Z 2025-05-07T20:25:36.7114216Z 2025-05-07T20:25:36.7195864Z cuda-nvvp-12.6.80 | 109.3 MB | ######2 | 62%  2025-05-07T20:25:36.7350502Z nsight-compute-2024. | 443.1 MB | ######8 | 69% 2025-05-07T20:25:36.7350769Z 2025-05-07T20:25:36.7350773Z 2025-05-07T20:25:36.7350777Z 2025-05-07T20:25:36.7350780Z 2025-05-07T20:25:36.7350784Z 2025-05-07T20:25:36.7350788Z 2025-05-07T20:25:36.7352436Z 2025-05-07T20:25:36.7509352Z libnpp-12.3.1.54 | 93.4 MB | 5 | 6%  2025-05-07T20:25:36.7509756Z 2025-05-07T20:25:36.7509762Z 2025-05-07T20:25:36.7509767Z 2025-05-07T20:25:36.7509772Z 2025-05-07T20:25:36.7509777Z 2025-05-07T20:25:36.7512962Z 2025-05-07T20:25:36.8203930Z libcusolver-11.7.1.2 | 95.8 MB | ####1 | 42%  2025-05-07T20:25:36.8355911Z nsight-compute-2024. | 443.1 MB | ######9 | 70% 2025-05-07T20:25:36.8356179Z 2025-05-07T20:25:36.8356219Z 2025-05-07T20:25:36.8356224Z 2025-05-07T20:25:36.8356227Z 2025-05-07T20:25:36.8356231Z 2025-05-07T20:25:36.8356243Z 2025-05-07T20:25:36.8357848Z 2025-05-07T20:25:36.8402748Z libnpp-12.3.1.54 | 93.4 MB | 8 | 8%  2025-05-07T20:25:36.8403045Z 2025-05-07T20:25:36.8403049Z 2025-05-07T20:25:36.8403052Z 2025-05-07T20:25:36.8403056Z 2025-05-07T20:25:36.8403060Z 2025-05-07T20:25:36.8726062Z cuda-nvvp-12.6.80 | 109.3 MB | ######5 | 65%  2025-05-07T20:25:36.8726381Z 2025-05-07T20:25:36.8726385Z 2025-05-07T20:25:36.8726388Z 2025-05-07T20:25:36.8726392Z 2025-05-07T20:25:36.8726396Z 2025-05-07T20:25:36.8726399Z 2025-05-07T20:25:36.9276745Z libcusolver-11.7.1.2 | 95.8 MB | ####4 | 45%  2025-05-07T20:25:36.9360711Z nsight-compute-2024. | 443.1 MB | ####### | 70% 2025-05-07T20:25:36.9361035Z 2025-05-07T20:25:36.9361041Z 2025-05-07T20:25:36.9361068Z 2025-05-07T20:25:36.9361074Z 2025-05-07T20:25:36.9361079Z 2025-05-07T20:25:36.9361085Z 2025-05-07T20:25:36.9361091Z 2025-05-07T20:25:36.9527367Z libnpp-12.3.1.54 | 93.4 MB | #1 | 11%  2025-05-07T20:25:36.9528305Z 2025-05-07T20:25:36.9528310Z 2025-05-07T20:25:36.9528313Z 2025-05-07T20:25:36.9528317Z 2025-05-07T20:25:36.9530740Z 2025-05-07T20:25:36.9854347Z cuda-nvvp-12.6.80 | 109.3 MB | ######7 | 68%  2025-05-07T20:25:36.9854649Z 2025-05-07T20:25:36.9854653Z 2025-05-07T20:25:36.9854656Z 2025-05-07T20:25:36.9854660Z 2025-05-07T20:25:36.9854664Z 2025-05-07T20:25:36.9856721Z 2025-05-07T20:25:37.0356631Z libcusolver-11.7.1.2 | 95.8 MB | ####7 | 47%  2025-05-07T20:25:37.0364026Z nsight-compute-2024. | 443.1 MB | ####### | 71% 2025-05-07T20:25:37.0364324Z 2025-05-07T20:25:37.0364329Z 2025-05-07T20:25:37.0364332Z 2025-05-07T20:25:37.0364336Z 2025-05-07T20:25:37.0364339Z 2025-05-07T20:25:37.0364369Z 2025-05-07T20:25:37.0364373Z 2025-05-07T20:25:37.0528719Z libnpp-12.3.1.54 | 93.4 MB | #4 | 14%  2025-05-07T20:25:37.0529113Z 2025-05-07T20:25:37.0529132Z 2025-05-07T20:25:37.0529137Z 2025-05-07T20:25:37.0529142Z 2025-05-07T20:25:37.0529147Z 2025-05-07T20:25:37.0855847Z cuda-nvvp-12.6.80 | 109.3 MB | ####### | 70%  2025-05-07T20:25:37.0856291Z 2025-05-07T20:25:37.0856297Z 2025-05-07T20:25:37.0856302Z 2025-05-07T20:25:37.0856307Z 2025-05-07T20:25:37.0856312Z 2025-05-07T20:25:37.0856318Z 2025-05-07T20:25:37.1369514Z libcusolver-11.7.1.2 | 95.8 MB | ##### | 50%  2025-05-07T20:25:37.1369826Z 2025-05-07T20:25:37.1369830Z 2025-05-07T20:25:37.1369834Z 2025-05-07T20:25:37.1369838Z 2025-05-07T20:25:37.1369841Z 2025-05-07T20:25:37.1369845Z 2025-05-07T20:25:37.1373975Z 2025-05-07T20:25:37.1559685Z libnpp-12.3.1.54 | 93.4 MB | #7 | 17%  2025-05-07T20:25:37.1560102Z 2025-05-07T20:25:37.1560387Z 2025-05-07T20:25:37.1560407Z 2025-05-07T20:25:37.1560412Z 2025-05-07T20:25:37.1563491Z 2025-05-07T20:25:37.1576177Z cuda-nvvp-12.6.80 | 109.3 MB | #######2 | 73%  2025-05-07T20:25:37.1917057Z nsight-compute-2024. | 443.1 MB | #######1 | 72% 2025-05-07T20:25:37.1917364Z 2025-05-07T20:25:37.1917368Z 2025-05-07T20:25:37.1917372Z 2025-05-07T20:25:37.1917375Z 2025-05-07T20:25:37.1917379Z 2025-05-07T20:25:37.1920103Z 2025-05-07T20:25:37.2393976Z libcusolver-11.7.1.2 | 95.8 MB | #####2 | 53%  2025-05-07T20:25:37.2394288Z 2025-05-07T20:25:37.2394292Z 2025-05-07T20:25:37.2394296Z 2025-05-07T20:25:37.2394299Z 2025-05-07T20:25:37.2394303Z 2025-05-07T20:25:37.2394307Z 2025-05-07T20:25:37.2397890Z 2025-05-07T20:25:37.2633829Z libnpp-12.3.1.54 | 93.4 MB | #9 | 20%  2025-05-07T20:25:37.2634179Z 2025-05-07T20:25:37.2634183Z 2025-05-07T20:25:37.2634186Z 2025-05-07T20:25:37.2634190Z 2025-05-07T20:25:37.2635760Z 2025-05-07T20:25:37.2728901Z cuda-nvvp-12.6.80 | 109.3 MB | #######5 | 75%  2025-05-07T20:25:37.2941005Z nsight-compute-2024. | 443.1 MB | #######2 | 72% 2025-05-07T20:25:37.2941334Z 2025-05-07T20:25:37.2941340Z 2025-05-07T20:25:37.2941343Z 2025-05-07T20:25:37.2941434Z 2025-05-07T20:25:37.2941437Z 2025-05-07T20:25:37.2941464Z 2025-05-07T20:25:37.3472995Z libcusolver-11.7.1.2 | 95.8 MB | #####5 | 55%  2025-05-07T20:25:37.3473432Z 2025-05-07T20:25:37.3473437Z 2025-05-07T20:25:37.3473442Z 2025-05-07T20:25:37.3473447Z 2025-05-07T20:25:37.3473452Z 2025-05-07T20:25:37.3473458Z 2025-05-07T20:25:37.3475278Z 2025-05-07T20:25:37.3692711Z libnpp-12.3.1.54 | 93.4 MB | ##2 | 23%  2025-05-07T20:25:37.3693007Z 2025-05-07T20:25:37.3693011Z 2025-05-07T20:25:37.3693014Z 2025-05-07T20:25:37.3693018Z 2025-05-07T20:25:37.3699701Z 2025-05-07T20:25:37.3812834Z cuda-nvvp-12.6.80 | 109.3 MB | #######7 | 78%  2025-05-07T20:25:37.3948623Z nsight-compute-2024. | 443.1 MB | #######2 | 73% 2025-05-07T20:25:37.3948930Z 2025-05-07T20:25:37.3948934Z 2025-05-07T20:25:37.3948937Z 2025-05-07T20:25:37.3948953Z 2025-05-07T20:25:37.3948957Z 2025-05-07T20:25:37.3950773Z 2025-05-07T20:25:37.4473774Z libcusolver-11.7.1.2 | 95.8 MB | #####8 | 58%  2025-05-07T20:25:37.4474091Z 2025-05-07T20:25:37.4474095Z 2025-05-07T20:25:37.4474099Z 2025-05-07T20:25:37.4474103Z 2025-05-07T20:25:37.4474106Z 2025-05-07T20:25:37.4474110Z 2025-05-07T20:25:37.4474114Z 2025-05-07T20:25:37.4818241Z libnpp-12.3.1.54 | 93.4 MB | ##5 | 26%  2025-05-07T20:25:37.4832214Z nsight-compute-2024. | 443.1 MB | #######3 | 73% 2025-05-07T20:25:37.4832484Z 2025-05-07T20:25:37.4832490Z 2025-05-07T20:25:37.4832495Z 2025-05-07T20:25:37.4832500Z 2025-05-07T20:25:37.4836811Z 2025-05-07T20:25:37.4950744Z cuda-nvvp-12.6.80 | 109.3 MB | #######9 | 80%  2025-05-07T20:25:37.4951109Z 2025-05-07T20:25:37.4951116Z 2025-05-07T20:25:37.4951122Z 2025-05-07T20:25:37.4951128Z 2025-05-07T20:25:37.4951135Z 2025-05-07T20:25:37.4954063Z 2025-05-07T20:25:37.5511869Z libcusolver-11.7.1.2 | 95.8 MB | ###### | 61%  2025-05-07T20:25:37.5512183Z 2025-05-07T20:25:37.5512187Z 2025-05-07T20:25:37.5512191Z 2025-05-07T20:25:37.5512194Z 2025-05-07T20:25:37.5512198Z 2025-05-07T20:25:37.5512201Z 2025-05-07T20:25:37.5514259Z 2025-05-07T20:25:37.5871029Z libnpp-12.3.1.54 | 93.4 MB | ##8 | 29%  2025-05-07T20:25:37.5885158Z nsight-compute-2024. | 443.1 MB | #######3 | 74% 2025-05-07T20:25:37.5885526Z 2025-05-07T20:25:37.5885532Z 2025-05-07T20:25:37.5885537Z 2025-05-07T20:25:37.5885542Z 2025-05-07T20:25:37.5887810Z 2025-05-07T20:25:37.5986621Z cuda-nvvp-12.6.80 | 109.3 MB | ########2 | 82%  2025-05-07T20:25:37.5986931Z 2025-05-07T20:25:37.5986936Z 2025-05-07T20:25:37.5986941Z 2025-05-07T20:25:37.5987227Z 2025-05-07T20:25:37.5987235Z 2025-05-07T20:25:37.5987239Z 2025-05-07T20:25:37.6515328Z libcusolver-11.7.1.2 | 95.8 MB | ######3 | 64%  2025-05-07T20:25:37.6516062Z 2025-05-07T20:25:37.6516068Z 2025-05-07T20:25:37.6516073Z 2025-05-07T20:25:37.6516079Z 2025-05-07T20:25:37.6516084Z 2025-05-07T20:25:37.6516089Z 2025-05-07T20:25:37.6519512Z 2025-05-07T20:25:37.6946813Z libnpp-12.3.1.54 | 93.4 MB | ###1 | 32%  2025-05-07T20:25:37.6947150Z 2025-05-07T20:25:37.6947154Z 2025-05-07T20:25:37.6947158Z 2025-05-07T20:25:37.6947162Z 2025-05-07T20:25:37.6952936Z 2025-05-07T20:25:37.6982927Z cuda-nvvp-12.6.80 | 109.3 MB | ########4 | 84%  2025-05-07T20:25:37.7009806Z nsight-compute-2024. | 443.1 MB | #######4 | 75% 2025-05-07T20:25:37.7010067Z 2025-05-07T20:25:37.7010071Z 2025-05-07T20:25:37.7010074Z 2025-05-07T20:25:37.7010078Z 2025-05-07T20:25:37.7010082Z 2025-05-07T20:25:37.7012008Z 2025-05-07T20:25:37.7517939Z libcusolver-11.7.1.2 | 95.8 MB | ######6 | 66%  2025-05-07T20:25:37.7518243Z 2025-05-07T20:25:37.7518247Z 2025-05-07T20:25:37.7518251Z 2025-05-07T20:25:37.7518264Z 2025-05-07T20:25:37.7518268Z 2025-05-07T20:25:37.7518272Z 2025-05-07T20:25:37.7522466Z 2025-05-07T20:25:37.7949404Z libnpp-12.3.1.54 | 93.4 MB | ###4 | 35%  2025-05-07T20:25:37.7949686Z 2025-05-07T20:25:37.7949690Z 2025-05-07T20:25:37.7949694Z 2025-05-07T20:25:37.7949698Z 2025-05-07T20:25:37.7953377Z 2025-05-07T20:25:37.7997426Z cuda-nvvp-12.6.80 | 109.3 MB | ########6 | 87%  2025-05-07T20:25:37.8013142Z nsight-compute-2024. | 443.1 MB | #######5 | 75% 2025-05-07T20:25:37.8013401Z 2025-05-07T20:25:37.8013405Z 2025-05-07T20:25:37.8013408Z 2025-05-07T20:25:37.8013412Z 2025-05-07T20:25:37.8013415Z 2025-05-07T20:25:37.8017473Z 2025-05-07T20:25:37.8736625Z libcusolver-11.7.1.2 | 95.8 MB | ######9 | 69%  2025-05-07T20:25:37.8736953Z 2025-05-07T20:25:37.8736957Z 2025-05-07T20:25:37.8736961Z 2025-05-07T20:25:37.8736964Z 2025-05-07T20:25:37.8736968Z 2025-05-07T20:25:37.8736972Z 2025-05-07T20:25:37.8742132Z 2025-05-07T20:25:37.8951495Z libnpp-12.3.1.54 | 93.4 MB | ###7 | 38%  2025-05-07T20:25:37.8951803Z 2025-05-07T20:25:37.8951807Z 2025-05-07T20:25:37.8951810Z 2025-05-07T20:25:37.8951814Z 2025-05-07T20:25:37.8954361Z 2025-05-07T20:25:37.9015636Z cuda-nvvp-12.6.80 | 109.3 MB | ########8 | 89%  2025-05-07T20:25:37.9015920Z 2025-05-07T20:25:37.9015924Z 2025-05-07T20:25:37.9015928Z 2025-05-07T20:25:37.9015931Z 2025-05-07T20:25:37.9015935Z 2025-05-07T20:25:37.9017770Z 2025-05-07T20:25:37.9025615Z libcusolver-11.7.1.2 | 95.8 MB | #######1 | 72%  2025-05-07T20:25:37.9797464Z nsight-compute-2024. | 443.1 MB | #######5 | 76% 2025-05-07T20:25:37.9797736Z 2025-05-07T20:25:37.9797740Z 2025-05-07T20:25:37.9797743Z 2025-05-07T20:25:37.9797763Z 2025-05-07T20:25:37.9797767Z 2025-05-07T20:25:37.9797771Z 2025-05-07T20:25:37.9797774Z 2025-05-07T20:25:37.9956411Z libnpp-12.3.1.54 | 93.4 MB | #### | 40%  2025-05-07T20:25:37.9956714Z 2025-05-07T20:25:37.9956718Z 2025-05-07T20:25:37.9956722Z 2025-05-07T20:25:37.9956726Z 2025-05-07T20:25:37.9959423Z 2025-05-07T20:25:38.0031801Z cuda-nvvp-12.6.80 | 109.3 MB | #########1 | 91%  2025-05-07T20:25:38.0032230Z 2025-05-07T20:25:38.0032236Z 2025-05-07T20:25:38.0032242Z 2025-05-07T20:25:38.0032247Z 2025-05-07T20:25:38.0032252Z 2025-05-07T20:25:38.0032258Z 2025-05-07T20:25:38.0134133Z libcusolver-11.7.1.2 | 95.8 MB | #######4 | 74%  2025-05-07T20:25:38.0801156Z nsight-compute-2024. | 443.1 MB | #######6 | 76% 2025-05-07T20:25:38.0801586Z 2025-05-07T20:25:38.0801593Z 2025-05-07T20:25:38.0801598Z 2025-05-07T20:25:38.0801603Z 2025-05-07T20:25:38.0801608Z 2025-05-07T20:25:38.0801613Z 2025-05-07T20:25:38.0801618Z 2025-05-07T20:25:38.1034777Z libnpp-12.3.1.54 | 93.4 MB | ####3 | 43%  2025-05-07T20:25:38.1035086Z 2025-05-07T20:25:38.1035089Z 2025-05-07T20:25:38.1035093Z 2025-05-07T20:25:38.1035628Z 2025-05-07T20:25:38.1035634Z 2025-05-07T20:25:38.1038942Z 2025-05-07T20:25:38.1049390Z libcusolver-11.7.1.2 | 95.8 MB | #######7 | 77%  2025-05-07T20:25:38.1049693Z 2025-05-07T20:25:38.1049697Z 2025-05-07T20:25:38.1049700Z 2025-05-07T20:25:38.1049704Z 2025-05-07T20:25:38.1055923Z 2025-05-07T20:25:38.1231988Z cuda-nvvp-12.6.80 | 109.3 MB | #########3 | 94%  2025-05-07T20:25:38.1889110Z nsight-compute-2024. | 443.1 MB | #######6 | 77% 2025-05-07T20:25:38.1889446Z 2025-05-07T20:25:38.1889452Z 2025-05-07T20:25:38.1889457Z 2025-05-07T20:25:38.1889462Z 2025-05-07T20:25:38.1889468Z 2025-05-07T20:25:38.1889473Z 2025-05-07T20:25:38.1889478Z 2025-05-07T20:25:38.2090330Z libnpp-12.3.1.54 | 93.4 MB | ####5 | 46%  2025-05-07T20:25:38.2090617Z 2025-05-07T20:25:38.2090621Z 2025-05-07T20:25:38.2090625Z 2025-05-07T20:25:38.2090628Z 2025-05-07T20:25:38.2091442Z 2025-05-07T20:25:38.2107381Z cuda-nvvp-12.6.80 | 109.3 MB | #########5 | 96%  2025-05-07T20:25:38.2107667Z 2025-05-07T20:25:38.2107670Z 2025-05-07T20:25:38.2107674Z 2025-05-07T20:25:38.2116818Z 2025-05-07T20:25:38.2127063Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:25:38.2127340Z 2025-05-07T20:25:38.2127344Z 2025-05-07T20:25:38.2127347Z 2025-05-07T20:25:38.2127351Z 2025-05-07T20:25:38.2127355Z 2025-05-07T20:25:38.2128660Z 2025-05-07T20:25:38.2282031Z libcusolver-11.7.1.2 | 95.8 MB | #######9 | 80%  2025-05-07T20:25:38.2966573Z nsight-compute-2024. | 443.1 MB | #######7 | 78% 2025-05-07T20:25:38.2966940Z 2025-05-07T20:25:38.2966946Z 2025-05-07T20:25:38.2966961Z 2025-05-07T20:25:38.2966966Z 2025-05-07T20:25:38.2966972Z 2025-05-07T20:25:38.2966977Z 2025-05-07T20:25:38.2967000Z 2025-05-07T20:25:38.3095665Z libnpp-12.3.1.54 | 93.4 MB | ####8 | 49%  2025-05-07T20:25:38.3096062Z 2025-05-07T20:25:38.3096067Z 2025-05-07T20:25:38.3096086Z 2025-05-07T20:25:38.3096091Z 2025-05-07T20:25:38.3096097Z 2025-05-07T20:25:38.3131342Z cuda-nvvp-12.6.80 | 109.3 MB | #########8 | 99%  2025-05-07T20:25:38.3131716Z 2025-05-07T20:25:38.3131720Z 2025-05-07T20:25:38.3131724Z 2025-05-07T20:25:38.3131727Z 2025-05-07T20:25:38.3131731Z 2025-05-07T20:25:38.3132716Z 2025-05-07T20:25:38.3285548Z libcusolver-11.7.1.2 | 95.8 MB | ########2 | 83%  2025-05-07T20:25:38.3967370Z nsight-compute-2024. | 443.1 MB | #######8 | 78% 2025-05-07T20:25:38.3967697Z 2025-05-07T20:25:38.3967703Z 2025-05-07T20:25:38.3967708Z 2025-05-07T20:25:38.3967722Z 2025-05-07T20:25:38.3967727Z 2025-05-07T20:25:38.3967732Z 2025-05-07T20:25:38.3967737Z 2025-05-07T20:25:38.4131453Z libnpp-12.3.1.54 | 93.4 MB | #####2 | 52%  2025-05-07T20:25:38.4131813Z 2025-05-07T20:25:38.4131825Z 2025-05-07T20:25:38.4131829Z 2025-05-07T20:25:38.4131833Z 2025-05-07T20:25:38.4131836Z 2025-05-07T20:25:38.4133296Z 2025-05-07T20:25:38.4288666Z libcusolver-11.7.1.2 | 95.8 MB | ########5 | 86%  2025-05-07T20:25:38.4968436Z nsight-compute-2024. | 443.1 MB | #######8 | 79% 2025-05-07T20:25:38.4968827Z 2025-05-07T20:25:38.4968833Z 2025-05-07T20:25:38.4968838Z 2025-05-07T20:25:38.4968853Z 2025-05-07T20:25:38.4968858Z 2025-05-07T20:25:38.4968863Z 2025-05-07T20:25:38.4971725Z 2025-05-07T20:25:38.5135423Z libnpp-12.3.1.54 | 93.4 MB | #####5 | 55%  2025-05-07T20:25:38.5135725Z 2025-05-07T20:25:38.5135729Z 2025-05-07T20:25:38.5135733Z 2025-05-07T20:25:38.5135736Z 2025-05-07T20:25:38.5135740Z 2025-05-07T20:25:38.5135744Z 2025-05-07T20:25:38.5297269Z libcusolver-11.7.1.2 | 95.8 MB | ########9 | 89%  2025-05-07T20:25:38.5977252Z nsight-compute-2024. | 443.1 MB | #######9 | 79% 2025-05-07T20:25:38.5977668Z 2025-05-07T20:25:38.5977674Z 2025-05-07T20:25:38.5977679Z 2025-05-07T20:25:38.5977684Z 2025-05-07T20:25:38.5977698Z 2025-05-07T20:25:38.5977857Z 2025-05-07T20:25:38.5980343Z 2025-05-07T20:25:38.6136294Z libnpp-12.3.1.54 | 93.4 MB | #####8 | 59%  2025-05-07T20:25:38.6136648Z 2025-05-07T20:25:38.6136662Z 2025-05-07T20:25:38.6136668Z 2025-05-07T20:25:38.6136673Z 2025-05-07T20:25:38.6136678Z 2025-05-07T20:25:38.6140282Z 2025-05-07T20:25:38.6299991Z libcusolver-11.7.1.2 | 95.8 MB | #########2 | 93%  2025-05-07T20:25:38.6977585Z nsight-compute-2024. | 443.1 MB | ######## | 80% 2025-05-07T20:25:38.6977912Z 2025-05-07T20:25:38.6977916Z 2025-05-07T20:25:38.6977919Z 2025-05-07T20:25:38.6977923Z 2025-05-07T20:25:38.6977927Z 2025-05-07T20:25:38.6977940Z 2025-05-07T20:25:38.6977944Z 2025-05-07T20:25:38.7146777Z libnpp-12.3.1.54 | 93.4 MB | ######2 | 63%  2025-05-07T20:25:38.7147139Z 2025-05-07T20:25:38.7147143Z 2025-05-07T20:25:38.7147146Z 2025-05-07T20:25:38.7147157Z 2025-05-07T20:25:38.7147161Z 2025-05-07T20:25:38.7147801Z 2025-05-07T20:25:38.7304919Z libcusolver-11.7.1.2 | 95.8 MB | #########6 | 96%  2025-05-07T20:25:38.8064674Z nsight-compute-2024. | 443.1 MB | ######## | 81% 2025-05-07T20:25:38.8065007Z 2025-05-07T20:25:38.8065011Z 2025-05-07T20:25:38.8065015Z 2025-05-07T20:25:38.8065018Z 2025-05-07T20:25:38.8065022Z 2025-05-07T20:25:38.8065025Z 2025-05-07T20:25:38.8068138Z 2025-05-07T20:25:38.8149583Z libnpp-12.3.1.54 | 93.4 MB | ######5 | 66%  2025-05-07T20:25:38.8149877Z 2025-05-07T20:25:38.8149881Z 2025-05-07T20:25:38.8149884Z 2025-05-07T20:25:38.8149888Z 2025-05-07T20:25:38.8149899Z 2025-05-07T20:25:38.8150642Z 2025-05-07T20:25:38.8305464Z libcusolver-11.7.1.2 | 95.8 MB | #########9 | 100%  2025-05-07T20:25:38.9066977Z nsight-compute-2024. | 443.1 MB | ########1 | 82% 2025-05-07T20:25:38.9067331Z 2025-05-07T20:25:38.9067337Z 2025-05-07T20:25:38.9067342Z 2025-05-07T20:25:38.9067347Z 2025-05-07T20:25:38.9067352Z 2025-05-07T20:25:38.9067370Z 2025-05-07T20:25:38.9070497Z 2025-05-07T20:25:38.9306390Z libnpp-12.3.1.54 | 93.4 MB | ######9 | 69%  2025-05-07T20:25:39.0121377Z nsight-compute-2024. | 443.1 MB | ########2 | 82% 2025-05-07T20:25:39.0121714Z 2025-05-07T20:25:39.0121720Z 2025-05-07T20:25:39.0121725Z 2025-05-07T20:25:39.0121730Z 2025-05-07T20:25:39.0121735Z 2025-05-07T20:25:39.0121741Z 2025-05-07T20:25:39.0126222Z 2025-05-07T20:25:39.0312883Z libnpp-12.3.1.54 | 93.4 MB | #######2 | 72%  2025-05-07T20:25:39.1167893Z nsight-compute-2024. | 443.1 MB | ########3 | 83% 2025-05-07T20:25:39.1168212Z 2025-05-07T20:25:39.1168216Z 2025-05-07T20:25:39.1168220Z 2025-05-07T20:25:39.1168224Z 2025-05-07T20:25:39.1168227Z 2025-05-07T20:25:39.1168231Z 2025-05-07T20:25:39.1168261Z 2025-05-07T20:25:39.1418304Z libnpp-12.3.1.54 | 93.4 MB | #######5 | 76%  2025-05-07T20:25:39.2170042Z nsight-compute-2024. | 443.1 MB | ########3 | 84% 2025-05-07T20:25:39.2170345Z 2025-05-07T20:25:39.2170349Z 2025-05-07T20:25:39.2170353Z 2025-05-07T20:25:39.2170356Z 2025-05-07T20:25:39.2170360Z 2025-05-07T20:25:39.2170364Z 2025-05-07T20:25:39.2170368Z 2025-05-07T20:25:39.2445317Z libnpp-12.3.1.54 | 93.4 MB | #######8 | 79%  2025-05-07T20:25:39.3175997Z nsight-compute-2024. | 443.1 MB | ########4 | 85% 2025-05-07T20:25:39.3176276Z 2025-05-07T20:25:39.3176281Z 2025-05-07T20:25:39.3176284Z 2025-05-07T20:25:39.3176288Z 2025-05-07T20:25:39.3176292Z 2025-05-07T20:25:39.3176296Z 2025-05-07T20:25:39.3176300Z 2025-05-07T20:25:39.3451582Z libnpp-12.3.1.54 | 93.4 MB | ########1 | 82%  2025-05-07T20:25:39.4451116Z nsight-compute-2024. | 443.1 MB | ########5 | 85% 2025-05-07T20:25:39.4870936Z nsight-compute-2024. | 443.1 MB | ########5 | 86% 2025-05-07T20:25:39.4871555Z 2025-05-07T20:25:39.4871559Z 2025-05-07T20:25:39.4871563Z 2025-05-07T20:25:39.4871567Z 2025-05-07T20:25:39.4871779Z 2025-05-07T20:25:39.4871783Z 2025-05-07T20:25:39.4872966Z 2025-05-07T20:25:39.5456654Z libnpp-12.3.1.54 | 93.4 MB | ########5 | 85%  2025-05-07T20:25:39.5876432Z nsight-compute-2024. | 443.1 MB | ########6 | 87% 2025-05-07T20:25:39.5876804Z 2025-05-07T20:25:39.5876808Z 2025-05-07T20:25:39.5876818Z 2025-05-07T20:25:39.5876822Z 2025-05-07T20:25:39.5876825Z 2025-05-07T20:25:39.5876829Z 2025-05-07T20:25:39.5876833Z 2025-05-07T20:25:39.6521491Z libnpp-12.3.1.54 | 93.4 MB | ########8 | 88%  2025-05-07T20:25:39.6890713Z nsight-compute-2024. | 443.1 MB | ########7 | 87% 2025-05-07T20:25:39.6891177Z 2025-05-07T20:25:39.6891435Z 2025-05-07T20:25:39.6891441Z 2025-05-07T20:25:39.6891447Z 2025-05-07T20:25:39.6891470Z 2025-05-07T20:25:39.6891487Z 2025-05-07T20:25:39.6891671Z 2025-05-07T20:25:39.7557190Z libnpp-12.3.1.54 | 93.4 MB | #########1 | 92%  2025-05-07T20:25:39.7892443Z nsight-compute-2024. | 443.1 MB | ########8 | 88% 2025-05-07T20:25:39.7892725Z 2025-05-07T20:25:39.7892729Z 2025-05-07T20:25:39.7892733Z 2025-05-07T20:25:39.7892737Z 2025-05-07T20:25:39.7892740Z 2025-05-07T20:25:39.7892744Z 2025-05-07T20:25:39.7894163Z 2025-05-07T20:25:39.8558368Z libnpp-12.3.1.54 | 93.4 MB | #########4 | 95%  2025-05-07T20:25:39.8894108Z nsight-compute-2024. | 443.1 MB | ########8 | 89% 2025-05-07T20:25:39.8894421Z 2025-05-07T20:25:39.8894555Z 2025-05-07T20:25:39.8894561Z 2025-05-07T20:25:39.8894566Z 2025-05-07T20:25:39.8894571Z 2025-05-07T20:25:39.8894577Z 2025-05-07T20:25:39.8894604Z 2025-05-07T20:25:39.9560731Z libnpp-12.3.1.54 | 93.4 MB | #########8 | 99%  2025-05-07T20:25:40.0560682Z nsight-compute-2024. | 443.1 MB | ########9 | 90% 2025-05-07T20:25:40.1562236Z nsight-compute-2024. | 443.1 MB | ######### | 91% 2025-05-07T20:25:40.2564375Z nsight-compute-2024. | 443.1 MB | #########1 | 91% 2025-05-07T20:25:40.3568748Z nsight-compute-2024. | 443.1 MB | #########2 | 92% 2025-05-07T20:25:40.4573909Z nsight-compute-2024. | 443.1 MB | #########3 | 93% 2025-05-07T20:25:40.5575362Z nsight-compute-2024. | 443.1 MB | #########4 | 94% 2025-05-07T20:25:40.6609152Z nsight-compute-2024. | 443.1 MB | #########5 | 95% 2025-05-07T20:25:40.7628716Z nsight-compute-2024. | 443.1 MB | #########5 | 96% 2025-05-07T20:25:40.8661387Z nsight-compute-2024. | 443.1 MB | #########6 | 97% 2025-05-07T20:25:40.9663007Z nsight-compute-2024. | 443.1 MB | #########7 | 98% 2025-05-07T20:25:41.0760551Z nsight-compute-2024. | 443.1 MB | #########8 | 99% 2025-05-07T20:25:41.6693414Z nsight-compute-2024. | 443.1 MB | #########9 | 100% 2025-05-07T20:25:41.6693810Z 2025-05-07T20:25:41.6693816Z 2025-05-07T20:25:41.6693821Z 2025-05-07T20:25:41.6693826Z 2025-05-07T20:25:41.6693850Z 2025-05-07T20:25:41.6693856Z 2025-05-07T20:25:41.7373114Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:25:41.7373417Z 2025-05-07T20:25:41.7373436Z 2025-05-07T20:25:41.7373439Z 2025-05-07T20:25:41.7373443Z 2025-05-07T20:25:41.7373446Z 2025-05-07T20:25:41.7373450Z 2025-05-07T20:25:41.7373454Z 2025-05-07T20:25:41.7374932Z 2025-05-07T20:25:41.7902392Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:25:41.7902712Z 2025-05-07T20:25:41.7902716Z 2025-05-07T20:25:41.7902720Z 2025-05-07T20:25:41.7902723Z 2025-05-07T20:25:41.7902727Z 2025-05-07T20:25:41.8373183Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:25:41.8373578Z 2025-05-07T20:25:41.8373583Z 2025-05-07T20:25:41.8373589Z 2025-05-07T20:25:41.8373594Z 2025-05-07T20:25:41.8373600Z 2025-05-07T20:25:41.8373605Z 2025-05-07T20:25:41.8373626Z 2025-05-07T20:25:41.8373631Z 2025-05-07T20:25:41.8534062Z cuda-nvdisasm-12.6.7 | 47.6 MB | 6 | 6%  2025-05-07T20:25:41.8534366Z 2025-05-07T20:25:41.8534369Z 2025-05-07T20:25:41.8534380Z 2025-05-07T20:25:41.8534383Z 2025-05-07T20:25:41.8534521Z 2025-05-07T20:25:41.8534525Z 2025-05-07T20:25:41.8534528Z 2025-05-07T20:25:41.8534532Z 2025-05-07T20:25:41.8535866Z 2025-05-07T20:25:41.9393644Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:25:41.9393975Z 2025-05-07T20:25:41.9393979Z 2025-05-07T20:25:41.9393983Z 2025-05-07T20:25:41.9393986Z 2025-05-07T20:25:41.9393990Z 2025-05-07T20:25:41.9393994Z 2025-05-07T20:25:41.9393998Z 2025-05-07T20:25:41.9394001Z 2025-05-07T20:25:41.9537111Z cuda-nvdisasm-12.6.7 | 47.6 MB | #2 | 13%  2025-05-07T20:25:41.9537420Z 2025-05-07T20:25:41.9537424Z 2025-05-07T20:25:41.9537427Z 2025-05-07T20:25:41.9537431Z 2025-05-07T20:25:41.9537435Z 2025-05-07T20:25:41.9537439Z 2025-05-07T20:25:41.9537449Z 2025-05-07T20:25:41.9537466Z 2025-05-07T20:25:41.9538920Z 2025-05-07T20:25:42.0395239Z libcurand-10.3.7.77 | 39.9 MB | 7 | 8%  2025-05-07T20:25:42.0395552Z 2025-05-07T20:25:42.0395568Z 2025-05-07T20:25:42.0395574Z 2025-05-07T20:25:42.0395579Z 2025-05-07T20:25:42.0395584Z 2025-05-07T20:25:42.0395590Z 2025-05-07T20:25:42.0395595Z 2025-05-07T20:25:42.0400916Z 2025-05-07T20:25:42.0586935Z cuda-nvdisasm-12.6.7 | 47.6 MB | #9 | 19%  2025-05-07T20:25:42.0587380Z 2025-05-07T20:25:42.0587386Z 2025-05-07T20:25:42.0587391Z 2025-05-07T20:25:42.0587396Z 2025-05-07T20:25:42.0587401Z 2025-05-07T20:25:42.0587406Z 2025-05-07T20:25:42.0587411Z 2025-05-07T20:25:42.0587416Z 2025-05-07T20:25:42.0589564Z 2025-05-07T20:25:42.1427434Z libcurand-10.3.7.77 | 39.9 MB | #5 | 15%  2025-05-07T20:25:42.1427750Z 2025-05-07T20:25:42.1427754Z 2025-05-07T20:25:42.1427757Z 2025-05-07T20:25:42.1427761Z 2025-05-07T20:25:42.1427776Z 2025-05-07T20:25:42.1427780Z 2025-05-07T20:25:42.1427784Z 2025-05-07T20:25:42.1429986Z 2025-05-07T20:25:42.1590808Z cuda-nvdisasm-12.6.7 | 47.6 MB | ##5 | 26%  2025-05-07T20:25:42.1591250Z 2025-05-07T20:25:42.1591256Z 2025-05-07T20:25:42.1591261Z 2025-05-07T20:25:42.1591266Z 2025-05-07T20:25:42.1591272Z 2025-05-07T20:25:42.1591277Z 2025-05-07T20:25:42.1591282Z 2025-05-07T20:25:42.1591287Z 2025-05-07T20:25:42.1593733Z 2025-05-07T20:25:42.2430015Z libcurand-10.3.7.77 | 39.9 MB | ##3 | 23%  2025-05-07T20:25:42.2430317Z 2025-05-07T20:25:42.2430321Z 2025-05-07T20:25:42.2430325Z 2025-05-07T20:25:42.2430328Z 2025-05-07T20:25:42.2430332Z 2025-05-07T20:25:42.2430336Z 2025-05-07T20:25:42.2430348Z 2025-05-07T20:25:42.2430352Z 2025-05-07T20:25:42.2591718Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###2 | 32%  2025-05-07T20:25:42.2592135Z 2025-05-07T20:25:42.2592142Z 2025-05-07T20:25:42.2592147Z 2025-05-07T20:25:42.2592164Z 2025-05-07T20:25:42.2592170Z 2025-05-07T20:25:42.2592175Z 2025-05-07T20:25:42.2592190Z 2025-05-07T20:25:42.2592195Z 2025-05-07T20:25:42.2595976Z 2025-05-07T20:25:42.3459111Z libcurand-10.3.7.77 | 39.9 MB | ### | 31%  2025-05-07T20:25:42.3459415Z 2025-05-07T20:25:42.3459425Z 2025-05-07T20:25:42.3459429Z 2025-05-07T20:25:42.3459433Z 2025-05-07T20:25:42.3459436Z 2025-05-07T20:25:42.3459440Z 2025-05-07T20:25:42.3459444Z 2025-05-07T20:25:42.3459447Z 2025-05-07T20:25:42.3647917Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###8 | 39%  2025-05-07T20:25:42.3648294Z 2025-05-07T20:25:42.3648298Z 2025-05-07T20:25:42.3648302Z 2025-05-07T20:25:42.3648305Z 2025-05-07T20:25:42.3648309Z 2025-05-07T20:25:42.3648313Z 2025-05-07T20:25:42.3648317Z 2025-05-07T20:25:42.3648320Z 2025-05-07T20:25:42.3649582Z 2025-05-07T20:25:42.4467796Z libcurand-10.3.7.77 | 39.9 MB | ###8 | 39%  2025-05-07T20:25:42.4468367Z 2025-05-07T20:25:42.4468372Z 2025-05-07T20:25:42.4468376Z 2025-05-07T20:25:42.4468379Z 2025-05-07T20:25:42.4468383Z 2025-05-07T20:25:42.4468387Z 2025-05-07T20:25:42.4468515Z 2025-05-07T20:25:42.4470136Z 2025-05-07T20:25:42.4650796Z cuda-nvdisasm-12.6.7 | 47.6 MB | ####5 | 46%  2025-05-07T20:25:42.4651110Z 2025-05-07T20:25:42.4651114Z 2025-05-07T20:25:42.4651118Z 2025-05-07T20:25:42.4651122Z 2025-05-07T20:25:42.4651125Z 2025-05-07T20:25:42.4651129Z 2025-05-07T20:25:42.4651132Z 2025-05-07T20:25:42.4651136Z 2025-05-07T20:25:42.4652443Z 2025-05-07T20:25:42.5480239Z libcurand-10.3.7.77 | 39.9 MB | ####7 | 47%  2025-05-07T20:25:42.5480577Z 2025-05-07T20:25:42.5480581Z 2025-05-07T20:25:42.5480585Z 2025-05-07T20:25:42.5480588Z 2025-05-07T20:25:42.5480592Z 2025-05-07T20:25:42.5480596Z 2025-05-07T20:25:42.5480605Z 2025-05-07T20:25:42.5482766Z 2025-05-07T20:25:42.5652552Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####2 | 52%  2025-05-07T20:25:42.5652890Z 2025-05-07T20:25:42.5652894Z 2025-05-07T20:25:42.5652906Z 2025-05-07T20:25:42.5652909Z 2025-05-07T20:25:42.5652913Z 2025-05-07T20:25:42.5652923Z 2025-05-07T20:25:42.5652926Z 2025-05-07T20:25:42.5652930Z 2025-05-07T20:25:42.5654837Z 2025-05-07T20:25:42.6503867Z libcurand-10.3.7.77 | 39.9 MB | #####5 | 56%  2025-05-07T20:25:42.6504221Z 2025-05-07T20:25:42.6504225Z 2025-05-07T20:25:42.6504229Z 2025-05-07T20:25:42.6504232Z 2025-05-07T20:25:42.6504236Z 2025-05-07T20:25:42.6504239Z 2025-05-07T20:25:42.6504243Z 2025-05-07T20:25:42.6508658Z 2025-05-07T20:25:42.6724701Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####8 | 59%  2025-05-07T20:25:42.6725011Z 2025-05-07T20:25:42.6725014Z 2025-05-07T20:25:42.6725018Z 2025-05-07T20:25:42.6725022Z 2025-05-07T20:25:42.6725025Z 2025-05-07T20:25:42.6725029Z 2025-05-07T20:25:42.6725033Z 2025-05-07T20:25:42.6725036Z 2025-05-07T20:25:42.6727917Z 2025-05-07T20:25:42.7504087Z libcurand-10.3.7.77 | 39.9 MB | ######3 | 64%  2025-05-07T20:25:42.7504506Z 2025-05-07T20:25:42.7504512Z 2025-05-07T20:25:42.7504526Z 2025-05-07T20:25:42.7504531Z 2025-05-07T20:25:42.7504536Z 2025-05-07T20:25:42.7504542Z 2025-05-07T20:25:42.7504546Z 2025-05-07T20:25:42.7505142Z 2025-05-07T20:25:42.7767827Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######5 | 66%  2025-05-07T20:25:42.7768180Z 2025-05-07T20:25:42.7768184Z 2025-05-07T20:25:42.7768187Z 2025-05-07T20:25:42.7768191Z 2025-05-07T20:25:42.7768194Z 2025-05-07T20:25:42.7768205Z 2025-05-07T20:25:42.7768209Z 2025-05-07T20:25:42.7768212Z 2025-05-07T20:25:42.7770778Z 2025-05-07T20:25:42.8558040Z libcurand-10.3.7.77 | 39.9 MB | #######1 | 72%  2025-05-07T20:25:42.8558469Z 2025-05-07T20:25:42.8558475Z 2025-05-07T20:25:42.8558480Z 2025-05-07T20:25:42.8558485Z 2025-05-07T20:25:42.8558490Z 2025-05-07T20:25:42.8558510Z 2025-05-07T20:25:42.8558516Z 2025-05-07T20:25:42.8560662Z 2025-05-07T20:25:42.8773433Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######2 | 72%  2025-05-07T20:25:42.8773760Z 2025-05-07T20:25:42.8773764Z 2025-05-07T20:25:42.8773767Z 2025-05-07T20:25:42.8773771Z 2025-05-07T20:25:42.8773775Z 2025-05-07T20:25:42.8773778Z 2025-05-07T20:25:42.8773782Z 2025-05-07T20:25:42.8773786Z 2025-05-07T20:25:42.8773789Z 2025-05-07T20:25:42.9341478Z libcurand-10.3.7.77 | 39.9 MB | #######9 | 80%  2025-05-07T20:25:42.9341813Z 2025-05-07T20:25:42.9341817Z 2025-05-07T20:25:42.9341820Z 2025-05-07T20:25:42.9341824Z 2025-05-07T20:25:42.9341828Z 2025-05-07T20:25:42.9341831Z 2025-05-07T20:25:42.9344062Z 2025-05-07T20:25:42.9590498Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:25:42.9590807Z 2025-05-07T20:25:42.9590811Z 2025-05-07T20:25:42.9590815Z 2025-05-07T20:25:42.9590818Z 2025-05-07T20:25:42.9590822Z 2025-05-07T20:25:42.9591038Z 2025-05-07T20:25:42.9591043Z 2025-05-07T20:25:42.9591450Z 2025-05-07T20:25:42.9775411Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######8 | 79%  2025-05-07T20:25:42.9775937Z 2025-05-07T20:25:42.9775941Z 2025-05-07T20:25:42.9775945Z 2025-05-07T20:25:42.9775948Z 2025-05-07T20:25:42.9775952Z 2025-05-07T20:25:42.9775955Z 2025-05-07T20:25:42.9775959Z 2025-05-07T20:25:42.9775963Z 2025-05-07T20:25:42.9776818Z 2025-05-07T20:25:42.9889194Z libcurand-10.3.7.77 | 39.9 MB | ########7 | 88%  2025-05-07T20:25:42.9889547Z 2025-05-07T20:25:42.9889551Z 2025-05-07T20:25:42.9889554Z 2025-05-07T20:25:42.9889558Z 2025-05-07T20:25:42.9889561Z 2025-05-07T20:25:42.9889565Z 2025-05-07T20:25:42.9889569Z 2025-05-07T20:25:42.9889572Z 2025-05-07T20:25:42.9889576Z 2025-05-07T20:25:42.9889579Z 2025-05-07T20:25:43.0653635Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:25:43.0653938Z 2025-05-07T20:25:43.0653953Z 2025-05-07T20:25:43.0653956Z 2025-05-07T20:25:43.0653960Z 2025-05-07T20:25:43.0653964Z 2025-05-07T20:25:43.0653974Z 2025-05-07T20:25:43.0653978Z 2025-05-07T20:25:43.0655735Z 2025-05-07T20:25:43.0857988Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########5 | 85%  2025-05-07T20:25:43.0858366Z 2025-05-07T20:25:43.0858378Z 2025-05-07T20:25:43.0858382Z 2025-05-07T20:25:43.0858385Z 2025-05-07T20:25:43.0858389Z 2025-05-07T20:25:43.0858393Z 2025-05-07T20:25:43.0858396Z 2025-05-07T20:25:43.0858400Z 2025-05-07T20:25:43.0864247Z 2025-05-07T20:25:43.0898443Z libcurand-10.3.7.77 | 39.9 MB | #########5 | 96%  2025-05-07T20:25:43.0898824Z 2025-05-07T20:25:43.0898828Z 2025-05-07T20:25:43.0898832Z 2025-05-07T20:25:43.0898836Z 2025-05-07T20:25:43.0898839Z 2025-05-07T20:25:43.0898843Z 2025-05-07T20:25:43.0898846Z 2025-05-07T20:25:43.0898850Z 2025-05-07T20:25:43.0898854Z 2025-05-07T20:25:43.0898857Z 2025-05-07T20:25:43.1742971Z gds-tools-1.11.1.6 | 37.8 MB | 6 | 7%  2025-05-07T20:25:43.1743286Z 2025-05-07T20:25:43.1743290Z 2025-05-07T20:25:43.1743294Z 2025-05-07T20:25:43.1743302Z 2025-05-07T20:25:43.1743305Z 2025-05-07T20:25:43.1743309Z 2025-05-07T20:25:43.1743313Z 2025-05-07T20:25:43.1747713Z 2025-05-07T20:25:43.1908912Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########1 | 92%  2025-05-07T20:25:43.1909253Z 2025-05-07T20:25:43.1909259Z 2025-05-07T20:25:43.1909264Z 2025-05-07T20:25:43.1909269Z 2025-05-07T20:25:43.1909274Z 2025-05-07T20:25:43.1909279Z 2025-05-07T20:25:43.1909283Z 2025-05-07T20:25:43.1909288Z 2025-05-07T20:25:43.1909293Z 2025-05-07T20:25:43.1909384Z 2025-05-07T20:25:43.2747369Z gds-tools-1.11.1.6 | 37.8 MB | #5 | 15%  2025-05-07T20:25:43.2747731Z 2025-05-07T20:25:43.2747735Z 2025-05-07T20:25:43.2747739Z 2025-05-07T20:25:43.2747742Z 2025-05-07T20:25:43.2747746Z 2025-05-07T20:25:43.2747750Z 2025-05-07T20:25:43.2747765Z 2025-05-07T20:25:43.2749780Z 2025-05-07T20:25:43.2918374Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########8 | 98%  2025-05-07T20:25:43.2918685Z 2025-05-07T20:25:43.2918699Z 2025-05-07T20:25:43.2918702Z 2025-05-07T20:25:43.2918706Z 2025-05-07T20:25:43.2918718Z 2025-05-07T20:25:43.2918721Z 2025-05-07T20:25:43.2918725Z 2025-05-07T20:25:43.2918728Z 2025-05-07T20:25:43.2918732Z 2025-05-07T20:25:43.2927356Z 2025-05-07T20:25:43.3918326Z gds-tools-1.11.1.6 | 37.8 MB | ##3 | 23%  2025-05-07T20:25:43.3918639Z 2025-05-07T20:25:43.3918643Z 2025-05-07T20:25:43.3918646Z 2025-05-07T20:25:43.3918650Z 2025-05-07T20:25:43.3918654Z 2025-05-07T20:25:43.3918657Z 2025-05-07T20:25:43.3918661Z 2025-05-07T20:25:43.3918664Z 2025-05-07T20:25:43.3918668Z 2025-05-07T20:25:43.3921449Z 2025-05-07T20:25:43.4920728Z gds-tools-1.11.1.6 | 37.8 MB | ###1 | 32%  2025-05-07T20:25:43.4921154Z 2025-05-07T20:25:43.4921415Z 2025-05-07T20:25:43.4921422Z 2025-05-07T20:25:43.4921427Z 2025-05-07T20:25:43.4921432Z 2025-05-07T20:25:43.4921437Z 2025-05-07T20:25:43.4921455Z 2025-05-07T20:25:43.4921625Z 2025-05-07T20:25:43.4921631Z 2025-05-07T20:25:43.4921636Z 2025-05-07T20:25:43.5931838Z gds-tools-1.11.1.6 | 37.8 MB | ####1 | 41%  2025-05-07T20:25:43.5932259Z 2025-05-07T20:25:43.5932265Z 2025-05-07T20:25:43.5932270Z 2025-05-07T20:25:43.5932276Z 2025-05-07T20:25:43.5932281Z 2025-05-07T20:25:43.5932286Z 2025-05-07T20:25:43.5932291Z 2025-05-07T20:25:43.5932296Z 2025-05-07T20:25:43.5932301Z 2025-05-07T20:25:43.5933102Z 2025-05-07T20:25:43.6933148Z gds-tools-1.11.1.6 | 37.8 MB | ####9 | 50%  2025-05-07T20:25:43.6933591Z 2025-05-07T20:25:43.6933597Z 2025-05-07T20:25:43.6933602Z 2025-05-07T20:25:43.6933607Z 2025-05-07T20:25:43.6933612Z 2025-05-07T20:25:43.6933618Z 2025-05-07T20:25:43.6933623Z 2025-05-07T20:25:43.6933646Z 2025-05-07T20:25:43.6933652Z 2025-05-07T20:25:43.6933657Z 2025-05-07T20:25:43.7933351Z gds-tools-1.11.1.6 | 37.8 MB | #####8 | 59%  2025-05-07T20:25:43.7933670Z 2025-05-07T20:25:43.7933681Z 2025-05-07T20:25:43.7933685Z 2025-05-07T20:25:43.7933689Z 2025-05-07T20:25:43.7933692Z 2025-05-07T20:25:43.7933696Z 2025-05-07T20:25:43.7933700Z 2025-05-07T20:25:43.7933703Z 2025-05-07T20:25:43.7933707Z 2025-05-07T20:25:43.7935085Z 2025-05-07T20:25:43.8353501Z gds-tools-1.11.1.6 | 37.8 MB | ######8 | 69%  2025-05-07T20:25:43.8356317Z 2025-05-07T20:25:43.8937994Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:25:43.8938359Z 2025-05-07T20:25:43.8938363Z 2025-05-07T20:25:43.8938367Z 2025-05-07T20:25:43.8938371Z 2025-05-07T20:25:43.8938374Z 2025-05-07T20:25:43.8938378Z 2025-05-07T20:25:43.8938382Z 2025-05-07T20:25:43.8938385Z 2025-05-07T20:25:43.8938389Z 2025-05-07T20:25:43.8940889Z 2025-05-07T20:25:43.9008174Z gds-tools-1.11.1.6 | 37.8 MB | ######## | 80%  2025-05-07T20:25:43.9008576Z 2025-05-07T20:25:43.9008583Z 2025-05-07T20:25:43.9008588Z 2025-05-07T20:25:43.9008606Z 2025-05-07T20:25:43.9008621Z 2025-05-07T20:25:43.9008626Z 2025-05-07T20:25:43.9008631Z 2025-05-07T20:25:43.9008636Z 2025-05-07T20:25:43.9008642Z 2025-05-07T20:25:43.9008647Z 2025-05-07T20:25:43.9011702Z 2025-05-07T20:25:43.9016314Z python-3.10.13 | 24.5 MB | | 0%  2025-05-07T20:25:43.9016643Z 2025-05-07T20:25:43.9016647Z 2025-05-07T20:25:43.9018631Z 2025-05-07T20:25:44.0015753Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:25:44.0016149Z 2025-05-07T20:25:44.0016155Z 2025-05-07T20:25:44.0016159Z 2025-05-07T20:25:44.0016163Z 2025-05-07T20:25:44.0016167Z 2025-05-07T20:25:44.0016171Z 2025-05-07T20:25:44.0016174Z 2025-05-07T20:25:44.0016178Z 2025-05-07T20:25:44.0016182Z 2025-05-07T20:25:44.0016186Z 2025-05-07T20:25:44.0019061Z 2025-05-07T20:25:44.0027702Z python-3.10.13 | 24.5 MB | #5 | 15%  2025-05-07T20:25:44.0028010Z 2025-05-07T20:25:44.0028024Z 2025-05-07T20:25:44.0028028Z 2025-05-07T20:25:44.0028032Z 2025-05-07T20:25:44.0028036Z 2025-05-07T20:25:44.0028039Z 2025-05-07T20:25:44.0028043Z 2025-05-07T20:25:44.0028047Z 2025-05-07T20:25:44.0028050Z 2025-05-07T20:25:44.0028393Z 2025-05-07T20:25:44.1172209Z gds-tools-1.11.1.6 | 37.8 MB | ######### | 90%  2025-05-07T20:25:44.1172656Z 2025-05-07T20:25:44.1172661Z 2025-05-07T20:25:44.1172664Z 2025-05-07T20:25:44.1172668Z 2025-05-07T20:25:44.1172671Z 2025-05-07T20:25:44.1172675Z 2025-05-07T20:25:44.1172679Z 2025-05-07T20:25:44.1172690Z 2025-05-07T20:25:44.1172693Z 2025-05-07T20:25:44.1172697Z 2025-05-07T20:25:44.1172701Z 2025-05-07T20:25:44.1187250Z python-3.10.13 | 24.5 MB | ##9 | 30%  2025-05-07T20:25:44.1187866Z 2025-05-07T20:25:44.1187876Z 2025-05-07T20:25:44.1187881Z 2025-05-07T20:25:44.1187886Z 2025-05-07T20:25:44.1187892Z 2025-05-07T20:25:44.1187897Z 2025-05-07T20:25:44.1187903Z 2025-05-07T20:25:44.1188099Z 2025-05-07T20:25:44.1188102Z 2025-05-07T20:25:44.1188106Z 2025-05-07T20:25:44.2177872Z gds-tools-1.11.1.6 | 37.8 MB | #########9 | 100%  2025-05-07T20:25:44.2178393Z 2025-05-07T20:25:44.2178401Z 2025-05-07T20:25:44.2178407Z 2025-05-07T20:25:44.2178412Z 2025-05-07T20:25:44.2178416Z 2025-05-07T20:25:44.2178421Z 2025-05-07T20:25:44.2178427Z 2025-05-07T20:25:44.2178432Z 2025-05-07T20:25:44.2178437Z 2025-05-07T20:25:44.2178442Z 2025-05-07T20:25:44.2178447Z 2025-05-07T20:25:44.3178773Z python-3.10.13 | 24.5 MB | ####4 | 45%  2025-05-07T20:25:44.3179217Z 2025-05-07T20:25:44.3179223Z 2025-05-07T20:25:44.3179228Z 2025-05-07T20:25:44.3179233Z 2025-05-07T20:25:44.3179238Z 2025-05-07T20:25:44.3179244Z 2025-05-07T20:25:44.3179270Z 2025-05-07T20:25:44.3179275Z 2025-05-07T20:25:44.3179281Z 2025-05-07T20:25:44.3179286Z 2025-05-07T20:25:44.3179291Z 2025-05-07T20:25:44.4295766Z python-3.10.13 | 24.5 MB | #####9 | 59%  2025-05-07T20:25:44.4296199Z 2025-05-07T20:25:44.4296205Z 2025-05-07T20:25:44.4296210Z 2025-05-07T20:25:44.4296215Z 2025-05-07T20:25:44.4296220Z 2025-05-07T20:25:44.4296235Z 2025-05-07T20:25:44.4296241Z 2025-05-07T20:25:44.4296246Z 2025-05-07T20:25:44.4296251Z 2025-05-07T20:25:44.4296256Z 2025-05-07T20:25:44.4298130Z 2025-05-07T20:25:44.5167179Z python-3.10.13 | 24.5 MB | #######3 | 74%  2025-05-07T20:25:44.5167592Z 2025-05-07T20:25:44.5167597Z 2025-05-07T20:25:44.5167603Z 2025-05-07T20:25:44.5167608Z 2025-05-07T20:25:44.5167613Z 2025-05-07T20:25:44.5167619Z 2025-05-07T20:25:44.5167624Z 2025-05-07T20:25:44.5167629Z 2025-05-07T20:25:44.5172231Z 2025-05-07T20:25:44.5301098Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:25:44.5301557Z 2025-05-07T20:25:44.5301564Z 2025-05-07T20:25:44.5301570Z 2025-05-07T20:25:44.5301575Z 2025-05-07T20:25:44.5301591Z 2025-05-07T20:25:44.5301597Z 2025-05-07T20:25:44.5301602Z 2025-05-07T20:25:44.5301607Z 2025-05-07T20:25:44.5301613Z 2025-05-07T20:25:44.5301618Z 2025-05-07T20:25:44.5301623Z 2025-05-07T20:25:44.5704951Z python-3.10.13 | 24.5 MB | ########7 | 88%  2025-05-07T20:25:44.5705248Z 2025-05-07T20:25:44.5705252Z 2025-05-07T20:25:44.5705256Z 2025-05-07T20:25:44.5705260Z 2025-05-07T20:25:44.5705263Z 2025-05-07T20:25:44.5705267Z 2025-05-07T20:25:44.5705271Z 2025-05-07T20:25:44.5705274Z 2025-05-07T20:25:44.5705286Z 2025-05-07T20:25:44.5705290Z 2025-05-07T20:25:44.5705293Z 2025-05-07T20:25:44.5705552Z 2025-05-07T20:25:44.6040285Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:25:44.6040669Z 2025-05-07T20:25:44.6043107Z 2025-05-07T20:25:44.6708267Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:25:44.6708624Z 2025-05-07T20:25:44.6708636Z 2025-05-07T20:25:44.6708654Z 2025-05-07T20:25:44.6708658Z 2025-05-07T20:25:44.6708661Z 2025-05-07T20:25:44.6708665Z 2025-05-07T20:25:44.6708669Z 2025-05-07T20:25:44.6708673Z 2025-05-07T20:25:44.6708676Z 2025-05-07T20:25:44.6708680Z 2025-05-07T20:25:44.6708684Z 2025-05-07T20:25:44.6711053Z 2025-05-07T20:25:44.7709926Z cuda-nvcc-tools-12.6 | 23.0 MB | #3 | 13%  2025-05-07T20:25:44.7710304Z 2025-05-07T20:25:44.7710308Z 2025-05-07T20:25:44.7710311Z 2025-05-07T20:25:44.7710315Z 2025-05-07T20:25:44.7710318Z 2025-05-07T20:25:44.7710322Z 2025-05-07T20:25:44.7710325Z 2025-05-07T20:25:44.7710329Z 2025-05-07T20:25:44.7710332Z 2025-05-07T20:25:44.7710336Z 2025-05-07T20:25:44.7710339Z 2025-05-07T20:25:44.7711488Z 2025-05-07T20:25:44.8710459Z cuda-nvcc-tools-12.6 | 23.0 MB | ##7 | 27%  2025-05-07T20:25:44.8710811Z 2025-05-07T20:25:44.8710815Z 2025-05-07T20:25:44.8710819Z 2025-05-07T20:25:44.8710823Z 2025-05-07T20:25:44.8710969Z 2025-05-07T20:25:44.8710973Z 2025-05-07T20:25:44.8710977Z 2025-05-07T20:25:44.8710980Z 2025-05-07T20:25:44.8710984Z 2025-05-07T20:25:44.8710988Z 2025-05-07T20:25:44.8710991Z 2025-05-07T20:25:44.8711530Z 2025-05-07T20:25:44.8871473Z cuda-nvcc-tools-12.6 | 23.0 MB | ####2 | 43%  2025-05-07T20:25:44.8871875Z 2025-05-07T20:25:44.8871879Z 2025-05-07T20:25:44.8871890Z 2025-05-07T20:25:44.8871894Z 2025-05-07T20:25:44.8871898Z 2025-05-07T20:25:44.8871908Z 2025-05-07T20:25:44.8871912Z 2025-05-07T20:25:44.8871915Z 2025-05-07T20:25:44.9517184Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:25:44.9517497Z 2025-05-07T20:25:44.9517509Z 2025-05-07T20:25:44.9517513Z 2025-05-07T20:25:44.9517516Z 2025-05-07T20:25:44.9517520Z 2025-05-07T20:25:44.9517538Z 2025-05-07T20:25:44.9517542Z 2025-05-07T20:25:44.9517545Z 2025-05-07T20:25:44.9517549Z 2025-05-07T20:25:44.9517552Z 2025-05-07T20:25:44.9517556Z 2025-05-07T20:25:44.9517566Z 2025-05-07T20:25:44.9518931Z 2025-05-07T20:25:44.9714241Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:25:44.9714564Z 2025-05-07T20:25:44.9714568Z 2025-05-07T20:25:44.9714571Z 2025-05-07T20:25:44.9714575Z 2025-05-07T20:25:44.9714578Z 2025-05-07T20:25:44.9714582Z 2025-05-07T20:25:44.9714585Z 2025-05-07T20:25:44.9714589Z 2025-05-07T20:25:44.9714593Z 2025-05-07T20:25:44.9714596Z 2025-05-07T20:25:44.9714600Z 2025-05-07T20:25:44.9715682Z 2025-05-07T20:25:45.0519208Z cuda-nvcc-tools-12.6 | 23.0 MB | #####9 | 60%  2025-05-07T20:25:45.0519628Z 2025-05-07T20:25:45.0519632Z 2025-05-07T20:25:45.0519635Z 2025-05-07T20:25:45.0519639Z 2025-05-07T20:25:45.0519642Z 2025-05-07T20:25:45.0519646Z 2025-05-07T20:25:45.0519667Z 2025-05-07T20:25:45.0519671Z 2025-05-07T20:25:45.0519674Z 2025-05-07T20:25:45.0519678Z 2025-05-07T20:25:45.0519681Z 2025-05-07T20:25:45.0519685Z 2025-05-07T20:25:45.0521623Z 2025-05-07T20:25:45.0778567Z cuda-nvrtc-12.6.85 | 17.3 MB | #8 | 19%  2025-05-07T20:25:45.0778885Z 2025-05-07T20:25:45.0778889Z 2025-05-07T20:25:45.0778892Z 2025-05-07T20:25:45.0778896Z 2025-05-07T20:25:45.0778899Z 2025-05-07T20:25:45.0778903Z 2025-05-07T20:25:45.0778907Z 2025-05-07T20:25:45.0778910Z 2025-05-07T20:25:45.0778914Z 2025-05-07T20:25:45.0778917Z 2025-05-07T20:25:45.0778921Z 2025-05-07T20:25:45.0778925Z 2025-05-07T20:25:45.1681557Z cuda-nvcc-tools-12.6 | 23.0 MB | #######5 | 75%  2025-05-07T20:25:45.1681882Z 2025-05-07T20:25:45.1681886Z 2025-05-07T20:25:45.1681889Z 2025-05-07T20:25:45.1681893Z 2025-05-07T20:25:45.1681897Z 2025-05-07T20:25:45.1681900Z 2025-05-07T20:25:45.1681904Z 2025-05-07T20:25:45.1681918Z 2025-05-07T20:25:45.1681922Z 2025-05-07T20:25:45.1681926Z 2025-05-07T20:25:45.1681929Z 2025-05-07T20:25:45.1681933Z 2025-05-07T20:25:45.1681944Z 2025-05-07T20:25:45.1991142Z cuda-nvrtc-12.6.85 | 17.3 MB | ###7 | 38%  2025-05-07T20:25:45.1991452Z 2025-05-07T20:25:45.1991456Z 2025-05-07T20:25:45.1991460Z 2025-05-07T20:25:45.1991471Z 2025-05-07T20:25:45.1991474Z 2025-05-07T20:25:45.1991478Z 2025-05-07T20:25:45.1991482Z 2025-05-07T20:25:45.1991485Z 2025-05-07T20:25:45.1991489Z 2025-05-07T20:25:45.1991493Z 2025-05-07T20:25:45.1991496Z 2025-05-07T20:25:45.1991500Z 2025-05-07T20:25:45.2684055Z cuda-nvcc-tools-12.6 | 23.0 MB | ######### | 91%  2025-05-07T20:25:45.2684404Z 2025-05-07T20:25:45.2684409Z 2025-05-07T20:25:45.2684414Z 2025-05-07T20:25:45.2684419Z 2025-05-07T20:25:45.2684424Z 2025-05-07T20:25:45.2684429Z 2025-05-07T20:25:45.2684434Z 2025-05-07T20:25:45.2684439Z 2025-05-07T20:25:45.2684670Z 2025-05-07T20:25:45.2684679Z 2025-05-07T20:25:45.2684684Z 2025-05-07T20:25:45.2684689Z 2025-05-07T20:25:45.2684699Z 2025-05-07T20:25:45.3756963Z cuda-nvrtc-12.6.85 | 17.3 MB | #####7 | 57%  2025-05-07T20:25:45.3757638Z 2025-05-07T20:25:45.3757642Z 2025-05-07T20:25:45.3757646Z 2025-05-07T20:25:45.3757650Z 2025-05-07T20:25:45.3757653Z 2025-05-07T20:25:45.3757657Z 2025-05-07T20:25:45.3757660Z 2025-05-07T20:25:45.3757664Z 2025-05-07T20:25:45.3757667Z 2025-05-07T20:25:45.3761652Z 2025-05-07T20:25:45.3770942Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:25:45.3771276Z 2025-05-07T20:25:45.3771281Z 2025-05-07T20:25:45.3771287Z 2025-05-07T20:25:45.3771292Z 2025-05-07T20:25:45.3771305Z 2025-05-07T20:25:45.3771310Z 2025-05-07T20:25:45.3771315Z 2025-05-07T20:25:45.3771320Z 2025-05-07T20:25:45.3771325Z 2025-05-07T20:25:45.3771330Z 2025-05-07T20:25:45.3771335Z 2025-05-07T20:25:45.3771351Z 2025-05-07T20:25:45.3771357Z 2025-05-07T20:25:45.4304432Z cuda-nvrtc-12.6.85 | 17.3 MB | #######5 | 76%  2025-05-07T20:25:45.4304854Z 2025-05-07T20:25:45.4304869Z 2025-05-07T20:25:45.4304873Z 2025-05-07T20:25:45.4304876Z 2025-05-07T20:25:45.4304880Z 2025-05-07T20:25:45.4304884Z 2025-05-07T20:25:45.4304887Z 2025-05-07T20:25:45.4304891Z 2025-05-07T20:25:45.4304895Z 2025-05-07T20:25:45.4304898Z 2025-05-07T20:25:45.4304902Z 2025-05-07T20:25:45.4316622Z python-3.10.13 | 24.5 MB | ########## | 100%  2025-05-07T20:25:45.4316961Z 2025-05-07T20:25:45.4316966Z 2025-05-07T20:25:45.4316969Z 2025-05-07T20:25:45.4316973Z 2025-05-07T20:25:45.4316977Z 2025-05-07T20:25:45.4316980Z 2025-05-07T20:25:45.4316984Z 2025-05-07T20:25:45.4316987Z 2025-05-07T20:25:45.4316991Z 2025-05-07T20:25:45.4316995Z 2025-05-07T20:25:45.4316998Z 2025-05-07T20:25:45.4317002Z 2025-05-07T20:25:45.4317012Z 2025-05-07T20:25:45.4317026Z 2025-05-07T20:25:45.4778873Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:25:45.4779198Z 2025-05-07T20:25:45.4779202Z 2025-05-07T20:25:45.4779223Z 2025-05-07T20:25:45.4779227Z 2025-05-07T20:25:45.4779230Z 2025-05-07T20:25:45.4779234Z 2025-05-07T20:25:45.4779238Z 2025-05-07T20:25:45.4779241Z 2025-05-07T20:25:45.4779245Z 2025-05-07T20:25:45.4779249Z 2025-05-07T20:25:45.4779252Z 2025-05-07T20:25:45.4779256Z 2025-05-07T20:25:45.4779655Z 2025-05-07T20:25:45.4834186Z cuda-nvrtc-12.6.85 | 17.3 MB | #########3 | 93%  2025-05-07T20:25:45.4834537Z 2025-05-07T20:25:45.4834542Z 2025-05-07T20:25:45.4834547Z 2025-05-07T20:25:45.4834552Z 2025-05-07T20:25:45.4834557Z 2025-05-07T20:25:45.4834562Z 2025-05-07T20:25:45.4834567Z 2025-05-07T20:25:45.4834572Z 2025-05-07T20:25:45.4834578Z 2025-05-07T20:25:45.4834583Z 2025-05-07T20:25:45.4834588Z 2025-05-07T20:25:45.4834593Z 2025-05-07T20:25:45.4834608Z 2025-05-07T20:25:45.4834614Z 2025-05-07T20:25:45.4836302Z 2025-05-07T20:25:45.5319415Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:25:45.5319885Z 2025-05-07T20:25:45.5319891Z 2025-05-07T20:25:45.5319896Z 2025-05-07T20:25:45.5319901Z 2025-05-07T20:25:45.5319915Z 2025-05-07T20:25:45.5319921Z 2025-05-07T20:25:45.5319926Z 2025-05-07T20:25:45.5319931Z 2025-05-07T20:25:45.5319937Z 2025-05-07T20:25:45.5319942Z 2025-05-07T20:25:45.5319947Z 2025-05-07T20:25:45.5319952Z 2025-05-07T20:25:45.5319958Z 2025-05-07T20:25:45.5321959Z 2025-05-07T20:25:45.5839138Z libnvjitlink-12.6.85 | 14.9 MB | ##1 | 21%  2025-05-07T20:25:45.5839610Z 2025-05-07T20:25:45.5839615Z 2025-05-07T20:25:45.5839620Z 2025-05-07T20:25:45.5839626Z 2025-05-07T20:25:45.5839631Z 2025-05-07T20:25:45.5839636Z 2025-05-07T20:25:45.5839641Z 2025-05-07T20:25:45.5839646Z 2025-05-07T20:25:45.5839651Z 2025-05-07T20:25:45.5839901Z 2025-05-07T20:25:45.5839908Z 2025-05-07T20:25:45.5839913Z 2025-05-07T20:25:45.5839918Z 2025-05-07T20:25:45.5839924Z 2025-05-07T20:25:45.5839929Z 2025-05-07T20:25:45.6472592Z cuda-nvcc-dev_linux- | 10.8 MB | ##6 | 26%  2025-05-07T20:25:45.6473063Z 2025-05-07T20:25:45.6473068Z 2025-05-07T20:25:45.6473073Z 2025-05-07T20:25:45.6473078Z 2025-05-07T20:25:45.6473084Z 2025-05-07T20:25:45.6473103Z 2025-05-07T20:25:45.6473109Z 2025-05-07T20:25:45.6473114Z 2025-05-07T20:25:45.6473119Z 2025-05-07T20:25:45.6473124Z 2025-05-07T20:25:45.6473129Z 2025-05-07T20:25:45.6473134Z 2025-05-07T20:25:45.6473139Z 2025-05-07T20:25:45.6473145Z 2025-05-07T20:25:45.6839172Z libnvjitlink-12.6.85 | 14.9 MB | ####2 | 42%  2025-05-07T20:25:45.6839636Z 2025-05-07T20:25:45.6839641Z 2025-05-07T20:25:45.6839646Z 2025-05-07T20:25:45.6839651Z 2025-05-07T20:25:45.6839657Z 2025-05-07T20:25:45.6839664Z 2025-05-07T20:25:45.6839687Z 2025-05-07T20:25:45.6839692Z 2025-05-07T20:25:45.6839697Z 2025-05-07T20:25:45.6839702Z 2025-05-07T20:25:45.6839708Z 2025-05-07T20:25:45.6839713Z 2025-05-07T20:25:45.6839726Z 2025-05-07T20:25:45.6839731Z 2025-05-07T20:25:45.6839736Z 2025-05-07T20:25:45.7472491Z cuda-nvcc-dev_linux- | 10.8 MB | #####3 | 53%  2025-05-07T20:25:45.7472927Z 2025-05-07T20:25:45.7472931Z 2025-05-07T20:25:45.7472934Z 2025-05-07T20:25:45.7472938Z 2025-05-07T20:25:45.7472941Z 2025-05-07T20:25:45.7472952Z 2025-05-07T20:25:45.7472955Z 2025-05-07T20:25:45.7472959Z 2025-05-07T20:25:45.7472962Z 2025-05-07T20:25:45.7472966Z 2025-05-07T20:25:45.7472970Z 2025-05-07T20:25:45.7472973Z 2025-05-07T20:25:45.7472977Z 2025-05-07T20:25:45.7472980Z 2025-05-07T20:25:45.7843009Z libnvjitlink-12.6.85 | 14.9 MB | ######3 | 64%  2025-05-07T20:25:45.7843427Z 2025-05-07T20:25:45.7843431Z 2025-05-07T20:25:45.7843450Z 2025-05-07T20:25:45.7843455Z 2025-05-07T20:25:45.7843459Z 2025-05-07T20:25:45.7843464Z 2025-05-07T20:25:45.7843468Z 2025-05-07T20:25:45.7843472Z 2025-05-07T20:25:45.7843475Z 2025-05-07T20:25:45.7843487Z 2025-05-07T20:25:45.7843491Z 2025-05-07T20:25:45.7843494Z 2025-05-07T20:25:45.7843498Z 2025-05-07T20:25:45.7843501Z 2025-05-07T20:25:45.7843505Z 2025-05-07T20:25:45.8475466Z cuda-nvcc-dev_linux- | 10.8 MB | ######## | 80%  2025-05-07T20:25:45.8475847Z 2025-05-07T20:25:45.8475851Z 2025-05-07T20:25:45.8475854Z 2025-05-07T20:25:45.8475858Z 2025-05-07T20:25:45.8475862Z 2025-05-07T20:25:45.8475865Z 2025-05-07T20:25:45.8475869Z 2025-05-07T20:25:45.8475872Z 2025-05-07T20:25:45.8475883Z 2025-05-07T20:25:45.8475886Z 2025-05-07T20:25:45.8475890Z 2025-05-07T20:25:45.8475894Z 2025-05-07T20:25:45.8475897Z 2025-05-07T20:25:45.8475901Z 2025-05-07T20:25:46.0261918Z libnvjitlink-12.6.85 | 14.9 MB | ########5 | 85%  2025-05-07T20:25:46.0262264Z 2025-05-07T20:25:46.0262268Z 2025-05-07T20:25:46.0262272Z 2025-05-07T20:25:46.0262275Z 2025-05-07T20:25:46.0262286Z 2025-05-07T20:25:46.0262300Z 2025-05-07T20:25:46.0262304Z 2025-05-07T20:25:46.0262308Z 2025-05-07T20:25:46.0262311Z 2025-05-07T20:25:46.0262315Z 2025-05-07T20:25:46.0262318Z 2025-05-07T20:25:46.0262322Z 2025-05-07T20:25:46.0900471Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:25:46.0900929Z 2025-05-07T20:25:46.0900935Z 2025-05-07T20:25:46.0900940Z 2025-05-07T20:25:46.0900946Z 2025-05-07T20:25:46.0900951Z 2025-05-07T20:25:46.0900956Z 2025-05-07T20:25:46.0900962Z 2025-05-07T20:25:46.0900967Z 2025-05-07T20:25:46.0900972Z 2025-05-07T20:25:46.0900977Z 2025-05-07T20:25:46.0900983Z 2025-05-07T20:25:46.0900988Z 2025-05-07T20:25:46.0904780Z 2025-05-07T20:25:46.0940252Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:25:46.0940926Z 2025-05-07T20:25:46.0940931Z 2025-05-07T20:25:46.0940934Z 2025-05-07T20:25:46.0940938Z 2025-05-07T20:25:46.0940942Z 2025-05-07T20:25:46.0940945Z 2025-05-07T20:25:46.0940949Z 2025-05-07T20:25:46.0941108Z 2025-05-07T20:25:46.0941113Z 2025-05-07T20:25:46.0941118Z 2025-05-07T20:25:46.0941134Z 2025-05-07T20:25:46.0941140Z 2025-05-07T20:25:46.0941145Z 2025-05-07T20:25:46.0941150Z 2025-05-07T20:25:46.0941155Z 2025-05-07T20:25:46.0942794Z 2025-05-07T20:25:46.1487558Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:25:46.1488044Z 2025-05-07T20:25:46.1488238Z 2025-05-07T20:25:46.1488248Z 2025-05-07T20:25:46.1488252Z 2025-05-07T20:25:46.1488396Z 2025-05-07T20:25:46.1488404Z 2025-05-07T20:25:46.1488407Z 2025-05-07T20:25:46.1488411Z 2025-05-07T20:25:46.1488415Z 2025-05-07T20:25:46.1488418Z 2025-05-07T20:25:46.1488446Z 2025-05-07T20:25:46.1488450Z 2025-05-07T20:25:46.1488453Z 2025-05-07T20:25:46.1488457Z 2025-05-07T20:25:46.1488475Z 2025-05-07T20:25:46.1488479Z 2025-05-07T20:25:46.1488975Z 2025-05-07T20:25:46.1953528Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:25:46.1953977Z 2025-05-07T20:25:46.1953983Z 2025-05-07T20:25:46.1953988Z 2025-05-07T20:25:46.1953993Z 2025-05-07T20:25:46.1953998Z 2025-05-07T20:25:46.1954004Z 2025-05-07T20:25:46.1954009Z 2025-05-07T20:25:46.1954014Z 2025-05-07T20:25:46.1954020Z 2025-05-07T20:25:46.1954025Z 2025-05-07T20:25:46.1954030Z 2025-05-07T20:25:46.1954035Z 2025-05-07T20:25:46.1954040Z 2025-05-07T20:25:46.1954045Z 2025-05-07T20:25:46.1954050Z 2025-05-07T20:25:46.1954056Z 2025-05-07T20:25:46.2156075Z cuda-nvvm-tools-12.6 | 10.4 MB | ###7 | 37%  2025-05-07T20:25:46.2156479Z 2025-05-07T20:25:46.2156484Z 2025-05-07T20:25:46.2156487Z 2025-05-07T20:25:46.2156491Z 2025-05-07T20:25:46.2156495Z 2025-05-07T20:25:46.2156498Z 2025-05-07T20:25:46.2156512Z 2025-05-07T20:25:46.2156516Z 2025-05-07T20:25:46.2156519Z 2025-05-07T20:25:46.2156530Z 2025-05-07T20:25:46.2156534Z 2025-05-07T20:25:46.2156538Z 2025-05-07T20:25:46.2156541Z 2025-05-07T20:25:46.2156549Z 2025-05-07T20:25:46.2156552Z 2025-05-07T20:25:46.2488385Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:25:46.2488731Z 2025-05-07T20:25:46.2488734Z 2025-05-07T20:25:46.2488738Z 2025-05-07T20:25:46.2488741Z 2025-05-07T20:25:46.2488745Z 2025-05-07T20:25:46.2488749Z 2025-05-07T20:25:46.2488752Z 2025-05-07T20:25:46.2488756Z 2025-05-07T20:25:46.2488760Z 2025-05-07T20:25:46.2488764Z 2025-05-07T20:25:46.2488767Z 2025-05-07T20:25:46.2488771Z 2025-05-07T20:25:46.2488775Z 2025-05-07T20:25:46.2488778Z 2025-05-07T20:25:46.2488782Z 2025-05-07T20:25:46.2488786Z 2025-05-07T20:25:46.2488789Z 2025-05-07T20:25:46.2572069Z cuda-sanitizer-api-1 | 8.9 MB | ### | 31%  2025-05-07T20:25:46.2572433Z 2025-05-07T20:25:46.2572437Z 2025-05-07T20:25:46.2572441Z 2025-05-07T20:25:46.2572444Z 2025-05-07T20:25:46.2572448Z 2025-05-07T20:25:46.2572452Z 2025-05-07T20:25:46.2572459Z 2025-05-07T20:25:46.2572463Z 2025-05-07T20:25:46.2572466Z 2025-05-07T20:25:46.2572470Z 2025-05-07T20:25:46.2572473Z 2025-05-07T20:25:46.2572477Z 2025-05-07T20:25:46.2572481Z 2025-05-07T20:25:46.2572484Z 2025-05-07T20:25:46.2572488Z 2025-05-07T20:25:46.2572500Z 2025-05-07T20:25:46.2572504Z 2025-05-07T20:25:46.2572507Z 2025-05-07T20:25:46.3222723Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:25:46.3223082Z 2025-05-07T20:25:46.3223088Z 2025-05-07T20:25:46.3223094Z 2025-05-07T20:25:46.3223099Z 2025-05-07T20:25:46.3223104Z 2025-05-07T20:25:46.3223110Z 2025-05-07T20:25:46.3223115Z 2025-05-07T20:25:46.3223120Z 2025-05-07T20:25:46.3223125Z 2025-05-07T20:25:46.3223130Z 2025-05-07T20:25:46.3223136Z 2025-05-07T20:25:46.3223366Z 2025-05-07T20:25:46.3223374Z 2025-05-07T20:25:46.3223379Z 2025-05-07T20:25:46.3223384Z 2025-05-07T20:25:46.3223389Z 2025-05-07T20:25:46.3489119Z cuda-nvvm-tools-12.6 | 10.4 MB | #######4 | 74%  2025-05-07T20:25:46.3489660Z 2025-05-07T20:25:46.3489664Z 2025-05-07T20:25:46.3489668Z 2025-05-07T20:25:46.3489671Z 2025-05-07T20:25:46.3489675Z 2025-05-07T20:25:46.3489678Z 2025-05-07T20:25:46.3489682Z 2025-05-07T20:25:46.3489686Z 2025-05-07T20:25:46.3489689Z 2025-05-07T20:25:46.3489693Z 2025-05-07T20:25:46.3489696Z 2025-05-07T20:25:46.3489700Z 2025-05-07T20:25:46.3489709Z 2025-05-07T20:25:46.3489713Z 2025-05-07T20:25:46.3489716Z 2025-05-07T20:25:46.3489720Z 2025-05-07T20:25:46.3493082Z 2025-05-07T20:25:46.3575018Z cuda-sanitizer-api-1 | 8.9 MB | ######3 | 63%  2025-05-07T20:25:46.3575372Z 2025-05-07T20:25:46.3575375Z 2025-05-07T20:25:46.3575379Z 2025-05-07T20:25:46.3575382Z 2025-05-07T20:25:46.3575396Z 2025-05-07T20:25:46.3575400Z 2025-05-07T20:25:46.3575403Z 2025-05-07T20:25:46.3575411Z 2025-05-07T20:25:46.3575415Z 2025-05-07T20:25:46.3575420Z 2025-05-07T20:25:46.3575432Z 2025-05-07T20:25:46.3575437Z 2025-05-07T20:25:46.3575442Z 2025-05-07T20:25:46.3575447Z 2025-05-07T20:25:46.3575453Z 2025-05-07T20:25:46.3575458Z 2025-05-07T20:25:46.3575463Z 2025-05-07T20:25:46.3575468Z 2025-05-07T20:25:46.4387475Z cuda-nvvm-impl-12.6. | 7.7 MB | ###4 | 35%  2025-05-07T20:25:46.4387819Z 2025-05-07T20:25:46.4387823Z 2025-05-07T20:25:46.4387827Z 2025-05-07T20:25:46.4387830Z 2025-05-07T20:25:46.4387834Z 2025-05-07T20:25:46.4387838Z 2025-05-07T20:25:46.4387841Z 2025-05-07T20:25:46.4387853Z 2025-05-07T20:25:46.4387856Z 2025-05-07T20:25:46.4387860Z 2025-05-07T20:25:46.4387864Z 2025-05-07T20:25:46.4387867Z 2025-05-07T20:25:46.4387871Z 2025-05-07T20:25:46.4390255Z 2025-05-07T20:25:46.4489430Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:25:46.4489770Z 2025-05-07T20:25:46.4489774Z 2025-05-07T20:25:46.4489777Z 2025-05-07T20:25:46.4489786Z 2025-05-07T20:25:46.4489790Z 2025-05-07T20:25:46.4489794Z 2025-05-07T20:25:46.4489797Z 2025-05-07T20:25:46.4489801Z 2025-05-07T20:25:46.4489805Z 2025-05-07T20:25:46.4489808Z 2025-05-07T20:25:46.4489812Z 2025-05-07T20:25:46.4489816Z 2025-05-07T20:25:46.4489819Z 2025-05-07T20:25:46.4489823Z 2025-05-07T20:25:46.4489827Z 2025-05-07T20:25:46.4489830Z 2025-05-07T20:25:46.4491579Z 2025-05-07T20:25:46.4580468Z cuda-sanitizer-api-1 | 8.9 MB | #########5 | 96%  2025-05-07T20:25:46.4580819Z 2025-05-07T20:25:46.4580827Z 2025-05-07T20:25:46.4580832Z 2025-05-07T20:25:46.4580837Z 2025-05-07T20:25:46.4580842Z 2025-05-07T20:25:46.4580847Z 2025-05-07T20:25:46.4580861Z 2025-05-07T20:25:46.4580866Z 2025-05-07T20:25:46.4580879Z 2025-05-07T20:25:46.4580884Z 2025-05-07T20:25:46.4580897Z 2025-05-07T20:25:46.4580903Z 2025-05-07T20:25:46.4580908Z 2025-05-07T20:25:46.4580913Z 2025-05-07T20:25:46.4580918Z 2025-05-07T20:25:46.4580923Z 2025-05-07T20:25:46.4580934Z 2025-05-07T20:25:46.4581057Z 2025-05-07T20:25:46.4899033Z cuda-nvvm-impl-12.6. | 7.7 MB | ####### | 71%  2025-05-07T20:25:46.4899494Z 2025-05-07T20:25:46.4899498Z 2025-05-07T20:25:46.4899502Z 2025-05-07T20:25:46.4899506Z 2025-05-07T20:25:46.4899509Z 2025-05-07T20:25:46.4899513Z 2025-05-07T20:25:46.4899517Z 2025-05-07T20:25:46.4899520Z 2025-05-07T20:25:46.4899524Z 2025-05-07T20:25:46.4899528Z 2025-05-07T20:25:46.4899531Z 2025-05-07T20:25:46.4899535Z 2025-05-07T20:25:46.4899539Z 2025-05-07T20:25:46.4899542Z 2025-05-07T20:25:46.4899546Z 2025-05-07T20:25:46.4899550Z 2025-05-07T20:25:46.4899553Z 2025-05-07T20:25:46.4899563Z 2025-05-07T20:25:46.4902207Z 2025-05-07T20:25:46.5901601Z ... (more hidden) ... 2025-05-07T20:25:46.5901921Z 2025-05-07T20:25:46.5901925Z 2025-05-07T20:25:46.5901929Z 2025-05-07T20:25:46.5901932Z 2025-05-07T20:25:46.5901936Z 2025-05-07T20:25:46.5902104Z 2025-05-07T20:25:46.5902110Z 2025-05-07T20:25:46.5902114Z 2025-05-07T20:25:46.5902117Z 2025-05-07T20:25:46.5902121Z 2025-05-07T20:25:46.5902124Z 2025-05-07T20:25:46.5902128Z 2025-05-07T20:25:46.5902132Z 2025-05-07T20:25:46.5902135Z 2025-05-07T20:25:46.5902139Z 2025-05-07T20:25:46.5902143Z 2025-05-07T20:25:46.5902146Z 2025-05-07T20:25:46.5902150Z 2025-05-07T20:25:46.5902153Z 2025-05-07T20:25:46.7463539Z ... (more hidden) ... 2025-05-07T20:25:46.7463839Z 2025-05-07T20:25:46.7463843Z 2025-05-07T20:25:46.7463846Z 2025-05-07T20:25:46.7463850Z 2025-05-07T20:25:46.7463854Z 2025-05-07T20:25:46.7463857Z 2025-05-07T20:25:46.7463861Z 2025-05-07T20:25:46.7463865Z 2025-05-07T20:25:46.7463875Z 2025-05-07T20:25:46.7463879Z 2025-05-07T20:25:46.7463893Z 2025-05-07T20:25:46.7463897Z 2025-05-07T20:25:46.7463901Z 2025-05-07T20:25:46.7463904Z 2025-05-07T20:25:46.7463908Z 2025-05-07T20:25:46.7463912Z 2025-05-07T20:25:46.7463921Z 2025-05-07T20:25:46.7463925Z 2025-05-07T20:25:46.7463929Z 2025-05-07T20:25:46.7479417Z ... (more hidden) ... 2025-05-07T20:25:46.7479717Z 2025-05-07T20:25:46.7479721Z 2025-05-07T20:25:46.7479724Z 2025-05-07T20:25:46.7479728Z 2025-05-07T20:25:46.7479732Z 2025-05-07T20:25:46.7479735Z 2025-05-07T20:25:46.7479739Z 2025-05-07T20:25:46.7479743Z 2025-05-07T20:25:46.7479746Z 2025-05-07T20:25:46.7479750Z 2025-05-07T20:25:46.7479753Z 2025-05-07T20:25:46.7479757Z 2025-05-07T20:25:46.7479760Z 2025-05-07T20:25:46.7479770Z 2025-05-07T20:25:46.7479774Z 2025-05-07T20:25:46.7479777Z 2025-05-07T20:25:46.7479781Z 2025-05-07T20:25:46.7532511Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:25:46.7532966Z 2025-05-07T20:25:46.7532970Z 2025-05-07T20:25:46.7532974Z 2025-05-07T20:25:46.7532978Z 2025-05-07T20:25:46.7532981Z 2025-05-07T20:25:46.7532985Z 2025-05-07T20:25:46.7532994Z 2025-05-07T20:25:46.7532998Z 2025-05-07T20:25:46.7533001Z 2025-05-07T20:25:46.7533005Z 2025-05-07T20:25:46.7533008Z 2025-05-07T20:25:46.7533012Z 2025-05-07T20:25:46.7533016Z 2025-05-07T20:25:46.7533019Z 2025-05-07T20:25:46.7533023Z 2025-05-07T20:25:46.7533491Z 2025-05-07T20:25:46.7601322Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:25:46.7601663Z 2025-05-07T20:25:46.7601667Z 2025-05-07T20:25:46.7601671Z 2025-05-07T20:25:46.7601674Z 2025-05-07T20:25:46.7601678Z 2025-05-07T20:25:46.7601682Z 2025-05-07T20:25:46.7601685Z 2025-05-07T20:25:46.7601689Z 2025-05-07T20:25:46.7601692Z 2025-05-07T20:25:46.7601696Z 2025-05-07T20:25:46.7601707Z 2025-05-07T20:25:46.7601711Z 2025-05-07T20:25:46.7601714Z 2025-05-07T20:25:46.7601718Z 2025-05-07T20:25:46.7601730Z 2025-05-07T20:25:46.7601733Z 2025-05-07T20:25:46.7601737Z 2025-05-07T20:25:46.7601740Z 2025-05-07T20:25:48.1054351Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:25:48.1054734Z 2025-05-07T20:25:48.1054738Z 2025-05-07T20:25:48.1054741Z 2025-05-07T20:25:48.1054745Z 2025-05-07T20:25:48.1054748Z 2025-05-07T20:25:48.1056190Z 2025-05-07T20:25:49.0332937Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:25:49.0333279Z 2025-05-07T20:25:49.0333283Z 2025-05-07T20:25:49.0333287Z 2025-05-07T20:25:49.0333291Z 2025-05-07T20:25:49.0333294Z 2025-05-07T20:25:49.3398915Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:25:49.3399243Z 2025-05-07T20:25:49.3399247Z 2025-05-07T20:25:49.3399251Z 2025-05-07T20:25:49.3399254Z 2025-05-07T20:25:49.3399258Z 2025-05-07T20:25:49.3399262Z 2025-05-07T20:25:49.3399266Z 2025-05-07T20:25:49.7016807Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:25:49.7017136Z 2025-05-07T20:25:49.7017140Z 2025-05-07T20:25:49.7017144Z 2025-05-07T20:25:49.7017147Z 2025-05-07T20:25:49.7017311Z 2025-05-07T20:25:49.7017315Z 2025-05-07T20:25:49.7017318Z 2025-05-07T20:25:49.7017322Z 2025-05-07T20:25:49.7017472Z 2025-05-07T20:25:49.7371169Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:25:49.7453936Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:25:49.7454351Z 2025-05-07T20:25:49.7454355Z 2025-05-07T20:25:49.7454360Z 2025-05-07T20:25:49.7454365Z 2025-05-07T20:25:49.7454370Z 2025-05-07T20:25:49.7454375Z 2025-05-07T20:25:49.7454381Z 2025-05-07T20:25:49.7454832Z 2025-05-07T20:25:50.0547737Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:25:50.0548074Z 2025-05-07T20:25:50.0548078Z 2025-05-07T20:25:50.0548082Z 2025-05-07T20:25:50.0548086Z 2025-05-07T20:25:50.0548117Z 2025-05-07T20:25:50.0548121Z 2025-05-07T20:25:50.0548125Z 2025-05-07T20:25:50.0548128Z 2025-05-07T20:25:50.0548132Z 2025-05-07T20:25:50.0548136Z 2025-05-07T20:25:50.6258561Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:25:50.6258916Z 2025-05-07T20:25:50.6258929Z 2025-05-07T20:25:50.6258933Z 2025-05-07T20:25:50.6258936Z 2025-05-07T20:25:50.6258940Z 2025-05-07T20:25:50.6258944Z 2025-05-07T20:25:50.6258949Z 2025-05-07T20:25:50.6258952Z 2025-05-07T20:25:50.6258956Z 2025-05-07T20:25:50.6258960Z 2025-05-07T20:25:50.6258965Z 2025-05-07T20:25:50.6258968Z 2025-05-07T20:25:51.0147961Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:25:51.0148322Z 2025-05-07T20:25:51.0148327Z 2025-05-07T20:25:51.0148331Z 2025-05-07T20:25:51.0148335Z 2025-05-07T20:25:51.0148339Z 2025-05-07T20:25:51.0148343Z 2025-05-07T20:25:51.0148347Z 2025-05-07T20:25:51.0148350Z 2025-05-07T20:25:51.0148354Z 2025-05-07T20:25:51.0148383Z 2025-05-07T20:25:51.0148387Z 2025-05-07T20:25:51.0148391Z 2025-05-07T20:25:51.0148423Z 2025-05-07T20:25:51.1615946Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:25:51.1616588Z 2025-05-07T20:25:51.1616596Z 2025-05-07T20:25:51.1616604Z 2025-05-07T20:25:51.1616611Z 2025-05-07T20:25:51.1616619Z 2025-05-07T20:25:51.1616626Z 2025-05-07T20:25:51.1616634Z 2025-05-07T20:25:51.1616641Z 2025-05-07T20:25:51.1616662Z 2025-05-07T20:25:51.1616669Z 2025-05-07T20:25:51.1616684Z 2025-05-07T20:25:51.2974216Z python-3.10.13 | 24.5 MB | ########## | 100%  2025-05-07T20:25:51.2974541Z 2025-05-07T20:25:51.2974546Z 2025-05-07T20:25:51.2974549Z 2025-05-07T20:25:51.2974553Z 2025-05-07T20:25:51.2974556Z 2025-05-07T20:25:51.2974560Z 2025-05-07T20:25:51.2974564Z 2025-05-07T20:25:51.2974569Z 2025-05-07T20:25:51.2974574Z 2025-05-07T20:25:51.2974578Z 2025-05-07T20:25:51.2974581Z 2025-05-07T20:25:51.2974606Z 2025-05-07T20:25:51.2974610Z 2025-05-07T20:25:51.2974613Z 2025-05-07T20:25:51.2974617Z 2025-05-07T20:25:51.3913156Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:25:51.3913529Z 2025-05-07T20:25:51.3913533Z 2025-05-07T20:25:51.3913536Z 2025-05-07T20:25:51.3913540Z 2025-05-07T20:25:51.3913544Z 2025-05-07T20:25:51.3913547Z 2025-05-07T20:25:51.3913551Z 2025-05-07T20:25:51.3913555Z 2025-05-07T20:25:51.3913558Z 2025-05-07T20:25:51.3913562Z 2025-05-07T20:25:51.3913566Z 2025-05-07T20:25:51.3913569Z 2025-05-07T20:25:51.3913579Z 2025-05-07T20:25:51.3913582Z 2025-05-07T20:25:51.5122627Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:25:51.5122981Z 2025-05-07T20:25:51.5122985Z 2025-05-07T20:25:51.5122988Z 2025-05-07T20:25:51.5122992Z 2025-05-07T20:25:51.5122996Z 2025-05-07T20:25:51.5123001Z 2025-05-07T20:25:51.5123004Z 2025-05-07T20:25:51.5123008Z 2025-05-07T20:25:51.5123272Z 2025-05-07T20:25:51.5123288Z 2025-05-07T20:25:51.5123292Z 2025-05-07T20:25:51.5123296Z 2025-05-07T20:25:51.5123299Z 2025-05-07T20:25:51.5123303Z 2025-05-07T20:25:51.5123490Z 2025-05-07T20:25:51.5123494Z 2025-05-07T20:25:51.5123498Z 2025-05-07T20:25:51.5123501Z 2025-05-07T20:25:51.5123505Z 2025-05-07T20:25:51.6051086Z ... (more hidden) ... 2025-05-07T20:25:51.6051502Z 2025-05-07T20:25:51.6051508Z 2025-05-07T20:25:51.6051513Z 2025-05-07T20:25:51.6051531Z 2025-05-07T20:25:51.6051553Z 2025-05-07T20:25:51.6051557Z 2025-05-07T20:25:51.6051561Z 2025-05-07T20:25:51.6051564Z 2025-05-07T20:25:51.6051568Z 2025-05-07T20:25:51.6051572Z 2025-05-07T20:25:51.6051575Z 2025-05-07T20:25:51.6051579Z 2025-05-07T20:25:51.6051583Z 2025-05-07T20:25:51.6051586Z 2025-05-07T20:25:51.6051590Z 2025-05-07T20:25:51.6051593Z 2025-05-07T20:25:51.6051597Z 2025-05-07T20:25:51.6853325Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:25:51.6853803Z 2025-05-07T20:25:51.6853807Z 2025-05-07T20:25:51.6853811Z 2025-05-07T20:25:51.6853814Z 2025-05-07T20:25:51.6853832Z 2025-05-07T20:25:51.6853835Z 2025-05-07T20:25:51.6853839Z 2025-05-07T20:25:51.6853843Z 2025-05-07T20:25:51.6853846Z 2025-05-07T20:25:51.6853850Z 2025-05-07T20:25:51.6853854Z 2025-05-07T20:25:51.6853857Z 2025-05-07T20:25:51.6853861Z 2025-05-07T20:25:51.6853865Z 2025-05-07T20:25:51.6853868Z 2025-05-07T20:25:51.6853879Z 2025-05-07T20:25:51.7251648Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:25:51.7252131Z 2025-05-07T20:25:51.7252136Z 2025-05-07T20:25:51.7252149Z 2025-05-07T20:25:51.7252154Z 2025-05-07T20:25:51.7252160Z 2025-05-07T20:25:51.7252165Z 2025-05-07T20:25:51.7252171Z 2025-05-07T20:25:51.7252177Z 2025-05-07T20:25:51.7252183Z 2025-05-07T20:25:51.7252190Z 2025-05-07T20:25:51.7252197Z 2025-05-07T20:25:51.7252203Z 2025-05-07T20:25:51.7252232Z 2025-05-07T20:25:51.7252238Z 2025-05-07T20:25:51.7252243Z 2025-05-07T20:25:51.7252249Z 2025-05-07T20:25:51.7252255Z 2025-05-07T20:25:51.7252260Z 2025-05-07T20:25:52.4615143Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:25:52.4616185Z 2025-05-07T20:25:57.7973960Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:25:57.7981956Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:25:57.7982426Z 2025-05-07T20:25:57.7982588Z 2025-05-07T20:25:57.7982597Z 2025-05-07T20:25:57.7982684Z 2025-05-07T20:25:57.7982692Z 2025-05-07T20:25:57.7982698Z 2025-05-07T20:25:57.7982705Z 2025-05-07T20:25:57.7982712Z 2025-05-07T20:25:57.7982741Z 2025-05-07T20:25:57.7982749Z 2025-05-07T20:25:57.7982755Z 2025-05-07T20:25:57.7982762Z 2025-05-07T20:25:57.7982768Z 2025-05-07T20:25:57.7982775Z 2025-05-07T20:25:57.7982782Z 2025-05-07T20:25:57.7982795Z 2025-05-07T20:25:57.7982829Z 2025-05-07T20:25:57.7982836Z 2025-05-07T20:25:57.7982843Z 2025-05-07T20:25:57.7983011Z 2025-05-07T20:25:57.7983599Z  2025-05-07T20:25:57.7984195Z 2025-05-07T20:25:57.7984517Z 2025-05-07T20:25:57.7984816Z  2025-05-07T20:25:57.7985180Z 2025-05-07T20:25:57.7985187Z 2025-05-07T20:25:57.7985448Z  2025-05-07T20:25:57.7985826Z 2025-05-07T20:25:57.7985833Z 2025-05-07T20:25:57.7985840Z 2025-05-07T20:25:57.7986128Z  2025-05-07T20:25:57.7986463Z 2025-05-07T20:25:57.7986469Z 2025-05-07T20:25:57.7986476Z 2025-05-07T20:25:57.7986483Z 2025-05-07T20:25:57.7986822Z  2025-05-07T20:25:57.7987192Z 2025-05-07T20:25:57.7987198Z 2025-05-07T20:25:57.7987492Z 2025-05-07T20:25:57.7987501Z 2025-05-07T20:25:57.7987507Z 2025-05-07T20:25:57.7987825Z  2025-05-07T20:25:57.7988351Z 2025-05-07T20:25:57.7988358Z 2025-05-07T20:25:57.7988364Z 2025-05-07T20:25:57.7988369Z 2025-05-07T20:25:57.7988375Z 2025-05-07T20:25:57.7988381Z 2025-05-07T20:25:57.7988707Z  2025-05-07T20:25:57.7989065Z 2025-05-07T20:25:57.7989071Z 2025-05-07T20:25:57.7989076Z 2025-05-07T20:25:57.7989081Z 2025-05-07T20:25:57.7989086Z 2025-05-07T20:25:57.7989091Z 2025-05-07T20:25:57.7989097Z 2025-05-07T20:25:57.7989675Z  2025-05-07T20:25:57.7990039Z 2025-05-07T20:25:57.7990045Z 2025-05-07T20:25:57.7990051Z 2025-05-07T20:25:57.7990056Z 2025-05-07T20:25:57.7990062Z 2025-05-07T20:25:57.7990067Z 2025-05-07T20:25:57.7990073Z 2025-05-07T20:25:57.7990079Z 2025-05-07T20:25:57.7990451Z  2025-05-07T20:25:57.7990830Z 2025-05-07T20:25:57.7990836Z 2025-05-07T20:25:57.7990842Z 2025-05-07T20:25:57.7990856Z 2025-05-07T20:25:57.7990862Z 2025-05-07T20:25:57.7990867Z 2025-05-07T20:25:57.7990872Z 2025-05-07T20:25:57.7990877Z 2025-05-07T20:25:57.7990887Z 2025-05-07T20:25:57.7991531Z  2025-05-07T20:25:57.7991881Z 2025-05-07T20:25:57.7991887Z 2025-05-07T20:25:57.7991893Z 2025-05-07T20:25:57.7991898Z 2025-05-07T20:25:57.7991904Z 2025-05-07T20:25:57.7991910Z 2025-05-07T20:25:57.7991915Z 2025-05-07T20:25:57.7991920Z 2025-05-07T20:25:57.7991925Z 2025-05-07T20:25:57.7991931Z 2025-05-07T20:25:57.7992314Z  2025-05-07T20:25:57.7992556Z 2025-05-07T20:25:57.7992560Z 2025-05-07T20:25:57.7992564Z 2025-05-07T20:25:57.7992567Z 2025-05-07T20:25:57.7992581Z 2025-05-07T20:25:57.7992585Z 2025-05-07T20:25:57.7992588Z 2025-05-07T20:25:57.7992592Z 2025-05-07T20:25:57.7992596Z 2025-05-07T20:25:57.7992599Z 2025-05-07T20:25:57.7992606Z 2025-05-07T20:25:57.7993125Z  2025-05-07T20:25:57.7993496Z 2025-05-07T20:25:57.7993502Z 2025-05-07T20:25:57.7993507Z 2025-05-07T20:25:57.7993512Z 2025-05-07T20:25:57.7993517Z 2025-05-07T20:25:57.7993523Z 2025-05-07T20:25:57.7993528Z 2025-05-07T20:25:57.7993533Z 2025-05-07T20:25:57.7993549Z 2025-05-07T20:25:57.7993556Z 2025-05-07T20:25:57.7993562Z 2025-05-07T20:25:57.7993573Z 2025-05-07T20:25:57.7993941Z  2025-05-07T20:25:57.7994343Z 2025-05-07T20:25:57.7994349Z 2025-05-07T20:25:57.7994354Z 2025-05-07T20:25:57.7994359Z 2025-05-07T20:25:57.7994364Z 2025-05-07T20:25:57.7994369Z 2025-05-07T20:25:57.7994374Z 2025-05-07T20:25:57.7994380Z 2025-05-07T20:25:57.7994394Z 2025-05-07T20:25:57.7994400Z 2025-05-07T20:25:57.7994405Z 2025-05-07T20:25:57.7994411Z 2025-05-07T20:25:57.7994416Z 2025-05-07T20:25:57.7994864Z  2025-05-07T20:25:57.7995257Z 2025-05-07T20:25:57.7995263Z 2025-05-07T20:25:57.7995268Z 2025-05-07T20:25:57.7995273Z 2025-05-07T20:25:57.7995278Z 2025-05-07T20:25:57.7995283Z 2025-05-07T20:25:57.7995289Z 2025-05-07T20:25:57.7995294Z 2025-05-07T20:25:57.7995300Z 2025-05-07T20:25:57.7995305Z 2025-05-07T20:25:57.7995311Z 2025-05-07T20:25:57.7995317Z 2025-05-07T20:25:57.7995327Z 2025-05-07T20:25:57.7995332Z 2025-05-07T20:25:57.7995721Z  2025-05-07T20:25:57.7996046Z 2025-05-07T20:25:57.7996051Z 2025-05-07T20:25:57.7996056Z 2025-05-07T20:25:57.7996062Z 2025-05-07T20:25:57.7996067Z 2025-05-07T20:25:57.7996079Z 2025-05-07T20:25:57.7996085Z 2025-05-07T20:25:57.7996259Z 2025-05-07T20:25:57.7996266Z 2025-05-07T20:25:57.7996272Z 2025-05-07T20:25:57.7996276Z 2025-05-07T20:25:57.7996282Z 2025-05-07T20:25:57.7996288Z 2025-05-07T20:25:57.7996408Z 2025-05-07T20:25:57.7996413Z 2025-05-07T20:25:57.7996743Z  2025-05-07T20:25:57.7997079Z 2025-05-07T20:25:57.7997084Z 2025-05-07T20:25:57.7997090Z 2025-05-07T20:25:57.7997095Z 2025-05-07T20:25:57.7997100Z 2025-05-07T20:25:57.7997105Z 2025-05-07T20:25:57.7997110Z 2025-05-07T20:25:57.7997115Z 2025-05-07T20:25:57.7997120Z 2025-05-07T20:25:57.7997126Z 2025-05-07T20:25:57.7997131Z 2025-05-07T20:25:57.7997136Z 2025-05-07T20:25:57.7997141Z 2025-05-07T20:25:57.7997146Z 2025-05-07T20:25:57.7997151Z 2025-05-07T20:25:57.7997156Z 2025-05-07T20:25:57.7997455Z  2025-05-07T20:25:57.7997780Z 2025-05-07T20:25:57.7997792Z 2025-05-07T20:25:57.7997797Z 2025-05-07T20:25:57.7997802Z 2025-05-07T20:25:57.7997807Z 2025-05-07T20:25:57.7997812Z 2025-05-07T20:25:57.7997817Z 2025-05-07T20:25:57.7997829Z 2025-05-07T20:25:57.7997834Z 2025-05-07T20:25:57.7997847Z 2025-05-07T20:25:57.7997853Z 2025-05-07T20:25:57.7997857Z 2025-05-07T20:25:57.7997863Z 2025-05-07T20:25:57.7997876Z 2025-05-07T20:25:57.7997881Z 2025-05-07T20:25:57.7997887Z 2025-05-07T20:25:57.7997891Z 2025-05-07T20:25:57.7998187Z  2025-05-07T20:25:57.7998522Z 2025-05-07T20:25:57.7998527Z 2025-05-07T20:25:57.7998532Z 2025-05-07T20:25:57.7998537Z 2025-05-07T20:25:57.7998543Z 2025-05-07T20:25:57.7998548Z 2025-05-07T20:25:57.7998553Z 2025-05-07T20:25:57.7998558Z 2025-05-07T20:25:57.7998563Z 2025-05-07T20:25:57.7998568Z 2025-05-07T20:25:57.7998574Z 2025-05-07T20:25:57.7998579Z 2025-05-07T20:25:57.7998584Z 2025-05-07T20:25:57.7998589Z 2025-05-07T20:25:57.7998601Z 2025-05-07T20:25:57.7998606Z 2025-05-07T20:25:57.7998611Z 2025-05-07T20:25:57.7998616Z 2025-05-07T20:25:57.7999439Z  2025-05-07T20:25:57.7999774Z 2025-05-07T20:25:57.7999783Z 2025-05-07T20:25:57.8000074Z  2025-05-07T20:25:57.8000258Z 2025-05-07T20:25:57.8000264Z 2025-05-07T20:25:57.8000818Z  2025-05-07T20:25:57.8000980Z 2025-05-07T20:25:57.8000990Z 2025-05-07T20:25:57.8000995Z 2025-05-07T20:25:57.8001649Z  2025-05-07T20:25:57.8001831Z 2025-05-07T20:25:57.8001836Z 2025-05-07T20:25:57.8001841Z 2025-05-07T20:25:57.8001846Z 2025-05-07T20:25:57.8002268Z  2025-05-07T20:25:57.8002444Z 2025-05-07T20:25:57.8002449Z 2025-05-07T20:25:57.8002454Z 2025-05-07T20:25:57.8002459Z 2025-05-07T20:25:57.8002468Z 2025-05-07T20:25:57.8003063Z  2025-05-07T20:25:57.8003278Z 2025-05-07T20:25:57.8003284Z 2025-05-07T20:25:57.8003290Z 2025-05-07T20:25:57.8003308Z 2025-05-07T20:25:57.8003314Z 2025-05-07T20:25:57.8003325Z 2025-05-07T20:25:57.8014194Z  2025-05-07T20:25:57.8014448Z 2025-05-07T20:25:57.8014455Z 2025-05-07T20:25:57.8014473Z 2025-05-07T20:25:57.8014479Z 2025-05-07T20:25:57.8014484Z 2025-05-07T20:25:57.8014489Z 2025-05-07T20:25:57.8014536Z 2025-05-07T20:25:57.8014778Z  2025-05-07T20:25:57.8015021Z 2025-05-07T20:25:57.8015027Z 2025-05-07T20:25:57.8015033Z 2025-05-07T20:25:57.8015047Z 2025-05-07T20:25:57.8015053Z 2025-05-07T20:25:57.8015059Z 2025-05-07T20:25:57.8015064Z 2025-05-07T20:25:57.8015070Z 2025-05-07T20:25:57.8015272Z  2025-05-07T20:25:57.8015535Z 2025-05-07T20:25:57.8015541Z 2025-05-07T20:25:57.8015546Z 2025-05-07T20:25:57.8015563Z 2025-05-07T20:25:57.8015569Z 2025-05-07T20:25:57.8015575Z 2025-05-07T20:25:57.8015581Z 2025-05-07T20:25:57.8015586Z 2025-05-07T20:25:57.8015592Z 2025-05-07T20:25:57.8015790Z  2025-05-07T20:25:57.8016208Z 2025-05-07T20:25:57.8016224Z 2025-05-07T20:25:57.8016231Z 2025-05-07T20:25:57.8016236Z 2025-05-07T20:25:57.8016242Z 2025-05-07T20:25:57.8016248Z 2025-05-07T20:25:57.8016348Z 2025-05-07T20:25:57.8016354Z 2025-05-07T20:25:57.8016359Z 2025-05-07T20:25:57.8016365Z 2025-05-07T20:25:57.8016603Z  2025-05-07T20:25:57.8016902Z 2025-05-07T20:25:57.8016908Z 2025-05-07T20:25:57.8016914Z 2025-05-07T20:25:57.8016919Z 2025-05-07T20:25:57.8016925Z 2025-05-07T20:25:57.8016931Z 2025-05-07T20:25:57.8016937Z 2025-05-07T20:25:57.8016942Z 2025-05-07T20:25:57.8016948Z 2025-05-07T20:25:57.8016954Z 2025-05-07T20:25:57.8016960Z 2025-05-07T20:25:57.8017169Z  2025-05-07T20:25:57.8017483Z 2025-05-07T20:25:57.8017489Z 2025-05-07T20:25:57.8017495Z 2025-05-07T20:25:57.8017501Z 2025-05-07T20:25:57.8017507Z 2025-05-07T20:25:57.8017512Z 2025-05-07T20:25:57.8017518Z 2025-05-07T20:25:57.8017524Z 2025-05-07T20:25:57.8017530Z 2025-05-07T20:25:57.8017544Z 2025-05-07T20:25:57.8017550Z 2025-05-07T20:25:57.8017556Z 2025-05-07T20:25:57.8017785Z  2025-05-07T20:25:57.8018221Z 2025-05-07T20:25:57.8018236Z 2025-05-07T20:25:57.8018242Z 2025-05-07T20:25:57.8018248Z 2025-05-07T20:25:57.8018254Z 2025-05-07T20:25:57.8018260Z 2025-05-07T20:25:57.8018266Z 2025-05-07T20:25:57.8018272Z 2025-05-07T20:25:57.8018278Z 2025-05-07T20:25:57.8018284Z 2025-05-07T20:25:57.8018290Z 2025-05-07T20:25:57.8018296Z 2025-05-07T20:25:57.8018302Z 2025-05-07T20:25:57.8018541Z  2025-05-07T20:25:57.8018835Z 2025-05-07T20:25:57.8018841Z 2025-05-07T20:25:57.8018847Z 2025-05-07T20:25:57.8018853Z 2025-05-07T20:25:57.8018858Z 2025-05-07T20:25:57.8018864Z 2025-05-07T20:25:57.8018869Z 2025-05-07T20:25:57.8018875Z 2025-05-07T20:25:57.8018890Z 2025-05-07T20:25:57.8018896Z 2025-05-07T20:25:57.8018902Z 2025-05-07T20:25:57.8018908Z 2025-05-07T20:25:57.8018914Z 2025-05-07T20:25:57.8018956Z 2025-05-07T20:25:57.8019219Z  2025-05-07T20:25:57.8019525Z 2025-05-07T20:25:57.8019531Z 2025-05-07T20:25:57.8019537Z 2025-05-07T20:25:57.8019543Z 2025-05-07T20:25:57.8019555Z 2025-05-07T20:25:57.8019561Z 2025-05-07T20:25:57.8019567Z 2025-05-07T20:25:57.8019573Z 2025-05-07T20:25:57.8019578Z 2025-05-07T20:25:57.8019584Z 2025-05-07T20:25:57.8019590Z 2025-05-07T20:25:57.8019596Z 2025-05-07T20:25:57.8019602Z 2025-05-07T20:25:57.8019617Z 2025-05-07T20:25:57.8019623Z 2025-05-07T20:25:57.8019878Z  2025-05-07T20:25:57.8020193Z 2025-05-07T20:25:57.8020199Z 2025-05-07T20:25:57.8020205Z 2025-05-07T20:25:57.8020211Z 2025-05-07T20:25:57.8020216Z 2025-05-07T20:25:57.8020231Z 2025-05-07T20:25:57.8020237Z 2025-05-07T20:25:57.8020243Z 2025-05-07T20:25:57.8020248Z 2025-05-07T20:25:57.8020254Z 2025-05-07T20:25:57.8020260Z 2025-05-07T20:25:57.8020265Z 2025-05-07T20:25:57.8020271Z 2025-05-07T20:25:57.8020276Z 2025-05-07T20:25:57.8020289Z 2025-05-07T20:25:57.8020295Z 2025-05-07T20:25:57.8020565Z  2025-05-07T20:25:57.8020899Z 2025-05-07T20:25:57.8020906Z 2025-05-07T20:25:57.8020917Z 2025-05-07T20:25:57.8020923Z 2025-05-07T20:25:57.8020929Z 2025-05-07T20:25:57.8020935Z 2025-05-07T20:25:57.8020940Z 2025-05-07T20:25:57.8020946Z 2025-05-07T20:25:57.8020951Z 2025-05-07T20:25:57.8020957Z 2025-05-07T20:25:57.8020963Z 2025-05-07T20:25:57.8020969Z 2025-05-07T20:25:57.8020975Z 2025-05-07T20:25:57.8020980Z 2025-05-07T20:25:57.8020986Z 2025-05-07T20:25:57.8020992Z 2025-05-07T20:25:57.8020997Z 2025-05-07T20:25:57.8021279Z  2025-05-07T20:25:57.8021607Z 2025-05-07T20:25:57.8021613Z 2025-05-07T20:25:57.8021619Z 2025-05-07T20:25:57.8021624Z 2025-05-07T20:25:57.8021631Z 2025-05-07T20:25:57.8021636Z 2025-05-07T20:25:57.8021642Z 2025-05-07T20:25:57.8021648Z 2025-05-07T20:25:57.8021654Z 2025-05-07T20:25:57.8021659Z 2025-05-07T20:25:57.8021803Z 2025-05-07T20:25:57.8021810Z 2025-05-07T20:25:57.8021816Z 2025-05-07T20:25:57.8021822Z 2025-05-07T20:25:57.8021828Z 2025-05-07T20:25:57.8021834Z 2025-05-07T20:25:57.8021926Z 2025-05-07T20:25:57.8021932Z 2025-05-07T20:25:57.8022202Z  2025-05-07T20:25:57.8022573Z 2025-05-07T20:25:57.8022579Z 2025-05-07T20:25:57.8022744Z  2025-05-07T20:25:57.8022907Z 2025-05-07T20:25:57.8022912Z 2025-05-07T20:25:57.8023093Z  2025-05-07T20:25:57.8023279Z 2025-05-07T20:25:57.8023285Z 2025-05-07T20:25:57.8023291Z 2025-05-07T20:25:57.8023459Z  2025-05-07T20:25:57.8023640Z 2025-05-07T20:25:57.8023645Z 2025-05-07T20:25:57.8023651Z 2025-05-07T20:25:57.8023656Z 2025-05-07T20:25:57.8023839Z  2025-05-07T20:25:57.8024050Z 2025-05-07T20:25:57.8024055Z 2025-05-07T20:25:57.8024061Z 2025-05-07T20:25:57.8024067Z 2025-05-07T20:25:57.8024072Z 2025-05-07T20:25:57.8024240Z  2025-05-07T20:25:57.8024458Z 2025-05-07T20:25:57.8024464Z 2025-05-07T20:25:57.8024470Z 2025-05-07T20:25:57.8024476Z 2025-05-07T20:25:57.8024482Z 2025-05-07T20:25:57.8024488Z 2025-05-07T20:25:57.8024679Z  2025-05-07T20:25:57.8024898Z 2025-05-07T20:25:57.8024904Z 2025-05-07T20:25:57.8024909Z 2025-05-07T20:25:57.8024914Z 2025-05-07T20:25:57.8024919Z 2025-05-07T20:25:57.8024924Z 2025-05-07T20:25:57.8024929Z 2025-05-07T20:25:57.8025127Z  2025-05-07T20:25:57.8025365Z 2025-05-07T20:25:57.8025380Z 2025-05-07T20:25:57.8025386Z 2025-05-07T20:25:57.8025392Z 2025-05-07T20:25:57.8025398Z 2025-05-07T20:25:57.8025403Z 2025-05-07T20:25:57.8025409Z 2025-05-07T20:25:57.8025415Z 2025-05-07T20:25:57.8025606Z  2025-05-07T20:25:57.8025851Z 2025-05-07T20:25:57.8025858Z 2025-05-07T20:25:57.8025864Z 2025-05-07T20:25:57.8025869Z 2025-05-07T20:25:57.8025876Z 2025-05-07T20:25:57.8025881Z 2025-05-07T20:25:57.8025887Z 2025-05-07T20:25:57.8025893Z 2025-05-07T20:25:57.8025907Z 2025-05-07T20:25:57.8026117Z  2025-05-07T20:25:57.8026396Z 2025-05-07T20:25:57.8026403Z 2025-05-07T20:25:57.8026411Z 2025-05-07T20:25:57.8026416Z 2025-05-07T20:25:57.8026428Z 2025-05-07T20:25:57.8026434Z 2025-05-07T20:25:57.8026439Z 2025-05-07T20:25:57.8026445Z 2025-05-07T20:25:57.8026451Z 2025-05-07T20:25:57.8026457Z 2025-05-07T20:25:57.8026691Z  2025-05-07T20:25:57.8026992Z 2025-05-07T20:25:57.8026998Z 2025-05-07T20:25:57.8027004Z 2025-05-07T20:25:57.8027010Z 2025-05-07T20:25:57.8027016Z 2025-05-07T20:25:57.8027022Z 2025-05-07T20:25:57.8027027Z 2025-05-07T20:25:57.8027033Z 2025-05-07T20:25:57.8027038Z 2025-05-07T20:25:57.8027044Z 2025-05-07T20:25:57.8027050Z 2025-05-07T20:25:57.8027305Z  2025-05-07T20:25:57.8027612Z 2025-05-07T20:25:57.8027618Z 2025-05-07T20:25:57.8027624Z 2025-05-07T20:25:57.8027630Z 2025-05-07T20:25:57.8027635Z 2025-05-07T20:25:57.8027641Z 2025-05-07T20:25:57.8027647Z 2025-05-07T20:25:57.8027660Z 2025-05-07T20:25:57.8027666Z 2025-05-07T20:25:57.8027671Z 2025-05-07T20:25:57.8027692Z 2025-05-07T20:25:57.8027698Z 2025-05-07T20:25:57.8027913Z  2025-05-07T20:25:57.8028236Z 2025-05-07T20:25:57.8028242Z 2025-05-07T20:25:57.8028248Z 2025-05-07T20:25:57.8028254Z 2025-05-07T20:25:57.8028259Z 2025-05-07T20:25:57.8028278Z 2025-05-07T20:25:57.8028284Z 2025-05-07T20:25:57.8028290Z 2025-05-07T20:25:57.8028296Z 2025-05-07T20:25:57.8028301Z 2025-05-07T20:25:57.8028307Z 2025-05-07T20:25:57.8028312Z 2025-05-07T20:25:57.8028318Z 2025-05-07T20:25:57.8028535Z  2025-05-07T20:25:57.8028873Z 2025-05-07T20:25:57.8028878Z 2025-05-07T20:25:57.8028884Z 2025-05-07T20:25:57.8028890Z 2025-05-07T20:25:57.8028896Z 2025-05-07T20:25:57.8028901Z 2025-05-07T20:25:57.8028907Z 2025-05-07T20:25:57.8028913Z 2025-05-07T20:25:57.8028918Z 2025-05-07T20:25:57.8028924Z 2025-05-07T20:25:57.8028930Z 2025-05-07T20:25:57.8029100Z 2025-05-07T20:25:57.8029108Z 2025-05-07T20:25:57.8029114Z 2025-05-07T20:25:57.8029361Z  2025-05-07T20:25:57.8029717Z 2025-05-07T20:25:57.8029831Z 2025-05-07T20:25:57.8029837Z 2025-05-07T20:25:57.8029842Z 2025-05-07T20:25:57.8029847Z 2025-05-07T20:25:57.8029853Z 2025-05-07T20:25:57.8029858Z 2025-05-07T20:25:57.8029864Z 2025-05-07T20:25:57.8029870Z 2025-05-07T20:25:57.8029875Z 2025-05-07T20:25:57.8029880Z 2025-05-07T20:25:57.8029885Z 2025-05-07T20:25:57.8029891Z 2025-05-07T20:25:57.8029896Z 2025-05-07T20:25:57.8029902Z 2025-05-07T20:25:57.8030164Z  2025-05-07T20:25:57.8030503Z 2025-05-07T20:25:57.8030509Z 2025-05-07T20:25:57.8030514Z 2025-05-07T20:25:57.8030519Z 2025-05-07T20:25:57.8030525Z 2025-05-07T20:25:57.8030530Z 2025-05-07T20:25:57.8030536Z 2025-05-07T20:25:57.8030541Z 2025-05-07T20:25:57.8030546Z 2025-05-07T20:25:57.8030565Z 2025-05-07T20:25:57.8030570Z 2025-05-07T20:25:57.8030585Z 2025-05-07T20:25:57.8030591Z 2025-05-07T20:25:57.8030596Z 2025-05-07T20:25:57.8030602Z 2025-05-07T20:25:57.8030608Z 2025-05-07T20:25:57.8030863Z  2025-05-07T20:25:57.8031241Z 2025-05-07T20:25:57.8031246Z 2025-05-07T20:25:57.8031251Z 2025-05-07T20:25:57.8031257Z 2025-05-07T20:25:57.8031262Z 2025-05-07T20:25:57.8031268Z 2025-05-07T20:25:57.8031273Z 2025-05-07T20:25:57.8031278Z 2025-05-07T20:25:57.8031283Z 2025-05-07T20:25:57.8031289Z 2025-05-07T20:25:57.8031294Z 2025-05-07T20:25:57.8031300Z 2025-05-07T20:25:57.8031305Z 2025-05-07T20:25:57.8031311Z 2025-05-07T20:25:57.8031317Z 2025-05-07T20:25:57.8031322Z 2025-05-07T20:25:57.8031328Z 2025-05-07T20:25:57.8031616Z  2025-05-07T20:25:57.8031954Z 2025-05-07T20:25:57.8031960Z 2025-05-07T20:25:57.8031966Z 2025-05-07T20:25:57.8031972Z 2025-05-07T20:25:57.8031978Z 2025-05-07T20:25:57.8031983Z 2025-05-07T20:25:57.8031989Z 2025-05-07T20:25:57.8032002Z 2025-05-07T20:25:57.8032008Z 2025-05-07T20:25:57.8032014Z 2025-05-07T20:25:57.8032020Z 2025-05-07T20:25:57.8032026Z 2025-05-07T20:25:57.8032032Z 2025-05-07T20:25:57.8032037Z 2025-05-07T20:25:57.8032059Z 2025-05-07T20:25:57.8032065Z 2025-05-07T20:25:57.8032071Z 2025-05-07T20:25:57.8032076Z 2025-05-07T20:25:57.8032364Z  2025-05-07T20:25:57.8032698Z 2025-05-07T20:25:57.8032704Z 2025-05-07T20:25:57.8032882Z  2025-05-07T20:25:57.8033051Z 2025-05-07T20:25:57.8033056Z 2025-05-07T20:25:57.8033222Z  2025-05-07T20:25:57.8033405Z 2025-05-07T20:25:57.8033411Z 2025-05-07T20:25:57.8033417Z 2025-05-07T20:25:57.8033631Z  2025-05-07T20:25:57.8033804Z 2025-05-07T20:25:57.8033810Z 2025-05-07T20:25:57.8033815Z 2025-05-07T20:25:57.8033820Z 2025-05-07T20:25:57.8034003Z  2025-05-07T20:25:57.8034210Z 2025-05-07T20:25:57.8034215Z 2025-05-07T20:25:57.8034221Z 2025-05-07T20:25:57.8034227Z 2025-05-07T20:25:57.8034233Z 2025-05-07T20:25:57.8034413Z  2025-05-07T20:25:57.8034627Z 2025-05-07T20:25:57.8034633Z 2025-05-07T20:25:57.8034639Z 2025-05-07T20:25:57.8034645Z 2025-05-07T20:25:57.8034651Z 2025-05-07T20:25:57.8034663Z 2025-05-07T20:25:57.8034848Z  2025-05-07T20:25:57.8035061Z 2025-05-07T20:25:57.8035068Z 2025-05-07T20:25:57.8035074Z 2025-05-07T20:25:57.8035079Z 2025-05-07T20:25:57.8035084Z 2025-05-07T20:25:57.8035090Z 2025-05-07T20:25:57.8035096Z 2025-05-07T20:25:57.8035293Z  2025-05-07T20:25:57.8035533Z 2025-05-07T20:25:57.8035539Z 2025-05-07T20:25:57.8035544Z 2025-05-07T20:25:57.8035549Z 2025-05-07T20:25:57.8035555Z 2025-05-07T20:25:57.8035560Z 2025-05-07T20:25:57.8035565Z 2025-05-07T20:25:57.8035570Z 2025-05-07T20:25:57.8035771Z  2025-05-07T20:25:57.8036037Z 2025-05-07T20:25:57.8036043Z 2025-05-07T20:25:57.8036049Z 2025-05-07T20:25:57.8036054Z 2025-05-07T20:25:57.8036060Z 2025-05-07T20:25:57.8036066Z 2025-05-07T20:25:57.8036201Z 2025-05-07T20:25:57.8036208Z 2025-05-07T20:25:57.8036214Z 2025-05-07T20:25:57.8036439Z  2025-05-07T20:25:57.8036764Z 2025-05-07T20:25:57.8036772Z 2025-05-07T20:25:57.8036881Z 2025-05-07T20:25:57.8036887Z 2025-05-07T20:25:57.8036892Z 2025-05-07T20:25:57.8036898Z 2025-05-07T20:25:57.8036904Z 2025-05-07T20:25:57.8036909Z 2025-05-07T20:25:57.8036915Z 2025-05-07T20:25:57.8036921Z 2025-05-07T20:25:57.8037148Z  2025-05-07T20:25:57.8037414Z 2025-05-07T20:25:57.8037420Z 2025-05-07T20:25:57.8037425Z 2025-05-07T20:25:57.8037430Z 2025-05-07T20:25:57.8037436Z 2025-05-07T20:25:57.8037441Z 2025-05-07T20:25:57.8037447Z 2025-05-07T20:25:57.8037453Z 2025-05-07T20:25:57.8037458Z 2025-05-07T20:25:57.8037463Z 2025-05-07T20:25:57.8037468Z 2025-05-07T20:25:57.8037700Z  2025-05-07T20:25:57.8037984Z 2025-05-07T20:25:57.8037989Z 2025-05-07T20:25:57.8037994Z 2025-05-07T20:25:57.8038000Z 2025-05-07T20:25:57.8038013Z 2025-05-07T20:25:57.8038020Z 2025-05-07T20:25:57.8038025Z 2025-05-07T20:25:57.8038031Z 2025-05-07T20:25:57.8038036Z 2025-05-07T20:25:57.8038042Z 2025-05-07T20:25:57.8038048Z 2025-05-07T20:25:57.8038060Z 2025-05-07T20:25:57.8038291Z  2025-05-07T20:25:57.8038602Z 2025-05-07T20:25:57.8038607Z 2025-05-07T20:25:57.8038613Z 2025-05-07T20:25:57.8038618Z 2025-05-07T20:25:57.8038623Z 2025-05-07T20:25:57.8038628Z 2025-05-07T20:25:57.8038633Z 2025-05-07T20:25:57.8038638Z 2025-05-07T20:25:57.8038644Z 2025-05-07T20:25:57.8038658Z 2025-05-07T20:25:57.8038664Z 2025-05-07T20:25:57.8038669Z 2025-05-07T20:25:57.8038675Z 2025-05-07T20:25:57.8038938Z  2025-05-07T20:25:57.8039254Z 2025-05-07T20:25:57.8039260Z 2025-05-07T20:25:57.8039265Z 2025-05-07T20:25:57.8039271Z 2025-05-07T20:25:57.8039277Z 2025-05-07T20:25:57.8039283Z 2025-05-07T20:25:57.8039288Z 2025-05-07T20:25:57.8039293Z 2025-05-07T20:25:57.8039298Z 2025-05-07T20:25:57.8039313Z 2025-05-07T20:25:57.8039319Z 2025-05-07T20:25:57.8039324Z 2025-05-07T20:25:57.8039329Z 2025-05-07T20:25:57.8039334Z 2025-05-07T20:25:57.8039581Z  2025-05-07T20:25:57.8039906Z 2025-05-07T20:25:57.8039912Z 2025-05-07T20:25:57.8039918Z 2025-05-07T20:25:57.8039923Z 2025-05-07T20:25:57.8039929Z 2025-05-07T20:25:57.8039935Z 2025-05-07T20:25:57.8039940Z 2025-05-07T20:25:57.8039945Z 2025-05-07T20:25:57.8039950Z 2025-05-07T20:25:57.8039956Z 2025-05-07T20:25:57.8039962Z 2025-05-07T20:25:57.8039968Z 2025-05-07T20:25:57.8039974Z 2025-05-07T20:25:57.8039979Z 2025-05-07T20:25:57.8039993Z 2025-05-07T20:25:57.8040240Z  2025-05-07T20:25:57.8040569Z 2025-05-07T20:25:57.8040575Z 2025-05-07T20:25:57.8040581Z 2025-05-07T20:25:57.8040587Z 2025-05-07T20:25:57.8040592Z 2025-05-07T20:25:57.8040597Z 2025-05-07T20:25:57.8040611Z 2025-05-07T20:25:57.8040618Z 2025-05-07T20:25:57.8040624Z 2025-05-07T20:25:57.8040635Z 2025-05-07T20:25:57.8040640Z 2025-05-07T20:25:57.8040645Z 2025-05-07T20:25:57.8040650Z 2025-05-07T20:25:57.8040656Z 2025-05-07T20:25:57.8040661Z 2025-05-07T20:25:57.8040666Z 2025-05-07T20:25:57.8040930Z  2025-05-07T20:25:57.8041266Z 2025-05-07T20:25:57.8041271Z 2025-05-07T20:25:57.8041276Z 2025-05-07T20:25:57.8041281Z 2025-05-07T20:25:57.8041287Z 2025-05-07T20:25:57.8041292Z 2025-05-07T20:25:57.8041297Z 2025-05-07T20:25:57.8041302Z 2025-05-07T20:25:57.8041307Z 2025-05-07T20:25:57.8041312Z 2025-05-07T20:25:57.8041317Z 2025-05-07T20:25:57.8041323Z 2025-05-07T20:25:57.8041328Z 2025-05-07T20:25:57.8041333Z 2025-05-07T20:25:57.8041339Z 2025-05-07T20:25:57.8041344Z 2025-05-07T20:25:57.8041349Z 2025-05-07T20:25:57.8041628Z  2025-05-07T20:25:57.8041958Z 2025-05-07T20:25:57.8041964Z 2025-05-07T20:25:57.8041969Z 2025-05-07T20:25:57.8041975Z 2025-05-07T20:25:57.8041981Z 2025-05-07T20:25:57.8042109Z 2025-05-07T20:25:57.8042116Z 2025-05-07T20:25:57.8042123Z 2025-05-07T20:25:57.8042128Z 2025-05-07T20:25:57.8042135Z 2025-05-07T20:25:57.8042140Z 2025-05-07T20:25:57.8042245Z 2025-05-07T20:25:57.8042252Z 2025-05-07T20:25:57.8042257Z 2025-05-07T20:25:57.8042263Z 2025-05-07T20:25:57.8042269Z 2025-05-07T20:25:57.8042274Z 2025-05-07T20:25:57.8042280Z 2025-05-07T20:25:57.8042543Z  2025-05-07T20:25:57.8042921Z 2025-05-07T20:25:57.8042927Z 2025-05-07T20:25:57.8043084Z  2025-05-07T20:25:57.8043245Z 2025-05-07T20:25:57.8043251Z 2025-05-07T20:25:57.8043450Z  2025-05-07T20:25:57.8043621Z 2025-05-07T20:25:57.8043626Z 2025-05-07T20:25:57.8043632Z 2025-05-07T20:25:57.8043811Z  2025-05-07T20:25:57.8043995Z 2025-05-07T20:25:57.8044001Z 2025-05-07T20:25:57.8044007Z 2025-05-07T20:25:57.8044012Z 2025-05-07T20:25:57.8044178Z  2025-05-07T20:25:57.8044380Z 2025-05-07T20:25:57.8044386Z 2025-05-07T20:25:57.8044401Z 2025-05-07T20:25:57.8044407Z 2025-05-07T20:25:57.8044413Z 2025-05-07T20:25:57.8044593Z  2025-05-07T20:25:57.8044794Z 2025-05-07T20:25:57.8044800Z 2025-05-07T20:25:57.8044814Z 2025-05-07T20:25:57.8044820Z 2025-05-07T20:25:57.8044825Z 2025-05-07T20:25:57.8044831Z 2025-05-07T20:25:57.8045023Z  2025-05-07T20:25:57.8045244Z 2025-05-07T20:25:57.8045249Z 2025-05-07T20:25:57.8045254Z 2025-05-07T20:25:57.8045260Z 2025-05-07T20:25:57.8045265Z 2025-05-07T20:25:57.8045270Z 2025-05-07T20:25:57.8045275Z 2025-05-07T20:25:57.8045465Z  2025-05-07T20:25:57.8045709Z 2025-05-07T20:25:57.8045715Z 2025-05-07T20:25:57.8045721Z 2025-05-07T20:25:57.8045726Z 2025-05-07T20:25:57.8045732Z 2025-05-07T20:25:57.8045739Z 2025-05-07T20:25:57.8045744Z 2025-05-07T20:25:57.8045750Z 2025-05-07T20:25:57.8045940Z  2025-05-07T20:25:57.8046204Z 2025-05-07T20:25:57.8046211Z 2025-05-07T20:25:57.8046216Z 2025-05-07T20:25:57.8046222Z 2025-05-07T20:25:57.8046235Z 2025-05-07T20:25:57.8046241Z 2025-05-07T20:25:57.8046247Z 2025-05-07T20:25:57.8046252Z 2025-05-07T20:25:57.8046258Z 2025-05-07T20:25:57.8046457Z  2025-05-07T20:25:57.8046743Z 2025-05-07T20:25:57.8046749Z 2025-05-07T20:25:57.8046755Z 2025-05-07T20:25:57.8046760Z 2025-05-07T20:25:57.8046766Z 2025-05-07T20:25:57.8046772Z 2025-05-07T20:25:57.8046778Z 2025-05-07T20:25:57.8046784Z 2025-05-07T20:25:57.8046790Z 2025-05-07T20:25:57.8046796Z 2025-05-07T20:25:57.8047045Z  2025-05-07T20:25:57.8047331Z 2025-05-07T20:25:57.8047338Z 2025-05-07T20:25:57.8047344Z 2025-05-07T20:25:57.8047349Z 2025-05-07T20:25:57.8047355Z 2025-05-07T20:25:57.8047361Z 2025-05-07T20:25:57.8047366Z 2025-05-07T20:25:57.8047372Z 2025-05-07T20:25:57.8047378Z 2025-05-07T20:25:57.8047383Z 2025-05-07T20:25:57.8047388Z 2025-05-07T20:25:57.8047608Z  2025-05-07T20:25:57.8047910Z 2025-05-07T20:25:57.8047916Z 2025-05-07T20:25:57.8047927Z 2025-05-07T20:25:57.8047932Z 2025-05-07T20:25:57.8047937Z 2025-05-07T20:25:57.8047942Z 2025-05-07T20:25:57.8047948Z 2025-05-07T20:25:57.8047962Z 2025-05-07T20:25:57.8047973Z 2025-05-07T20:25:57.8047978Z 2025-05-07T20:25:57.8047983Z 2025-05-07T20:25:57.8047988Z 2025-05-07T20:25:57.8048221Z  2025-05-07T20:25:57.8048515Z 2025-05-07T20:25:57.8048521Z 2025-05-07T20:25:57.8048536Z 2025-05-07T20:25:57.8048542Z 2025-05-07T20:25:57.8048548Z 2025-05-07T20:25:57.8048553Z 2025-05-07T20:25:57.8048559Z 2025-05-07T20:25:57.8048565Z 2025-05-07T20:25:57.8048571Z 2025-05-07T20:25:57.8048577Z 2025-05-07T20:25:57.8048583Z 2025-05-07T20:25:57.8048588Z 2025-05-07T20:25:57.8048594Z 2025-05-07T20:25:57.8048825Z  2025-05-07T20:25:57.8049123Z 2025-05-07T20:25:57.8049129Z 2025-05-07T20:25:57.8049135Z 2025-05-07T20:25:57.8049141Z 2025-05-07T20:25:57.8049147Z 2025-05-07T20:25:57.8049153Z 2025-05-07T20:25:57.8049159Z 2025-05-07T20:25:57.8049292Z 2025-05-07T20:25:57.8049299Z 2025-05-07T20:25:57.8049305Z 2025-05-07T20:25:57.8049311Z 2025-05-07T20:25:57.8049317Z 2025-05-07T20:25:57.8049323Z 2025-05-07T20:25:57.8049476Z 2025-05-07T20:25:57.8049727Z  2025-05-07T20:25:57.8050063Z 2025-05-07T20:25:57.8050069Z 2025-05-07T20:25:57.8050074Z 2025-05-07T20:25:57.8050079Z 2025-05-07T20:25:57.8050084Z 2025-05-07T20:25:57.8050089Z 2025-05-07T20:25:57.8050095Z 2025-05-07T20:25:57.8050100Z 2025-05-07T20:25:57.8050105Z 2025-05-07T20:25:57.8050110Z 2025-05-07T20:25:57.8050116Z 2025-05-07T20:25:57.8050121Z 2025-05-07T20:25:57.8050127Z 2025-05-07T20:25:57.8050132Z 2025-05-07T20:25:57.8050147Z 2025-05-07T20:25:57.8050401Z  2025-05-07T20:25:57.8050720Z 2025-05-07T20:25:57.8050725Z 2025-05-07T20:25:57.8050730Z 2025-05-07T20:25:57.8050735Z 2025-05-07T20:25:57.8050741Z 2025-05-07T20:25:57.8050746Z 2025-05-07T20:25:57.8050761Z 2025-05-07T20:25:57.8050774Z 2025-05-07T20:25:57.8050779Z 2025-05-07T20:25:57.8050785Z 2025-05-07T20:25:57.8050790Z 2025-05-07T20:25:57.8050795Z 2025-05-07T20:25:57.8050800Z 2025-05-07T20:25:57.8050814Z 2025-05-07T20:25:57.8050820Z 2025-05-07T20:25:57.8050826Z 2025-05-07T20:25:57.8051113Z  2025-05-07T20:25:57.8051441Z 2025-05-07T20:25:57.8051447Z 2025-05-07T20:25:57.8051453Z 2025-05-07T20:25:57.8051458Z 2025-05-07T20:25:57.8051464Z 2025-05-07T20:25:57.8051470Z 2025-05-07T20:25:57.8051476Z 2025-05-07T20:25:57.8051482Z 2025-05-07T20:25:57.8051488Z 2025-05-07T20:25:57.8051494Z 2025-05-07T20:25:57.8051500Z 2025-05-07T20:25:57.8051506Z 2025-05-07T20:25:57.8051511Z 2025-05-07T20:25:57.8051517Z 2025-05-07T20:25:57.8051523Z 2025-05-07T20:25:57.8051529Z 2025-05-07T20:25:57.8051535Z 2025-05-07T20:25:57.8051800Z  2025-05-07T20:25:57.8052137Z 2025-05-07T20:25:57.8052143Z 2025-05-07T20:25:57.8052149Z 2025-05-07T20:25:57.8052163Z 2025-05-07T20:25:57.8052169Z 2025-05-07T20:25:57.8052175Z 2025-05-07T20:25:57.8052181Z 2025-05-07T20:25:57.8052194Z 2025-05-07T20:25:57.8052200Z 2025-05-07T20:25:57.8052213Z 2025-05-07T20:25:57.8052219Z 2025-05-07T20:25:57.8052224Z 2025-05-07T20:25:57.8052230Z 2025-05-07T20:25:57.8052236Z 2025-05-07T20:25:57.8052242Z 2025-05-07T20:25:57.8052248Z 2025-05-07T20:25:57.8052254Z 2025-05-07T20:25:57.8052260Z 2025-05-07T20:25:57.8052520Z  2025-05-07T20:25:57.8052891Z 2025-05-07T20:25:57.8052898Z 2025-05-07T20:25:57.8053064Z  2025-05-07T20:25:57.8053238Z 2025-05-07T20:25:57.8053244Z 2025-05-07T20:25:57.8053417Z  2025-05-07T20:25:57.8053589Z 2025-05-07T20:25:57.8053595Z 2025-05-07T20:25:57.8053601Z 2025-05-07T20:25:57.8053783Z  2025-05-07T20:25:57.8053953Z 2025-05-07T20:25:57.8053959Z 2025-05-07T20:25:57.8053964Z 2025-05-07T20:25:57.8053970Z 2025-05-07T20:25:57.8054153Z  2025-05-07T20:25:57.8054354Z 2025-05-07T20:25:57.8054360Z 2025-05-07T20:25:57.8054366Z 2025-05-07T20:25:57.8054371Z 2025-05-07T20:25:57.8054395Z 2025-05-07T20:25:57.8054580Z  2025-05-07T20:25:57.8054801Z 2025-05-07T20:25:57.8054807Z 2025-05-07T20:25:57.8054812Z 2025-05-07T20:25:57.8054817Z 2025-05-07T20:25:57.8054822Z 2025-05-07T20:25:57.8054828Z 2025-05-07T20:25:57.8055014Z  2025-05-07T20:25:57.8055238Z 2025-05-07T20:25:57.8055244Z 2025-05-07T20:25:57.8055249Z 2025-05-07T20:25:57.8055254Z 2025-05-07T20:25:57.8055259Z 2025-05-07T20:25:57.8055264Z 2025-05-07T20:25:57.8055269Z 2025-05-07T20:25:57.8055460Z  2025-05-07T20:25:57.8055937Z 2025-05-07T20:25:57.8055943Z 2025-05-07T20:25:57.8055948Z 2025-05-07T20:25:57.8055953Z 2025-05-07T20:25:57.8055958Z 2025-05-07T20:25:57.8055963Z 2025-05-07T20:25:57.8055968Z 2025-05-07T20:25:57.8055973Z 2025-05-07T20:25:57.8056187Z  2025-05-07T20:25:57.8056484Z 2025-05-07T20:25:57.8056489Z 2025-05-07T20:25:57.8056716Z 2025-05-07T20:25:57.8056724Z 2025-05-07T20:25:57.8056730Z 2025-05-07T20:25:57.8056736Z 2025-05-07T20:25:57.8056741Z 2025-05-07T20:25:57.8056747Z 2025-05-07T20:25:57.8056891Z 2025-05-07T20:25:57.8057119Z  2025-05-07T20:25:57.8057403Z 2025-05-07T20:25:57.8057409Z 2025-05-07T20:25:57.8057415Z 2025-05-07T20:25:57.8057421Z 2025-05-07T20:25:57.8057427Z 2025-05-07T20:25:57.8057432Z 2025-05-07T20:25:57.8057438Z 2025-05-07T20:25:57.8057443Z 2025-05-07T20:25:57.8057448Z 2025-05-07T20:25:57.8057453Z 2025-05-07T20:25:57.8057676Z  2025-05-07T20:25:57.8057950Z 2025-05-07T20:25:57.8057956Z 2025-05-07T20:25:57.8057961Z 2025-05-07T20:25:57.8057966Z 2025-05-07T20:25:57.8057972Z 2025-05-07T20:25:57.8057977Z 2025-05-07T20:25:57.8057982Z 2025-05-07T20:25:57.8057987Z 2025-05-07T20:25:57.8057992Z 2025-05-07T20:25:57.8057998Z 2025-05-07T20:25:57.8058004Z 2025-05-07T20:25:57.8058394Z  2025-05-07T20:25:57.8058685Z 2025-05-07T20:25:57.8058691Z 2025-05-07T20:25:57.8058697Z 2025-05-07T20:25:57.8058714Z 2025-05-07T20:25:57.8058720Z 2025-05-07T20:25:57.8058726Z 2025-05-07T20:25:57.8058739Z 2025-05-07T20:25:57.8058745Z 2025-05-07T20:25:57.8058751Z 2025-05-07T20:25:57.8058756Z 2025-05-07T20:25:57.8058762Z 2025-05-07T20:25:57.8058768Z 2025-05-07T20:25:57.8058988Z  2025-05-07T20:25:57.8059304Z 2025-05-07T20:25:57.8059309Z 2025-05-07T20:25:57.8059314Z 2025-05-07T20:25:57.8059320Z 2025-05-07T20:25:57.8059325Z 2025-05-07T20:25:57.8059330Z 2025-05-07T20:25:57.8059335Z 2025-05-07T20:25:57.8059340Z 2025-05-07T20:25:57.8059345Z 2025-05-07T20:25:57.8059351Z 2025-05-07T20:25:57.8059356Z 2025-05-07T20:25:57.8059361Z 2025-05-07T20:25:57.8059367Z 2025-05-07T20:25:57.8059589Z  2025-05-07T20:25:57.8059910Z 2025-05-07T20:25:57.8059916Z 2025-05-07T20:25:57.8059922Z 2025-05-07T20:25:57.8059927Z 2025-05-07T20:25:57.8059941Z 2025-05-07T20:25:57.8059946Z 2025-05-07T20:25:57.8059951Z 2025-05-07T20:25:57.8059957Z 2025-05-07T20:25:57.8059963Z 2025-05-07T20:25:57.8059969Z 2025-05-07T20:25:57.8059974Z 2025-05-07T20:25:57.8059983Z 2025-05-07T20:25:57.8059988Z 2025-05-07T20:25:57.8059994Z 2025-05-07T20:25:57.8060238Z  2025-05-07T20:25:57.8060543Z 2025-05-07T20:25:57.8060549Z 2025-05-07T20:25:57.8060554Z 2025-05-07T20:25:57.8060560Z 2025-05-07T20:25:57.8060565Z 2025-05-07T20:25:57.8060570Z 2025-05-07T20:25:57.8060576Z 2025-05-07T20:25:57.8060581Z 2025-05-07T20:25:57.8060587Z 2025-05-07T20:25:57.8060592Z 2025-05-07T20:25:57.8060607Z 2025-05-07T20:25:57.8060612Z 2025-05-07T20:25:57.8060618Z 2025-05-07T20:25:57.8060624Z 2025-05-07T20:25:57.8060629Z 2025-05-07T20:25:57.8060879Z  2025-05-07T20:25:57.8061197Z 2025-05-07T20:25:57.8061203Z 2025-05-07T20:25:57.8061209Z 2025-05-07T20:25:57.8061223Z 2025-05-07T20:25:57.8061229Z 2025-05-07T20:25:57.8061243Z 2025-05-07T20:25:57.8061249Z 2025-05-07T20:25:57.8061255Z 2025-05-07T20:25:57.8061261Z 2025-05-07T20:25:57.8061267Z 2025-05-07T20:25:57.8061273Z 2025-05-07T20:25:57.8061284Z 2025-05-07T20:25:57.8061290Z 2025-05-07T20:25:57.8061296Z 2025-05-07T20:25:57.8061301Z 2025-05-07T20:25:57.8061307Z 2025-05-07T20:25:57.8061551Z  2025-05-07T20:25:57.8061907Z 2025-05-07T20:25:57.8061913Z 2025-05-07T20:25:57.8061919Z 2025-05-07T20:25:57.8061924Z 2025-05-07T20:25:57.8061930Z 2025-05-07T20:25:57.8061936Z 2025-05-07T20:25:57.8061942Z 2025-05-07T20:25:57.8061947Z 2025-05-07T20:25:57.8061952Z 2025-05-07T20:25:57.8061957Z 2025-05-07T20:25:57.8061963Z 2025-05-07T20:25:57.8061969Z 2025-05-07T20:25:57.8061974Z 2025-05-07T20:25:57.8061980Z 2025-05-07T20:25:57.8061985Z 2025-05-07T20:25:57.8061990Z 2025-05-07T20:25:57.8061995Z 2025-05-07T20:25:57.8062269Z  2025-05-07T20:25:57.8062746Z 2025-05-07T20:25:57.8062754Z 2025-05-07T20:25:57.8062759Z 2025-05-07T20:25:57.8062764Z 2025-05-07T20:25:57.8062769Z 2025-05-07T20:25:57.8062774Z 2025-05-07T20:25:57.8062779Z 2025-05-07T20:25:57.8062874Z 2025-05-07T20:25:57.8062891Z 2025-05-07T20:25:57.8062896Z 2025-05-07T20:25:57.8062901Z 2025-05-07T20:25:57.8062906Z 2025-05-07T20:25:57.8062911Z 2025-05-07T20:25:57.8062916Z 2025-05-07T20:25:57.8062921Z 2025-05-07T20:25:57.8062926Z 2025-05-07T20:25:57.8062932Z 2025-05-07T20:25:57.8062937Z 2025-05-07T20:25:57.8063262Z  2025-05-07T20:25:57.8063612Z 2025-05-07T20:25:57.8063618Z 2025-05-07T20:25:57.8063780Z  2025-05-07T20:25:57.8063968Z 2025-05-07T20:25:57.8063974Z 2025-05-07T20:25:57.8064138Z  2025-05-07T20:25:57.8064319Z 2025-05-07T20:25:57.8064325Z 2025-05-07T20:25:57.8064343Z 2025-05-07T20:25:57.8064524Z  2025-05-07T20:25:57.8064707Z 2025-05-07T20:25:57.8064713Z 2025-05-07T20:25:57.8064719Z 2025-05-07T20:25:57.8064734Z 2025-05-07T20:25:57.8064914Z  2025-05-07T20:25:57.8065111Z 2025-05-07T20:25:57.8065117Z 2025-05-07T20:25:57.8065123Z 2025-05-07T20:25:57.8065128Z 2025-05-07T20:25:57.8065142Z 2025-05-07T20:25:57.8065329Z  2025-05-07T20:25:57.8065538Z 2025-05-07T20:25:57.8065543Z 2025-05-07T20:25:57.8065548Z 2025-05-07T20:25:57.8065553Z 2025-05-07T20:25:57.8065559Z 2025-05-07T20:25:57.8065564Z 2025-05-07T20:25:57.8065758Z  2025-05-07T20:25:57.8065966Z 2025-05-07T20:25:57.8065972Z 2025-05-07T20:25:57.8065978Z 2025-05-07T20:25:57.8065983Z 2025-05-07T20:25:57.8065989Z 2025-05-07T20:25:57.8065994Z 2025-05-07T20:25:57.8066000Z 2025-05-07T20:25:57.8066197Z  2025-05-07T20:25:57.8066429Z 2025-05-07T20:25:57.8066436Z 2025-05-07T20:25:57.8066443Z 2025-05-07T20:25:57.8066448Z 2025-05-07T20:25:57.8066454Z 2025-05-07T20:25:57.8066460Z 2025-05-07T20:25:57.8066466Z 2025-05-07T20:25:57.8066471Z 2025-05-07T20:25:57.8066710Z  2025-05-07T20:25:57.8066980Z 2025-05-07T20:25:57.8066986Z 2025-05-07T20:25:57.8066991Z 2025-05-07T20:25:57.8066996Z 2025-05-07T20:25:57.8067001Z 2025-05-07T20:25:57.8067006Z 2025-05-07T20:25:57.8067018Z 2025-05-07T20:25:57.8067024Z 2025-05-07T20:25:57.8067029Z 2025-05-07T20:25:57.8067245Z  2025-05-07T20:25:57.8067495Z 2025-05-07T20:25:57.8067501Z 2025-05-07T20:25:57.8067507Z 2025-05-07T20:25:57.8067513Z 2025-05-07T20:25:57.8067518Z 2025-05-07T20:25:57.8067524Z 2025-05-07T20:25:57.8067530Z 2025-05-07T20:25:57.8067536Z 2025-05-07T20:25:57.8067541Z 2025-05-07T20:25:57.8067591Z 2025-05-07T20:25:57.8067805Z  2025-05-07T20:25:57.8068085Z 2025-05-07T20:25:57.8068091Z 2025-05-07T20:25:57.8068097Z 2025-05-07T20:25:57.8068102Z 2025-05-07T20:25:57.8068108Z 2025-05-07T20:25:57.8068114Z 2025-05-07T20:25:57.8068119Z 2025-05-07T20:25:57.8068125Z 2025-05-07T20:25:57.8068130Z 2025-05-07T20:25:57.8068136Z 2025-05-07T20:25:57.8068142Z 2025-05-07T20:25:57.8068372Z  2025-05-07T20:25:57.8068667Z 2025-05-07T20:25:57.8068673Z 2025-05-07T20:25:57.8068678Z 2025-05-07T20:25:57.8068683Z 2025-05-07T20:25:57.8068695Z 2025-05-07T20:25:57.8068700Z 2025-05-07T20:25:57.8068705Z 2025-05-07T20:25:57.8068711Z 2025-05-07T20:25:57.8068716Z 2025-05-07T20:25:57.8068721Z 2025-05-07T20:25:57.8068726Z 2025-05-07T20:25:57.8068732Z 2025-05-07T20:25:57.8068976Z  2025-05-07T20:25:57.8069273Z 2025-05-07T20:25:57.8069279Z 2025-05-07T20:25:57.8069285Z 2025-05-07T20:25:57.8069291Z 2025-05-07T20:25:57.8069298Z 2025-05-07T20:25:57.8069303Z 2025-05-07T20:25:57.8069309Z 2025-05-07T20:25:57.8069315Z 2025-05-07T20:25:57.8069320Z 2025-05-07T20:25:57.8069326Z 2025-05-07T20:25:57.8069331Z 2025-05-07T20:25:57.8069336Z 2025-05-07T20:25:57.8069341Z 2025-05-07T20:25:57.8069579Z  2025-05-07T20:25:57.8069880Z 2025-05-07T20:25:57.8069885Z 2025-05-07T20:25:57.8070025Z 2025-05-07T20:25:57.8070032Z 2025-05-07T20:25:57.8070038Z 2025-05-07T20:25:57.8070043Z 2025-05-07T20:25:57.8070049Z 2025-05-07T20:25:57.8070054Z 2025-05-07T20:25:57.8070163Z 2025-05-07T20:25:57.8070169Z 2025-05-07T20:25:57.8070174Z 2025-05-07T20:25:57.8070179Z 2025-05-07T20:25:57.8070184Z 2025-05-07T20:25:57.8070189Z 2025-05-07T20:25:57.8070427Z  2025-05-07T20:25:57.8070719Z 2025-05-07T20:25:57.8070734Z 2025-05-07T20:25:57.8070740Z 2025-05-07T20:25:57.8070745Z 2025-05-07T20:25:57.8070750Z 2025-05-07T20:25:57.8070756Z 2025-05-07T20:25:57.8070762Z 2025-05-07T20:25:57.8070768Z 2025-05-07T20:25:57.8070773Z 2025-05-07T20:25:57.8070779Z 2025-05-07T20:25:57.8070784Z 2025-05-07T20:25:57.8070789Z 2025-05-07T20:25:57.8070795Z 2025-05-07T20:25:57.8070800Z 2025-05-07T20:25:57.8070806Z 2025-05-07T20:25:57.8071035Z  2025-05-07T20:25:57.8071358Z 2025-05-07T20:25:57.8071363Z 2025-05-07T20:25:57.8071377Z 2025-05-07T20:25:57.8071383Z 2025-05-07T20:25:57.8071389Z 2025-05-07T20:25:57.8071394Z 2025-05-07T20:25:57.8071400Z 2025-05-07T20:25:57.8071406Z 2025-05-07T20:25:57.8071412Z 2025-05-07T20:25:57.8071428Z 2025-05-07T20:25:57.8071434Z 2025-05-07T20:25:57.8071439Z 2025-05-07T20:25:57.8071445Z 2025-05-07T20:25:57.8071451Z 2025-05-07T20:25:57.8071457Z 2025-05-07T20:25:57.8071462Z 2025-05-07T20:25:57.8071707Z  2025-05-07T20:25:57.8072032Z 2025-05-07T20:25:57.8072037Z 2025-05-07T20:25:57.8072043Z 2025-05-07T20:25:57.8072048Z 2025-05-07T20:25:57.8072054Z 2025-05-07T20:25:57.8072059Z 2025-05-07T20:25:57.8072065Z 2025-05-07T20:25:57.8072070Z 2025-05-07T20:25:57.8072076Z 2025-05-07T20:25:57.8072090Z 2025-05-07T20:25:57.8072095Z 2025-05-07T20:25:57.8072100Z 2025-05-07T20:25:57.8072106Z 2025-05-07T20:25:57.8072111Z 2025-05-07T20:25:57.8072117Z 2025-05-07T20:25:57.8072122Z 2025-05-07T20:25:57.8072128Z 2025-05-07T20:25:57.8072373Z  2025-05-07T20:25:57.8072703Z 2025-05-07T20:25:57.8072709Z 2025-05-07T20:25:57.8072715Z 2025-05-07T20:25:57.8072720Z 2025-05-07T20:25:57.8072731Z 2025-05-07T20:25:57.8072737Z 2025-05-07T20:25:57.8072743Z 2025-05-07T20:25:57.8072748Z 2025-05-07T20:25:57.8072754Z 2025-05-07T20:25:57.8072759Z 2025-05-07T20:25:57.8072765Z 2025-05-07T20:25:57.8072770Z 2025-05-07T20:25:57.8072776Z 2025-05-07T20:25:57.8072782Z 2025-05-07T20:25:57.8072787Z 2025-05-07T20:25:57.8072792Z 2025-05-07T20:25:57.8072798Z 2025-05-07T20:25:57.8072804Z 2025-05-07T20:25:57.8073057Z  2025-05-07T20:25:57.8073379Z 2025-05-07T20:25:57.8073385Z 2025-05-07T20:25:57.8073529Z  2025-05-07T20:25:57.8073694Z 2025-05-07T20:25:57.8073700Z 2025-05-07T20:25:57.8073848Z  2025-05-07T20:25:57.8074013Z 2025-05-07T20:25:57.8074019Z 2025-05-07T20:25:57.8074023Z 2025-05-07T20:25:57.8074196Z  2025-05-07T20:25:57.8074361Z 2025-05-07T20:25:57.8074372Z 2025-05-07T20:25:57.8074377Z 2025-05-07T20:25:57.8074382Z 2025-05-07T20:25:57.8074538Z  2025-05-07T20:25:57.8074705Z 2025-05-07T20:25:57.8074710Z 2025-05-07T20:25:57.8074729Z 2025-05-07T20:25:57.8074735Z 2025-05-07T20:25:57.8074740Z 2025-05-07T20:25:57.8074897Z  2025-05-07T20:25:57.8075087Z 2025-05-07T20:25:57.8075093Z 2025-05-07T20:25:57.8075098Z 2025-05-07T20:25:57.8075103Z 2025-05-07T20:25:57.8075118Z 2025-05-07T20:25:57.8075124Z 2025-05-07T20:25:57.8075293Z  2025-05-07T20:25:57.8075481Z 2025-05-07T20:25:57.8075486Z 2025-05-07T20:25:57.8075491Z 2025-05-07T20:25:57.8075497Z 2025-05-07T20:25:57.8075502Z 2025-05-07T20:25:57.8075507Z 2025-05-07T20:25:57.8075521Z 2025-05-07T20:25:57.8075693Z  2025-05-07T20:25:57.8075890Z 2025-05-07T20:25:57.8075896Z 2025-05-07T20:25:57.8075901Z 2025-05-07T20:25:57.8075906Z 2025-05-07T20:25:57.8075911Z 2025-05-07T20:25:57.8075917Z 2025-05-07T20:25:57.8075922Z 2025-05-07T20:25:57.8076065Z 2025-05-07T20:25:57.8076208Z  2025-05-07T20:25:57.8076420Z 2025-05-07T20:25:57.8076425Z 2025-05-07T20:25:57.8076430Z 2025-05-07T20:25:57.8076560Z 2025-05-07T20:25:57.8076566Z 2025-05-07T20:25:57.8076570Z 2025-05-07T20:25:57.8076585Z 2025-05-07T20:25:57.8076590Z 2025-05-07T20:25:57.8076596Z 2025-05-07T20:25:57.8076784Z  2025-05-07T20:25:57.8077011Z 2025-05-07T20:25:57.8077016Z 2025-05-07T20:25:57.8077021Z 2025-05-07T20:25:57.8077026Z 2025-05-07T20:25:57.8077031Z 2025-05-07T20:25:57.8077044Z 2025-05-07T20:25:57.8077049Z 2025-05-07T20:25:57.8077055Z 2025-05-07T20:25:57.8077059Z 2025-05-07T20:25:57.8077064Z 2025-05-07T20:25:57.8077257Z  2025-05-07T20:25:57.8077505Z 2025-05-07T20:25:57.8077511Z 2025-05-07T20:25:57.8077527Z 2025-05-07T20:25:57.8077533Z 2025-05-07T20:25:57.8077538Z 2025-05-07T20:25:57.8077543Z 2025-05-07T20:25:57.8077549Z 2025-05-07T20:25:57.8077554Z 2025-05-07T20:25:57.8077568Z 2025-05-07T20:25:57.8077573Z 2025-05-07T20:25:57.8077578Z 2025-05-07T20:25:57.8077793Z  done 2025-05-07T20:25:58.1238064Z Preparing transaction: - \ | done 2025-05-07T20:25:59.5948034Z Verifying transaction: - \ | / - \ | / - \ | / - \ done 2025-05-07T20:26:00.3033021Z Executing transaction: / - \ | / - \ done 2025-05-07T20:26:02.6563680Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ... 2025-05-07T20:26:02.6564247Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:02.6565070Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:02.6565646Z 2025-05-07T20:26:02.6578154Z 2025-05-07T20:26:02.6579082Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:02.6579893Z 2025-05-07T20:26:02.6591847Z 2025-05-07T20:26:02.6592087Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:02.6597940Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:02.6601746Z 2025-05-07T20:26:02.8323872Z 2025-05-07T20:26:02.8330159Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:02.8334075Z 2025-05-07T20:26:02.8352222Z 2025-05-07T20:26:02.8352516Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:02.8717479Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:04.7492120Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:04.8114089Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:04.8114647Z 2025-05-07T20:26:05.2326196Z 2025-05-07T20:26:05.2335346Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:05.2677459Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:05.2678154Z 2025-05-07T20:26:05.7001131Z 2025-05-07T20:26:05.7001533Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:05.7002463Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:05.7003194Z 2025-05-07T20:26:06.1218554Z 2025-05-07T20:26:08.1351778Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:10.1507221Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:12.1716371Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:12.1717203Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:14.1861377Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:16.0768148Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:16.0768477Z 2025-05-07T20:26:16.1398510Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:26:19.9740165Z /tmp/tmpv72y_08w: line 3: clang: command not found 2025-05-07T20:26:19.9740457Z 2025-05-07T20:26:19.9740973Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:26:20.0371720Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:26:20.0372050Z 2025-05-07T20:26:20.0393092Z total 36 2025-05-07T20:26:20.0393385Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:25 . 2025-05-07T20:26:20.0393776Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:24 .. 2025-05-07T20:26:20.0394231Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:26:20.0394768Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:26:20.0395249Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:26:20.0395717Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:26:20.0396176Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:26:20.0396623Z -rw-r--r--. 2 ec2-user ec2-user 2932 Nov 20 20:32 ~cuda-nvcc_activate.sh 2025-05-07T20:26:20.0396927Z 2025-05-07T20:26:20.0397142Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:26:20.0397784Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:26:20.0398199Z 2025-05-07T20:26:20.0420046Z 2025-05-07T20:26:20.0420388Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:26:20.0420643Z 2025-05-07T20:26:21.9909157Z 2025-05-07T20:26:21.9910026Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:26:21.9911161Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:26:21.9911567Z 2025-05-07T20:26:22.4135255Z 2025-05-07T20:26:22.4135993Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:26:22.4137189Z 2025-05-07T20:26:24.2916314Z -allow-unsupported-compiler 2025-05-07T20:26:24.3532044Z 2025-05-07T20:26:24.3532049Z 2025-05-07T20:26:24.3532404Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:26:24.3532920Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:26:24.3533244Z 2025-05-07T20:26:26.3049198Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:26:26.3049962Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:26:26.3050377Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:26:26.3050714Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:26:26.3051082Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:26:26.3051350Z #define _STL_PAIR_H 1 2025-05-07T20:26:26.3051643Z #define __cpp_attributes 200809L 2025-05-07T20:26:26.3051980Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:26:26.3052351Z #define __DELETE_THROW throw() 2025-05-07T20:26:26.3052610Z #define _PTRDIFF_T_ 2025-05-07T20:26:26.3052867Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:26:26.3053267Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:26:26.3053564Z #define _IO_LEFT 02 2025-05-07T20:26:26.3053836Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:26:26.3054105Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:26:26.3054389Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:26:26.3054816Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:26:26.3055253Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:26:26.3055809Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:26:26.3056175Z #define _IOS_OUTPUT 2 2025-05-07T20:26:26.3056628Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:26:26.3057215Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:26:26.3057653Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:26:26.3058166Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:26:26.3058567Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:26:26.3059604Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:26:26.3061014Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:26:26.3061376Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:26:26.3061687Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:26:26.3062096Z #define _T_WCHAR_ 2025-05-07T20:26:26.3062413Z #define stdout stdout 2025-05-07T20:26:26.3062823Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:26:26.3063370Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:26:26.3063738Z #define __flexarr [] 2025-05-07T20:26:26.3064128Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:26:26.3064585Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:26:26.3065034Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:26:26.3065398Z #define _MATH_H 1 2025-05-07T20:26:26.3065809Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:26:26.3066315Z #define __S64_TYPE long int 2025-05-07T20:26:26.3066675Z #define __stub_fchflags 2025-05-07T20:26:26.3067045Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:26:26.3067451Z #define __SQUAD_TYPE long int 2025-05-07T20:26:26.3067822Z #define __INTMAX_C(c) c ## L 2025-05-07T20:26:26.3068191Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:26:26.3068553Z #define NL_NMAX INT_MAX 2025-05-07T20:26:26.3068877Z #define _BITS_TIME_H 1 2025-05-07T20:26:26.3069261Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:26:26.3069718Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:26:26.3070531Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:26:26.3071049Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:26:26.3071609Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:26:26.3072180Z #define __CHAR_BIT__ 8 2025-05-07T20:26:26.3072446Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.3072773Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:26:26.3073075Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:26:26.3073341Z #define FP_NAN 0 2025-05-07T20:26:26.3073614Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:26:26.3074076Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:26:26.3074580Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:26:26.3074977Z #define __cudaCDP2GetErrorString 2025-05-07T20:26:26.3075274Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:26:26.3075538Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:26:26.3075801Z #define __SM_80_RT_H__ 2025-05-07T20:26:26.3076043Z #define _NEW 2025-05-07T20:26:26.3076279Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:26:26.3076562Z #define __UINT8_MAX__ 0xff 2025-05-07T20:26:26.3076939Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:26:26.3077369Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:26:26.3077612Z #define __USE_ANSI 1 2025-05-07T20:26:26.3077912Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:26:26.3078330Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:26:26.3078825Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:26:26.3079234Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:26:26.3079632Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:26:26.3079950Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:26:26.3080238Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:26:26.3080525Z #define PIPE_BUF 4096 2025-05-07T20:26:26.3080847Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:26:26.3081223Z #define ADJ_TICK 0x4000 2025-05-07T20:26:26.3081512Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:26:26.3081835Z #define MQ_PRIO_MAX 32768 2025-05-07T20:26:26.3082101Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:26:26.3082439Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:26:26.3082905Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:26.3083433Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:26:26.3083857Z #define _XOPEN_SOURCE 700 2025-05-07T20:26:26.3084119Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:26:26.3084395Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:26:26.3084691Z #define __cpp_static_assert 201411L 2025-05-07T20:26:26.3085034Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:26:26.3085384Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:26:26.3085665Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:26:26.3085952Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:26:26.3086264Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:26:26.3086546Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:26:26.3086854Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.3087223Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:26:26.3087565Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:26:26.3087854Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:26:26.3088180Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.3088543Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:26:26.3088904Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:26:26.3089206Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:26:26.3089509Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:26:26.3089843Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:26:26.3090177Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:26:26.3090732Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:26:26.3091148Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:26:26.3091465Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:26:26.3091744Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:26:26.3092113Z #define __GCC_IEC_559 2 2025-05-07T20:26:26.3092416Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:26:26.3092765Z #define _IO_flockfile(_fp) 2025-05-07T20:26:26.3093030Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:26:26.3093308Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:26:26.3093578Z #define _IOFBF 0 2025-05-07T20:26:26.3093797Z #define __USE_BSD 1 2025-05-07T20:26:26.3094077Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:26:26.3094369Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:26:26.3094650Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:26:26.3094906Z #define _IO_NO_WRITES 8 2025-05-07T20:26:26.3095171Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:26:26.3095534Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:26:26.3095891Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:26:26.3096207Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:26:26.3096534Z #define __cpp_binary_literals 201304L 2025-05-07T20:26:26.3096833Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:26:26.3097110Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:26:26.3097406Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:26:26.3097825Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:26:26.3098328Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:26:26.3098705Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:26:26.3099022Z #define M_PI 3.14159265358979323846 2025-05-07T20:26:26.3099332Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:26:26.3099669Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:26:26.3099984Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:26:26.3100295Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:26:26.3100573Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:26:26.3100855Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:26:26.3101453Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:26:26.3102050Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:26:26.3102387Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:26:26.3102720Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:26:26.3103023Z #define __cudaCDP2GetErrorName 2025-05-07T20:26:26.3103307Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:26:26.3115291Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:26:26.3115792Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:26:26.3116171Z #define __cpp_variadic_templates 200704L 2025-05-07T20:26:26.3116469Z #define RAND_MAX 2147483647 2025-05-07T20:26:26.3116731Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:26:26.3117060Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.3117375Z #define __SM_90_RT_H__ 2025-05-07T20:26:26.3117614Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:26:26.3117886Z #define __COMPAR_FN_T 2025-05-07T20:26:26.3118135Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:26:26.3118427Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:26:26.3118978Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:26:26.3119587Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:26:26.3119966Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:26:26.3120362Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:26:26.3120683Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:26:26.3121061Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:26:26.3121415Z #define __cpp_variable_templates 201304L 2025-05-07T20:26:26.3122012Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:26.3122872Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:26:26.3123458Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:26:26.3123744Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:26:26.3124093Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:26:26.3124500Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:26:26.3124773Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:26:26.3125038Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:26:26.3125309Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:26:26.3125560Z #define __u_char_defined 2025-05-07T20:26:26.3125873Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:26:26.3126237Z #define STA_PPSERROR 0x0800 2025-05-07T20:26:26.3126498Z #define _GLIBCXX_STD_A std 2025-05-07T20:26:26.3126748Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:26:26.3127035Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:26:26.3127477Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:26:26.3127903Z #define FP_INFINITE 1 2025-05-07T20:26:26.3128272Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:26.3128690Z #define _IO_pid_t __pid_t 2025-05-07T20:26:26.3128945Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:26:26.3129201Z #define __LEAF , __leaf__ 2025-05-07T20:26:26.3129456Z #define PATH_MAX 4096 2025-05-07T20:26:26.3129714Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:26:26.3130050Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:26:26.3130376Z #define _LIMITS_H___ 2025-05-07T20:26:26.3130605Z #define __size_t 2025-05-07T20:26:26.3130830Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:26:26.3131377Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:26:26.3131942Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:26:26.3132254Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:26:26.3132583Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:26:26.3132846Z #define _WCHAR_T_DEFINED 2025-05-07T20:26:26.3133205Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:26:26.3133597Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:26:26.3133895Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:26:26.3134226Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:26:26.3134507Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:26:26.3134790Z #define __INT8_C(c) c 2025-05-07T20:26:26.3135050Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:26:26.3135344Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:26:26.3135609Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:26:26.3135866Z #define __SM_70_RT_HPP__ 2025-05-07T20:26:26.3136113Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:26:26.3136391Z #define __cpp_variadic_using 201611L 2025-05-07T20:26:26.3136723Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.3137045Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:26:26.3137316Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:26:26.3137590Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:26:26.3137859Z #define __cpp_capture_star_this 201603L 2025-05-07T20:26:26.3138316Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:26:26.3138627Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:26:26.3138996Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:26:26.3139368Z #define NFDBITS __NFDBITS 2025-05-07T20:26:26.3139626Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:26:26.3139916Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:26:26.3140230Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:26:26.3140546Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:26:26.3140803Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:26:26.3141085Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:26:26.3141388Z #define STA_UNSYNC 0x0040 2025-05-07T20:26:26.3141698Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:26.3142111Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:26:26.3142496Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:26:26.3143032Z #define __cpp_if_constexpr 201606L 2025-05-07T20:26:26.3143361Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:26:26.3143756Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:26:26.3144249Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:26:26.3144565Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:26:26.3144892Z #define __daddr_t_defined 2025-05-07T20:26:26.3145144Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:26:26.3145417Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:26:26.3145736Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:26:26.3146246Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:26:26.3146728Z #define _ACRTIMP 2025-05-07T20:26:26.3146958Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:26:26.3147220Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:26:26.3147508Z #define _IOS_BIN 128 2025-05-07T20:26:26.3147870Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:26:26.3148278Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.3148549Z #define UNDERFLOW 4 2025-05-07T20:26:26.3148775Z #define NAME_MAX 255 2025-05-07T20:26:26.3149005Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:26:26.3149278Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:26:26.3149560Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:26:26.3149857Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:26:26.3150237Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:26:26.3150627Z #define __ptr_t void * 2025-05-07T20:26:26.3150866Z #define M_E 2.7182818284590452354 2025-05-07T20:26:26.3151139Z #define cudaSurfaceType1D 0x01 2025-05-07T20:26:26.3151408Z #define __USE_ISOCXX11 1 2025-05-07T20:26:26.3151674Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:26:26.3151990Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:26:26.3152289Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:26:26.3152572Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:26:26.3152856Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:26:26.3153186Z #define cudaSurfaceType2D 0x02 2025-05-07T20:26:26.3153450Z #define __linux 1 2025-05-07T20:26:26.3153678Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:26:26.3153978Z #define cudaDeviceMask 0xff 2025-05-07T20:26:26.3154276Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:26:26.3154565Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:26:26.3154842Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:26:26.3155127Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:26:26.3155443Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:26:26.3156051Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:26:26.3156347Z #define _BITS_TYPES_H 1 2025-05-07T20:26:26.3156639Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:26:26.3156975Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:26:26.3157283Z #define cudaSurfaceType3D 0x03 2025-05-07T20:26:26.3157571Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:26:26.3157857Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:26:26.3158148Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:26:26.3158939Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:26:26.3159757Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:26:26.3160036Z #define __unix 1 2025-05-07T20:26:26.3160258Z #define MATH_ERRNO 1 2025-05-07T20:26:26.3160503Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:26:26.3160779Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:26:26.3161053Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:26:26.3161341Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:26:26.3161626Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:26:26.3161910Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:26:26.3162629Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:26:26.3163096Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:26:26.3163399Z #define CUDARTAPI_CDECL 2025-05-07T20:26:26.3163790Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:26:26.3164069Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:26:26.3164351Z #define __cpp_lib_void_t 201411 2025-05-07T20:26:26.3164616Z #define _POSIX_AIO_MAX 1 2025-05-07T20:26:26.3164854Z #define __SIZE_T 2025-05-07T20:26:26.3165101Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:26:26.3165424Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:26:26.3165723Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:26:26.3165983Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:26:26.3166251Z #define _ATFILE_SOURCE 1 2025-05-07T20:26:26.3166647Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:26:26.3167078Z #define __WAIT_STATUS void * 2025-05-07T20:26:26.3167352Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:26:26.3167623Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:26:26.3167888Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:26:26.3168179Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:26:26.3168461Z #define __WINT_MIN__ 0U 2025-05-07T20:26:26.3169040Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:26:26.3169680Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:26:26.3169982Z #define WUNTRACED 2 2025-05-07T20:26:26.3170215Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:26:26.3170490Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:26:26.3170778Z #define NZERO 20 2025-05-07T20:26:26.3171010Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:26:26.3171281Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:26:26.3171574Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:26:26.3171868Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:26:26.3172121Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:26:26.3172405Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:26:26.3172677Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:26:26.3172955Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:26:26.3173226Z #define EXIT_FAILURE 1 2025-05-07T20:26:26.3173468Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:26:26.3173758Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:26:26.3174036Z #define _SIZE_T_DEFINED_ 2025-05-07T20:26:26.3174290Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:26:26.3174570Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:26:26.3174904Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:26:26.3175264Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:26:26.3175556Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:26:26.3175806Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:26:26.3176079Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:26:26.3176374Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:26:26.3176679Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:26:26.3176969Z #define SEEK_DATA 3 2025-05-07T20:26:26.3177221Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:26:26.3177511Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:26:26.3177934Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:26:26.3178452Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:26:26.3178706Z #define __INT64_C(c) c ## L 2025-05-07T20:26:26.3178976Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:26:26.3179311Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:26:26.3179637Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:26:26.3179913Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:26:26.3180211Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:26:26.3180515Z #define STA_PPSWANDER 0x0400 2025-05-07T20:26:26.3180772Z #define __INT_WCHAR_T_H 2025-05-07T20:26:26.3181012Z #define WSTOPPED 2 2025-05-07T20:26:26.3181300Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:26:26.3182659Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:26:26.3182959Z #define FP_NORMAL 4 2025-05-07T20:26:26.3183209Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:26:26.3183490Z #define _BITS_TIMEX_H 1 2025-05-07T20:26:26.3183839Z #define _POSIX_LINK_MAX 8 2025-05-07T20:26:26.3184127Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:26:26.3184421Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:26:26.3184693Z #define cudaTextureType1D 0x01 2025-05-07T20:26:26.3184967Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:26:26.3185233Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:26:26.3185499Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:26:26.3185797Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:26:26.3186228Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:26:26.3186673Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:26:26.3186945Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:26:26.3187218Z #define _POSIX_SOURCE 1 2025-05-07T20:26:26.3187465Z #define cudaTextureType2D 0x02 2025-05-07T20:26:26.3187738Z #define _PTR_TRAITS_H 1 2025-05-07T20:26:26.3188012Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:26:26.3188323Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:26:26.3188601Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:26:26.3188929Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:26:26.3189272Z #define cudaTextureType3D 0x03 2025-05-07T20:26:26.3189539Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:26:26.3189800Z #define CLOCK_REALTIME 0 2025-05-07T20:26:26.3190050Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:26:26.3190320Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:26:26.3190626Z #define __cpp_aligned_new 201606L 2025-05-07T20:26:26.3190916Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:26:26.3191190Z #define cudaEventBlockingSync 0x01 2025-05-07T20:26:26.3191482Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:26:26.3191755Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:26:26.3192108Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:26:26.3192497Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:26:26.3192781Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:26:26.3193027Z #define __GLIBC__ 2 2025-05-07T20:26:26.3193255Z #define __END_DECLS } 2025-05-07T20:26:26.3193497Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:26:26.3193862Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:26:26.3194236Z #define __CONCAT(x,y) x ## y 2025-05-07T20:26:26.3194490Z #define WCONTINUED 8 2025-05-07T20:26:26.3194725Z #define __STDC_HOSTED__ 1 2025-05-07T20:26:26.3194981Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:26:26.3195261Z #define _ALLOCA_H 1 2025-05-07T20:26:26.3195496Z #define __host__ __location__(host) 2025-05-07T20:26:26.3195919Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:26:26.3196359Z #define __SLONG32_TYPE int 2025-05-07T20:26:26.3196636Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:26:26.3196926Z #define _SYS_SELECT_H 1 2025-05-07T20:26:26.3197174Z #define _IO_LINE_BUF 0x200 2025-05-07T20:26:26.3197422Z #define _IOS_NOCREATE 32 2025-05-07T20:26:26.3197666Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:26:26.3197955Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:26:26.3198249Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:26:26.3198538Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:26:26.3198816Z #define __global__ __location__(global) 2025-05-07T20:26:26.3199109Z #define __GNU_LIBRARY__ 6 2025-05-07T20:26:26.3199367Z #define __cpp_decltype_auto 201304L 2025-05-07T20:26:26.3199640Z #define __DBL_DIG__ 15 2025-05-07T20:26:26.3199869Z #define TIME_UTC 1 2025-05-07T20:26:26.3200087Z #define __FLT32_DIG__ 6 2025-05-07T20:26:26.3200406Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:26:26.3200808Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:26:26.3201127Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:26:26.3201433Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:26:26.3201850Z #define _G_BUFSIZ 8192 2025-05-07T20:26:26.3202159Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:26:26.3202530Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:26:26.3203006Z #define __cudaCDP2GetDevice 2025-05-07T20:26:26.3203373Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:26:26.3203690Z #define STA_CLOCKERR 0x1000 2025-05-07T20:26:26.3203958Z #define __GXX_WEAK__ 1 2025-05-07T20:26:26.3204214Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:26.3204518Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:26:26.3204777Z #define __SHRT_WIDTH__ 16 2025-05-07T20:26:26.3205077Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:26:26.3205418Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:26:26.3205692Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:26:26.3205980Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:26:26.3206277Z #define _G_config_h 1 2025-05-07T20:26:26.3206560Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:26:26.3206899Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:26:26.3207183Z #define _GCC_WCHAR_T 2025-05-07T20:26:26.3207422Z #define TMP_MAX 238328 2025-05-07T20:26:26.3207664Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:26:26.3207939Z #define __DEVICE_TYPES_H__ 2025-05-07T20:26:26.3208212Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:26.3208489Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:26:26.3208773Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:26:26.3209060Z #define _IO_SKIPWS 01 2025-05-07T20:26:26.3209458Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:26:26.3209918Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:26:26.3210187Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:26:26.3210516Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:26:26.3210884Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:26:26.3211260Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:26:26.3211627Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:26:26.3211877Z #define le32toh(x) (x) 2025-05-07T20:26:26.3212118Z #define _SIZE_T_DEFINED 2025-05-07T20:26:26.3212378Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:26:26.3212712Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:26:26.3213063Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:26:26.3213461Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:26:26.3213896Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:26:26.3214190Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:26:26.3214455Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:26:26.3214714Z #define _POSIX_NAME_MAX 14 2025-05-07T20:26:26.3215000Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:26:26.3215533Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:26:26.3216036Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:26:26.3216351Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:26:26.3216704Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:26:26.3217021Z #define _WCHAR_T_ 2025-05-07T20:26:26.3217250Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:26:26.3217613Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:26:26.3217996Z #define RTSIG_MAX 32 2025-05-07T20:26:26.3218381Z #define _STDDEF_H 2025-05-07T20:26:26.3218618Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:26:26.3218890Z #define _VA_LIST_DEFINED 2025-05-07T20:26:26.3219138Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:26:26.3219478Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:26:26.3219870Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:26:26.3220200Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:26:26.3220486Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:26:26.3221122Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:26:26.3221651Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:26:26.3222014Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:26:26.3222422Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:26:26.3222737Z #define __unix__ 1 2025-05-07T20:26:26.3222969Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:26.3223254Z #define __INT_WIDTH__ 32 2025-05-07T20:26:26.3223504Z #define __SIZEOF_LONG__ 8 2025-05-07T20:26:26.3223758Z #define _IONBF 2 2025-05-07T20:26:26.3224237Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:26:26.3225007Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:26:26.3225541Z #define __STDC_IEC_559__ 1 2025-05-07T20:26:26.3225794Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:26:26.3226066Z #define __UINT16_C(c) c 2025-05-07T20:26:26.3226314Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:26:26.3226581Z #define STA_DEL 0x0020 2025-05-07T20:26:26.3226824Z #define __CUDACC_VER_MINOR__ 6 2025-05-07T20:26:26.3227081Z #define __id_t_defined 2025-05-07T20:26:26.3227353Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:26:26.3227806Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:26:26.3228235Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:26:26.3228503Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:26:26.3228758Z #define __DECIMAL_DIG__ 21 2025-05-07T20:26:26.3229012Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:26:26.3229275Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:26:26.3229538Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:26:26.3229803Z #define SING 2 2025-05-07T20:26:26.3230022Z #define STA_FREQHOLD 0x0080 2025-05-07T20:26:26.3230289Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:26.3230591Z #define cudaStreamDefault 0x00 2025-05-07T20:26:26.3230946Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:26:26.3231312Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:26:26.3231582Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:26:26.3231855Z #define __gnu_linux__ 1 2025-05-07T20:26:26.3232090Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:26:26.3232348Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:26:26.3232602Z #define MAX_INPUT 255 2025-05-07T20:26:26.3232839Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:26:26.3233175Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:26:26.3233552Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:26:26.3233870Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:26:26.3234190Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:26:26.3234594Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:26:26.3235021Z #define _IO_SHOWPOS 02000 2025-05-07T20:26:26.3235349Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:26:26.3235714Z #define _Mfloat_ float 2025-05-07T20:26:26.3235986Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:26:26.3236294Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:26:26.3236588Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:26:26.3237077Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:26:26.3237591Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.3245896Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:26:26.3246253Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:26:26.3246622Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:26:26.3246920Z #define __USE_ISOC11 1 2025-05-07T20:26:26.3247156Z #define _BSD_SIZE_T_ 2025-05-07T20:26:26.3247398Z #define ADJ_MICRO 0x1000 2025-05-07T20:26:26.3247647Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:26:26.3247918Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:26:26.3248222Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:26:26.3248754Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:26:26.3249072Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:26:26.3249405Z #define __THROW throw () 2025-05-07T20:26:26.3249768Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:26:26.3250060Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:26.3250420Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:26:26.3250778Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:26:26.3251056Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:26:26.3251314Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:26:26.3251583Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:26:26.3251841Z #define L_tmpnam 20 2025-05-07T20:26:26.3252071Z #define ___int_wchar_t_h 2025-05-07T20:26:26.3252421Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:26:26.3252802Z #define isascii(c) __isascii (c) 2025-05-07T20:26:26.3253071Z #define _T_PTRDIFF 2025-05-07T20:26:26.3253390Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:26:26.3253779Z #define toascii(c) __toascii (c) 2025-05-07T20:26:26.3254058Z #define __GNUC__ 11 2025-05-07T20:26:26.3254318Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:26:26.3254622Z #define __GXX_RTTI 1 2025-05-07T20:26:26.3254850Z #define __pie__ 2 2025-05-07T20:26:26.3255070Z #define __MMX__ 1 2025-05-07T20:26:26.3255291Z #define __cudaCDP2Malloc 2025-05-07T20:26:26.3255916Z #define __timespec_defined 1 2025-05-07T20:26:26.3256292Z #define L_ctermid 9 2025-05-07T20:26:26.3256618Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:26.3257028Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:26:26.3257543Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:26:26.3257927Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:26:26.3258262Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:26:26.3258563Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:26:26.3258878Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:26:26.3259199Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:26:26.3259471Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:26:26.3259921Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:26:26.3260688Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:26.3261290Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:26:26.3261605Z #define __USE_SVID 1 2025-05-07T20:26:26.3261866Z #define __constant__ __location__(constant) 2025-05-07T20:26:26.3262178Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:26:26.3262483Z #define __device__ __location__(device) 2025-05-07T20:26:26.3262815Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:26:26.3263141Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:26:26.3263411Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:26:26.3263732Z #define CUDART_DEVICE __device__ 2025-05-07T20:26:26.3264110Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:26:26.3264486Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:26:26.3264772Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:26:26.3265153Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:26:26.3265540Z #define __STDC_UTF_16__ 1 2025-05-07T20:26:26.3265797Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:26:26.3266178Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:26:26.3266603Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:26:26.3266929Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:26:26.3267208Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:26:26.3267473Z #define NGROUPS_MAX 65536 2025-05-07T20:26:26.3267735Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:26:26.3268003Z #define __USE_ISOC95 1 2025-05-07T20:26:26.3268226Z #define _TIME_H 1 2025-05-07T20:26:26.3268502Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:26:26.3269115Z #define __USE_ISOC99 1 2025-05-07T20:26:26.3269446Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:26:26.3269827Z #define HOST_NAME_MAX 64 2025-05-07T20:26:26.3270277Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:26:26.3270546Z #define _IOS_ATEND 4 2025-05-07T20:26:26.3270779Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:26:26.3271117Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:26:26.3271527Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:26.3271871Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:26:26.3272162Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:26:26.3272486Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:26:26.3272800Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:26:26.3273061Z #define _STDIO_H 1 2025-05-07T20:26:26.3273473Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:26:26.3274115Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:26:26.3274530Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:26.3274912Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:26:26.3275218Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:26:26.3275486Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:26:26.3275767Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:26:26.3276063Z #define __cpp_raw_strings 200710L 2025-05-07T20:26:26.3276363Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.3276689Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:26:26.3276967Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:26:26.3277246Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:26:26.3277556Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:26:26.3277834Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:26:26.3278120Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:26:26.3278479Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:26:26.3278859Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:26:26.3279110Z #define __USE_XOPEN 1 2025-05-07T20:26:26.3279359Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:26:26.3279807Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:26.3280257Z #define __USE_XOPEN2K 1 2025-05-07T20:26:26.3280501Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:26:26.3280778Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:26:26.3281084Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:26:26.3281360Z #define __cpp_fold_expressions 201603L 2025-05-07T20:26:26.3281896Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:26:26.3282429Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:26:26.3282716Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:26:26.3283081Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:26:26.3283479Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:26:26.3283876Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:26:26.3284272Z #define __END_NAMESPACE_C99 2025-05-07T20:26:26.3284555Z #define __glibcxx_integral_traps true 2025-05-07T20:26:26.3284945Z #define _POSIX_PATH_MAX 256 2025-05-07T20:26:26.3285297Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:26:26.3285657Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:26:26.3286024Z #define _ISOC11_SOURCE 1 2025-05-07T20:26:26.3286373Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:26:26.3286672Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:26:26.3286981Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:26:26.3287345Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:26:26.3287737Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:26:26.3288018Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:26:26.3288285Z #define _IO_UNITBUF 020000 2025-05-07T20:26:26.3288538Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:26:26.3288803Z #define __FD_SETSIZE 1024 2025-05-07T20:26:26.3289198Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:26:26.3289474Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:26:26.3289823Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:26:26.3290185Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:26:26.3290539Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:26:26.3290855Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:26:26.3291182Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:26:26.3291457Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:26:26.3291766Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:26:26.3292107Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:26:26.3292394Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:26:26.3292726Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:26:26.3293027Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:26:26.3293304Z #define __USE_POSIX199506 1 2025-05-07T20:26:26.3293557Z #define _FEATURES_H 1 2025-05-07T20:26:26.3293831Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:26:26.3294257Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:26:26.3294673Z #define __stub_getmsg 2025-05-07T20:26:26.3294911Z #define _IO_FIXED 010000 2025-05-07T20:26:26.3295194Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:26:26.3295508Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:26:26.3295787Z #define __stub_setlogin 2025-05-07T20:26:26.3296030Z #define __stub_fattach 2025-05-07T20:26:26.3296273Z #define __cplusplus 201703L 2025-05-07T20:26:26.3296548Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:26:26.3296834Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:26:26.3297091Z #define INFINITY (__builtin_inff()) 2025-05-07T20:26:26.3297376Z #define _IO_UNBUFFERED 2 2025-05-07T20:26:26.3297875Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:26:26.3298578Z #define _IO_INTERNAL 010 2025-05-07T20:26:26.3298827Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:26:26.3299176Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:26.3299540Z #define __dev_t_defined 2025-05-07T20:26:26.3299781Z #define __DEPRECATED 1 2025-05-07T20:26:26.3300018Z #define __S32_TYPE int 2025-05-07T20:26:26.3300280Z #define __cpp_rvalue_references 200610L 2025-05-07T20:26:26.3300576Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:26:26.3300840Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:26:26.3301100Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:26:26.3301709Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:26:26.3302352Z #define _G_HAVE_MREMAP 1 2025-05-07T20:26:26.3302667Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:26.3303013Z #define OVERFLOW 3 2025-05-07T20:26:26.3303260Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:26:26.3303573Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:26:26.3303876Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:26.3304256Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:26:26.3304593Z #define __SSE2_MATH__ 1 2025-05-07T20:26:26.3304845Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:26:26.3305154Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:26.3305465Z #define _IO_STDIO_H 2025-05-07T20:26:26.3305719Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:26:26.3306013Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:26:26.3306343Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:26:26.3306647Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.3306963Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:26:26.3307230Z #define __amd64 1 2025-05-07T20:26:26.3307457Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:26:26.3307725Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:26:26.3308025Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:26:26.3308322Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:26:26.3308639Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:26:26.3308908Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:26:26.3309434Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:26:26.3309715Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:26:26.3309963Z #define __bounded 2025-05-07T20:26:26.3310207Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:26:26.3310584Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:26:26.3310869Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:26:26.3311143Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:26:26.3311425Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.3311745Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:26:26.3312169Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:26.3312577Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:26:26.3312854Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:26:26.3313196Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:26:26.3313559Z #define STA_PLL 0x0001 2025-05-07T20:26:26.3313852Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:26:26.3314128Z #define __GNUG__ 11 2025-05-07T20:26:26.3314369Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:26:26.3314635Z #define _T_WCHAR 2025-05-07T20:26:26.3314872Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:26:26.3315170Z #define __specialization_static 2025-05-07T20:26:26.3315477Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:26:26.3315790Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:26:26.3316058Z #define cudaArraySparse 0x40 2025-05-07T20:26:26.3316328Z #define STA_PPSFREQ 0x0002 2025-05-07T20:26:26.3316577Z #define __GLIBCXX__ 20230528 2025-05-07T20:26:26.3316867Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:26:26.3317172Z #define _WCHAR_T 2025-05-07T20:26:26.3317391Z #define __cudaCDP2Free 2025-05-07T20:26:26.3318043Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:26:26.3318757Z #define __cpp_nsdmi 200809L 2025-05-07T20:26:26.3319183Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:26:26.3319625Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:26:26.3319914Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:26:26.3320180Z #define cudaArrayCubemap 0x04 2025-05-07T20:26:26.3320513Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:26.3320867Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:26:26.3321118Z #define __NO_CTYPE 1 2025-05-07T20:26:26.3321354Z #define __stub_bdflush 2025-05-07T20:26:26.3321727Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:26:26.3322157Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:26:26.3322468Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:26:26.3322738Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:26:26.3323022Z #define __cpp_initializer_lists 200806L 2025-05-07T20:26:26.3323336Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:26:26.3323654Z #define __U16_TYPE unsigned short int 2025-05-07T20:26:26.3324035Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:26:26.3324389Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:26:26.3324676Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:26:26.3324964Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:26:26.3325312Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:26:26.3325660Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:26:26.3325943Z #define _IO_STDIO 040000 2025-05-07T20:26:26.3326282Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:26:26.3326677Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:26:26.3326997Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:26:26.3327297Z #define _PTRDIFF_T 2025-05-07T20:26:26.3327521Z #define _MOVE_H 1 2025-05-07T20:26:26.3327747Z #define __cpp_hex_float 201603L 2025-05-07T20:26:26.3328019Z #define ADJ_TAI 0x0080 2025-05-07T20:26:26.3328252Z #define __ptrvalue 2025-05-07T20:26:26.3328581Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:26:26.3328839Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:26:26.3329129Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:26:26.3329512Z #define MATH_ERREXCEPT 2 2025-05-07T20:26:26.3329773Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:26:26.3330060Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:26:26.3330462Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:26:26.3330838Z #define __USE_GNU 1 2025-05-07T20:26:26.3331075Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:26:26.3331356Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:26:26.3331622Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:26:26.3332014Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:26:26.3332410Z #define WEXITED 4 2025-05-07T20:26:26.3332625Z #define _IO_NO_READS 4 2025-05-07T20:26:26.3332929Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:26:26.3333289Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:26:26.3333569Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:26:26.3333904Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:26:26.3334255Z #define __uid_t_defined 2025-05-07T20:26:26.3334542Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:26:26.3334938Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:26:26.3335275Z #define WNOHANG 1 2025-05-07T20:26:26.3335527Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:26:26.3335836Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:26:26.3336120Z #define cudaEventDefault 0x00 2025-05-07T20:26:26.3336434Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:26:26.3336752Z #define NL_SETMAX INT_MAX 2025-05-07T20:26:26.3336993Z #define __x86_64 1 2025-05-07T20:26:26.3337231Z #define __cudaCDP2LaunchDevice 2025-05-07T20:26:26.3337625Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:26.3338220Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:26:26.3338738Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:26.3339182Z #define __PTRDIFF_T 2025-05-07T20:26:26.3339514Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:26:26.3339899Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:26:26.3340180Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:26.3340474Z #define _Mlong_double_ long double 2025-05-07T20:26:26.3340764Z #define __cpp_lambdas 200907L 2025-05-07T20:26:26.3341027Z #define _IO_DEC 020 2025-05-07T20:26:26.3341256Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:26:26.3341536Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:26:26.3341831Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:26:26.3342114Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:26:26.3342381Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:26:26.3342687Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:26:26.3343025Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:26:26.3343312Z #define _ANSI_STDDEF_H 2025-05-07T20:26:26.3343601Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:26:26.3343969Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:26:26.3344349Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:26:26.3344737Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:26:26.3345027Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:26:26.3345320Z #define __cpp_template_auto 201606L 2025-05-07T20:26:26.3345683Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:26:26.3346056Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:26:26.3346323Z #define __key_t_defined 2025-05-07T20:26:26.3346577Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:26:26.3346954Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:26:26.3347425Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:26:26.3347917Z #define __GNUC_VA_LIST 2025-05-07T20:26:26.3348260Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:26.3348649Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:26:26.3349005Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:26:26.3349291Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:26:26.3349598Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:26:26.3349852Z #define __WCOREFLAG 0x80 2025-05-07T20:26:26.3350109Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:26:26.3350419Z #define cudaEventDisableTiming 0x02 2025-05-07T20:26:26.3350697Z #define __LP64__ 1 2025-05-07T20:26:26.3350947Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:26:26.3351270Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:26:26.3351553Z #define _IO_off64_t __off64_t 2025-05-07T20:26:26.3351822Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.3352091Z #define __time_t_defined 1 2025-05-07T20:26:26.3352354Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:26:26.3352707Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:26:26.3353078Z #define __USE_UNIX98 1 2025-05-07T20:26:26.3353326Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:26:26.3353609Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:26:26.3353933Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:26:26.3354237Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:26:26.3354552Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:26:26.3354817Z #define SEEK_CUR 1 2025-05-07T20:26:26.3355054Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:26.3355332Z #define _ASSERT_H 1 2025-05-07T20:26:26.3356433Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:26:26.3357086Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:26:26.3357372Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:26:26.3357630Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:26:26.3357904Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:26:26.3358188Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:26:26.3358568Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:26.3358986Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:26:26.3359659Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:26:26.3360324Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:26:26.3360627Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:26:26.3360988Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:26:26.3361372Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:26:26.3361643Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:26:26.3361932Z #define cudaArrayDefault 0x00 2025-05-07T20:26:26.3362221Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:26:26.3362517Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:26:26.3362806Z #define TLOSS 5 2025-05-07T20:26:26.3363031Z #define __ssize_t_defined 2025-05-07T20:26:26.3363282Z #define __CUDACC_VER_BUILD__ 85 2025-05-07T20:26:26.3363563Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:26:26.3363869Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:26:26.3364163Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:26:26.3364536Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:26:26.3364929Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:26:26.3365223Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:26:26.3365512Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:26:26.3365831Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:26:26.3366138Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:26:26.3366425Z #define __REGISTER_PREFIX__ 2025-05-07T20:26:26.3366687Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:26:26.3367026Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:26:26.3367386Z #define _IOS_NOREPLACE 64 2025-05-07T20:26:26.3367884Z #define __cdecl 2025-05-07T20:26:26.3368132Z #define cudaEventInterprocess 0x04 2025-05-07T20:26:26.3368461Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:26:26.3368796Z #define LOGIN_NAME_MAX 256 2025-05-07T20:26:26.3369235Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:26:26.3369512Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:26:26.3369802Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:26:26.3370092Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:26:26.3370408Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:26:26.3370740Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:26:26.3378634Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:26.3379097Z #define ADJ_NANO 0x2000 2025-05-07T20:26:26.3379419Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:26:26.3379778Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:26:26.3380080Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:26:26.3380361Z #define __FLT_DIG__ 6 2025-05-07T20:26:26.3380721Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:26:26.3381130Z #define __NO_INLINE__ 1 2025-05-07T20:26:26.3381451Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:26.3381809Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:26:26.3382080Z #define ADJ_STATUS 0x0010 2025-05-07T20:26:26.3382355Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:26:26.3382653Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:26:26.3382935Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:26.3383241Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:26:26.3383547Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:26:26.3383992Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:26:26.3384419Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:26:26.3384769Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:26:26.3385127Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:26:26.3385382Z #define MAX_CANON 255 2025-05-07T20:26:26.3385620Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:26:26.3385891Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:26:26.3386173Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:26:26.3386469Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:26:26.3386787Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:26:26.3387102Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:26:26.3387388Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:26:26.3387722Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:26:26.3388096Z #define __VERSION__ "11.4.0" 2025-05-07T20:26:26.3388462Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:26:26.3388766Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:26:26.3389158Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:26:26.3389451Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:26:26.3389768Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:26:26.3390080Z #define __UINT64_C(c) c ## UL 2025-05-07T20:26:26.3390350Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:26:26.3390605Z #define _SYS_TYPES_H 1 2025-05-07T20:26:26.3390858Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:26:26.3391134Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:26:26.3391388Z #define _SYS_CDEFS_H 1 2025-05-07T20:26:26.3391636Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:26:26.3391920Z #define __cpp_unicode_characters 201411L 2025-05-07T20:26:26.3392218Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:26:26.3392487Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:26:26.3392795Z #define __cudaCDP2StreamDestroy 2025-05-07T20:26:26.3393067Z #define FP_SUBNORMAL 3 2025-05-07T20:26:26.3393325Z #define cudaOccupancyDefault 0x00 2025-05-07T20:26:26.3393614Z #define _INITIALIZER_LIST 2025-05-07T20:26:26.3393897Z #define _STDC_PREDEF_H 1 2025-05-07T20:26:26.3394175Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:26:26.3394461Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:26:26.3394758Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:26:26.3395182Z #define _IO_file_flags _flags 2025-05-07T20:26:26.3395454Z #define __USE_XOPEN2K8 1 2025-05-07T20:26:26.3395712Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:26:26.3396078Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:26:26.3396362Z #define HUGE 3.40282347e+38F 2025-05-07T20:26:26.3396638Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:26:26.3397015Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:26:26.3397422Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:26:26.3397737Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:26:26.3398010Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:26:26.3398271Z #define _BSD_SOURCE 1 2025-05-07T20:26:26.3398516Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:26:26.3399382Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:26:26.3400253Z #define __catch(X) catch(X) 2025-05-07T20:26:26.3400527Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:26:26.3400825Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:26:26.3401107Z #define __TIMER_T_TYPE void * 2025-05-07T20:26:26.3401366Z #define __STRING(x) #x 2025-05-07T20:26:26.3401615Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:26:26.3401887Z #define _T_PTRDIFF_ 2025-05-07T20:26:26.3402140Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:26:26.3402451Z #define cudaEventWaitExternal 0x01 2025-05-07T20:26:26.3402729Z #define __unbounded 2025-05-07T20:26:26.3402977Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:26.3403272Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:26:26.3403558Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:26.3403859Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:26:26.3404141Z #define __cpp_lib_is_final 201402L 2025-05-07T20:26:26.3404445Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:26:26.3404781Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:26:26.3405095Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:26:26.3405383Z #define __managed__ __location__(managed) 2025-05-07T20:26:26.3405686Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:26:26.3406094Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:26.3406521Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:26:26.3406778Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:26:26.3407159Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:26:26.3407567Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:26:26.3407828Z #define _SYS_SIZE_T_H 2025-05-07T20:26:26.3408119Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:26:26.3408464Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:26:26.3408755Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:26:26.3409048Z #define _CRTIMP 2025-05-07T20:26:26.3409282Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:26:26.3409599Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:26.3409927Z #define STA_PPSJITTER 0x0200 2025-05-07T20:26:26.3410292Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:26:26.3410715Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.3411034Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:26:26.3411327Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:26:26.3411624Z #define __SIZE_T__ 2025-05-07T20:26:26.3411845Z #define __stub_gtty 2025-05-07T20:26:26.3412076Z #define __pid_t_defined 2025-05-07T20:26:26.3412343Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:26:26.3412650Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:26.3412969Z #define __glibcxx_function_requires(...) 2025-05-07T20:26:26.3413270Z #define __SM_80_RT_HPP__ 2025-05-07T20:26:26.3413526Z #define __need_clockid_t 2025-05-07T20:26:26.3413799Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:26:26.3414089Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:26:26.3414521Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:26:26.3414842Z #define _IO_HEX 0100 2025-05-07T20:26:26.3415112Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:26:26.3415551Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:26:26.3415861Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:26:26.3416147Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:26:26.3416561Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:26.3417011Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:26:26.3417327Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:26:26.3417631Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:26:26.3417740Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:26:26.3417852Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:26:26.3417937Z #define __stub_sstk 2025-05-07T20:26:26.3418033Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:26:26.3418306Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:26:26.3418388Z #define __wur 2025-05-07T20:26:26.3418509Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:26:26.3418603Z #define _G_HAVE_MMAP 1 2025-05-07T20:26:26.3418693Z #define _IO_OCT 040 2025-05-07T20:26:26.3418796Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:26:26.3418888Z #define NL_MSGMAX INT_MAX 2025-05-07T20:26:26.3418980Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:26:26.3419117Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:26:26.3419211Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:26:26.3419315Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:26:26.3419514Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:26:26.3419611Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:26:26.3419703Z #define _STL_ALGOBASE_H 1 2025-05-07T20:26:26.3419819Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:26:26.3419910Z #define __off64_t_defined 2025-05-07T20:26:26.3420011Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:26:26.3420114Z #define __FLT128_DIG__ 33 2025-05-07T20:26:26.3420222Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:26:26.3420325Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:26:26.3420410Z #define __INT32_C(c) c 2025-05-07T20:26:26.3420512Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:26:26.3420616Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:26:26.3420713Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:26:26.3420805Z #define __PDP_ENDIAN 3412 2025-05-07T20:26:26.3420899Z #define _ISOC95_SOURCE 1 2025-05-07T20:26:26.3420996Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:26:26.3421128Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:26:26.3421236Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:26:26.3421327Z #define __SM_90_RT_HPP__ 2025-05-07T20:26:26.3421430Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:26:26.3421532Z #define __have_pthread_attr_t 1 2025-05-07T20:26:26.3421635Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:26:26.3421872Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:26:26.3421982Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:26:26.3422086Z #define __cudaCDP2EventRecord 2025-05-07T20:26:26.3422192Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:26:26.3422278Z #define htole32(x) (x) 2025-05-07T20:26:26.3422535Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:26:26.3422667Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:26:26.3422767Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:26:26.3422927Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:26:26.3423074Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:26:26.3423203Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:26:26.3423354Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:26:26.3423451Z #define ADJ_OFFSET 0x0001 2025-05-07T20:26:26.3423554Z #define cudaArrayLayered 0x01 2025-05-07T20:26:26.3423829Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:26:26.3423965Z #define cudaEventRecordDefault 0x00 2025-05-07T20:26:26.3424071Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:26:26.3424198Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:26:26.3424357Z #define unix 1 2025-05-07T20:26:26.3424452Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:26:26.3424556Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:26:26.3424653Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:26:26.3424774Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:26:26.3424871Z #define __USE_POSIX 1 2025-05-07T20:26:26.3424968Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:26:26.3425111Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:26:26.3425205Z #define __THROWNL throw () 2025-05-07T20:26:26.3425298Z #define __cpp_rtti 199711L 2025-05-07T20:26:26.3425412Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:26:26.3425502Z #define __PMT(args) args 2025-05-07T20:26:26.3425617Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.3425782Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:26:26.3425900Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:26:26.3425992Z #define _SIZE_T_DECLARED 2025-05-07T20:26:26.3426102Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:26:26.3426197Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:26:26.3426603Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:26:26.3426706Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:26:26.3426802Z #define XATTR_LIST_MAX 65536 2025-05-07T20:26:26.3426905Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:26:26.3427051Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:26:26.3427135Z #define _WCHAR_T_H 2025-05-07T20:26:26.3427235Z #define __FLT64X_DIG__ 18 2025-05-07T20:26:26.3427327Z #define _IO_SHOWBASE 0200 2025-05-07T20:26:26.3427416Z #define _POSIX_QLIMIT 1 2025-05-07T20:26:26.3427523Z #define __INT8_TYPE__ signed char 2025-05-07T20:26:26.3427625Z #define __SURFACE_TYPES_H__ 2025-05-07T20:26:26.3427715Z #define __CUDA_ARCH__ 520 2025-05-07T20:26:26.3427830Z #define __cpp_digit_separators 201309L 2025-05-07T20:26:26.3427916Z #define __ELF__ 1 2025-05-07T20:26:26.3428023Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:26:26.3428123Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:26:26.3428210Z #define STA_INS 0x0010 2025-05-07T20:26:26.3428315Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:26:26.3428488Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:26:26.3428583Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:26:26.3428686Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:26:26.3428798Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.3428909Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:26:26.3429014Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:26:26.3429120Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:26:26.3429223Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:26:26.3429384Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:26:26.3429543Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:26:26.3429653Z #define _IO_funlockfile(_fp) 2025-05-07T20:26:26.3429981Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:26.3430111Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:26:26.3430215Z #define __DRIVER_TYPES_H__ 2025-05-07T20:26:26.3430304Z #define __FLT_RADIX__ 2 2025-05-07T20:26:26.3430410Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:26:26.3430584Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:26:26.3430681Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:26:26.3430783Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:26:26.3430889Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:26:26.3430988Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:26:26.3431096Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:26:26.3431293Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:26:26.3431380Z #define WORD_BIT 32 2025-05-07T20:26:26.3431475Z #define _IO_USER_BUF 1 2025-05-07T20:26:26.3431570Z #define __VECTOR_TYPES_H__ 2025-05-07T20:26:26.3431792Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:26.3431910Z #define cudaHostAllocPortable 0x01 2025-05-07T20:26:26.3432013Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:26:26.3432113Z #define __long_double_t long double 2025-05-07T20:26:26.3432215Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:26:26.3432309Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:26:26.3432726Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:26:26.3432833Z #define __k8 1 2025-05-07T20:26:26.3433034Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:26:26.3433212Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:26:26.3433339Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:26:26.3433441Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:26:26.3433551Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:26:26.3433655Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:26:26.3433759Z #define __blksize_t_defined 2025-05-07T20:26:26.3433859Z #define _IO_SHOWPOINT 0400 2025-05-07T20:26:26.3433961Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:26:26.3434080Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:26:26.3434182Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:26:26.3434291Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:26:26.3434394Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:26:26.3434492Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:26:26.3434748Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:26:26.3435099Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:26:26.3435204Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:26:26.3435307Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:26:26.3435398Z #define SEEK_SET 0 2025-05-07T20:26:26.3435498Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:26:26.3435598Z #define __CUDA_API_VER_MINOR__ 6 2025-05-07T20:26:26.3435803Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:26:26.3435909Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:26:26.3436020Z #define __cudaCDP2GetLastError 2025-05-07T20:26:26.3436116Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:26:26.3436209Z #define _MATH_H_MATHDEF 1 2025-05-07T20:26:26.3436539Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:26:26.3436639Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:26:26.3436739Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:26:26.3436839Z #define __stub_sigreturn 2025-05-07T20:26:26.3437085Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:26:26.3437184Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:26:26.3437284Z #define __HOST_CONFIG_H__ 2025-05-07T20:26:26.3437385Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:26:26.3437482Z #define CLOCK_TAI 11 2025-05-07T20:26:26.3437592Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:26:26.3437682Z #define __restrict_arr 2025-05-07T20:26:26.3437801Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:26:26.3437945Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:26:26.3438475Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:26.3438666Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:26:26.3438751Z #define __USE_MISC 1 2025-05-07T20:26:26.3438863Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:26:26.3439052Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:26:26.3439143Z #define _GCC_LIMITS_H_ 2025-05-07T20:26:26.3439238Z #define __LDBL_DIG__ 18 2025-05-07T20:26:26.3439336Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:26:26.3439517Z #define __malloc_and_calloc_defined 2025-05-07T20:26:26.3439618Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:26:26.3439723Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:26:26.3439806Z #define __x86_64__ 1 2025-05-07T20:26:26.3439898Z #define _SIZE_T_ 2025-05-07T20:26:26.3440768Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:26:26.3440880Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:26:26.3440979Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:26:26.3441101Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:26:26.3441226Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:26:26.3441324Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:26:26.3441440Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:26:26.3441572Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:26:26.3441715Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:26:26.3441818Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:26:26.3442286Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:26.3442417Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:26:26.3442569Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:26:26.3442673Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:26:26.3442771Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:26:26.3442867Z #define STA_FLL 0x0008 2025-05-07T20:26:26.3443017Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:26:26.3443116Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:26:26.3443245Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.3443362Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:26:26.3443458Z #define __stub_revoke 2025-05-07T20:26:26.3443555Z #define __timer_t_defined 1 2025-05-07T20:26:26.3443713Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:26:26.3443832Z #define INT_MAX __INT_MAX__ 2025-05-07T20:26:26.3443943Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:26:26.3444051Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:26:26.3444154Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:26:26.3444260Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:26:26.3444373Z #define cudaArrayTextureGather 0x08 2025-05-07T20:26:26.3444479Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:26:26.3444626Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:26:26.3444733Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:26:26.3444826Z #define _IO_off_t __off_t 2025-05-07T20:26:26.3444915Z #define __FLT64_DIG__ 15 2025-05-07T20:26:26.3445149Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:26:26.3445253Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:26:26.3445383Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.3445514Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:26:26.3445612Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:26:26.3445717Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:26:26.3445808Z #define NULL __null 2025-05-07T20:26:26.3445942Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:26:26.3446049Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:26:26.3446161Z #define __U64_TYPE unsigned long int 2025-05-07T20:26:26.3446257Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.3446358Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:26:26.3446443Z #define FP_ZERO 2 2025-05-07T20:26:26.3446634Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:26:26.3446797Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:26:26.3446907Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.3447071Z #define __WCHAR_T__ 2025-05-07T20:26:26.3447174Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:26:26.3447373Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:26:26.3447527Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:26:26.3447635Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:26:26.3447760Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:26:26.3447884Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:26.3448015Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:26:26.3448146Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:26:26.3448246Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:26:26.3448339Z #define _SIGSET_H_types 1 2025-05-07T20:26:26.3448464Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:26:26.3448576Z #define __cpp_unicode_literals 200710L 2025-05-07T20:26:26.3448730Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:26:26.3448843Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:26:26.3448972Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:26:26.3449104Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:26:26.3449215Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:26:26.3449350Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:26:26.3449527Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:26:26.3449636Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:26:26.3449744Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:26:26.3449847Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:26:26.3449947Z #define STA_MODE 0x4000 2025-05-07T20:26:26.3450059Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:26:26.3450170Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:26:26.3450296Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:26:26.3450398Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:26:26.3450496Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:26:26.3450616Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:26:26.3450713Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:26:26.3450835Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:26:26.3450925Z #define __SIZE_WIDTH__ 64 2025-05-07T20:26:26.3451045Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.3451133Z #define __SEG_FS 1 2025-05-07T20:26:26.3451224Z #define _IO_size_t size_t 2025-05-07T20:26:26.3451323Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:26:26.3451433Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:26:26.3451520Z #define __stub_lchmod 2025-05-07T20:26:26.3451613Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:26:26.3451729Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.3451828Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:26:26.3451915Z #define __SEG_GS 1 2025-05-07T20:26:26.3452104Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:26:26.3452195Z #define _IOS_APPEND 8 2025-05-07T20:26:26.3452300Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:26:26.3452394Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:26:26.3452492Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:26:26.3452601Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:26:26.3452704Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:26:26.3452791Z #define htole16(x) (x) 2025-05-07T20:26:26.3452910Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:26.3453007Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:26:26.3453102Z #define __INT16_TYPE__ short int 2025-05-07T20:26:26.3453214Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:26:26.3453324Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:26:26.3453441Z #define __cpp_structured_bindings 201606L 2025-05-07T20:26:26.3453566Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:26:26.3453750Z #define __SIZEOF_INT__ 4 2025-05-07T20:26:26.3453869Z #define __WCLONE 0x80000000 2025-05-07T20:26:26.3453972Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:26:26.3454156Z #define SEEK_HOLE 4 2025-05-07T20:26:26.3454252Z #define TIMER_ABSTIME 1 2025-05-07T20:26:26.3454348Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:26:26.3454442Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:26:26.3454625Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:26.3454741Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.3454839Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:26:26.3454956Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:26:26.3455054Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:26:26.3455184Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:26:26.3455276Z #define _LINUX_LIMITS_H 2025-05-07T20:26:26.3455361Z #define linux 1 2025-05-07T20:26:26.3455462Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:26:26.3456120Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:26:26.3456251Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:26:26.3456354Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:26:26.3456465Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:26:26.3456620Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:26:26.3456725Z #define __cpp_lib_hypot 201603 2025-05-07T20:26:26.3456824Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.3456924Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:26:26.3457030Z #define MOD_NANO ADJ_NANO 2025-05-07T20:26:26.3457117Z #define htole64(x) (x) 2025-05-07T20:26:26.3457227Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:26:26.3457356Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:26:26.3457453Z #define _IO_UPPERCASE 01000 2025-05-07T20:26:26.3457959Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:26:26.3458126Z #define __USE_POSIX2 1 2025-05-07T20:26:26.3458242Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:26:26.3458340Z #define __WALL 0x40000000 2025-05-07T20:26:26.3458439Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:26:26.3458525Z #define _XLOCALE_H 1 2025-05-07T20:26:26.3458633Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:26:26.3458733Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:26:26.3458842Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:26:26.3458954Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:26:26.3459044Z #define __EXCEPTIONS 1 2025-05-07T20:26:26.3459154Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:26:26.3459351Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:26:26.3459439Z #define __WORDSIZE 64 2025-05-07T20:26:26.3459545Z #define CLOCK_MONOTONIC 1 2025-05-07T20:26:26.3459635Z #define _STL_RELOPS_H 1 2025-05-07T20:26:26.3459731Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:26:26.3459838Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:26:26.3459938Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:26:26.3460038Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:26:26.3460145Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:26:26.3460451Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:26:26.3460699Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:26.3460830Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:26:26.3460929Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:26:26.3461043Z #define __cpp_range_based_for 201603L 2025-05-07T20:26:26.3461159Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:26:26.3461262Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:26:26.3461379Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:26:26.3461566Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:26:26.3461665Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:26:26.3461765Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:26:26.3462161Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:26:26.3462353Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:26.3462472Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:26:26.3462691Z #define _STRING_H 1 2025-05-07T20:26:26.3462800Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:26:26.3462892Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:26:26.3462991Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:26:26.3463138Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:26:26.3463235Z #define __code_model_small__ 1 2025-05-07T20:26:26.3463325Z #define _PSTL_CONFIG_H 2025-05-07T20:26:26.3463436Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:26:26.3463552Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:26:26.3463660Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:26:26.3463770Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:26:26.3464112Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:26.3464220Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:26:26.3464309Z #define le64toh(x) (x) 2025-05-07T20:26:26.3464401Z #define FILENAME_MAX 4096 2025-05-07T20:26:26.3464558Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:26:26.3464681Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:26:26.3464765Z #define L_cuserid 9 2025-05-07T20:26:26.3464861Z #define __ino_t_defined 2025-05-07T20:26:26.3464942Z #define __k8__ 1 2025-05-07T20:26:26.3465043Z #define __INTPTR_TYPE__ long int 2025-05-07T20:26:26.3465161Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:26:26.3465250Z #define __int8_t_defined 2025-05-07T20:26:26.3465349Z #define __WCHAR_TYPE__ int 2025-05-07T20:26:26.3465451Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:26:26.3465567Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:26:26.3465673Z #define __SLONGWORD_TYPE long int 2025-05-07T20:26:26.3465760Z #define _IOS_TRUNC 16 2025-05-07T20:26:26.3465885Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:26:26.3466051Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:26:26.3466139Z #define __HAVE_COLUMN 2025-05-07T20:26:26.3466227Z #define __stub_fdetach 2025-05-07T20:26:26.3466649Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:26:26.3466733Z #define __pic__ 2 2025-05-07T20:26:26.3466862Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.3466962Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:26:26.3467058Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:26:26.3467168Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:26:26.3467258Z #define __stub_chflags 2025-05-07T20:26:26.3467347Z #define CLOCK_BOOTTIME 7 2025-05-07T20:26:26.3467442Z #define __need_IOV_MAX 2025-05-07T20:26:26.3467553Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:26:26.3467659Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:26:26.3467770Z #define __cpp_decltype 200707L 2025-05-07T20:26:26.3467873Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:26:26.3467967Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:26:26.3468081Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:26:26.3468174Z #define TTY_NAME_MAX 32 2025-05-07T20:26:26.3468350Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:26:26.3468474Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.3468651Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:26:26.3468770Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:26:26.3468867Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:26:26.3468964Z #define STA_PPSTIME 0x0004 2025-05-07T20:26:26.3469052Z #define __import__ 2025-05-07T20:26:26.3469142Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:26:26.3469281Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:26:26.3469377Z #define __export__ 2025-05-07T20:26:26.3469497Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:26:26.3469707Z #define cudaMemAttachHost 0x02 2025-05-07T20:26:26.3469875Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:26.3469974Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:26:26.3470152Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:26:26.3470257Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:26:26.3470350Z #define _WCHAR_T_DECLARED 2025-05-07T20:26:26.3470483Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:26:26.3470602Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:26:26.3470711Z #define __cpp_inline_variables 201606L 2025-05-07T20:26:26.3470810Z #define WNOWAIT 0x01000000 2025-05-07T20:26:26.3470895Z #define PLOSS 6 2025-05-07T20:26:26.3470990Z #define M_LN10 2.30258509299404568402 2025-05-07T20:26:26.3471276Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:26:26.3471366Z #define EXIT_SUCCESS 0 2025-05-07T20:26:26.3471474Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:26:26.3471577Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:26:26.3471681Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:26:26.3471780Z #define __thread__ __thread 2025-05-07T20:26:26.3471890Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:26:26.3471984Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:26:26.3472104Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:26:26.3472334Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:26.3472449Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:26:26.3472551Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:26:26.3472634Z #define __linux__ 1 2025-05-07T20:26:26.3472741Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:26:26.3472870Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:26:26.3472964Z #define __S16_TYPE short int 2025-05-07T20:26:26.3473324Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:26:26.3473442Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:26:26.3473662Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:26:26.3473779Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:26:26.3473905Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:26:26.3473988Z #define _T_SIZE_ 2025-05-07T20:26:26.3474096Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:26.3474217Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:26:26.3474317Z #define _PSTL_VERSION 12000 2025-05-07T20:26:26.3474441Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:26:26.3474538Z #define __WNOTHREAD 0x20000000 2025-05-07T20:26:26.3474642Z #define _G_va_list __gnuc_va_list 2025-05-07T20:26:26.3474773Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:26:26.3474859Z #define _IOS_INPUT 1 2025-05-07T20:26:26.3474959Z #define __USE_LARGEFILE64 1 2025-05-07T20:26:26.3475074Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:26:26.3475170Z #define __INT64_TYPE__ long int 2025-05-07T20:26:26.3475277Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:26:26.3475378Z #define __shared__ __location__(shared) 2025-05-07T20:26:26.3475472Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:26:26.3475642Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:26:26.3475732Z #define __gid_t_defined 2025-05-07T20:26:26.3475854Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:26:26.3475955Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:26:26.3476164Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:26:26.3476269Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:26:26.3476362Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:26:26.3476450Z #define ___int_size_t_h 2025-05-07T20:26:26.3476569Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:26.3476696Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:26:26.3476854Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:26:26.3477057Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:26:26.3477157Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:26:26.3477271Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:26:26.3477368Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:26:26.3477578Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.3477699Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:26:26.3477821Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:26:26.3477915Z #define __clock_t_defined 1 2025-05-07T20:26:26.3478029Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:26:26.3478145Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:26:26.3478238Z #define __GLIBC_MINOR__ 17 2025-05-07T20:26:26.3478335Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:26:26.3478441Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:26:26.3478556Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:26:26.3478657Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:26:26.3478830Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:26.3478923Z #define __SSE__ 1 2025-05-07T20:26:26.3479021Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:26:26.3479118Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:26:26.3479214Z #define _CTYPE_H 1 2025-05-07T20:26:26.3479306Z #define __sigset_t_defined 2025-05-07T20:26:26.3479404Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:26:26.3479513Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:26:26.3479623Z #define MOD_TAI ADJ_TAI 2025-05-07T20:26:26.3479761Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:26:26.3479898Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:26:26.3480012Z #define __SM_70_RT_H__ 2025-05-07T20:26:26.3480151Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:26:26.3480267Z #define cudaEventWaitDefault 0x00 2025-05-07T20:26:26.3480364Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:26:26.3480534Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:26.3480630Z #define _POSIX_MAX_CANON 255 2025-05-07T20:26:26.3480769Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:26:26.3480873Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:26:26.3480966Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:26:26.3481050Z #define __amd64__ 1 2025-05-07T20:26:26.3481152Z #define __WINT_WIDTH__ 32 2025-05-07T20:26:26.3481259Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:26:26.3481534Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:26.3481644Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:26:26.3494871Z #define EOF (-1) 2025-05-07T20:26:26.3495031Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:26:26.3495166Z #define __USE_POSIX199309 1 2025-05-07T20:26:26.3495314Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:26:26.3495452Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:26:26.3495599Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:26:26.3495738Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:26:26.3495901Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:26:26.3496007Z #define ____mbstate_t_defined 1 2025-05-07T20:26:26.3496107Z #define STA_NANO 0x2000 2025-05-07T20:26:26.3496203Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:26:26.3496305Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:26:26.3496397Z #define _IO_LINKED 0x80 2025-05-07T20:26:26.3496495Z #define __cpp_lib_launder 201606 2025-05-07T20:26:26.3496597Z #define __SIZEOF_INT128__ 16 2025-05-07T20:26:26.3496700Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:26:26.3496796Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:26:26.3496897Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:26:26.3497043Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:26:26.3497159Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:26.3497264Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:26.3497360Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:26:26.3497461Z #define __W_CONTINUED 0xffff 2025-05-07T20:26:26.3497552Z #define __ATOMIC_RELAXED 0 2025-05-07T20:26:26.3497687Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:26:26.3498003Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:26.3498313Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:26:26.3498503Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:26:26.3498685Z #define __stub_stty 2025-05-07T20:26:26.3498853Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:26:26.3498949Z #define le16toh(x) (x) 2025-05-07T20:26:26.3499058Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:26:26.3499232Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:26:26.3499322Z #define _SIZET_ 2025-05-07T20:26:26.3499414Z #define XATTR_NAME_MAX 255 2025-05-07T20:26:26.3499502Z #define _SVID_SOURCE 1 2025-05-07T20:26:26.3499590Z #define _LP64 1 2025-05-07T20:26:26.3499681Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:26:26.3499920Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:26:26.3500044Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:26:26.3500130Z #define __UINT8_C(c) c 2025-05-07T20:26:26.3500232Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:26:26.3500327Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:26:26.3500444Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:26:26.3500544Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:26:26.3500637Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:26:26.3500736Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:26:26.3500825Z #define CUDARTAPI 2025-05-07T20:26:26.3500911Z #define IOV_MAX 1024 2025-05-07T20:26:26.3501058Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:26:26.3501159Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:26:26.3501261Z #define cudaMemAttachSingle 0x04 2025-05-07T20:26:26.3501343Z #define __wchar_t__ 2025-05-07T20:26:26.3501453Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:26:26.3501535Z #define SEEK_END 2 2025-05-07T20:26:26.3501628Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:26:26.3501812Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:26:26.3501911Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:26:26.3502062Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:26:26.3502159Z #define ____FILE_defined 1 2025-05-07T20:26:26.3502277Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:26:26.3502380Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:26:26.3502466Z #define _ISOC99_SOURCE 1 2025-05-07T20:26:26.3502565Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:26:26.3502819Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:26.3502950Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:26:26.3503037Z #define _IO_RIGHT 04 2025-05-07T20:26:26.3503138Z #define __END_NAMESPACE_STD 2025-05-07T20:26:26.3503326Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:26.3503424Z #define _GLIBCXX_STD_C std 2025-05-07T20:26:26.3503551Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:26:26.3503652Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:26:26.3503779Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:26:26.3503873Z #define _STDDEF_H_ 2025-05-07T20:26:26.3504062Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:26.3504167Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.3504285Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:26:26.3504485Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:26:26.3504604Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:26.3504749Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:26:26.3504879Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:26:26.3504983Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:26:26.3505091Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:26:26.3505192Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:26:26.3505392Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:26:26.3505491Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:26:26.3505595Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:26:26.3505690Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:26:26.3505986Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:26:26.3506087Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:26:26.3506265Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:26:26.3506370Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:26:26.3506467Z #define __STDCPP_THREADS__ 1 2025-05-07T20:26:26.3506610Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:26:26.3506714Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:26:26.3506808Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:26:26.3506909Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:26:26.3507015Z #define P_tmpdir "/tmp" 2025-05-07T20:26:26.3507135Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:26:26.3507230Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:26:26.3507344Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:26:26.3507510Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:26:26.3507685Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:26:26.3507790Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:26:26.3507913Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:26:26.3508032Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:26:26.3508133Z #define __location__(a) __annotate__(a) 2025-05-07T20:26:26.3508363Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:26:26.3508467Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:26:26.3508581Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:26:26.3508677Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:26:26.3508773Z #define __STDC_UTF_32__ 1 2025-05-07T20:26:26.3508866Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:26:26.3508972Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:26:26.3509072Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:26:26.3509153Z #define __FXSR__ 1 2025-05-07T20:26:26.3509242Z #define _SIZE_T 2025-05-07T20:26:26.3509345Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:26:26.3509462Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:26:26.3509637Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:26.3509786Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:26:26.3509881Z #define _IO_ssize_t __ssize_t 2025-05-07T20:26:26.3509989Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:26:26.3510172Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:26.3510376Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:26:26.3510474Z #define _GXX_NULLPTR_T 2025-05-07T20:26:26.3510597Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:26:26.3510693Z #define FOPEN_MAX 16 2025-05-07T20:26:26.3510781Z #define __BIG_ENDIAN 4321 2025-05-07T20:26:26.3510903Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:26.3511006Z #define __suseconds_t_defined 2025-05-07T20:26:26.3511092Z #define __off_t_defined 2025-05-07T20:26:26.3511181Z #define stderr stderr 2025-05-07T20:26:26.3511278Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:26:26.3511389Z #define __glibcxx_requires_string(_String) 2025-05-07T20:26:26.3511485Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:26:26.3511583Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:26:26.3511992Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:26:26.3512088Z #define __mode_t_defined 2025-05-07T20:26:26.3512171Z #define _GCC_SIZE_T 2025-05-07T20:26:26.3512268Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:26.3512376Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:26:26.3512482Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:26:26.3512575Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:26:26.3512790Z #define __UINT32_C(c) c ## U 2025-05-07T20:26:26.3512897Z #define __cpp_alias_templates 200704L 2025-05-07T20:26:26.3513003Z #define cudaHostAllocMapped 0x02 2025-05-07T20:26:26.3513190Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:26:26.3513281Z #define _STL_ITERATOR_H 1 2025-05-07T20:26:26.3513362Z #define __size_t__ 2025-05-07T20:26:26.3513501Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:26:26.3513598Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:26:26.3513713Z #define cudaEventRecordExternal 0x01 2025-05-07T20:26:26.3513890Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:26:26.3514001Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:26:26.3514182Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:26:26.3514266Z #define _ENDIAN_H 1 2025-05-07T20:26:26.3514372Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:26:26.3514473Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:26:26.3514581Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:26:26.3514661Z #define __try try 2025-05-07T20:26:26.3514763Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:26:26.3514855Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:26:26.3514956Z #define __INT8_MAX__ 0x7f 2025-05-07T20:26:26.3515216Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:26:26.3515306Z #define __LONG_WIDTH__ 64 2025-05-07T20:26:26.3515392Z #define __PIC__ 2 2025-05-07T20:26:26.3515505Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:26:26.3515624Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:26:26.3515760Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:26:26.3515857Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:26:26.3515952Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:26:26.3516140Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:26.3516240Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:26:26.3516352Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:26:26.3516441Z #define _IO_uid_t __uid_t 2025-05-07T20:26:26.3516539Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:26:26.3516674Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:26:26.3516770Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:26:26.3516914Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:26.3517024Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:26:26.3517145Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:26:26.3517230Z #define LONG_BIT 64 2025-05-07T20:26:26.3517345Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:26:26.3517445Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:26:26.3517572Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:26:26.3517672Z #define __fsfilcnt_t_defined 2025-05-07T20:26:26.3517765Z #define __blkcnt_t_defined 2025-05-07T20:26:26.3518046Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:26.3518139Z #define __USE_LARGEFILE 1 2025-05-07T20:26:26.3518241Z #define __cpp_constexpr 201603L 2025-05-07T20:26:26.3518344Z #define CUDART_VERSION 12060 2025-05-07T20:26:26.3518434Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:26:26.3518539Z #define cudaDeviceMapHost 0x08 2025-05-07T20:26:26.3518638Z #define _GLIBCXX_CMATH 1 2025-05-07T20:26:26.3518836Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:26:26.3518929Z #define __lldiv_t_defined 1 2025-05-07T20:26:26.3519015Z #define __SSE2__ 1 2025-05-07T20:26:26.3519097Z #define _IOLBF 1 2025-05-07T20:26:26.3519198Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:26:26.3519300Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:26:26.3519408Z #define __cpp_deduction_guides 201703L 2025-05-07T20:26:26.3519509Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:26:26.3519619Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:26:26.3519709Z #define __INT32_TYPE__ int 2025-05-07T20:26:26.3519805Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:26:26.3519996Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:26:26.3520098Z #define __cpp_exceptions 199711L 2025-05-07T20:26:26.3520199Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:26:26.3520308Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:26:26.3520474Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:26:26.3520595Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:26:26.3520755Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:26:26.3520858Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:26:26.3520954Z #define __SWORD_TYPE long int 2025-05-07T20:26:26.3521050Z #define __INTMAX_TYPE__ long int 2025-05-07T20:26:26.3521153Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:26:26.3521249Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:26:26.3521342Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:26:26.3521631Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:26.3521726Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:26:26.3521880Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:26:26.3521965Z #define _T_SIZE 2025-05-07T20:26:26.3522071Z #define cudaHostAllocDefault 0x00 2025-05-07T20:26:26.3522199Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:26:26.3522336Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:26:26.3522428Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:26:26.3522526Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:26:26.3522648Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:26:26.3522743Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:26:26.3522845Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.3522935Z #define __ATOMIC_CONSUME 1 2025-05-07T20:26:26.3523112Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:26:26.3523207Z #define __GNUC_MINOR__ 4 2025-05-07T20:26:26.3523308Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:26:26.3523400Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:26:26.3523524Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.3523610Z #define __PIE__ 2 2025-05-07T20:26:26.3523719Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:26:26.3523819Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:26:26.3524015Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:26:26.3524239Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:26.3524329Z #define __nlink_t_defined 2025-05-07T20:26:26.3524457Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:26:26.3524575Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:26:26.3524662Z #define _XOPEN_LIM_H 1 2025-05-07T20:26:26.3524922Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:26.3525046Z #define __cpp_template_template_args 201611L 2025-05-07T20:26:26.3525152Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:26:26.3525259Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:26:26.3525358Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:26:26.3525446Z #define __FILE_defined 1 2025-05-07T20:26:26.3525627Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:26:26.3525725Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:26:26.3525820Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:26:26.3525934Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:26:26.3526049Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:26:26.3526156Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:26:26.3526263Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:26:26.3526347Z #define __INT16_C(c) c 2025-05-07T20:26:26.3526451Z #define __U32_TYPE unsigned int 2025-05-07T20:26:26.3526549Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:26:26.3526670Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:26:26.3526756Z #define __STDC__ 1 2025-05-07T20:26:26.3526852Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:26:26.3526952Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:26:26.3527143Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:26:26.3527298Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:26:26.3527388Z #define __FLT32X_DIG__ 15 2025-05-07T20:26:26.3527574Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:26:26.3527670Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:26:26.3527785Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:26:26.3527900Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:26:26.3527997Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:26:26.3528110Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:26:26.3528193Z #define stdin stdin 2025-05-07T20:26:26.3528281Z #define __ino64_t_defined 2025-05-07T20:26:26.3528373Z #define STA_CLK 0x8000 2025-05-07T20:26:26.3528467Z #define __clockid_t_defined 1 2025-05-07T20:26:26.3528614Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:26:26.3528786Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:26:26.3528896Z #define __cudaCDP2MemsetAsync 2025-05-07T20:26:26.3528998Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:26:26.3529110Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:26:26.3529213Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:26:26.3529427Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:26:26.3529519Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:26:26.3530057Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:26:26.3530148Z #define DOMAIN 1 2025-05-07T20:26:26.3530241Z #define M_LN2 0.69314718055994530942 2025-05-07T20:26:26.3530322Z #define __NVCC__ 1 2025-05-07T20:26:26.3530437Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:26:26.3530550Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:26.3530652Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:26:26.3530769Z #define __throw_exception_again throw 2025-05-07T20:26:26.3530862Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:26:26.3530956Z #define __EXCEPTION_H 1 2025-05-07T20:26:26.3531057Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:26:26.3531159Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:26:26.3531470Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:26.3531582Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:26:26.3531681Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:26:26.3531786Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:26:26.3531888Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:26:26.3531982Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:26:26.3532131Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:26:26.3532237Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:26.3532356Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:26:26.3532453Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:26:26.3532556Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:26:26.3532658Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:26:26.3532758Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:26:26.3532897Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:26:26.3532998Z #define __useconds_t_defined 2025-05-07T20:26:26.3533097Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:26:26.3533280Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:26:26.3533438Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:26:26.3533525Z #define __SSE_MATH__ 1 2025-05-07T20:26:26.3533634Z #define _IO_wint_t wint_t 2025-05-07T20:26:26.3533740Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:26:26.3533848Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:26:26.3533951Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:26:26.3534065Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:26:26.3534159Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:26:26.3534376Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:26:26.3534462Z #define __USE_ATFILE 1 2025-05-07T20:26:26.3534555Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:26:26.3534730Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:26:26.3534815Z #define _GCC_PTRDIFF_T 2025-05-07T20:26:26.3535044Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:26.3535148Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:26:26.3535248Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:26:26.3535357Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:26:26.3535466Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:26:26.3535548Z #define _STDLIB_H 1 2025-05-07T20:26:26.3535693Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:26:26.3535787Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:26:26.3535879Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:26:26.3536011Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:26.3536125Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:26.3536219Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:26:26.3536409Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:26:26.3536570Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:26:26.3536682Z #define __glibcxx_requires_nonempty() 2025-05-07T20:26:26.3536797Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:26:26.3536888Z #define __ldiv_t_defined 1 2025-05-07T20:26:26.3537074Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:26:26.3537167Z #define ___int_ptrdiff_t_h 2025-05-07T20:26:26.3537335Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:26.3537444Z #define __cudaCDP2EventDestroy 2025-05-07T20:26:26.3537537Z #define __HOST_DEFINES_H__ 2025-05-07T20:26:26.3537639Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:26:26.3537744Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:26.3537846Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:26:26.3537928Z #define CUDART_CB 2025-05-07T20:26:26.3538172Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:26:26.3538318Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:26:26.3538415Z #define MB_LEN_MAX 16 2025-05-07T20:26:26.3538644Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:26.3538742Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:26:26.3538871Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:26:26.3538983Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:26:26.3539079Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:26:26.3539233Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:26:26.3539340Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:26:26.3539424Z #define _GNU_SOURCE 1 2025-05-07T20:26:26.3539515Z #define __stub_putmsg 2025-05-07T20:26:26.3539598Z #define __CUDACC__ 1 2025-05-07T20:26:26.3539692Z #define __N(msgid) (msgid) 2025-05-07T20:26:26.3539780Z #define __P(args) args 2025-05-07T20:26:26.3540038Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:26:26.3540148Z #define __cpp_init_captures 201304L 2025-05-07T20:26:26.3540263Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:26:26.3540354Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:26:26.3540478Z #define __cpp_lib_as_const 201510 2025-05-07T20:26:26.3540559Z #define __WCHAR_T 2025-05-07T20:26:26.3540650Z #define __ATOMIC_RELEASE 3 2025-05-07T20:26:26.3540749Z #define __fsblkcnt_t_defined 2025-05-07T20:26:26.3540865Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:26:26.3540969Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:26:26.3540984Z 2025-05-07T20:26:26.3732735Z 2025-05-07T20:26:26.3733556Z + conda run -n build_binary nvcc --version 2025-05-07T20:26:26.3733574Z 2025-05-07T20:26:28.2687066Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:26:28.2687746Z Copyright (c) 2005-2024 NVIDIA Corporation 2025-05-07T20:26:28.2688473Z Built on Tue_Oct_29_23:50:19_PDT_2024 2025-05-07T20:26:28.2688861Z Cuda compilation tools, release 12.6, V12.6.85 2025-05-07T20:26:28.2689333Z Build cuda_12.6.r12.6/compiler.35059454_0 2025-05-07T20:26:28.2689787Z 2025-05-07T20:26:28.3314571Z 2025-05-07T20:26:28.3326235Z /usr/bin/nvidia-smi 2025-05-07T20:26:28.3330321Z + nvidia-smi 2025-05-07T20:26:28.3330472Z 2025-05-07T20:26:28.3504656Z Wed May 7 20:26:28 2025 2025-05-07T20:26:28.3505295Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:28.3506140Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:26:28.3507069Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:28.3507963Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:26:28.3508814Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:26:28.3509639Z | | | MIG M. | 2025-05-07T20:26:28.3510228Z |=========================================+========================+======================| 2025-05-07T20:26:28.3672465Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:26:28.3673290Z | 0% 27C P8 16W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:26:28.3673912Z | | | N/A | 2025-05-07T20:26:28.3674569Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:28.3677431Z 2025-05-07T20:26:28.3678058Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:28.3678783Z | Processes: | 2025-05-07T20:26:28.3679685Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:26:28.3680393Z | ID ID Usage | 2025-05-07T20:26:28.3680911Z |=========================================================================================| 2025-05-07T20:26:28.3683926Z | No running processes found | 2025-05-07T20:26:28.3684787Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:28.6066475Z 2025-05-07T20:26:28.6071004Z [INSTALL] Successfully installed CUDA 12.6.3 2025-05-07T20:26:28.6124953Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:26:28.6125587Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:26:28.6138927Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:26:28.6139527Z env: 2025-05-07T20:26:28.6139873Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:26:28.6140237Z BUILD_ENV: build_binary 2025-05-07T20:26:28.6140663Z BUILD_TARGET: genai 2025-05-07T20:26:28.6141016Z BUILD_VARIANT: cuda 2025-05-07T20:26:28.6141353Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:26:28.6141730Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:26:28.6142155Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:26:28.6142611Z ##[endgroup] 2025-05-07T20:26:28.9502167Z ################################################################################ 2025-05-07T20:26:28.9502611Z # Install PyTorch (PIP) 2025-05-07T20:26:28.9503037Z # 2025-05-07T20:26:28.9516914Z # [2025-05-07T20:26:28.951Z] + install_pytorch_pip build_binary nightly cuda/12.6.3 2025-05-07T20:26:28.9517410Z ################################################################################ 2025-05-07T20:26:28.9517795Z 2025-05-07T20:26:28.9545484Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:26:29.9614352Z Channels: 2025-05-07T20:26:29.9614741Z - conda-forge 2025-05-07T20:26:29.9615483Z Platform: linux-64 2025-05-07T20:26:33.2497022Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:26:33.9653459Z Solving environment: \ | / done 2025-05-07T20:26:34.1815718Z 2025-05-07T20:26:34.1816383Z ## Package Plan ## 2025-05-07T20:26:34.1816862Z 2025-05-07T20:26:34.1817197Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:34.1817624Z 2025-05-07T20:26:34.1817815Z added / updated specs: 2025-05-07T20:26:34.1818277Z - numpy 2025-05-07T20:26:34.1818504Z 2025-05-07T20:26:34.1818530Z 2025-05-07T20:26:34.1818697Z The following packages will be downloaded: 2025-05-07T20:26:34.1818976Z 2025-05-07T20:26:34.1819120Z package | build 2025-05-07T20:26:34.1819532Z ---------------------------|----------------- 2025-05-07T20:26:34.1820036Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:26:34.1820577Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:26:34.1821132Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:26:34.1821732Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:26:34.1822249Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:26:34.1822819Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:26:34.1837826Z numpy-2.2.5 | py310hefbff90_0 7.6 MB conda-forge 2025-05-07T20:26:34.1838275Z ------------------------------------------------------------ 2025-05-07T20:26:34.1838660Z Total: 14.8 MB 2025-05-07T20:26:34.1838926Z 2025-05-07T20:26:34.1839063Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:34.1839327Z 2025-05-07T20:26:34.1839599Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:26:34.1840170Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:26:34.1840685Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:26:34.1841211Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:26:34.1841970Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:26:34.1842740Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:26:34.1843923Z numpy conda-forge/linux-64::numpy-2.2.5-py310hefbff90_0 2025-05-07T20:26:34.1844415Z 2025-05-07T20:26:34.1844422Z 2025-05-07T20:26:34.1844427Z 2025-05-07T20:26:34.1844659Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:34.1845165Z numpy-2.2.5 | 7.6 MB | | 0% 2025-05-07T20:26:34.1845495Z 2025-05-07T20:26:34.1845951Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:34.1846217Z 2025-05-07T20:26:34.1846228Z 2025-05-07T20:26:34.1846464Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:26:34.1846720Z 2025-05-07T20:26:34.1846724Z 2025-05-07T20:26:34.1846728Z 2025-05-07T20:26:34.1852207Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:26:34.1852582Z 2025-05-07T20:26:34.1852586Z 2025-05-07T20:26:34.1852589Z 2025-05-07T20:26:34.1862337Z 2025-05-07T20:26:34.1871001Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:34.1871364Z 2025-05-07T20:26:34.1871383Z 2025-05-07T20:26:34.1871389Z 2025-05-07T20:26:34.1871394Z 2025-05-07T20:26:34.1877122Z 2025-05-07T20:26:34.1879003Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:34.1879376Z 2025-05-07T20:26:34.1879382Z 2025-05-07T20:26:34.1879387Z 2025-05-07T20:26:34.1879392Z 2025-05-07T20:26:34.1879402Z 2025-05-07T20:26:34.1879407Z 2025-05-07T20:26:34.2517467Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:34.2518191Z 2025-05-07T20:26:34.2518196Z 2025-05-07T20:26:34.2518201Z 2025-05-07T20:26:34.2520717Z 2025-05-07T20:26:34.2531566Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.2531920Z 2025-05-07T20:26:34.2531925Z 2025-05-07T20:26:34.2538728Z 2025-05-07T20:26:34.3537851Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:34.3538340Z 2025-05-07T20:26:34.3538346Z 2025-05-07T20:26:34.3538351Z 2025-05-07T20:26:34.3538356Z 2025-05-07T20:26:34.3538361Z 2025-05-07T20:26:34.3589259Z 2025-05-07T20:26:34.3932530Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:34.3932914Z 2025-05-07T20:26:34.3932920Z 2025-05-07T20:26:34.3932925Z 2025-05-07T20:26:34.3932931Z 2025-05-07T20:26:34.3932950Z 2025-05-07T20:26:34.4061169Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:34.4061551Z 2025-05-07T20:26:34.4061556Z 2025-05-07T20:26:34.4061575Z 2025-05-07T20:26:34.4061580Z 2025-05-07T20:26:34.4061586Z 2025-05-07T20:26:34.4068388Z 2025-05-07T20:26:34.4070732Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.4071114Z 2025-05-07T20:26:34.4071119Z 2025-05-07T20:26:34.4071125Z 2025-05-07T20:26:34.4071130Z 2025-05-07T20:26:34.4071736Z 2025-05-07T20:26:34.5057477Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.5057858Z 2025-05-07T20:26:34.5059744Z 2025-05-07T20:26:34.5181799Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:26:34.5255173Z numpy-2.2.5 | 7.6 MB | | 0% 2025-05-07T20:26:34.5255514Z 2025-05-07T20:26:34.5255520Z 2025-05-07T20:26:34.5255795Z 2025-05-07T20:26:34.5255803Z 2025-05-07T20:26:34.5256654Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.5257008Z 2025-05-07T20:26:34.5257013Z 2025-05-07T20:26:34.5257018Z 2025-05-07T20:26:34.5257036Z 2025-05-07T20:26:34.5295514Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.5295902Z 2025-05-07T20:26:34.5295908Z 2025-05-07T20:26:34.5295913Z 2025-05-07T20:26:34.5295917Z 2025-05-07T20:26:34.5296225Z 2025-05-07T20:26:34.5345404Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.5345765Z 2025-05-07T20:26:34.5345770Z 2025-05-07T20:26:34.5348020Z 2025-05-07T20:26:34.5354163Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:34.5354426Z 2025-05-07T20:26:34.5354748Z 2025-05-07T20:26:34.5355166Z 2025-05-07T20:26:34.5452454Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:34.5452725Z 2025-05-07T20:26:34.5526590Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:34.5526847Z 2025-05-07T20:26:34.5528775Z 2025-05-07T20:26:34.5733662Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:34.5733942Z 2025-05-07T20:26:34.5733958Z 2025-05-07T20:26:34.5733962Z 2025-05-07T20:26:34.5733966Z 2025-05-07T20:26:34.5733969Z 2025-05-07T20:26:34.5734970Z 2025-05-07T20:26:34.6184865Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:34.6548332Z numpy-2.2.5 | 7.6 MB | ######8 | 69% 2025-05-07T20:26:34.6550067Z 2025-05-07T20:26:34.6550426Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:34.6550683Z 2025-05-07T20:26:34.6739064Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:34.6836854Z numpy-2.2.5 | 7.6 MB | ########## | 100% 2025-05-07T20:26:34.6837490Z 2025-05-07T20:26:34.6837497Z 2025-05-07T20:26:34.6840034Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:34.6840565Z 2025-05-07T20:26:34.6840573Z 2025-05-07T20:26:34.8030334Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:34.8030730Z 2025-05-07T20:26:35.1329888Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:35.1337380Z numpy-2.2.5 | 7.6 MB | ########## | 100% 2025-05-07T20:26:35.1337864Z 2025-05-07T20:26:35.1338249Z 2025-05-07T20:26:35.1338517Z  2025-05-07T20:26:35.1338796Z 2025-05-07T20:26:35.1338802Z 2025-05-07T20:26:35.1339034Z  2025-05-07T20:26:35.1339319Z 2025-05-07T20:26:35.1339324Z 2025-05-07T20:26:35.1339330Z 2025-05-07T20:26:35.1339549Z  2025-05-07T20:26:35.1339758Z 2025-05-07T20:26:35.1339762Z 2025-05-07T20:26:35.1339765Z 2025-05-07T20:26:35.1339769Z 2025-05-07T20:26:35.1340024Z  2025-05-07T20:26:35.1340317Z 2025-05-07T20:26:35.1340322Z 2025-05-07T20:26:35.1340328Z 2025-05-07T20:26:35.1340333Z 2025-05-07T20:26:35.1340338Z 2025-05-07T20:26:35.1340578Z  2025-05-07T20:26:35.1340892Z 2025-05-07T20:26:35.1340897Z 2025-05-07T20:26:35.1340902Z 2025-05-07T20:26:35.1340908Z 2025-05-07T20:26:35.1340913Z 2025-05-07T20:26:35.1340918Z 2025-05-07T20:26:35.1341118Z  done 2025-05-07T20:26:35.2345541Z Preparing transaction: \ done 2025-05-07T20:26:35.4351588Z Verifying transaction: / - done 2025-05-07T20:26:35.5359506Z Executing transaction: | done 2025-05-07T20:26:35.7126535Z ################################################################################ 2025-05-07T20:26:35.7126955Z # Install Package From PyTorch PIP: torch 2025-05-07T20:26:35.7127257Z # 2025-05-07T20:26:35.7142071Z # [2025-05-07T20:26:35.713Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3 2025-05-07T20:26:35.7142560Z ################################################################################ 2025-05-07T20:26:35.7142772Z 2025-05-07T20:26:35.7157658Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:26:35.8104609Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:26:35.8105570Z ################################################################################ 2025-05-07T20:26:35.8106303Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:26:35.8106707Z # 2025-05-07T20:26:35.8121755Z # [2025-05-07T20:26:35.811Z] + __prepare_pip_arguments torch nightly cuda/12.6.3 2025-05-07T20:26:35.8122533Z ################################################################################ 2025-05-07T20:26:35.8122764Z 2025-05-07T20:26:35.8143888Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:26:35.8170804Z [INSTALL] Extracted package variant: cu126 2025-05-07T20:26:35.8187931Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:26:35.8188708Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:26:35.8197413Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:26:35.8207092Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ... 2025-05-07T20:26:35.8229045Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:55.6194257Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:55.6194894Z Collecting torch 2025-05-07T20:27:55.6195575Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:27:55.6196299Z Collecting filelock (from torch) 2025-05-07T20:27:55.6196822Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:27:55.6197984Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from torch) (4.13.2) 2025-05-07T20:27:55.6199282Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:27:55.6199805Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:27:55.6200653Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 54.6 MB/s eta 0:00:00 2025-05-07T20:27:55.6201006Z Collecting networkx (from torch) 2025-05-07T20:27:55.6201523Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:27:55.6202194Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 28.6 MB/s eta 0:00:00 2025-05-07T20:27:55.6202546Z Collecting jinja2 (from torch) 2025-05-07T20:27:55.6203028Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:27:55.6203545Z Collecting fsspec (from torch) 2025-05-07T20:27:55.6204054Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:27:55.6204641Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch) 2025-05-07T20:27:55.6205371Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) 2025-05-07T20:27:55.6206176Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 44.9 MB/s eta 0:00:00 2025-05-07T20:27:55.6206600Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch) 2025-05-07T20:27:55.6207342Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB) 2025-05-07T20:27:55.6208142Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 10.4 MB/s eta 0:00:00 2025-05-07T20:27:55.6208550Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch) 2025-05-07T20:27:55.6209259Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB) 2025-05-07T20:27:55.6210044Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 41.1 MB/s eta 0:00:00 2025-05-07T20:27:55.6210444Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch) 2025-05-07T20:27:55.6211136Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) 2025-05-07T20:27:55.6211907Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 34.5 MB/s eta 0:00:00 2025-05-07T20:27:55.6212303Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch) 2025-05-07T20:27:55.6213317Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) 2025-05-07T20:27:55.6214194Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 58.4 MB/s eta 0:00:00 2025-05-07T20:27:55.6214574Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch) 2025-05-07T20:27:55.6215264Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB) 2025-05-07T20:27:55.6216051Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 137.0 MB/s eta 0:00:00 2025-05-07T20:27:55.6216452Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch) 2025-05-07T20:27:55.6217195Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB) 2025-05-07T20:27:55.6217975Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 158.1 MB/s eta 0:00:00 2025-05-07T20:27:55.6218569Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch) 2025-05-07T20:27:55.6219306Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB) 2025-05-07T20:27:55.6220095Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 158.8 MB/s eta 0:00:00 2025-05-07T20:27:55.6220492Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch) 2025-05-07T20:27:55.6221203Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB) 2025-05-07T20:27:55.6222096Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 134.1 MB/s eta 0:00:00 2025-05-07T20:27:55.6222486Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:27:55.6223194Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:27:55.6223978Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 147.5 MB/s eta 0:00:00 2025-05-07T20:27:55.6224364Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:27:55.6225136Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:27:55.6225915Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch) 2025-05-07T20:27:55.6226584Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB) 2025-05-07T20:27:55.6227287Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch) 2025-05-07T20:27:55.6228071Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) 2025-05-07T20:27:55.6228938Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 164.9 MB/s eta 0:00:00 2025-05-07T20:27:55.6229332Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch) 2025-05-07T20:27:55.6230133Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:27:55.6230955Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:27:55.6231800Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:27:55.6233088Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1) 2025-05-07T20:27:55.6233965Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:27:55.6234531Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:27:55.6235465Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 56.0 MB/s eta 0:00:00 2025-05-07T20:27:55.6235842Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:27:55.6236685Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB) 2025-05-07T20:27:55.6237751Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp310-cp310-manylinux_2_28_x86_64.whl (825.5 MB) 2025-05-07T20:27:55.6238565Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.5/825.5 MB 37.8 MB/s eta 0:00:00 2025-05-07T20:27:55.6239333Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) 2025-05-07T20:27:55.6240268Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 11.5 MB/s eta 0:00:00 2025-05-07T20:27:55.6241027Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:27:55.6241872Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 102.8 MB/s eta 0:00:00 2025-05-07T20:27:55.6242924Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.4 MB) 2025-05-07T20:27:55.6243858Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.4/153.4 MB 134.3 MB/s eta 0:00:00 2025-05-07T20:27:55.6245609Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:27:55.6247337Z 2025-05-07T20:27:55.6249395Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126 2025-05-07T20:27:55.6251571Z 2025-05-07T20:27:57.8293526Z torch 2.8.0.dev20250507+cu126 2025-05-07T20:27:57.8296088Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126) 2025-05-07T20:28:01.2755338Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:04.7150224Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:04.7150786Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:08.0891158Z True 2025-05-07T20:28:08.0891502Z True 2025-05-07T20:28:08.0891655Z 2025-05-07T20:28:08.1523087Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:08.1561889Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:08.1562514Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:08.1575507Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:08.1575860Z env: 2025-05-07T20:28:08.1576091Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:08.1576394Z BUILD_ENV: build_binary 2025-05-07T20:28:08.1576648Z BUILD_TARGET: genai 2025-05-07T20:28:08.1576894Z BUILD_VARIANT: cuda 2025-05-07T20:28:08.1577135Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:08.1577391Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:08.1577824Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:08.1578257Z ##[endgroup] 2025-05-07T20:28:08.4945569Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:08.4947422Z ################################################################################ 2025-05-07T20:28:08.4947903Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:08.4948283Z # 2025-05-07T20:28:08.4964277Z # [2025-05-07T20:28:08.496Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:08.4964681Z ################################################################################ 2025-05-07T20:28:08.4964893Z 2025-05-07T20:28:08.4981617Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:08.5916898Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:08.5927429Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:08.5928066Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:08.5928485Z 2025-05-07T20:28:08.6818627Z 2025-05-07T20:28:08.6819295Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:08.6841656Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:28:14.6380002Z Collecting environment information... 2025-05-07T20:28:14.6380383Z PyTorch version: 2.8.0.dev20250507+cu126 2025-05-07T20:28:14.6380718Z Is debug build: False 2025-05-07T20:28:14.6380977Z CUDA used to build PyTorch: 12.6 2025-05-07T20:28:14.6381270Z ROCM used to build PyTorch: N/A 2025-05-07T20:28:14.6381478Z 2025-05-07T20:28:14.6381602Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:28:14.6381925Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:28:14.6382251Z Clang version: Could not collect 2025-05-07T20:28:14.6382547Z CMake version: Could not collect 2025-05-07T20:28:14.6382815Z Libc version: glibc-2.34 2025-05-07T20:28:14.6382980Z 2025-05-07T20:28:14.6383305Z Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime) 2025-05-07T20:28:14.6383938Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:28:14.6384369Z Is CUDA available: True 2025-05-07T20:28:14.6384623Z CUDA runtime version: 12.6.85 2025-05-07T20:28:14.6384900Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:28:14.6385215Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:28:14.6385547Z Nvidia driver version: 570.133.07 2025-05-07T20:28:14.6395427Z cuDNN version: Could not collect 2025-05-07T20:28:14.6395768Z HIP runtime version: N/A 2025-05-07T20:28:14.6396023Z MIOpen runtime version: N/A 2025-05-07T20:28:14.6396293Z Is XNNPACK available: True 2025-05-07T20:28:14.6396460Z 2025-05-07T20:28:14.6396550Z CPU: 2025-05-07T20:28:14.6396767Z Architecture: x86_64 2025-05-07T20:28:14.6397113Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:28:14.6397511Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:28:14.6397895Z Byte Order: Little Endian 2025-05-07T20:28:14.6398215Z CPU(s): 16 2025-05-07T20:28:14.6398515Z On-line CPU(s) list: 0-15 2025-05-07T20:28:14.6399148Z Vendor ID: AuthenticAMD 2025-05-07T20:28:14.6399494Z Model name: AMD EPYC 7R32 2025-05-07T20:28:14.6399813Z CPU family: 23 2025-05-07T20:28:14.6400100Z Model: 49 2025-05-07T20:28:14.6400382Z Thread(s) per core: 2 2025-05-07T20:28:14.6400677Z Core(s) per socket: 8 2025-05-07T20:28:14.6400961Z Socket(s): 1 2025-05-07T20:28:14.6401296Z Stepping: 0 2025-05-07T20:28:14.6401623Z BogoMIPS: 5599.99 2025-05-07T20:28:14.6403744Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:28:14.6405884Z Hypervisor vendor: KVM 2025-05-07T20:28:14.6406197Z Virtualization type: full 2025-05-07T20:28:14.6406540Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:28:14.6406893Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:28:14.6407405Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:28:14.6407760Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:28:14.6408075Z NUMA node(s): 1 2025-05-07T20:28:14.6408373Z NUMA node0 CPU(s): 0-15 2025-05-07T20:28:14.6408710Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:28:14.6409092Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:28:14.6409448Z Vulnerability L1tf: Not affected 2025-05-07T20:28:14.6409800Z Vulnerability Mds: Not affected 2025-05-07T20:28:14.6410154Z Vulnerability Meltdown: Not affected 2025-05-07T20:28:14.6410505Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:28:14.6410875Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:28:14.6411426Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:28:14.6412011Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:28:14.6412559Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:28:14.6413254Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:28:14.6414125Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:28:14.6414808Z Vulnerability Srbds: Not affected 2025-05-07T20:28:14.6415166Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:28:14.6415404Z 2025-05-07T20:28:14.6415509Z Versions of relevant libraries: 2025-05-07T20:28:14.6415779Z [pip3] numpy==2.2.5 2025-05-07T20:28:14.6416023Z [pip3] nvidia-cublas-cu12==12.6.4.1 2025-05-07T20:28:14.6416330Z [pip3] nvidia-cuda-cupti-cu12==12.6.80 2025-05-07T20:28:14.6416652Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 2025-05-07T20:28:14.6416962Z [pip3] nvidia-cuda-runtime-cu12==12.6.77 2025-05-07T20:28:14.6417279Z [pip3] nvidia-cudnn-cu12==9.5.1.17 2025-05-07T20:28:14.6417574Z [pip3] nvidia-cufft-cu12==11.3.0.4 2025-05-07T20:28:14.6417859Z [pip3] nvidia-curand-cu12==10.3.7.77 2025-05-07T20:28:14.6418259Z [pip3] nvidia-cusolver-cu12==11.7.1.2 2025-05-07T20:28:14.6418576Z [pip3] nvidia-cusparse-cu12==12.5.4.2 2025-05-07T20:28:14.6418998Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:28:14.6419305Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:28:14.6419594Z [pip3] nvidia-nvjitlink-cu12==12.6.85 2025-05-07T20:28:14.6419889Z [pip3] nvidia-nvtx-cu12==12.6.77 2025-05-07T20:28:14.6420177Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:28:14.6420479Z [pip3] torch==2.8.0.dev20250507+cu126 2025-05-07T20:28:14.6420856Z [conda] cuda-cudart 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.6421340Z [conda] cuda-cudart-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.6421855Z [conda] cuda-cudart-dev_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:14.6422381Z [conda] cuda-cudart-static 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.6422910Z [conda] cuda-cudart-static_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:14.6423442Z [conda] cuda-cudart_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:14.6423921Z [conda] cuda-cupti 12.6.80 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.6424393Z [conda] cuda-cupti-dev 12.6.80 h5888daf_0 conda-forge 2025-05-07T20:28:14.6424879Z [conda] cuda-libraries 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:14.6425372Z [conda] cuda-libraries-dev 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:14.6425842Z [conda] cuda-nvrtc 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.6426400Z [conda] cuda-nvrtc-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:14.6426859Z [conda] cuda-nvtx 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.6427306Z [conda] cuda-opencl 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.6427781Z [conda] cuda-opencl-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.6428263Z [conda] cuda-runtime 12.6.3 ha804496_0 conda-forge 2025-05-07T20:28:14.6428722Z [conda] libcublas 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:14.6429181Z [conda] libcublas-dev 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:14.6429646Z [conda] libcufft 11.3.0.4 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.6430102Z [conda] libcufft-dev 11.3.0.4 h5888daf_0 conda-forge 2025-05-07T20:28:14.6430564Z [conda] libcurand 10.3.7.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.6431030Z [conda] libcurand-dev 10.3.7.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.6431502Z [conda] libcusolver 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:14.6431985Z [conda] libcusolver-dev 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:14.6432465Z [conda] libcusparse 12.5.4.2 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.6432946Z [conda] libcusparse-dev 12.5.4.2 h5888daf_0 conda-forge 2025-05-07T20:28:14.6433430Z [conda] libnvjitlink 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.6433913Z [conda] libnvjitlink-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:14.6434370Z [conda] numpy 2.2.5 py310hefbff90_0 conda-forge 2025-05-07T20:28:14.6434831Z [conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi 2025-05-07T20:28:14.6435336Z [conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi 2025-05-07T20:28:14.6435829Z [conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:14.6436331Z [conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:14.6436829Z [conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi 2025-05-07T20:28:14.6437392Z [conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi 2025-05-07T20:28:14.6437861Z [conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi 2025-05-07T20:28:14.6438348Z [conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi 2025-05-07T20:28:14.6438836Z [conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi 2025-05-07T20:28:14.6439325Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:28:14.6439812Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:28:14.6440296Z [conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi 2025-05-07T20:28:14.6440777Z [conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:14.6441249Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:28:14.6441765Z [conda] torch 2.8.0.dev20250507+cu126 pypi_0 pypi 2025-05-07T20:28:14.6442035Z 2025-05-07T20:28:14.7086723Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:14.7087431Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:14.7099504Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:14.7099861Z env: 2025-05-07T20:28:14.7100094Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:14.7100426Z BUILD_ENV: build_binary 2025-05-07T20:28:14.7100681Z BUILD_TARGET: genai 2025-05-07T20:28:14.7101111Z BUILD_VARIANT: cuda 2025-05-07T20:28:14.7101353Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:14.7101622Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:14.7101933Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:14.7102273Z ##[endgroup] 2025-05-07T20:28:15.0498127Z ################################################################################ 2025-05-07T20:28:15.0498576Z # Prepare FBGEMM-GPU Build 2025-05-07T20:28:15.0498833Z # 2025-05-07T20:28:15.0514627Z # [2025-05-07T20:28:15.051Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:28:15.0515053Z ################################################################################ 2025-05-07T20:28:15.0515274Z 2025-05-07T20:28:15.0531270Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:15.1464528Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:15.1488765Z [BUILD] Running git submodules update ... 2025-05-07T20:28:15.1512958Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:28:15.1876056Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:28:15.1876762Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:28:15.1877409Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:28:15.1877817Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:28:15.1878233Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:28:15.1878689Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:28:15.1879105Z Synchronizing submodule url for '../external/json' 2025-05-07T20:28:15.1909597Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:28:15.2461242Z [BUILD] Installing other build dependencies ... 2025-05-07T20:28:15.2482147Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:28:17.6526029Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:28:17.6701096Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:28:17.7717798Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:28:17.7744727Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:28:18.0236519Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:28:18.0267760Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:28:18.1285819Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:28:18.1319527Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:28:18.4768683Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:28:18.4802826Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:28:18.5362168Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:28:18.5365781Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:28:18.6152263Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:28:18.6179342Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:28:18.6628993Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:28:18.7234913Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:28:18.7262238Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:28:18.8414110Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:28:18.8442913Z Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:28:18.9417091Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:28:18.9453018Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:28:18.9985527Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:28:19.0607684Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:28:19.0640356Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:28:19.1587707Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:28:19.1612419Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:28:19.2638389Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:28:19.2667132Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:28:19.3749253Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:19.3777610Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:28:19.4720155Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:19.4749415Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:28:19.5821470Z Collecting tomli>=1.1.0 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:19.5850629Z Downloading tomli-2.2.1-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:19.6854143Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:19.6887400Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:19.8180804Z Collecting exceptiongroup>=1.0.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:19.8208340Z Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB) 2025-05-07T20:28:19.9172162Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:19.9319385Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:19.9809311Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:28:20.0359333Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:20.0385756Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:28:20.0867595Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:28:20.1408503Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:28:20.1438724Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:28:20.1904098Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:28:20.2562751Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:20.2589256Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:28:20.3096389Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:28:20.3571109Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:28:20.4058991Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:28:20.9045579Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 55.8 MB/s eta 0:00:00 2025-05-07T20:28:20.9075858Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:28:20.9554788Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:28:21.0111825Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:28:21.0539850Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:28:21.1156569Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:28:21.1612860Z Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB) 2025-05-07T20:28:21.2233452Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 751.2/751.2 kB 8.2 MB/s eta 0:00:00 2025-05-07T20:28:21.2283167Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:28:21.2738601Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:21.3230748Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:28:21.3748161Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:28:21.4265349Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:28:21.4720813Z Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB) 2025-05-07T20:28:21.5196188Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:28:21.5683980Z Downloading tomli-2.2.1-py3-none-any.whl (14 kB) 2025-05-07T20:28:21.6230053Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:28:21.6750006Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:21.7273473Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:28:21.7757719Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:28:22.0048477Z Installing collected packages: sortedcontainers, tomli, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, exceptiongroup, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:28:24.5254665Z 2025-05-07T20:28:24.5325506Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 exceptiongroup-1.2.2 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 tomli-2.2.1 typing-inspect-0.9.0 2025-05-07T20:28:24.7103747Z ################################################################################ 2025-05-07T20:28:24.7104494Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:28:24.7104763Z # 2025-05-07T20:28:24.7119881Z # [2025-05-07T20:28:24.711Z] + install_triton_pip build_binary 2025-05-07T20:28:24.7120271Z ################################################################################ 2025-05-07T20:28:24.7120487Z 2025-05-07T20:28:24.7120715Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:28:24.7121153Z ################################################################################ 2025-05-07T20:28:24.7121519Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:28:24.7121851Z # 2025-05-07T20:28:24.7136511Z # [2025-05-07T20:28:24.713Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:24.7137042Z ################################################################################ 2025-05-07T20:28:24.7137261Z 2025-05-07T20:28:24.7151985Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:24.8070483Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:24.8071070Z ################################################################################ 2025-05-07T20:28:24.8071425Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:28:24.8071717Z # 2025-05-07T20:28:24.8088952Z # [2025-05-07T20:28:24.808Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:24.8089445Z ################################################################################ 2025-05-07T20:28:24.8089666Z 2025-05-07T20:28:24.8136147Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:28:24.8152799Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:28:24.8153324Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:24.8161918Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:24.8171130Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:28:24.8192190Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:32.4891601Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:28:32.4893099Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:28:32.4893853Z 2025-05-07T20:28:32.4894071Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:32.4894483Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:32.4895286Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:28:32.4896515Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:28:32.4897613Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 55.8 MB/s eta 0:00:00 2025-05-07T20:28:32.4897991Z Installing collected packages: pytorch-triton 2025-05-07T20:28:32.4898428Z Attempting uninstall: pytorch-triton 2025-05-07T20:28:32.4898824Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:28:32.4899250Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:28:32.4899672Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:28:32.4900113Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:28:32.4900372Z 2025-05-07T20:28:34.6804421Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:28:34.6809224Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:28:36.8235360Z ################################################################################ 2025-05-07T20:28:36.8235860Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:28:36.8236254Z ################################################################################ 2025-05-07T20:28:36.8236475Z 2025-05-07T20:28:38.8643149Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:28:40.9758048Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:28:40.9762052Z [BUILD] Successfully ran git submodules update 2025-05-07T20:28:40.9819896Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:40.9820388Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:40.9832939Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:40.9833292Z env: 2025-05-07T20:28:40.9833527Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:40.9833832Z BUILD_ENV: build_binary 2025-05-07T20:28:40.9834087Z BUILD_TARGET: genai 2025-05-07T20:28:40.9834327Z BUILD_VARIANT: cuda 2025-05-07T20:28:40.9834571Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:40.9834830Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:40.9835141Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:40.9835487Z ##[endgroup] 2025-05-07T20:28:41.3225628Z ################################################################################ 2025-05-07T20:28:41.3226151Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:28:41.3226520Z # 2025-05-07T20:28:41.3243274Z # [2025-05-07T20:28:41.323Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.3244244Z ################################################################################ 2025-05-07T20:28:41.3244460Z 2025-05-07T20:28:41.3244840Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.3245765Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.3246109Z 2025-05-07T20:28:41.3365181Z 8ba3834acd41ae3bcccd6bc3808c6265641c1772 fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.3367906Z 2025-05-07T20:28:41.3368387Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.3368743Z 2025-05-07T20:28:41.3503510Z f663bfaa41e2d494994aba32b6056b326d1d4de603cd7405849022b0c68c5a6f fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.3506029Z 2025-05-07T20:28:41.3506522Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.3506876Z 2025-05-07T20:28:41.3735314Z b3c9062203e47ff2273663b0f7d0fbee fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.3737959Z 2025-05-07T20:28:41.3747701Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl ... 2025-05-07T20:28:41.3770100Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:44.1094953Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp310-cp310-manylinux_2_28_x86_64.whl 2025-05-07T20:28:44.1095931Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:28:44.1096797Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:28:44.1097254Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:28:44.1097534Z 2025-05-07T20:28:51.0088645Z ################################################################################ 2025-05-07T20:28:51.0089070Z [CHECK] !!!! INFO !!!! 2025-05-07T20:28:51.0089468Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:51.0089906Z [CHECK] CUDA version reported by PyTorch is: 12.6 2025-05-07T20:28:51.0090223Z [CHECK] 2025-05-07T20:28:51.0090559Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:28:51.0091102Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:28:51.0091527Z ################################################################################ 2025-05-07T20:28:51.0091739Z 2025-05-07T20:28:51.0091859Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:28:54.9418621Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:28:58.8634798Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:02.7912256Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:02.7915430Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:29:14.5766382Z ################################################################################ 2025-05-07T20:29:14.5768871Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:29:14.5769479Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:29:14.5769947Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:29:14.5770316Z ################################################################################ 2025-05-07T20:29:14.5770535Z 2025-05-07T20:29:22.4328151Z ################################################################################ 2025-05-07T20:29:22.4328564Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:29:22.4329963Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:29:22.4331908Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:29:22.4332430Z ################################################################################ 2025-05-07T20:29:22.4332656Z 2025-05-07T20:29:22.4332815Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:29:26.3632913Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:29:30.2915413Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:29:34.3705737Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:29:38.3008407Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:29:38.3012694Z [INSTALL] Check for operator registrations ... 2025-05-07T20:29:42.1686239Z fbgemm.nccl_init 2025-05-07T20:29:42.1686425Z 2025-05-07T20:29:42.2299613Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:29:46.0877235Z fbgemm.gqa_attn_splitk 2025-05-07T20:29:46.0877528Z 2025-05-07T20:29:46.1494019Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:29:50.0223229Z fbgemm.rope_qkv_decoding 2025-05-07T20:29:50.0223546Z 2025-05-07T20:29:50.0838844Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:29:50.0839480Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:29:50.0874662Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:50.0875135Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:50.0887379Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:50.0887734Z env: 2025-05-07T20:29:50.0887965Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:50.0888268Z BUILD_ENV: build_binary 2025-05-07T20:29:50.0888520Z BUILD_TARGET: genai 2025-05-07T20:29:50.0888756Z BUILD_VARIANT: cuda 2025-05-07T20:29:50.0888990Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:29:50.0889251Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:50.0889559Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:50.0889889Z ##[endgroup] 2025-05-07T20:29:50.4233060Z ################################################################################ 2025-05-07T20:29:50.4233442Z # Test All FBGEMM-GPU Modules 2025-05-07T20:29:50.4233699Z # 2025-05-07T20:29:50.4248412Z # [2025-05-07T20:29:50.424Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:29:50.4248823Z ################################################################################ 2025-05-07T20:29:50.4249049Z 2025-05-07T20:29:58.2711909Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:29:58.2712689Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:29:58.2713099Z [TEST] Determined the test directories: 2025-05-07T20:29:58.2713424Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:29:58.2713729Z fbgemm_gpu/experimental/example/test 2025-05-07T20:29:58.2714038Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:29:58.2714238Z 2025-05-07T20:29:58.2722013Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:29:58.2729200Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:29:58.2729817Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:29:58.2730226Z 2025-05-07T20:29:58.6959671Z 2025-05-07T20:29:58.6960075Z [TEST] Installing PyTest ... 2025-05-07T20:29:58.6982416Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:29:59.8027845Z Channels: 2025-05-07T20:29:59.8028101Z - conda-forge 2025-05-07T20:29:59.8028336Z Platform: linux-64 2025-05-07T20:30:03.1365898Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:04.2789409Z Solving environment: \ | / done 2025-05-07T20:30:04.5055116Z 2025-05-07T20:30:04.5055514Z ## Package Plan ## 2025-05-07T20:30:04.5056015Z 2025-05-07T20:30:04.5056326Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:04.5056753Z 2025-05-07T20:30:04.5056886Z added / updated specs: 2025-05-07T20:30:04.5057211Z - expecttest 2025-05-07T20:30:04.5057430Z - pytest 2025-05-07T20:30:04.5057566Z 2025-05-07T20:30:04.5057570Z 2025-05-07T20:30:04.5057694Z The following packages will be downloaded: 2025-05-07T20:30:04.5057942Z 2025-05-07T20:30:04.5058060Z package | build 2025-05-07T20:30:04.5058492Z ---------------------------|----------------- 2025-05-07T20:30:04.5058927Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:04.5059608Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:04.5060249Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:04.5060701Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:04.5061136Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:04.5061565Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:04.5061984Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:04.5062761Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:04.5063158Z ------------------------------------------------------------ 2025-05-07T20:30:04.5063502Z Total: 428 KB 2025-05-07T20:30:04.5063718Z 2025-05-07T20:30:04.5063856Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:04.5064092Z 2025-05-07T20:30:04.5064298Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:04.5064813Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:04.5065342Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:04.5065829Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:04.5066302Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:04.5066929Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:04.5067574Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:04.5068165Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:04.5068433Z 2025-05-07T20:30:04.5068437Z 2025-05-07T20:30:04.5068446Z 2025-05-07T20:30:04.5068595Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:04.5068980Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:04.5069218Z 2025-05-07T20:30:04.5069619Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:04.5069857Z 2025-05-07T20:30:04.5069861Z 2025-05-07T20:30:04.5086927Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:04.5087281Z 2025-05-07T20:30:04.5087287Z 2025-05-07T20:30:04.5087292Z 2025-05-07T20:30:04.5096653Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:04.5097037Z 2025-05-07T20:30:04.5097044Z 2025-05-07T20:30:04.5097050Z 2025-05-07T20:30:04.5097066Z 2025-05-07T20:30:04.5104600Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:04.5105102Z 2025-05-07T20:30:04.5105110Z 2025-05-07T20:30:04.5105116Z 2025-05-07T20:30:04.5105123Z 2025-05-07T20:30:04.5105140Z 2025-05-07T20:30:04.5106049Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:04.5106340Z 2025-05-07T20:30:04.5106344Z 2025-05-07T20:30:04.5106347Z 2025-05-07T20:30:04.5106351Z 2025-05-07T20:30:04.5106362Z 2025-05-07T20:30:04.5110722Z 2025-05-07T20:30:04.5112544Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:04.5112906Z 2025-05-07T20:30:04.5112912Z 2025-05-07T20:30:04.5112925Z 2025-05-07T20:30:04.5112931Z 2025-05-07T20:30:04.5112936Z 2025-05-07T20:30:04.5112942Z 2025-05-07T20:30:04.5121201Z 2025-05-07T20:30:04.7190619Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:04.7190917Z 2025-05-07T20:30:04.7190921Z 2025-05-07T20:30:04.7190925Z 2025-05-07T20:30:04.7193608Z 2025-05-07T20:30:04.7240811Z exceptiongroup-1.2.2 | 20 KB | #######9 | 80%  2025-05-07T20:30:04.7241113Z 2025-05-07T20:30:04.7241117Z 2025-05-07T20:30:04.7241121Z 2025-05-07T20:30:04.7241124Z 2025-05-07T20:30:04.8082033Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:04.8082338Z 2025-05-07T20:30:04.8082342Z 2025-05-07T20:30:04.8297839Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:30:04.8298179Z 2025-05-07T20:30:04.8298183Z 2025-05-07T20:30:04.8298194Z 2025-05-07T20:30:04.8298198Z 2025-05-07T20:30:04.8298201Z 2025-05-07T20:30:04.8564778Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:30:04.8565053Z 2025-05-07T20:30:04.8569463Z 2025-05-07T20:30:04.8582698Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:04.8582964Z 2025-05-07T20:30:04.8582968Z 2025-05-07T20:30:04.8582972Z 2025-05-07T20:30:04.8611515Z pluggy-1.5.0 | 23 KB | ######9 | 69%  2025-05-07T20:30:04.8611778Z 2025-05-07T20:30:04.8612021Z 2025-05-07T20:30:04.8612027Z 2025-05-07T20:30:04.8612030Z 2025-05-07T20:30:04.8612034Z 2025-05-07T20:30:04.8700278Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:04.8700554Z 2025-05-07T20:30:04.8700558Z 2025-05-07T20:30:04.8700562Z 2025-05-07T20:30:04.8921192Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:04.8921455Z 2025-05-07T20:30:04.8921459Z 2025-05-07T20:30:04.8921463Z 2025-05-07T20:30:04.8921467Z 2025-05-07T20:30:04.8959146Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:04.8959441Z 2025-05-07T20:30:04.8959445Z 2025-05-07T20:30:04.8959449Z 2025-05-07T20:30:04.8959452Z 2025-05-07T20:30:04.8959456Z 2025-05-07T20:30:04.9014051Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:04.9014558Z 2025-05-07T20:30:04.9014566Z 2025-05-07T20:30:04.9067567Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:04.9067826Z 2025-05-07T20:30:04.9067840Z 2025-05-07T20:30:04.9067844Z 2025-05-07T20:30:04.9131379Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:04.9131647Z 2025-05-07T20:30:04.9131650Z 2025-05-07T20:30:04.9131654Z 2025-05-07T20:30:04.9131658Z 2025-05-07T20:30:04.9131661Z 2025-05-07T20:30:04.9131665Z 2025-05-07T20:30:04.9134810Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:04.9135101Z 2025-05-07T20:30:04.9135105Z 2025-05-07T20:30:04.9135109Z 2025-05-07T20:30:04.9135112Z 2025-05-07T20:30:04.9135116Z 2025-05-07T20:30:04.9135119Z 2025-05-07T20:30:04.9201844Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:04.9202131Z 2025-05-07T20:30:04.9202135Z 2025-05-07T20:30:04.9202138Z 2025-05-07T20:30:04.9202142Z 2025-05-07T20:30:04.9202145Z 2025-05-07T20:30:04.9202149Z 2025-05-07T20:30:04.9228123Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:04.9272560Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:30:04.9357131Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:04.9357375Z 2025-05-07T20:30:04.9357588Z 2025-05-07T20:30:04.9357592Z 2025-05-07T20:30:04.9357596Z 2025-05-07T20:30:04.9357599Z 2025-05-07T20:30:04.9357603Z 2025-05-07T20:30:04.9357615Z 2025-05-07T20:30:04.9362469Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:04.9362819Z 2025-05-07T20:30:04.9362826Z 2025-05-07T20:30:04.9362831Z 2025-05-07T20:30:04.9362844Z 2025-05-07T20:30:04.9362850Z 2025-05-07T20:30:04.9362856Z 2025-05-07T20:30:04.9362861Z 2025-05-07T20:30:04.9513451Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:04.9516953Z 2025-05-07T20:30:04.9535931Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:30:04.9536264Z 2025-05-07T20:30:04.9536270Z 2025-05-07T20:30:04.9536275Z 2025-05-07T20:30:04.9536280Z 2025-05-07T20:30:04.9536285Z 2025-05-07T20:30:04.9536309Z 2025-05-07T20:30:04.9536499Z 2025-05-07T20:30:04.9540888Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:04.9541331Z 2025-05-07T20:30:04.9697169Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:04.9703154Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:04.9703400Z 2025-05-07T20:30:04.9709969Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:04.9710370Z 2025-05-07T20:30:04.9710574Z 2025-05-07T20:30:04.9710737Z  2025-05-07T20:30:04.9710944Z 2025-05-07T20:30:04.9710949Z 2025-05-07T20:30:04.9711113Z  2025-05-07T20:30:04.9711324Z 2025-05-07T20:30:04.9711327Z 2025-05-07T20:30:04.9711331Z 2025-05-07T20:30:04.9711495Z  2025-05-07T20:30:04.9711907Z 2025-05-07T20:30:04.9711913Z 2025-05-07T20:30:04.9711925Z 2025-05-07T20:30:04.9711929Z 2025-05-07T20:30:04.9712115Z  2025-05-07T20:30:04.9712332Z 2025-05-07T20:30:04.9712335Z 2025-05-07T20:30:04.9712339Z 2025-05-07T20:30:04.9712348Z 2025-05-07T20:30:04.9712352Z 2025-05-07T20:30:04.9712525Z  2025-05-07T20:30:04.9712735Z 2025-05-07T20:30:04.9712739Z 2025-05-07T20:30:04.9712743Z 2025-05-07T20:30:04.9712746Z 2025-05-07T20:30:04.9712750Z 2025-05-07T20:30:04.9712759Z 2025-05-07T20:30:04.9712934Z  2025-05-07T20:30:04.9713146Z 2025-05-07T20:30:04.9713150Z 2025-05-07T20:30:04.9713153Z 2025-05-07T20:30:04.9713157Z 2025-05-07T20:30:04.9713160Z 2025-05-07T20:30:04.9713164Z 2025-05-07T20:30:04.9713174Z 2025-05-07T20:30:04.9713381Z  done 2025-05-07T20:30:05.0715475Z Preparing transaction: \ done 2025-05-07T20:30:05.1720852Z Verifying transaction: / done 2025-05-07T20:30:07.0749049Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:30:07.2014502Z [TEST] Checking imports ... 2025-05-07T20:30:11.1154575Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:11.1168605Z [TEST] Setting feature flags ... 2025-05-07T20:30:11.1169375Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:30:11.1169766Z 2025-05-07T20:30:11.5374854Z 2025-05-07T20:30:11.5375180Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:30:11.5377067Z ################################################################################ 2025-05-07T20:30:11.5377811Z # Run FBGEMM-GPU Tests: 2025-05-07T20:30:11.5378397Z # 2025-05-07T20:30:11.5395978Z # [2025-05-07T20:30:11.539Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:30:11.5396521Z ################################################################################ 2025-05-07T20:30:11.5396979Z 2025-05-07T20:30:11.5403774Z [TEST] Enumerating ALL test files ... 2025-05-07T20:30:11.5432405Z ./attention/gqa_test.py 2025-05-07T20:30:11.5432809Z ./coalesce/coalesce_test.py 2025-05-07T20:30:11.5433199Z ./comm/multi_gpu_car_test.py 2025-05-07T20:30:11.5433629Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:30:11.5434007Z ./kv_cache/kv_cache_test.py 2025-05-07T20:30:11.5434263Z ./moe/activation_test.py 2025-05-07T20:30:11.5434621Z ./moe/gather_scatter_test.py 2025-05-07T20:30:11.5435000Z ./moe/layers_test.py 2025-05-07T20:30:11.5435327Z ./moe/shuffling_test.py 2025-05-07T20:30:11.5435679Z ./quantize/quantize_test.py 2025-05-07T20:30:11.5435919Z 2025-05-07T20:30:11.5436081Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:30:11.5436301Z 2025-05-07T20:30:11.5453404Z ################################################################################ 2025-05-07T20:30:11.5468671Z # [2025-05-07T20:30:11.546Z] Run Python Test Suite: 2025-05-07T20:30:11.5469094Z # ./attention/gqa_test.py 2025-05-07T20:30:11.5469416Z ################################################################################ 2025-05-07T20:30:11.5492734Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:30:11.5493357Z 2025-05-07T20:30:14.0870861Z ============================= test session starts ============================== 2025-05-07T20:30:14.0871524Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:14.0872073Z cachedir: .pytest_cache 2025-05-07T20:30:14.0872717Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:14.0873815Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:14.0874243Z plugins: hypothesis-6.131.14 2025-05-07T20:30:15.6206672Z collecting ... collected 2 items 2025-05-07T20:30:15.6207037Z 2025-05-07T20:30:53.9325215Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:30:53.9325874Z self=, 2025-05-07T20:30:53.9326284Z int4_kv=False, 2025-05-07T20:30:53.9326556Z num_groups=1, 2025-05-07T20:30:53.9326825Z B=1, 2025-05-07T20:30:53.9327073Z MAX_T=4, 2025-05-07T20:30:53.9327321Z N_H_L=1, 2025-05-07T20:30:53.9327567Z ) 2025-05-07T20:30:53.9327816Z Trying example: test_gqa( 2025-05-07T20:30:53.9328182Z self=, 2025-05-07T20:30:53.9328579Z int4_kv=True, 2025-05-07T20:30:53.9328851Z num_groups=1, 2025-05-07T20:30:53.9329105Z B=1, 2025-05-07T20:30:53.9329347Z MAX_T=4, 2025-05-07T20:30:53.9329592Z N_H_L=1, 2025-05-07T20:30:53.9329827Z ) 2025-05-07T20:30:53.9330102Z Trying example: test_gqa( 2025-05-07T20:30:53.9330474Z self=, 2025-05-07T20:30:53.9330864Z int4_kv=True, 2025-05-07T20:30:53.9331142Z num_groups=4, 2025-05-07T20:30:53.9331402Z B=23, 2025-05-07T20:30:53.9331636Z MAX_T=33, 2025-05-07T20:30:53.9331884Z N_H_L=68, 2025-05-07T20:30:53.9332126Z ) 2025-05-07T20:30:53.9332385Z Trying example: test_gqa( 2025-05-07T20:30:53.9332784Z self=, 2025-05-07T20:30:53.9333179Z int4_kv=True, 2025-05-07T20:30:53.9333442Z num_groups=4, 2025-05-07T20:30:53.9333694Z B=77, 2025-05-07T20:30:53.9333926Z MAX_T=4, 2025-05-07T20:30:53.9334169Z N_H_L=1, 2025-05-07T20:30:53.9334402Z ) 2025-05-07T20:30:53.9334648Z Trying example: test_gqa( 2025-05-07T20:30:53.9335022Z self=, 2025-05-07T20:30:53.9335418Z int4_kv=True, 2025-05-07T20:30:53.9335688Z num_groups=4, 2025-05-07T20:30:53.9335950Z B=77, 2025-05-07T20:30:53.9336190Z MAX_T=52, 2025-05-07T20:30:53.9336444Z N_H_L=67, 2025-05-07T20:30:53.9336695Z ) 2025-05-07T20:30:53.9336938Z Trying example: test_gqa( 2025-05-07T20:30:53.9337723Z self=, 2025-05-07T20:30:53.9338261Z int4_kv=False, 2025-05-07T20:30:53.9338526Z num_groups=4, 2025-05-07T20:30:53.9338789Z B=57, 2025-05-07T20:30:53.9339029Z MAX_T=45, 2025-05-07T20:30:53.9339276Z N_H_L=120, 2025-05-07T20:30:53.9339527Z ) 2025-05-07T20:30:53.9339775Z Trying example: test_gqa( 2025-05-07T20:30:53.9340137Z self=, 2025-05-07T20:30:53.9340534Z int4_kv=True, 2025-05-07T20:30:53.9340803Z num_groups=4, 2025-05-07T20:30:53.9341054Z B=52, 2025-05-07T20:30:53.9341296Z MAX_T=42, 2025-05-07T20:30:53.9341543Z N_H_L=53, 2025-05-07T20:30:53.9341781Z ) 2025-05-07T20:30:53.9342031Z Trying example: test_gqa( 2025-05-07T20:30:53.9342404Z self=, 2025-05-07T20:30:53.9342846Z int4_kv=True, 2025-05-07T20:30:53.9343109Z num_groups=1, 2025-05-07T20:30:53.9343366Z B=77, 2025-05-07T20:30:53.9343615Z MAX_T=95, 2025-05-07T20:30:53.9343857Z N_H_L=53, 2025-05-07T20:30:53.9344106Z ) 2025-05-07T20:30:53.9344353Z Trying example: test_gqa( 2025-05-07T20:30:53.9344714Z self=, 2025-05-07T20:30:53.9345111Z int4_kv=True, 2025-05-07T20:30:53.9345378Z num_groups=4, 2025-05-07T20:30:53.9345635Z B=113, 2025-05-07T20:30:53.9345882Z MAX_T=48, 2025-05-07T20:30:53.9346137Z N_H_L=96, 2025-05-07T20:30:53.9346383Z ) 2025-05-07T20:30:53.9346632Z Trying example: test_gqa( 2025-05-07T20:30:53.9347005Z self=, 2025-05-07T20:30:53.9347396Z int4_kv=False, 2025-05-07T20:30:53.9347663Z num_groups=1, 2025-05-07T20:30:53.9347925Z B=51, 2025-05-07T20:30:53.9348153Z MAX_T=61, 2025-05-07T20:30:53.9348410Z N_H_L=69, 2025-05-07T20:30:53.9348870Z ) 2025-05-07T20:30:53.9349121Z Trying example: test_gqa( 2025-05-07T20:30:53.9349498Z self=, 2025-05-07T20:30:53.9349898Z int4_kv=False, 2025-05-07T20:30:53.9350163Z num_groups=4, 2025-05-07T20:30:53.9350424Z B=17, 2025-05-07T20:30:53.9350662Z MAX_T=113, 2025-05-07T20:30:53.9350911Z N_H_L=65, 2025-05-07T20:30:53.9351162Z ) 2025-05-07T20:30:53.9351407Z Trying example: test_gqa( 2025-05-07T20:30:53.9351766Z self=, 2025-05-07T20:30:53.9352212Z int4_kv=False, 2025-05-07T20:30:53.9352481Z num_groups=4, 2025-05-07T20:30:53.9352747Z B=17, 2025-05-07T20:30:53.9352979Z MAX_T=65, 2025-05-07T20:30:53.9353227Z N_H_L=65, 2025-05-07T20:30:53.9353470Z ) 2025-05-07T20:30:53.9353709Z Trying example: test_gqa( 2025-05-07T20:30:53.9354080Z self=, 2025-05-07T20:30:53.9354478Z int4_kv=False, 2025-05-07T20:30:53.9354743Z num_groups=4, 2025-05-07T20:30:53.9355013Z B=65, 2025-05-07T20:30:53.9355250Z MAX_T=65, 2025-05-07T20:30:53.9355494Z N_H_L=65, 2025-05-07T20:30:53.9356009Z ) 2025-05-07T20:30:53.9356263Z Trying example: test_gqa( 2025-05-07T20:30:53.9356624Z self=, 2025-05-07T20:30:53.9357019Z int4_kv=False, 2025-05-07T20:30:53.9357285Z num_groups=1, 2025-05-07T20:30:53.9357542Z B=6, 2025-05-07T20:30:53.9357779Z MAX_T=108, 2025-05-07T20:30:53.9358028Z N_H_L=14, 2025-05-07T20:30:53.9358264Z ) 2025-05-07T20:30:53.9358509Z Trying example: test_gqa( 2025-05-07T20:30:53.9358875Z self=, 2025-05-07T20:30:53.9359267Z int4_kv=False, 2025-05-07T20:30:53.9359535Z num_groups=1, 2025-05-07T20:30:53.9359797Z B=6, 2025-05-07T20:30:53.9360030Z MAX_T=14, 2025-05-07T20:30:53.9360280Z N_H_L=14, 2025-05-07T20:30:53.9360534Z ) 2025-05-07T20:30:53.9360773Z Trying example: test_gqa( 2025-05-07T20:30:53.9361151Z self=, 2025-05-07T20:30:53.9361550Z int4_kv=False, 2025-05-07T20:30:53.9361819Z num_groups=1, 2025-05-07T20:30:53.9362074Z B=6, 2025-05-07T20:30:53.9362496Z MAX_T=6, 2025-05-07T20:30:53.9362765Z N_H_L=14, 2025-05-07T20:30:53.9363004Z ) 2025-05-07T20:30:53.9363246Z Trying example: test_gqa( 2025-05-07T20:30:53.9363611Z self=, 2025-05-07T20:30:53.9363998Z int4_kv=False, 2025-05-07T20:30:53.9364263Z num_groups=1, 2025-05-07T20:30:53.9364525Z B=6, 2025-05-07T20:30:53.9364751Z MAX_T=6, 2025-05-07T20:30:53.9365001Z N_H_L=6, 2025-05-07T20:30:53.9365239Z ) 2025-05-07T20:30:53.9365506Z Trying example: test_gqa( 2025-05-07T20:30:53.9365873Z self=, 2025-05-07T20:30:53.9366277Z int4_kv=False, 2025-05-07T20:30:53.9366553Z num_groups=1, 2025-05-07T20:30:53.9366812Z B=70, 2025-05-07T20:30:53.9367056Z MAX_T=94, 2025-05-07T20:30:53.9367305Z N_H_L=78, 2025-05-07T20:30:53.9367546Z ) 2025-05-07T20:30:53.9367800Z Trying example: test_gqa( 2025-05-07T20:30:53.9368176Z self=, 2025-05-07T20:30:53.9368573Z int4_kv=False, 2025-05-07T20:30:53.9368847Z num_groups=1, 2025-05-07T20:30:53.9369112Z B=78, 2025-05-07T20:30:53.9369347Z MAX_T=94, 2025-05-07T20:30:53.9369600Z N_H_L=78, 2025-05-07T20:30:53.9369842Z ) 2025-05-07T20:30:53.9370088Z Trying example: test_gqa( 2025-05-07T20:30:53.9370457Z self=, 2025-05-07T20:30:53.9370855Z int4_kv=False, 2025-05-07T20:30:53.9371120Z num_groups=1, 2025-05-07T20:30:53.9371378Z B=94, 2025-05-07T20:30:53.9371622Z MAX_T=94, 2025-05-07T20:30:53.9371861Z N_H_L=78, 2025-05-07T20:30:53.9372111Z ) 2025-05-07T20:30:53.9372363Z Trying example: test_gqa( 2025-05-07T20:30:53.9372780Z self=, 2025-05-07T20:30:53.9373176Z int4_kv=False, 2025-05-07T20:30:53.9373583Z num_groups=1, 2025-05-07T20:30:53.9373843Z B=94, 2025-05-07T20:30:53.9374082Z MAX_T=94, 2025-05-07T20:30:53.9374335Z N_H_L=94, 2025-05-07T20:30:53.9374611Z ) 2025-05-07T20:30:53.9374820Z Trying example: test_gqa( 2025-05-07T20:30:53.9375124Z self=, 2025-05-07T20:30:53.9375441Z int4_kv=False, 2025-05-07T20:30:53.9375657Z num_groups=4, 2025-05-07T20:30:53.9375868Z B=41, 2025-05-07T20:30:53.9376066Z MAX_T=105, 2025-05-07T20:30:53.9376271Z N_H_L=126, 2025-05-07T20:30:53.9376473Z ) 2025-05-07T20:30:53.9376672Z Trying example: test_gqa( 2025-05-07T20:30:53.9376972Z self=, 2025-05-07T20:30:53.9377292Z int4_kv=False, 2025-05-07T20:30:53.9377508Z num_groups=4, 2025-05-07T20:30:53.9377714Z B=105, 2025-05-07T20:30:53.9377909Z MAX_T=105, 2025-05-07T20:30:53.9378187Z N_H_L=126, 2025-05-07T20:30:53.9378384Z ) 2025-05-07T20:30:53.9378592Z Trying example: test_gqa( 2025-05-07T20:30:53.9378908Z self=, 2025-05-07T20:30:53.9379225Z int4_kv=False, 2025-05-07T20:30:53.9379452Z num_groups=4, 2025-05-07T20:30:53.9379672Z B=105, 2025-05-07T20:30:53.9379869Z MAX_T=105, 2025-05-07T20:30:53.9380082Z N_H_L=105, 2025-05-07T20:30:53.9380288Z ) 2025-05-07T20:30:53.9380486Z Trying example: test_gqa( 2025-05-07T20:30:53.9380792Z self=, 2025-05-07T20:30:53.9381120Z int4_kv=True, 2025-05-07T20:30:53.9381333Z num_groups=1, 2025-05-07T20:30:53.9381547Z B=95, 2025-05-07T20:30:53.9381752Z MAX_T=114, 2025-05-07T20:30:53.9381967Z N_H_L=43, 2025-05-07T20:30:53.9382168Z ) 2025-05-07T20:30:53.9382372Z Trying example: test_gqa( 2025-05-07T20:30:53.9382680Z self=, 2025-05-07T20:30:53.9382995Z int4_kv=True, 2025-05-07T20:30:53.9383210Z num_groups=1, 2025-05-07T20:30:53.9383418Z B=43, 2025-05-07T20:30:53.9383606Z MAX_T=114, 2025-05-07T20:30:53.9383818Z N_H_L=43, 2025-05-07T20:30:53.9384015Z ) 2025-05-07T20:30:53.9384215Z Trying example: test_gqa( 2025-05-07T20:30:53.9384543Z self=, 2025-05-07T20:30:53.9384994Z int4_kv=True, 2025-05-07T20:30:53.9385216Z num_groups=1, 2025-05-07T20:30:53.9385430Z B=43, 2025-05-07T20:30:53.9385625Z MAX_T=43, 2025-05-07T20:30:53.9385831Z N_H_L=43, 2025-05-07T20:30:53.9386020Z ) 2025-05-07T20:30:53.9386223Z Trying example: test_gqa( 2025-05-07T20:30:53.9386528Z self=, 2025-05-07T20:30:53.9386843Z int4_kv=False, 2025-05-07T20:30:53.9387064Z num_groups=1, 2025-05-07T20:30:53.9387273Z B=21, 2025-05-07T20:30:53.9387458Z MAX_T=38, 2025-05-07T20:30:53.9387657Z N_H_L=42, 2025-05-07T20:30:53.9387859Z ) 2025-05-07T20:30:53.9388050Z Trying example: test_gqa( 2025-05-07T20:30:53.9388355Z self=, 2025-05-07T20:30:53.9388683Z int4_kv=False, 2025-05-07T20:30:53.9388897Z num_groups=1, 2025-05-07T20:30:53.9389108Z B=38, 2025-05-07T20:30:53.9389308Z MAX_T=38, 2025-05-07T20:30:53.9389515Z N_H_L=42, 2025-05-07T20:30:53.9389721Z ) 2025-05-07T20:30:53.9389924Z Trying example: test_gqa( 2025-05-07T20:30:53.9390222Z self=, 2025-05-07T20:30:53.9390544Z int4_kv=False, 2025-05-07T20:30:53.9390770Z num_groups=1, 2025-05-07T20:30:53.9390987Z B=38, 2025-05-07T20:30:53.9391176Z MAX_T=42, 2025-05-07T20:30:53.9391385Z N_H_L=42, 2025-05-07T20:30:53.9391587Z ) 2025-05-07T20:30:53.9391778Z Trying example: test_gqa( 2025-05-07T20:30:53.9392085Z self=, 2025-05-07T20:30:53.9392418Z int4_kv=False, 2025-05-07T20:30:53.9392634Z num_groups=1, 2025-05-07T20:30:53.9392851Z B=42, 2025-05-07T20:30:53.9393087Z MAX_T=42, 2025-05-07T20:30:53.9393303Z N_H_L=42, 2025-05-07T20:30:53.9393511Z ) 2025-05-07T20:30:53.9393814Z Trying example: test_gqa( 2025-05-07T20:30:53.9394113Z self=, 2025-05-07T20:30:53.9394437Z int4_kv=True, 2025-05-07T20:30:53.9394660Z num_groups=1, 2025-05-07T20:30:53.9394865Z B=74, 2025-05-07T20:30:53.9395064Z MAX_T=20, 2025-05-07T20:30:53.9395270Z N_H_L=15, 2025-05-07T20:30:53.9395459Z ) 2025-05-07T20:30:53.9395665Z Trying example: test_gqa( 2025-05-07T20:30:53.9395975Z self=, 2025-05-07T20:30:53.9396290Z int4_kv=True, 2025-05-07T20:30:53.9396520Z num_groups=1, 2025-05-07T20:30:53.9396742Z B=20, 2025-05-07T20:30:53.9396934Z MAX_T=20, 2025-05-07T20:30:53.9397149Z N_H_L=15, 2025-05-07T20:30:53.9397360Z ) 2025-05-07T20:30:53.9397558Z Trying example: test_gqa( 2025-05-07T20:30:53.9397869Z self=, 2025-05-07T20:30:53.9398200Z int4_kv=True, 2025-05-07T20:30:53.9398424Z num_groups=1, 2025-05-07T20:30:53.9398630Z B=20, 2025-05-07T20:30:53.9398835Z MAX_T=15, 2025-05-07T20:30:53.9399041Z N_H_L=15, 2025-05-07T20:30:53.9399239Z ) 2025-05-07T20:30:53.9399439Z Trying example: test_gqa( 2025-05-07T20:30:53.9399741Z self=, 2025-05-07T20:30:53.9400057Z int4_kv=True, 2025-05-07T20:30:53.9400273Z num_groups=1, 2025-05-07T20:30:53.9400487Z B=15, 2025-05-07T20:30:53.9400678Z MAX_T=20, 2025-05-07T20:30:53.9400882Z N_H_L=15, 2025-05-07T20:30:53.9401084Z ) 2025-05-07T20:30:53.9401280Z Trying example: test_gqa( 2025-05-07T20:30:53.9401589Z self=, 2025-05-07T20:30:53.9401909Z int4_kv=True, 2025-05-07T20:30:53.9402117Z num_groups=1, 2025-05-07T20:30:53.9402337Z B=15, 2025-05-07T20:30:53.9402531Z MAX_T=15, 2025-05-07T20:30:53.9402735Z N_H_L=15, 2025-05-07T20:30:53.9402970Z ) 2025-05-07T20:30:53.9403191Z Trying example: test_gqa( 2025-05-07T20:30:53.9403489Z self=, 2025-05-07T20:30:53.9403819Z int4_kv=False, 2025-05-07T20:30:53.9404044Z num_groups=4, 2025-05-07T20:30:53.9404258Z B=117, 2025-05-07T20:30:53.9404459Z MAX_T=104, 2025-05-07T20:30:53.9404756Z N_H_L=69, 2025-05-07T20:30:53.9404955Z ) 2025-05-07T20:30:53.9405155Z Trying example: test_gqa( 2025-05-07T20:30:53.9405460Z self=, 2025-05-07T20:30:53.9405784Z int4_kv=False, 2025-05-07T20:30:53.9405994Z num_groups=4, 2025-05-07T20:30:53.9406208Z B=117, 2025-05-07T20:30:53.9406408Z MAX_T=117, 2025-05-07T20:30:53.9406607Z N_H_L=69, 2025-05-07T20:30:53.9406805Z ) 2025-05-07T20:30:53.9407007Z Trying example: test_gqa( 2025-05-07T20:30:53.9407305Z self=, 2025-05-07T20:30:53.9407629Z int4_kv=False, 2025-05-07T20:30:53.9407847Z num_groups=4, 2025-05-07T20:30:53.9408054Z B=69, 2025-05-07T20:30:53.9408251Z MAX_T=117, 2025-05-07T20:30:53.9408453Z N_H_L=69, 2025-05-07T20:30:53.9408643Z ) 2025-05-07T20:30:53.9408852Z Trying example: test_gqa( 2025-05-07T20:30:53.9409149Z self=, 2025-05-07T20:30:53.9409464Z int4_kv=False, 2025-05-07T20:30:53.9409691Z num_groups=4, 2025-05-07T20:30:53.9409897Z B=117, 2025-05-07T20:30:53.9410092Z MAX_T=69, 2025-05-07T20:30:53.9410301Z N_H_L=69, 2025-05-07T20:30:53.9410497Z ) 2025-05-07T20:30:53.9410685Z PASSED 2025-05-07T20:30:53.9677860Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:30:53.9678209Z 2025-05-07T20:30:53.9678372Z =========================== short test summary info ============================ 2025-05-07T20:30:53.9679097Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when CUDA is not available or xformers is not available 2025-05-07T20:30:53.9679800Z ======================== 1 passed, 1 skipped in 40.38s ========================= 2025-05-07T20:30:54.6107536Z 2025-05-07T20:30:54.6108334Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:30:54.6130129Z [TEST] Python test time for ./attention/gqa_test.py: 43 seconds 2025-05-07T20:30:54.6130529Z 2025-05-07T20:30:54.6130536Z 2025-05-07T20:30:54.6130541Z 2025-05-07T20:30:54.6130547Z 2025-05-07T20:30:54.6150818Z ################################################################################ 2025-05-07T20:30:54.6166328Z # [2025-05-07T20:30:54.616Z] Run Python Test Suite: 2025-05-07T20:30:54.6166808Z # ./coalesce/coalesce_test.py 2025-05-07T20:30:54.6167117Z ################################################################################ 2025-05-07T20:30:54.6192372Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:30:54.6193201Z 2025-05-07T20:30:56.7637852Z ============================= test session starts ============================== 2025-05-07T20:30:56.7638599Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:56.7639137Z cachedir: .pytest_cache 2025-05-07T20:30:56.7639736Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:56.7640476Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:56.7640885Z plugins: hypothesis-6.131.14 2025-05-07T20:30:58.3250053Z collecting ... collected 1 item 2025-05-07T20:30:58.3250316Z 2025-05-07T20:30:59.0547702Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:30:59.0548060Z 2025-05-07T20:30:59.0548214Z ============================== 1 passed in 2.41s =============================== 2025-05-07T20:30:59.6786519Z 2025-05-07T20:30:59.6787189Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:30:59.6807891Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:30:59.6808346Z 2025-05-07T20:30:59.6808352Z 2025-05-07T20:30:59.6808357Z 2025-05-07T20:30:59.6808362Z 2025-05-07T20:30:59.6828929Z ################################################################################ 2025-05-07T20:30:59.6844807Z # [2025-05-07T20:30:59.684Z] Run Python Test Suite: 2025-05-07T20:30:59.6845301Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:30:59.6845713Z ################################################################################ 2025-05-07T20:30:59.6869583Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:30:59.6870352Z 2025-05-07T20:31:01.8244466Z ============================= test session starts ============================== 2025-05-07T20:31:01.8246091Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:01.8247159Z cachedir: .pytest_cache 2025-05-07T20:31:01.8248368Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:01.8249846Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:01.8250675Z plugins: hypothesis-6.131.14 2025-05-07T20:31:03.4167253Z collecting ... collected 5 items 2025-05-07T20:31:03.4167587Z 2025-05-07T20:31:03.4178376Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:03.4186661Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:03.4194157Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:03.4201706Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:03.4217098Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:03.4217573Z 2025-05-07T20:31:03.4218205Z =========================== short test summary info ============================ 2025-05-07T20:31:03.4218914Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:03.4219870Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:03.4220815Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:03.4221756Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:03.4222692Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:03.4223361Z ============================== 5 skipped in 1.72s ============================== 2025-05-07T20:31:03.9807449Z 2025-05-07T20:31:03.9808788Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:03.9828659Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds 2025-05-07T20:31:03.9829096Z 2025-05-07T20:31:03.9829102Z 2025-05-07T20:31:03.9829107Z 2025-05-07T20:31:03.9829112Z 2025-05-07T20:31:03.9850793Z ################################################################################ 2025-05-07T20:31:03.9867002Z # [2025-05-07T20:31:03.986Z] Run Python Test Suite: 2025-05-07T20:31:03.9867508Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:03.9867957Z ################################################################################ 2025-05-07T20:31:03.9891595Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:03.9892382Z 2025-05-07T20:31:06.1351974Z ============================= test session starts ============================== 2025-05-07T20:31:06.1353663Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:06.1355288Z cachedir: .pytest_cache 2025-05-07T20:31:06.1356351Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:06.1357100Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:06.1357524Z plugins: hypothesis-6.131.14 2025-05-07T20:31:07.7911141Z collecting ... collected 2 items 2025-05-07T20:31:07.7911469Z 2025-05-07T20:31:07.7922884Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:31:07.7937318Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:31:07.7937935Z 2025-05-07T20:31:07.7938242Z =========================== short test summary info ============================ 2025-05-07T20:31:07.7938899Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:07.7939782Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:07.7940394Z ============================== 2 skipped in 1.78s ============================== 2025-05-07T20:31:08.3639274Z 2025-05-07T20:31:08.3639717Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:08.3661020Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds 2025-05-07T20:31:08.3661349Z 2025-05-07T20:31:08.3661354Z 2025-05-07T20:31:08.3661358Z 2025-05-07T20:31:08.3661361Z 2025-05-07T20:31:08.3683127Z ################################################################################ 2025-05-07T20:31:08.3698936Z # [2025-05-07T20:31:08.369Z] Run Python Test Suite: 2025-05-07T20:31:08.3699517Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:31:08.3699818Z ################################################################################ 2025-05-07T20:31:08.3723931Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:31:08.3724560Z 2025-05-07T20:31:10.5099718Z ============================= test session starts ============================== 2025-05-07T20:31:10.5100526Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:10.5101068Z cachedir: .pytest_cache 2025-05-07T20:31:10.5101665Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:10.5102406Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:10.5102827Z plugins: hypothesis-6.131.14 2025-05-07T20:31:12.0900145Z collecting ... collected 4 items 2025-05-07T20:31:12.0900591Z 2025-05-07T20:31:14.9201325Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:31:14.9334395Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:31:14.9491614Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:31:14.9625083Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:31:14.9625481Z 2025-05-07T20:31:14.9625635Z =========================== short test summary info ============================ 2025-05-07T20:31:14.9626363Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when H100 is not available or MI300 is not available 2025-05-07T20:31:14.9627305Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/unittest/case.py:117: Skip when xformers is not available 2025-05-07T20:31:14.9628003Z ============================== 4 skipped in 4.58s ============================== 2025-05-07T20:31:16.8560325Z 2025-05-07T20:31:16.8560944Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:31:16.8582667Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds 2025-05-07T20:31:16.8582967Z 2025-05-07T20:31:16.8582972Z 2025-05-07T20:31:16.8582976Z 2025-05-07T20:31:16.8582979Z 2025-05-07T20:31:16.8603162Z ################################################################################ 2025-05-07T20:31:16.8618318Z # [2025-05-07T20:31:16.861Z] Run Python Test Suite: 2025-05-07T20:31:16.8618976Z # ./moe/activation_test.py 2025-05-07T20:31:16.8619307Z ################################################################################ 2025-05-07T20:31:16.8645786Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:31:16.8646405Z 2025-05-07T20:31:19.0025716Z ============================= test session starts ============================== 2025-05-07T20:31:19.0026377Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:19.0026922Z cachedir: .pytest_cache 2025-05-07T20:31:19.0027512Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:19.0028255Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:19.0028669Z plugins: hypothesis-6.131.14 2025-05-07T20:31:20.6468687Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:20.8247767Z collecting ... collected 2 items 2025-05-07T20:31:20.8247983Z 2025-05-07T20:31:26.2701523Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:31:26.2702240Z self=, 2025-05-07T20:31:26.2702994Z T=1, 2025-05-07T20:31:26.2703197Z D=5120, 2025-05-07T20:31:26.2703401Z contiguous=True, 2025-05-07T20:31:26.2703629Z compiled=True, 2025-05-07T20:31:26.2703861Z ) 2025-05-07T20:31:26.2704071Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2704458Z self=, 2025-05-07T20:31:26.2704840Z T=4096, 2025-05-07T20:31:26.2705033Z D=5120, 2025-05-07T20:31:26.2705234Z contiguous=True, 2025-05-07T20:31:26.2705458Z compiled=True, 2025-05-07T20:31:26.2705667Z ) 2025-05-07T20:31:26.2705875Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2706260Z self=, 2025-05-07T20:31:26.2706653Z T=4096, 2025-05-07T20:31:26.2706844Z D=7168, 2025-05-07T20:31:26.2707041Z contiguous=False, 2025-05-07T20:31:26.2707274Z compiled=False, 2025-05-07T20:31:26.2707485Z ) 2025-05-07T20:31:26.2707686Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2708080Z self=, 2025-05-07T20:31:26.2708470Z T=4096, 2025-05-07T20:31:26.2708659Z D=5120, 2025-05-07T20:31:26.2708865Z contiguous=False, 2025-05-07T20:31:26.2709099Z compiled=True, 2025-05-07T20:31:26.2709304Z ) 2025-05-07T20:31:26.2709512Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2709905Z self=, 2025-05-07T20:31:26.2710291Z T=1, 2025-05-07T20:31:26.2710476Z D=7168, 2025-05-07T20:31:26.2710683Z contiguous=True, 2025-05-07T20:31:26.2710910Z compiled=True, 2025-05-07T20:31:26.2711116Z ) 2025-05-07T20:31:26.2711320Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2711797Z self=, 2025-05-07T20:31:26.2712185Z T=1, 2025-05-07T20:31:26.2712368Z D=7168, 2025-05-07T20:31:26.2712568Z contiguous=False, 2025-05-07T20:31:26.2712799Z compiled=True, 2025-05-07T20:31:26.2713010Z ) 2025-05-07T20:31:26.2713207Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2713587Z self=, 2025-05-07T20:31:26.2714133Z T=4096, 2025-05-07T20:31:26.2714319Z D=5120, 2025-05-07T20:31:26.2714518Z contiguous=False, 2025-05-07T20:31:26.2714748Z compiled=False, 2025-05-07T20:31:26.2714954Z ) 2025-05-07T20:31:26.2715157Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2715539Z self=, 2025-05-07T20:31:26.2715916Z T=1, 2025-05-07T20:31:26.2716117Z D=7168, 2025-05-07T20:31:26.2716319Z contiguous=True, 2025-05-07T20:31:26.2716548Z compiled=False, 2025-05-07T20:31:26.2716752Z ) 2025-05-07T20:31:26.2716957Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2717337Z self=, 2025-05-07T20:31:26.2717712Z T=2048, 2025-05-07T20:31:26.2717903Z D=5120, 2025-05-07T20:31:26.2718108Z contiguous=True, 2025-05-07T20:31:26.2718333Z compiled=True, 2025-05-07T20:31:26.2718541Z ) 2025-05-07T20:31:26.2718744Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2719134Z self=, 2025-05-07T20:31:26.2719512Z T=2048, 2025-05-07T20:31:26.2719705Z D=7168, 2025-05-07T20:31:26.2719906Z contiguous=True, 2025-05-07T20:31:26.2720132Z compiled=True, 2025-05-07T20:31:26.2720342Z ) 2025-05-07T20:31:26.2720545Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2720919Z self=, 2025-05-07T20:31:26.2721299Z T=2048, 2025-05-07T20:31:26.2721488Z D=7168, 2025-05-07T20:31:26.2721682Z contiguous=True, 2025-05-07T20:31:26.2721909Z compiled=False, 2025-05-07T20:31:26.2722120Z ) 2025-05-07T20:31:26.2722316Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2722696Z self=, 2025-05-07T20:31:26.2723176Z T=128, 2025-05-07T20:31:26.2723364Z D=5120, 2025-05-07T20:31:26.2723565Z contiguous=False, 2025-05-07T20:31:26.2723792Z compiled=True, 2025-05-07T20:31:26.2723997Z ) 2025-05-07T20:31:26.2724200Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2724581Z self=, 2025-05-07T20:31:26.2724966Z T=128, 2025-05-07T20:31:26.2725153Z D=5120, 2025-05-07T20:31:26.2725351Z contiguous=True, 2025-05-07T20:31:26.2725578Z compiled=True, 2025-05-07T20:31:26.2725781Z ) 2025-05-07T20:31:26.2725981Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2726362Z self=, 2025-05-07T20:31:26.2726738Z T=16384, 2025-05-07T20:31:26.2726936Z D=5120, 2025-05-07T20:31:26.2727137Z contiguous=False, 2025-05-07T20:31:26.2727370Z compiled=True, 2025-05-07T20:31:26.2727579Z ) 2025-05-07T20:31:26.2727781Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2728168Z self=, 2025-05-07T20:31:26.2728554Z T=16384, 2025-05-07T20:31:26.2728754Z D=5120, 2025-05-07T20:31:26.2728958Z contiguous=False, 2025-05-07T20:31:26.2729191Z compiled=False, 2025-05-07T20:31:26.2729401Z ) 2025-05-07T20:31:26.2729599Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2729981Z self=, 2025-05-07T20:31:26.2730367Z T=128, 2025-05-07T20:31:26.2730553Z D=7168, 2025-05-07T20:31:26.2730756Z contiguous=True, 2025-05-07T20:31:26.2730987Z compiled=False, 2025-05-07T20:31:26.2731191Z ) 2025-05-07T20:31:26.2731395Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2731778Z self=, 2025-05-07T20:31:26.2732163Z T=128, 2025-05-07T20:31:26.2732347Z D=7168, 2025-05-07T20:31:26.2732551Z contiguous=False, 2025-05-07T20:31:26.2732787Z compiled=False, 2025-05-07T20:31:26.2732994Z ) 2025-05-07T20:31:26.2733196Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2733578Z self=, 2025-05-07T20:31:26.2734042Z T=1, 2025-05-07T20:31:26.2734230Z D=5120, 2025-05-07T20:31:26.2734431Z contiguous=False, 2025-05-07T20:31:26.2734655Z compiled=False, 2025-05-07T20:31:26.2734864Z ) 2025-05-07T20:31:26.2735065Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2735439Z self=, 2025-05-07T20:31:26.2735819Z T=1, 2025-05-07T20:31:26.2736008Z D=7168, 2025-05-07T20:31:26.2736205Z contiguous=False, 2025-05-07T20:31:26.2736434Z compiled=False, 2025-05-07T20:31:26.2736643Z ) 2025-05-07T20:31:26.2736837Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2737215Z self=, 2025-05-07T20:31:26.2737596Z T=4096, 2025-05-07T20:31:26.2737786Z D=5120, 2025-05-07T20:31:26.2737984Z contiguous=True, 2025-05-07T20:31:26.2738299Z compiled=False, 2025-05-07T20:31:26.2738506Z ) 2025-05-07T20:31:26.2738701Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2739087Z self=, 2025-05-07T20:31:26.2739469Z T=128, 2025-05-07T20:31:26.2739654Z D=7168, 2025-05-07T20:31:26.2739851Z contiguous=True, 2025-05-07T20:31:26.2740079Z compiled=True, 2025-05-07T20:31:26.2740288Z ) 2025-05-07T20:31:26.2740524Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2740918Z self=, 2025-05-07T20:31:26.2741295Z T=1, 2025-05-07T20:31:26.2741483Z D=5120, 2025-05-07T20:31:26.2741688Z contiguous=False, 2025-05-07T20:31:26.2741912Z compiled=True, 2025-05-07T20:31:26.2742120Z ) 2025-05-07T20:31:26.2742325Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2742702Z self=, 2025-05-07T20:31:26.2743431Z T=4096, 2025-05-07T20:31:26.2743630Z D=7168, 2025-05-07T20:31:26.2743824Z contiguous=True, 2025-05-07T20:31:26.2744055Z compiled=False, 2025-05-07T20:31:26.2744275Z ) 2025-05-07T20:31:26.2744473Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2744863Z self=, 2025-05-07T20:31:26.2745248Z T=4096, 2025-05-07T20:31:26.2745438Z D=7168, 2025-05-07T20:31:26.2745638Z contiguous=False, 2025-05-07T20:31:26.2745869Z compiled=True, 2025-05-07T20:31:26.2746075Z ) 2025-05-07T20:31:26.2746273Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2746655Z self=, 2025-05-07T20:31:26.2747038Z T=128, 2025-05-07T20:31:26.2747223Z D=5120, 2025-05-07T20:31:26.2747423Z contiguous=True, 2025-05-07T20:31:26.2747650Z compiled=False, 2025-05-07T20:31:26.2747853Z ) 2025-05-07T20:31:26.2748057Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2748446Z self=, 2025-05-07T20:31:26.2748823Z T=128, 2025-05-07T20:31:26.2749017Z D=5120, 2025-05-07T20:31:26.2749219Z contiguous=False, 2025-05-07T20:31:26.2749447Z compiled=False, 2025-05-07T20:31:26.2749657Z ) 2025-05-07T20:31:26.2749859Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2750233Z self=, 2025-05-07T20:31:26.2750613Z T=1, 2025-05-07T20:31:26.2750800Z D=5120, 2025-05-07T20:31:26.2750998Z contiguous=True, 2025-05-07T20:31:26.2751220Z compiled=False, 2025-05-07T20:31:26.2751431Z ) 2025-05-07T20:31:26.2751631Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2752005Z self=, 2025-05-07T20:31:26.2752385Z T=2048, 2025-05-07T20:31:26.2752575Z D=7168, 2025-05-07T20:31:26.2752771Z contiguous=False, 2025-05-07T20:31:26.2753000Z compiled=True, 2025-05-07T20:31:26.2753216Z ) 2025-05-07T20:31:26.2753418Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2753800Z self=, 2025-05-07T20:31:26.2754273Z T=2048, 2025-05-07T20:31:26.2754458Z D=7168, 2025-05-07T20:31:26.2754661Z contiguous=False, 2025-05-07T20:31:26.2754893Z compiled=False, 2025-05-07T20:31:26.2755097Z ) 2025-05-07T20:31:26.2755302Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2755958Z self=, 2025-05-07T20:31:26.2756339Z T=16384, 2025-05-07T20:31:26.2756538Z D=7168, 2025-05-07T20:31:26.2756745Z contiguous=False, 2025-05-07T20:31:26.2756969Z compiled=True, 2025-05-07T20:31:26.2757178Z ) 2025-05-07T20:31:26.2757380Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2757759Z self=, 2025-05-07T20:31:26.2758134Z T=16384, 2025-05-07T20:31:26.2758334Z D=7168, 2025-05-07T20:31:26.2758541Z contiguous=True, 2025-05-07T20:31:26.2758762Z compiled=True, 2025-05-07T20:31:26.2758967Z ) 2025-05-07T20:31:26.2759167Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2759549Z self=, 2025-05-07T20:31:26.2759928Z T=4096, 2025-05-07T20:31:26.2760119Z D=7168, 2025-05-07T20:31:26.2760312Z contiguous=True, 2025-05-07T20:31:26.2760538Z compiled=True, 2025-05-07T20:31:26.2760744Z ) 2025-05-07T20:31:26.2760938Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2761315Z self=, 2025-05-07T20:31:26.2761694Z T=2048, 2025-05-07T20:31:26.2761879Z D=5120, 2025-05-07T20:31:26.2762083Z contiguous=False, 2025-05-07T20:31:26.2762311Z compiled=False, 2025-05-07T20:31:26.2762514Z ) 2025-05-07T20:31:26.2762718Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2763234Z self=, 2025-05-07T20:31:26.2763621Z T=2048, 2025-05-07T20:31:26.2763806Z D=5120, 2025-05-07T20:31:26.2764007Z contiguous=True, 2025-05-07T20:31:26.2764238Z compiled=False, 2025-05-07T20:31:26.2764441Z ) 2025-05-07T20:31:26.2764640Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2765020Z self=, 2025-05-07T20:31:26.2765395Z T=128, 2025-05-07T20:31:26.2765585Z D=7168, 2025-05-07T20:31:26.2765792Z contiguous=False, 2025-05-07T20:31:26.2766017Z compiled=True, 2025-05-07T20:31:26.2766225Z ) 2025-05-07T20:31:26.2775087Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2775516Z self=, 2025-05-07T20:31:26.2775936Z T=16384, 2025-05-07T20:31:26.2776144Z D=5120, 2025-05-07T20:31:26.2776340Z contiguous=True, 2025-05-07T20:31:26.2776583Z compiled=True, 2025-05-07T20:31:26.2776799Z ) 2025-05-07T20:31:26.2777011Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2777396Z self=, 2025-05-07T20:31:26.2777783Z T=2048, 2025-05-07T20:31:26.2777983Z D=5120, 2025-05-07T20:31:26.2778241Z contiguous=False, 2025-05-07T20:31:26.2778488Z compiled=True, 2025-05-07T20:31:26.2778696Z ) 2025-05-07T20:31:26.2778898Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2779329Z self=, 2025-05-07T20:31:26.2779767Z T=16384, 2025-05-07T20:31:26.2779968Z D=5120, 2025-05-07T20:31:26.2780170Z contiguous=True, 2025-05-07T20:31:26.2780401Z compiled=False, 2025-05-07T20:31:26.2780613Z ) 2025-05-07T20:31:26.2780834Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2781239Z self=, 2025-05-07T20:31:26.2781625Z T=16384, 2025-05-07T20:31:26.2781820Z D=7168, 2025-05-07T20:31:26.2782021Z contiguous=False, 2025-05-07T20:31:26.2782260Z compiled=False, 2025-05-07T20:31:26.2782462Z ) 2025-05-07T20:31:26.2782668Z Trying example: test_silu_mul( 2025-05-07T20:31:26.2783048Z self=, 2025-05-07T20:31:26.2783609Z T=16384, 2025-05-07T20:31:26.2783798Z D=7168, 2025-05-07T20:31:26.2783997Z contiguous=True, 2025-05-07T20:31:26.2784224Z compiled=False, 2025-05-07T20:31:26.2784424Z ) 2025-05-07T20:31:26.2784608Z PASSED 2025-05-07T20:31:26.3366993Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.3368211Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:26.3369586Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.3371105Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.3372526Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.3373944Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.3375471Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.3376898Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.3378410Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.3379685Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:26.3380951Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.3382208Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:26.3383267Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:26.3384308Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:26.3385546Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.3386843Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.3387972Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:26.3389029Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:26.3390354Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.3391729Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.3392802Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.3393722Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.3394473Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:26.3395512Z W0507 20:31:26.335000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.3535687Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.3536775Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:26.3538360Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.3539820Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.3541241Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.3542656Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.3543987Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.3545396Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.3546847Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.3548117Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:26.3549364Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.3550613Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:26.3551703Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:26.3552872Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:26.3554117Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.3555426Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.3556827Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:26.3557887Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:26.3559096Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.3560485Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.3561624Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.3562557Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.3563459Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:26.3564506Z W0507 20:31:26.353000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.3952972Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.3954122Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:26.3955946Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.3957632Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.3959266Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.3960942Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.3962471Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.3964091Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.3965935Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.3967398Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:26.3968828Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.3970247Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:26.3971511Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:26.3972696Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:26.3974125Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.3975631Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.3976934Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:26.3978366Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:26.3979754Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.3981410Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.3982645Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.3983700Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.3984551Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:26.3985734Z W0507 20:31:26.394000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.3996915Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.3998153Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:26.3999730Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.4001474Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.4003100Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.4004843Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.4006375Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.4008004Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.4009679Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.4011199Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:26.4012627Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.4014053Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:26.4015348Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:26.4016537Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:26.4017973Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.4019582Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.4020886Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:26.4022108Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:26.4023490Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.4025087Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.4026322Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.4027370Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.4028213Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:26.4029400Z W0507 20:31:26.399000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.8484592Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:26.8485950Z self=, 2025-05-07T20:31:26.8486771Z T=1, 2025-05-07T20:31:26.8487145Z D=5120, 2025-05-07T20:31:26.8487542Z scale_ub=None, 2025-05-07T20:31:26.8487982Z contiguous=True, 2025-05-07T20:31:26.8488431Z compiled=True, 2025-05-07T20:31:26.8488852Z ) 2025-05-07T20:31:26.8489504Z self = 2025-05-07T20:31:26.8490416Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:26.8490727Z 2025-05-07T20:31:26.8490816Z @given( 2025-05-07T20:31:26.8491074Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:26.8491397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:26.8491709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:26.8492064Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:26.8492409Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:26.8492883Z ) 2025-05-07T20:31:26.8493250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:26.8493706Z def test_silu_mul_quant( 2025-05-07T20:31:26.8493950Z self, 2025-05-07T20:31:26.8494157Z T: int, 2025-05-07T20:31:26.8494362Z D: int, 2025-05-07T20:31:26.8494595Z scale_ub: Optional[float], 2025-05-07T20:31:26.8494874Z contiguous: bool, 2025-05-07T20:31:26.8495122Z compiled: bool, 2025-05-07T20:31:26.8495357Z ) -> None: 2025-05-07T20:31:26.8495578Z torch.manual_seed(2025) 2025-05-07T20:31:26.8495830Z 2025-05-07T20:31:26.8496336Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:26.8496735Z 2025-05-07T20:31:26.8496944Z x_sign = torch.sign(x) 2025-05-07T20:31:26.8497280Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:26.8497624Z x = x_sign * x_clamp 2025-05-07T20:31:26.8497893Z x0 = x[:, :D] 2025-05-07T20:31:26.8498179Z x1 = x[:, D:] 2025-05-07T20:31:26.8498406Z 2025-05-07T20:31:26.8498600Z if contiguous: 2025-05-07T20:31:26.8498839Z x0 = x0.contiguous() 2025-05-07T20:31:26.8499110Z x1 = x1.contiguous() 2025-05-07T20:31:26.8499364Z 2025-05-07T20:31:26.8499563Z if scale_ub is not None: 2025-05-07T20:31:26.8499850Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:26.8500201Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:26.8500512Z ) 2025-05-07T20:31:26.8500716Z else: 2025-05-07T20:31:26.8500946Z scale_ub_tensor = None 2025-05-07T20:31:26.8501203Z 2025-05-07T20:31:26.8501451Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:26.8501786Z op = silu_mul_quant 2025-05-07T20:31:26.8502045Z if compiled: 2025-05-07T20:31:26.8502297Z op = torch.compile(op) 2025-05-07T20:31:26.8502610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:26.8502894Z 2025-05-07T20:31:26.8503092Z y_fp8, y_scale = fn() 2025-05-07T20:31:26.8503391Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:26.8503692Z 2025-05-07T20:31:26.8503938Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:26.8504283Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:26.8504588Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:26.8504911Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:26.8505285Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:26.8505607Z 2025-05-07T20:31:26.8505826Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:26.8506030Z 2025-05-07T20:31:26.8506296Z moe/activation_test.py:126: 2025-05-07T20:31:26.8506604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:26.8506944Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:26.8507280Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:26.8508090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:26.8508865Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:26.8509426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:26.8510121Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:26.8510833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:26.8511572Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:26.8512339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:26.8513098Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:26.8513846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:26.8514501Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:26.8515110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:26.8515643Z fn() 2025-05-07T20:31:26.8516249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:26.8516848Z self.fn.run( 2025-05-07T20:31:26.8517322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:26.8517876Z kernel = self.compile( 2025-05-07T20:31:26.8518432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:26.8519095Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.8519499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:26.8519733Z 2025-05-07T20:31:26.8519946Z self = 2025-05-07T20:31:26.8521050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:26.8522471Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53d8f4af0>} 2025-05-07T20:31:26.8523850Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:26.8524901Z context = 2025-05-07T20:31:26.8525194Z 2025-05-07T20:31:26.8525372Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:26.8525906Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.8526380Z module_map=module_map) 2025-05-07T20:31:26.8526759Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.8527128Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:26.8527403Z E ^ 2025-05-07T20:31:26.8527881Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.8528458Z 2025-05-07T20:31:26.8528893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:26.8529416Z 2025-05-07T20:31:26.8529536Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:26.8529958Z self=, 2025-05-07T20:31:26.8530370Z T=2048, 2025-05-07T20:31:26.8530564Z D=5120, 2025-05-07T20:31:26.8530761Z scale_ub=1200.0, 2025-05-07T20:31:26.8531002Z contiguous=True, 2025-05-07T20:31:26.8531274Z compiled=False, 2025-05-07T20:31:26.8531481Z ) 2025-05-07T20:31:27.3899048Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.3900158Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:27.3901599Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.3903063Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.3904483Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.3906075Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.3907430Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.3909038Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.3910601Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.3911880Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:27.3913123Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.3914360Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:27.3915411Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:27.3916447Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:27.3917691Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.3918993Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.3920292Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:27.3921398Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:27.3922599Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.3923983Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.3925060Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.3925995Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.3926742Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:27.3927777Z W0507 20:31:27.386000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.5685338Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.5686609Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:27.5687980Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.5689433Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.5690842Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.5692267Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.5693604Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.5695019Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.5696459Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.5697731Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:27.5699120Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.5700487Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:27.5701597Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:27.5702634Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:27.5703883Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.5705198Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.5706347Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:27.5707411Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:27.5708607Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.5710072Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.5711155Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.5712090Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.5712844Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:27.5713877Z W0507 20:31:27.565000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:28.0638266Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:28.0639377Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:28.0640747Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:28.0642258Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:28.0643672Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:28.0645085Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:28.0646421Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:28.0648015Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:28.0649465Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:28.0650777Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:28.0652032Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:28.0653264Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:28.0654318Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:28.0655355Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:28.0656915Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:28.0658478Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:28.0659635Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:28.0660703Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:28.0661954Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:28.0663332Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:28.0664408Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:28.0665339Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:28.0666091Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:28.0667127Z W0507 20:31:28.060000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:28.0941758Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:28.0942845Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:28.0944527Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:28.0946184Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:28.0947604Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:28.0949016Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:28.0950368Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:28.0951783Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:28.0953237Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:28.0954517Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:28.0956200Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:28.0957524Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:28.0958819Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:28.0960089Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:28.0961671Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:28.0963289Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:28.0964679Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:28.0965981Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:28.0967459Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:28.0969165Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:28.0970486Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:28.0971657Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:28.0972569Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:28.0973606Z W0507 20:31:28.091000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:28.7810838Z self = 2025-05-07T20:31:28.7811583Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:28.7811921Z 2025-05-07T20:31:28.7812006Z @given( 2025-05-07T20:31:28.7812277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:28.7820669Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:28.7821207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:28.7821738Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:28.7822219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:28.7822629Z ) 2025-05-07T20:31:28.7823133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:28.7823763Z def test_silu_mul_quant( 2025-05-07T20:31:28.7824115Z self, 2025-05-07T20:31:28.7824403Z T: int, 2025-05-07T20:31:28.7824681Z D: int, 2025-05-07T20:31:28.7824933Z scale_ub: Optional[float], 2025-05-07T20:31:28.7825212Z contiguous: bool, 2025-05-07T20:31:28.7825453Z compiled: bool, 2025-05-07T20:31:28.7825689Z ) -> None: 2025-05-07T20:31:28.7825914Z torch.manual_seed(2025) 2025-05-07T20:31:28.7826153Z 2025-05-07T20:31:28.7826434Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:28.7826971Z 2025-05-07T20:31:28.7827170Z x_sign = torch.sign(x) 2025-05-07T20:31:28.7827471Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:28.7827793Z x = x_sign * x_clamp 2025-05-07T20:31:28.7828038Z x0 = x[:, :D] 2025-05-07T20:31:28.7828252Z x1 = x[:, D:] 2025-05-07T20:31:28.7828466Z 2025-05-07T20:31:28.7828661Z if contiguous: 2025-05-07T20:31:28.7828890Z x0 = x0.contiguous() 2025-05-07T20:31:28.7829158Z x1 = x1.contiguous() 2025-05-07T20:31:28.7829406Z 2025-05-07T20:31:28.7829596Z if scale_ub is not None: 2025-05-07T20:31:28.7829878Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:28.7830228Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:28.7830538Z ) 2025-05-07T20:31:28.7830737Z else: 2025-05-07T20:31:28.7830954Z scale_ub_tensor = None 2025-05-07T20:31:28.7831210Z 2025-05-07T20:31:28.7831458Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:28.7831783Z op = silu_mul_quant 2025-05-07T20:31:28.7832031Z if compiled: 2025-05-07T20:31:28.7832291Z op = torch.compile(op) 2025-05-07T20:31:28.7832595Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:28.7832883Z 2025-05-07T20:31:28.7833078Z > y_fp8, y_scale = fn() 2025-05-07T20:31:28.7833252Z 2025-05-07T20:31:28.7833355Z moe/activation_test.py:117: 2025-05-07T20:31:28.7833663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:28.7833996Z moe/activation_test.py:115: in fn 2025-05-07T20:31:28.7834290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:28.7835004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:28.7835708Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:28.7836264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:28.7836964Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:28.7837779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:28.7838315Z kernel = self.compile( 2025-05-07T20:31:28.7838876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:28.7839547Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:28.7839949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:28.7840177Z 2025-05-07T20:31:28.7840388Z self = 2025-05-07T20:31:28.7841560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:28.7842972Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53d946ef0>} 2025-05-07T20:31:28.7844356Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:28.7845399Z context = 2025-05-07T20:31:28.7845697Z 2025-05-07T20:31:28.7845867Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:28.7846399Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:28.7846879Z module_map=module_map) 2025-05-07T20:31:28.7847330Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:28.7847693Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:28.7847957Z E ^ 2025-05-07T20:31:28.7848431Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:28.7848897Z 2025-05-07T20:31:28.7849322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:28.7849848Z 2025-05-07T20:31:28.7849953Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:28.7850377Z self=, 2025-05-07T20:31:28.7850778Z T=2048, 2025-05-07T20:31:28.7850972Z D=5120, 2025-05-07T20:31:28.7851221Z scale_ub=1200.0, 2025-05-07T20:31:28.7851460Z contiguous=True, 2025-05-07T20:31:28.7851686Z compiled=True, 2025-05-07T20:31:28.7851898Z ) 2025-05-07T20:31:28.7852227Z self = 2025-05-07T20:31:28.7852732Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:28.7853016Z 2025-05-07T20:31:28.7853092Z @given( 2025-05-07T20:31:28.7853327Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:28.7853639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:28.7853953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:28.7854292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:28.7854622Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:28.7854919Z ) 2025-05-07T20:31:28.7855275Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:28.7856031Z def test_silu_mul_quant( 2025-05-07T20:31:28.7856272Z self, 2025-05-07T20:31:28.7856469Z T: int, 2025-05-07T20:31:28.7856668Z D: int, 2025-05-07T20:31:28.7856882Z scale_ub: Optional[float], 2025-05-07T20:31:28.7857161Z contiguous: bool, 2025-05-07T20:31:28.7857402Z compiled: bool, 2025-05-07T20:31:28.7857625Z ) -> None: 2025-05-07T20:31:28.7857845Z torch.manual_seed(2025) 2025-05-07T20:31:28.7858307Z 2025-05-07T20:31:28.7858579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:28.7858925Z 2025-05-07T20:31:28.7859126Z x_sign = torch.sign(x) 2025-05-07T20:31:28.7859418Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:28.7859736Z x = x_sign * x_clamp 2025-05-07T20:31:28.7859977Z x0 = x[:, :D] 2025-05-07T20:31:28.7860190Z x1 = x[:, D:] 2025-05-07T20:31:28.7860401Z 2025-05-07T20:31:28.7860589Z if contiguous: 2025-05-07T20:31:28.7860818Z x0 = x0.contiguous() 2025-05-07T20:31:28.7861078Z x1 = x1.contiguous() 2025-05-07T20:31:28.7861320Z 2025-05-07T20:31:28.7861510Z if scale_ub is not None: 2025-05-07T20:31:28.7861792Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:28.7862134Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:28.7862446Z ) 2025-05-07T20:31:28.7862640Z else: 2025-05-07T20:31:28.7862853Z scale_ub_tensor = None 2025-05-07T20:31:28.7863116Z 2025-05-07T20:31:28.7863347Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:28.7863664Z op = silu_mul_quant 2025-05-07T20:31:28.7863917Z if compiled: 2025-05-07T20:31:28.7864162Z op = torch.compile(op) 2025-05-07T20:31:28.7864463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:28.7864765Z 2025-05-07T20:31:28.7864962Z y_fp8, y_scale = fn() 2025-05-07T20:31:28.7865248Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:28.7865541Z 2025-05-07T20:31:28.7865787Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:28.7866123Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:28.7866553Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:28.7866880Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:28.7867244Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:28.7867561Z 2025-05-07T20:31:28.7867768Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:28.7867968Z 2025-05-07T20:31:28.7868077Z moe/activation_test.py:126: 2025-05-07T20:31:28.7868368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:28.7868705Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:28.7869040Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:28.7869837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:28.7870605Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:28.7871173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:28.7871919Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:28.7872622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:28.7873357Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:28.7874124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:28.7874883Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:28.7875620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:28.7876272Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:28.7876889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:28.7877412Z fn() 2025-05-07T20:31:28.7877934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:28.7878620Z self.fn.run( 2025-05-07T20:31:28.7879099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:28.7879634Z kernel = self.compile( 2025-05-07T20:31:28.7880187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:28.7880853Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:28.7881247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:28.7881506Z 2025-05-07T20:31:28.7881742Z self = 2025-05-07T20:31:28.7882849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:28.7884254Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53d9b4790>} 2025-05-07T20:31:28.7885619Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:28.7886656Z context = 2025-05-07T20:31:28.7886952Z 2025-05-07T20:31:28.7887120Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:28.7887731Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:28.7888212Z module_map=module_map) 2025-05-07T20:31:28.7888578Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:28.7888956Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:28.7889228Z E ^ 2025-05-07T20:31:28.7889695Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:28.7890158Z 2025-05-07T20:31:28.7890582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:28.7891106Z 2025-05-07T20:31:28.7891214Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:28.7891635Z self=, 2025-05-07T20:31:28.7892038Z T=16384, 2025-05-07T20:31:28.7892237Z D=7168, 2025-05-07T20:31:28.7892436Z scale_ub=1200.0, 2025-05-07T20:31:28.7892659Z contiguous=False, 2025-05-07T20:31:28.7892896Z compiled=False, 2025-05-07T20:31:28.7893107Z ) 2025-05-07T20:31:29.1591698Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:29.1593217Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:29.1594670Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:29.1596119Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:29.1597532Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:29.1599202Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:29.1600528Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:29.1601940Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:29.1603392Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:29.1604667Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:29.1605920Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:29.1607155Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:29.1608205Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:29.1609358Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:29.1610609Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:29.1611928Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:29.1613068Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:29.1614126Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:29.1615334Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:29.1616718Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:29.1617807Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:29.1618816Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:29.1619568Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:29.1620607Z W0507 20:31:29.155000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:29.2994529Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:29.2996024Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:29.2997392Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:29.2998845Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:29.3000254Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:29.3001669Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:29.3003007Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:29.3004409Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:29.3005973Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:29.3007239Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:29.3008486Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:29.3009716Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:29.3010769Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:29.3011861Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:29.3013107Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:29.3014419Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:29.3015559Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:29.3016617Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:29.3017821Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:29.3019281Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:29.3020447Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:29.3021378Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:29.3022189Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:29.3023223Z W0507 20:31:29.296000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:29.7312690Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:29.7314052Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:29.7315422Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:29.7316879Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:29.7318486Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:29.7319913Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:29.7321243Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:29.7322652Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:29.7324099Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:29.7325381Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:29.7326627Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:29.7327857Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:29.7328911Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:29.7329952Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:29.7331205Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:29.7332692Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:29.7333828Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:29.7334889Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:29.7336091Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:29.7337479Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:29.7338686Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:29.7339618Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:29.7340376Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:29.7341415Z W0507 20:31:29.727000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:29.7614239Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:29.7615439Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:29.7616808Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:29.7618355Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:29.7619768Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:29.7621176Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:29.7622506Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:29.7623904Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:29.7625346Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:29.7626616Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:29.7627861Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:29.7629230Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:29.7630286Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:29.7631327Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:29.7632575Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:29.7633886Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:29.7635023Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:29.7636081Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:29.7637280Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:29.7638737Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:29.7639827Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:29.7640747Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:29.7641501Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:29.7642539Z W0507 20:31:29.758000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.0835720Z self = 2025-05-07T20:31:31.0836350Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:31.0836752Z 2025-05-07T20:31:31.0836876Z @given( 2025-05-07T20:31:31.0837120Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:31.0837463Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:31.0837782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:31.0838123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:31.0838469Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:31.0838767Z ) 2025-05-07T20:31:31.0839139Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:31.0839592Z def test_silu_mul_quant( 2025-05-07T20:31:31.0839849Z self, 2025-05-07T20:31:31.0840055Z T: int, 2025-05-07T20:31:31.0840254Z D: int, 2025-05-07T20:31:31.0840484Z scale_ub: Optional[float], 2025-05-07T20:31:31.0840768Z contiguous: bool, 2025-05-07T20:31:31.0841010Z compiled: bool, 2025-05-07T20:31:31.0841251Z ) -> None: 2025-05-07T20:31:31.0841475Z torch.manual_seed(2025) 2025-05-07T20:31:31.0841725Z 2025-05-07T20:31:31.0842015Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:31.0842723Z 2025-05-07T20:31:31.0842920Z x_sign = torch.sign(x) 2025-05-07T20:31:31.0843229Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:31.0843549Z x = x_sign * x_clamp 2025-05-07T20:31:31.0843789Z x0 = x[:, :D] 2025-05-07T20:31:31.0844015Z x1 = x[:, D:] 2025-05-07T20:31:31.0844233Z 2025-05-07T20:31:31.0844429Z if contiguous: 2025-05-07T20:31:31.0844669Z x0 = x0.contiguous() 2025-05-07T20:31:31.0844940Z x1 = x1.contiguous() 2025-05-07T20:31:31.0845188Z 2025-05-07T20:31:31.0845383Z if scale_ub is not None: 2025-05-07T20:31:31.0845667Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:31.0846025Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:31.0846343Z ) 2025-05-07T20:31:31.0846547Z else: 2025-05-07T20:31:31.0846767Z scale_ub_tensor = None 2025-05-07T20:31:31.0847034Z 2025-05-07T20:31:31.0847281Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:31.0847606Z op = silu_mul_quant 2025-05-07T20:31:31.0847861Z if compiled: 2025-05-07T20:31:31.0848128Z op = torch.compile(op) 2025-05-07T20:31:31.0848438Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:31.0848716Z 2025-05-07T20:31:31.0848922Z > y_fp8, y_scale = fn() 2025-05-07T20:31:31.0849098Z 2025-05-07T20:31:31.0849205Z moe/activation_test.py:117: 2025-05-07T20:31:31.0849515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:31.0849850Z moe/activation_test.py:115: in fn 2025-05-07T20:31:31.0850143Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:31.0851015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:31.0851729Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:31.0852297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:31.0853002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:31.0853685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:31.0854229Z kernel = self.compile( 2025-05-07T20:31:31.0854792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:31.0855472Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.0856257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:31.0856497Z 2025-05-07T20:31:31.0856718Z self = 2025-05-07T20:31:31.0857830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:31.0859363Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53d28d2d0>} 2025-05-07T20:31:31.0860737Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:31.0861788Z context = 2025-05-07T20:31:31.0862081Z 2025-05-07T20:31:31.0862253Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:31.0862790Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.0863273Z module_map=module_map) 2025-05-07T20:31:31.0863792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.0864156Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:31.0864424Z E ^ 2025-05-07T20:31:31.0864901Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.0865361Z 2025-05-07T20:31:31.0865790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:31.0866323Z 2025-05-07T20:31:31.0866431Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:31.0866858Z self=, 2025-05-07T20:31:31.0867265Z T=1, 2025-05-07T20:31:31.0867459Z D=7168, 2025-05-07T20:31:31.0867662Z scale_ub=None, 2025-05-07T20:31:31.0867890Z contiguous=True, 2025-05-07T20:31:31.0868117Z compiled=True, 2025-05-07T20:31:31.0868330Z ) 2025-05-07T20:31:31.0868666Z self = 2025-05-07T20:31:31.0869155Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:31.0869440Z 2025-05-07T20:31:31.0869519Z @given( 2025-05-07T20:31:31.0869762Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:31.0870079Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:31.0878150Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:31.0878522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:31.0878885Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:31.0879184Z ) 2025-05-07T20:31:31.0879546Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:31.0880148Z def test_silu_mul_quant( 2025-05-07T20:31:31.0880404Z self, 2025-05-07T20:31:31.0880607Z T: int, 2025-05-07T20:31:31.0880814Z D: int, 2025-05-07T20:31:31.0881036Z scale_ub: Optional[float], 2025-05-07T20:31:31.0881319Z contiguous: bool, 2025-05-07T20:31:31.0881566Z compiled: bool, 2025-05-07T20:31:31.0881790Z ) -> None: 2025-05-07T20:31:31.0882015Z torch.manual_seed(2025) 2025-05-07T20:31:31.0882267Z 2025-05-07T20:31:31.0882548Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:31.0882898Z 2025-05-07T20:31:31.0883103Z x_sign = torch.sign(x) 2025-05-07T20:31:31.0883403Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:31.0883721Z x = x_sign * x_clamp 2025-05-07T20:31:31.0883969Z x0 = x[:, :D] 2025-05-07T20:31:31.0884187Z x1 = x[:, D:] 2025-05-07T20:31:31.0884403Z 2025-05-07T20:31:31.0884597Z if contiguous: 2025-05-07T20:31:31.0884840Z x0 = x0.contiguous() 2025-05-07T20:31:31.0885108Z x1 = x1.contiguous() 2025-05-07T20:31:31.0885358Z 2025-05-07T20:31:31.0885553Z if scale_ub is not None: 2025-05-07T20:31:31.0885840Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:31.0886187Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:31.0886509Z ) 2025-05-07T20:31:31.0886706Z else: 2025-05-07T20:31:31.0886929Z scale_ub_tensor = None 2025-05-07T20:31:31.0887194Z 2025-05-07T20:31:31.0887431Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:31.0887759Z op = silu_mul_quant 2025-05-07T20:31:31.0888018Z if compiled: 2025-05-07T20:31:31.0888268Z op = torch.compile(op) 2025-05-07T20:31:31.0888577Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:31.0888860Z 2025-05-07T20:31:31.0889054Z y_fp8, y_scale = fn() 2025-05-07T20:31:31.0889354Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:31.0889654Z 2025-05-07T20:31:31.0889897Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:31.0890343Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:31.0890646Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:31.0890972Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:31.0891339Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:31.0891660Z 2025-05-07T20:31:31.0891873Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:31.0892074Z 2025-05-07T20:31:31.0892178Z moe/activation_test.py:126: 2025-05-07T20:31:31.0892484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:31.0892829Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:31.0893162Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:31.0893972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:31.0894740Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:31.0895306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:31.0895994Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:31.0896698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:31.0897432Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:31.0898283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:31.0899038Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:31.0899863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:31.0900521Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:31.0901143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:31.0901666Z fn() 2025-05-07T20:31:31.0902234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:31.0902831Z self.fn.run( 2025-05-07T20:31:31.0903307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:31.0903849Z kernel = self.compile( 2025-05-07T20:31:31.0904401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:31.0905069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.0905475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:31.0905710Z 2025-05-07T20:31:31.0905923Z self = 2025-05-07T20:31:31.0907036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:31.0908437Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53c0fd5a0>} 2025-05-07T20:31:31.0909799Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:31.0910848Z context = 2025-05-07T20:31:31.0911150Z 2025-05-07T20:31:31.0911323Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:31.0911854Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.0912463Z module_map=module_map) 2025-05-07T20:31:31.0912841Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.0913208Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:31.0913491Z E ^ 2025-05-07T20:31:31.0913965Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.0914430Z 2025-05-07T20:31:31.0914852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:31.0915376Z 2025-05-07T20:31:31.0915491Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:31.0915919Z self=, 2025-05-07T20:31:31.0916326Z T=4096, 2025-05-07T20:31:31.0916517Z D=5120, 2025-05-07T20:31:31.0916712Z scale_ub=None, 2025-05-07T20:31:31.0916932Z contiguous=False, 2025-05-07T20:31:31.0917163Z compiled=False, 2025-05-07T20:31:31.0917371Z ) 2025-05-07T20:31:31.6426617Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:31.6427721Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:31.6429237Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:31.6430900Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:31.6432333Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:31.6433762Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.6435114Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:31.6436535Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.6438002Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:31.6439402Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:31.6440838Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:31.6442081Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:31.6443151Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:31.6444198Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:31.6445655Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:31.6446974Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:31.6448114Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:31.6449180Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:31.6450386Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:31.6451783Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:31.6452862Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.6453793Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:31.6454550Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:31.6455972Z W0507 20:31:31.639000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.1852643Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.1853740Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:32.1855108Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.1856956Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.1858456Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.1859875Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.1861210Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.1862623Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.1864072Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.1865545Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:32.1866787Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.1868018Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:32.1869076Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:32.1870111Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:32.1871361Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.1872669Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.1873813Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.1874987Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:32.1876186Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.1877578Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.1878659Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.1879591Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.1880346Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:32.1881386Z W0507 20:31:32.182000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.8717347Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.8718443Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:32.8719806Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.8721257Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.8722667Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.8724253Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.8725583Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.8726984Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.8728433Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.8729702Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:32.8730943Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.8732174Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:32.8733284Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:32.8734429Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:32.8735669Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.8736968Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.8738165Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.8739229Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:32.8740425Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.8741806Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.8742932Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.8743860Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.8744616Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:32.8745651Z W0507 20:31:32.868000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.9017314Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.9020516Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:32.9022816Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.9024272Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.9025702Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.9027121Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.9028456Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.9029859Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.9031420Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.9032753Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:32.9033997Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.9035232Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:32.9036284Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:32.9037329Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:32.9038573Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.9039886Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.9041020Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.9042091Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:32.9043351Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.9044818Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.9045895Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.9046821Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.9047577Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:32.9048620Z W0507 20:31:32.898000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.2005473Z self = 2025-05-07T20:31:36.2007088Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:36.2007773Z 2025-05-07T20:31:36.2007950Z @given( 2025-05-07T20:31:36.2008440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.2009095Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.2009728Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.2010409Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.2011081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.2011669Z ) 2025-05-07T20:31:36.2012397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.2013300Z def test_silu_mul_quant( 2025-05-07T20:31:36.2013612Z self, 2025-05-07T20:31:36.2013841Z T: int, 2025-05-07T20:31:36.2014384Z D: int, 2025-05-07T20:31:36.2014622Z scale_ub: Optional[float], 2025-05-07T20:31:36.2014908Z contiguous: bool, 2025-05-07T20:31:36.2015159Z compiled: bool, 2025-05-07T20:31:36.2015403Z ) -> None: 2025-05-07T20:31:36.2015633Z torch.manual_seed(2025) 2025-05-07T20:31:36.2015882Z 2025-05-07T20:31:36.2016172Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.2016527Z 2025-05-07T20:31:36.2016730Z x_sign = torch.sign(x) 2025-05-07T20:31:36.2017028Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.2017351Z x = x_sign * x_clamp 2025-05-07T20:31:36.2017603Z x0 = x[:, :D] 2025-05-07T20:31:36.2017825Z x1 = x[:, D:] 2025-05-07T20:31:36.2018116Z 2025-05-07T20:31:36.2018316Z if contiguous: 2025-05-07T20:31:36.2018556Z x0 = x0.contiguous() 2025-05-07T20:31:36.2018826Z x1 = x1.contiguous() 2025-05-07T20:31:36.2019085Z 2025-05-07T20:31:36.2019288Z if scale_ub is not None: 2025-05-07T20:31:36.2019574Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.2019932Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.2020245Z ) 2025-05-07T20:31:36.2020448Z else: 2025-05-07T20:31:36.2020668Z scale_ub_tensor = None 2025-05-07T20:31:36.2020926Z 2025-05-07T20:31:36.2021172Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.2021498Z op = silu_mul_quant 2025-05-07T20:31:36.2021760Z if compiled: 2025-05-07T20:31:36.2022014Z op = torch.compile(op) 2025-05-07T20:31:36.2022324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.2022612Z 2025-05-07T20:31:36.2022812Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.2022988Z 2025-05-07T20:31:36.2023100Z moe/activation_test.py:117: 2025-05-07T20:31:36.2023413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.2023749Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.2024044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.2024932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.2025660Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.2026220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.2026929Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.2027619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.2028167Z kernel = self.compile( 2025-05-07T20:31:36.2028730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.2029418Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.2029831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.2030069Z 2025-05-07T20:31:36.2030285Z self = 2025-05-07T20:31:36.2031406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.2032844Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53f59e7a0>} 2025-05-07T20:31:36.2034276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.2035417Z context = 2025-05-07T20:31:36.2035714Z 2025-05-07T20:31:36.2035885Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.2036431Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.2036920Z module_map=module_map) 2025-05-07T20:31:36.2037291Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.2037655Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.2037922Z E ^ 2025-05-07T20:31:36.2038395Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.2038864Z 2025-05-07T20:31:36.2039292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.2039822Z 2025-05-07T20:31:36.2039937Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.2040363Z self=, 2025-05-07T20:31:36.2040767Z T=4096, 2025-05-07T20:31:36.2040974Z D=7168, 2025-05-07T20:31:36.2041172Z scale_ub=None, 2025-05-07T20:31:36.2041388Z contiguous=False, 2025-05-07T20:31:36.2041621Z compiled=False, 2025-05-07T20:31:36.2041838Z ) 2025-05-07T20:31:36.2042166Z self = 2025-05-07T20:31:36.2042670Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:36.2042953Z 2025-05-07T20:31:36.2043030Z @given( 2025-05-07T20:31:36.2043277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.2043593Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.2043909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.2044250Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.2044593Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.2044886Z ) 2025-05-07T20:31:36.2045251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.2045793Z def test_silu_mul_quant( 2025-05-07T20:31:36.2046036Z self, 2025-05-07T20:31:36.2046236Z T: int, 2025-05-07T20:31:36.2046441Z D: int, 2025-05-07T20:31:36.2046662Z scale_ub: Optional[float], 2025-05-07T20:31:36.2046941Z contiguous: bool, 2025-05-07T20:31:36.2047187Z compiled: bool, 2025-05-07T20:31:36.2047412Z ) -> None: 2025-05-07T20:31:36.2047636Z torch.manual_seed(2025) 2025-05-07T20:31:36.2047887Z 2025-05-07T20:31:36.2048162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.2048517Z 2025-05-07T20:31:36.2048719Z x_sign = torch.sign(x) 2025-05-07T20:31:36.2049015Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.2049340Z x = x_sign * x_clamp 2025-05-07T20:31:36.2049588Z x0 = x[:, :D] 2025-05-07T20:31:36.2049807Z x1 = x[:, D:] 2025-05-07T20:31:36.2050022Z 2025-05-07T20:31:36.2050221Z if contiguous: 2025-05-07T20:31:36.2050454Z x0 = x0.contiguous() 2025-05-07T20:31:36.2050724Z x1 = x1.contiguous() 2025-05-07T20:31:36.2050974Z 2025-05-07T20:31:36.2051174Z if scale_ub is not None: 2025-05-07T20:31:36.2051452Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.2051799Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.2052114Z ) 2025-05-07T20:31:36.2052305Z else: 2025-05-07T20:31:36.2052520Z scale_ub_tensor = None 2025-05-07T20:31:36.2052782Z 2025-05-07T20:31:36.2053016Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.2053337Z op = silu_mul_quant 2025-05-07T20:31:36.2053594Z if compiled: 2025-05-07T20:31:36.2053929Z op = torch.compile(op) 2025-05-07T20:31:36.2054238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.2054519Z 2025-05-07T20:31:36.2054712Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.2054891Z 2025-05-07T20:31:36.2054994Z moe/activation_test.py:117: 2025-05-07T20:31:36.2055296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.2055828Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.2056116Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.2056827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.2057546Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.2058157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.2058858Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.2059548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.2060095Z kernel = self.compile( 2025-05-07T20:31:36.2060651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.2061330Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.2061737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.2061968Z 2025-05-07T20:31:36.2062187Z self = 2025-05-07T20:31:36.2063294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.2064714Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53d946cb0>} 2025-05-07T20:31:36.2066104Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.2067291Z context = 2025-05-07T20:31:36.2067586Z 2025-05-07T20:31:36.2067757Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.2068300Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.2068783Z module_map=module_map) 2025-05-07T20:31:36.2069164Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.2069525Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.2069794Z E ^ 2025-05-07T20:31:36.2070283Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.2070746Z 2025-05-07T20:31:36.2071176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.2071715Z 2025-05-07T20:31:36.2071823Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.2072254Z self=, 2025-05-07T20:31:36.2072671Z T=128, 2025-05-07T20:31:36.2072863Z D=7168, 2025-05-07T20:31:36.2073071Z scale_ub=None, 2025-05-07T20:31:36.2073324Z contiguous=False, 2025-05-07T20:31:36.2073582Z compiled=True, 2025-05-07T20:31:36.2073802Z ) 2025-05-07T20:31:36.2741463Z self = 2025-05-07T20:31:36.2742217Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:36.2742636Z 2025-05-07T20:31:36.2742849Z @given( 2025-05-07T20:31:36.2743826Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.2756021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.2756398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.2756748Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.2757090Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.2757379Z ) 2025-05-07T20:31:36.2757743Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.2758196Z def test_silu_mul_quant( 2025-05-07T20:31:36.2758440Z self, 2025-05-07T20:31:36.2758642Z T: int, 2025-05-07T20:31:36.2758843Z D: int, 2025-05-07T20:31:36.2759061Z scale_ub: Optional[float], 2025-05-07T20:31:36.2759342Z contiguous: bool, 2025-05-07T20:31:36.2759592Z compiled: bool, 2025-05-07T20:31:36.2759818Z ) -> None: 2025-05-07T20:31:36.2760045Z torch.manual_seed(2025) 2025-05-07T20:31:36.2760309Z 2025-05-07T20:31:36.2760585Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.2760936Z 2025-05-07T20:31:36.2761145Z x_sign = torch.sign(x) 2025-05-07T20:31:36.2761443Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.2761751Z x = x_sign * x_clamp 2025-05-07T20:31:36.2761998Z x0 = x[:, :D] 2025-05-07T20:31:36.2762217Z x1 = x[:, D:] 2025-05-07T20:31:36.2762425Z 2025-05-07T20:31:36.2762618Z if contiguous: 2025-05-07T20:31:36.2762852Z x0 = x0.contiguous() 2025-05-07T20:31:36.2763108Z x1 = x1.contiguous() 2025-05-07T20:31:36.2763376Z 2025-05-07T20:31:36.2763597Z if scale_ub is not None: 2025-05-07T20:31:36.2763868Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.2764212Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.2764528Z ) 2025-05-07T20:31:36.2764717Z else: 2025-05-07T20:31:36.2764940Z scale_ub_tensor = None 2025-05-07T20:31:36.2765203Z 2025-05-07T20:31:36.2765437Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.2765939Z op = silu_mul_quant 2025-05-07T20:31:36.2766196Z if compiled: 2025-05-07T20:31:36.2766451Z op = torch.compile(op) 2025-05-07T20:31:36.2766751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.2767032Z 2025-05-07T20:31:36.2767232Z y_fp8, y_scale = fn() 2025-05-07T20:31:36.2767520Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:36.2767817Z 2025-05-07T20:31:36.2768062Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.2768395Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:36.2768696Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:36.2769019Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:36.2769385Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:36.2769702Z 2025-05-07T20:31:36.2769914Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:36.2770114Z 2025-05-07T20:31:36.2770230Z moe/activation_test.py:126: 2025-05-07T20:31:36.2770526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.2770867Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:36.2771206Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:36.2772003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:36.2772767Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:36.2773328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.2774015Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.2774841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:36.2775579Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:36.2776348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:36.2777103Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:36.2777842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:36.2778574Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:36.2779187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:36.2779708Z fn() 2025-05-07T20:31:36.2780231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:36.2780821Z self.fn.run( 2025-05-07T20:31:36.2781304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.2781859Z kernel = self.compile( 2025-05-07T20:31:36.2782417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.2783090Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.2783516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.2783778Z 2025-05-07T20:31:36.2783990Z self = 2025-05-07T20:31:36.2785101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.2786518Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53e480d30>} 2025-05-07T20:31:36.2788252Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.2789308Z context = 2025-05-07T20:31:36.2789608Z 2025-05-07T20:31:36.2789777Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.2790310Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.2790780Z module_map=module_map) 2025-05-07T20:31:36.2791157Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.2791530Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:36.2791804Z E ^ 2025-05-07T20:31:36.2792272Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.2792743Z 2025-05-07T20:31:36.2793254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.2793829Z 2025-05-07T20:31:36.2793936Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.2794360Z self=, 2025-05-07T20:31:36.2794762Z T=128, 2025-05-07T20:31:36.2794956Z D=7168, 2025-05-07T20:31:36.2795151Z scale_ub=None, 2025-05-07T20:31:36.2795367Z contiguous=False, 2025-05-07T20:31:36.2795600Z compiled=False, 2025-05-07T20:31:36.2795810Z ) 2025-05-07T20:31:36.4889791Z self = 2025-05-07T20:31:36.4890506Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:36.4890817Z 2025-05-07T20:31:36.4890901Z @given( 2025-05-07T20:31:36.4891143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.4891475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.4891784Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.4892125Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.4892467Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.4892756Z ) 2025-05-07T20:31:36.4893123Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.4893611Z def test_silu_mul_quant( 2025-05-07T20:31:36.4893882Z self, 2025-05-07T20:31:36.4894084Z T: int, 2025-05-07T20:31:36.4894289Z D: int, 2025-05-07T20:31:36.4894510Z scale_ub: Optional[float], 2025-05-07T20:31:36.4894794Z contiguous: bool, 2025-05-07T20:31:36.4895042Z compiled: bool, 2025-05-07T20:31:36.4895274Z ) -> None: 2025-05-07T20:31:36.4895500Z torch.manual_seed(2025) 2025-05-07T20:31:36.4895756Z 2025-05-07T20:31:36.4896038Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.4896388Z 2025-05-07T20:31:36.4896589Z x_sign = torch.sign(x) 2025-05-07T20:31:36.4896890Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.4897202Z x = x_sign * x_clamp 2025-05-07T20:31:36.4897448Z x0 = x[:, :D] 2025-05-07T20:31:36.4897673Z x1 = x[:, D:] 2025-05-07T20:31:36.4897886Z 2025-05-07T20:31:36.4898158Z if contiguous: 2025-05-07T20:31:36.4898400Z x0 = x0.contiguous() 2025-05-07T20:31:36.4898661Z x1 = x1.contiguous() 2025-05-07T20:31:36.4898911Z 2025-05-07T20:31:36.4899119Z if scale_ub is not None: 2025-05-07T20:31:36.4899395Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.4899748Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.4900062Z ) 2025-05-07T20:31:36.4900256Z else: 2025-05-07T20:31:36.4900474Z scale_ub_tensor = None 2025-05-07T20:31:36.4900871Z 2025-05-07T20:31:36.4901108Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.4901433Z op = silu_mul_quant 2025-05-07T20:31:36.4901691Z if compiled: 2025-05-07T20:31:36.4901949Z op = torch.compile(op) 2025-05-07T20:31:36.4902249Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.4902535Z 2025-05-07T20:31:36.4902740Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.4902911Z 2025-05-07T20:31:36.4903015Z moe/activation_test.py:117: 2025-05-07T20:31:36.4903319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.4903664Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.4903950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.4904665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.4905379Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.4905938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.4906632Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.4907315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.4907866Z kernel = self.compile( 2025-05-07T20:31:36.4908421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.4909100Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.4909511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.4909740Z 2025-05-07T20:31:36.4910047Z self = 2025-05-07T20:31:36.4911151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.4912559Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc5175b0430>} 2025-05-07T20:31:36.4913979Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.4915022Z context = 2025-05-07T20:31:36.4915314Z 2025-05-07T20:31:36.4915495Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.4916027Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.4916512Z module_map=module_map) 2025-05-07T20:31:36.4916895Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.4917256Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.4917526Z E ^ 2025-05-07T20:31:36.4918005Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.4918465Z 2025-05-07T20:31:36.4918896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.4919417Z 2025-05-07T20:31:36.4919527Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.4919952Z self=, 2025-05-07T20:31:36.4920365Z T=4096, 2025-05-07T20:31:36.4920557Z D=5120, 2025-05-07T20:31:36.4920764Z scale_ub=1200.0, 2025-05-07T20:31:36.4920999Z contiguous=True, 2025-05-07T20:31:36.4921234Z compiled=False, 2025-05-07T20:31:36.4921528Z ) 2025-05-07T20:31:36.4921859Z self = 2025-05-07T20:31:36.4922373Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:36.4922650Z 2025-05-07T20:31:36.4922736Z @given( 2025-05-07T20:31:36.4922972Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.4923298Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.4923671Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.4924034Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.4924373Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.4924662Z ) 2025-05-07T20:31:36.4925022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.4925483Z def test_silu_mul_quant( 2025-05-07T20:31:36.4925731Z self, 2025-05-07T20:31:36.4925932Z T: int, 2025-05-07T20:31:36.4926142Z D: int, 2025-05-07T20:31:36.4926370Z scale_ub: Optional[float], 2025-05-07T20:31:36.4926654Z contiguous: bool, 2025-05-07T20:31:36.4926905Z compiled: bool, 2025-05-07T20:31:36.4927129Z ) -> None: 2025-05-07T20:31:36.4927357Z torch.manual_seed(2025) 2025-05-07T20:31:36.4927610Z 2025-05-07T20:31:36.4927885Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.4928234Z 2025-05-07T20:31:36.4928438Z x_sign = torch.sign(x) 2025-05-07T20:31:36.4928731Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.4929049Z x = x_sign * x_clamp 2025-05-07T20:31:36.4929297Z x0 = x[:, :D] 2025-05-07T20:31:36.4929524Z x1 = x[:, D:] 2025-05-07T20:31:36.4929729Z 2025-05-07T20:31:36.4929923Z if contiguous: 2025-05-07T20:31:36.4930273Z x0 = x0.contiguous() 2025-05-07T20:31:36.4930535Z x1 = x1.contiguous() 2025-05-07T20:31:36.4930778Z 2025-05-07T20:31:36.4930982Z if scale_ub is not None: 2025-05-07T20:31:36.4931259Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.4931598Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.4931907Z ) 2025-05-07T20:31:36.4932103Z else: 2025-05-07T20:31:36.4932319Z scale_ub_tensor = None 2025-05-07T20:31:36.4932576Z 2025-05-07T20:31:36.4932813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.4933133Z op = silu_mul_quant 2025-05-07T20:31:36.4933387Z if compiled: 2025-05-07T20:31:36.4933634Z op = torch.compile(op) 2025-05-07T20:31:36.4933936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.4934211Z 2025-05-07T20:31:36.4934405Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.4934578Z 2025-05-07T20:31:36.4934684Z moe/activation_test.py:117: 2025-05-07T20:31:36.4934980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.4935316Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.4935598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.4936303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.4937007Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.4937548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.4938295Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.4938973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.4939518Z kernel = self.compile( 2025-05-07T20:31:36.4940072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.4940742Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.4941233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.4941460Z 2025-05-07T20:31:36.4941681Z self = 2025-05-07T20:31:36.4942769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.4944211Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc52c335b40>} 2025-05-07T20:31:36.4945586Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.4946626Z context = 2025-05-07T20:31:36.4946924Z 2025-05-07T20:31:36.4947092Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.4947621Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.4948100Z module_map=module_map) 2025-05-07T20:31:36.4948473Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.4948835Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.4949095Z E ^ 2025-05-07T20:31:36.4949573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.4950029Z 2025-05-07T20:31:36.4950533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.4951054Z 2025-05-07T20:31:36.4951164Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.4951594Z self=, 2025-05-07T20:31:36.4951998Z T=1, 2025-05-07T20:31:36.4952182Z D=5120, 2025-05-07T20:31:36.4952382Z scale_ub=None, 2025-05-07T20:31:36.4952601Z contiguous=True, 2025-05-07T20:31:36.4952823Z compiled=True, 2025-05-07T20:31:36.4953029Z ) 2025-05-07T20:31:36.9647791Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:36.9649961Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:36.9652710Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:36.9654471Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:36.9656278Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:36.9657704Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.9659129Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:36.9660542Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.9662160Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:36.9663436Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:36.9664684Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:36.9665926Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:36.9666993Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:36.9668032Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:36.9669277Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:36.9670586Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:36.9671842Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:36.9672919Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:36.9674117Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:36.9675506Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:36.9676592Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.9677532Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.9678290Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:36.9679335Z W0507 20:31:36.961000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.1266917Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.1268000Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.1269366Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.1270810Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.1272374Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.1273832Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.1275165Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.1276572Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.1278019Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.1279285Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.1280521Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.1281868Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.1282927Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:37.1284019Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.1285252Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.1286561Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.1287705Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.1288765Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.1289968Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.1291339Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.1292420Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.1293354Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.1294112Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.1295257Z W0507 20:31:37.123000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.5728543Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.5729637Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.5731020Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.5732485Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.5733918Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.5735349Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.5736686Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.5738330Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.5739805Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.5741096Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.5742354Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.5743635Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.5744721Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:37.5745782Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.5747042Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.5748365Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.5749512Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.5750595Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.5751925Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.5753318Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.5754451Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.5755389Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.5756380Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.5757436Z W0507 20:31:37.569000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.6024961Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.6026036Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.6027398Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.6028983Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.6030395Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.6031799Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.6033125Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.6034523Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.6035967Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.6037237Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.6038481Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.6039706Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.6040761Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:37.6041915Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.6043154Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.6044455Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.6045588Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.6046646Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.6047861Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.6049245Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.6050323Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.6051242Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.6052073Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.6053113Z W0507 20:31:37.599000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.9108137Z self = 2025-05-07T20:31:37.9108705Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:37.9108979Z 2025-05-07T20:31:37.9109065Z @given( 2025-05-07T20:31:37.9109296Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.9109619Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.9109933Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.9110273Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.9110604Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.9110913Z ) 2025-05-07T20:31:37.9111284Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.9111736Z def test_silu_mul_quant( 2025-05-07T20:31:37.9111980Z self, 2025-05-07T20:31:37.9112175Z T: int, 2025-05-07T20:31:37.9112376Z D: int, 2025-05-07T20:31:37.9112594Z scale_ub: Optional[float], 2025-05-07T20:31:37.9112869Z contiguous: bool, 2025-05-07T20:31:37.9113111Z compiled: bool, 2025-05-07T20:31:37.9113336Z ) -> None: 2025-05-07T20:31:37.9113557Z torch.manual_seed(2025) 2025-05-07T20:31:37.9113827Z 2025-05-07T20:31:37.9114123Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.9114472Z 2025-05-07T20:31:37.9114671Z x_sign = torch.sign(x) 2025-05-07T20:31:37.9114969Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.9115277Z x = x_sign * x_clamp 2025-05-07T20:31:37.9115521Z x0 = x[:, :D] 2025-05-07T20:31:37.9115746Z x1 = x[:, D:] 2025-05-07T20:31:37.9115953Z 2025-05-07T20:31:37.9116143Z if contiguous: 2025-05-07T20:31:37.9116386Z x0 = x0.contiguous() 2025-05-07T20:31:37.9116818Z x1 = x1.contiguous() 2025-05-07T20:31:37.9117062Z 2025-05-07T20:31:37.9117257Z if scale_ub is not None: 2025-05-07T20:31:37.9117531Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.9117875Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.9118187Z ) 2025-05-07T20:31:37.9118376Z else: 2025-05-07T20:31:37.9118592Z scale_ub_tensor = None 2025-05-07T20:31:37.9118851Z 2025-05-07T20:31:37.9119082Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.9119401Z op = silu_mul_quant 2025-05-07T20:31:37.9119653Z if compiled: 2025-05-07T20:31:37.9119899Z op = torch.compile(op) 2025-05-07T20:31:37.9120200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.9120483Z 2025-05-07T20:31:37.9120680Z y_fp8, y_scale = fn() 2025-05-07T20:31:37.9120967Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:37.9121265Z 2025-05-07T20:31:37.9121509Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.9121844Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:37.9122140Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:37.9122458Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:37.9122813Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.9123130Z 2025-05-07T20:31:37.9123329Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:37.9123531Z 2025-05-07T20:31:37.9123633Z moe/activation_test.py:126: 2025-05-07T20:31:37.9123929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.9124263Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:37.9124710Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.9125516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:37.9126281Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:37.9126829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.9127518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.9128217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:37.9128952Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.9129712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:37.9130484Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.9131225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:37.9131878Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:37.9132481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:37.9133005Z fn() 2025-05-07T20:31:37.9133531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:37.9134155Z self.fn.run( 2025-05-07T20:31:37.9134631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.9135171Z kernel = self.compile( 2025-05-07T20:31:37.9135724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.9136388Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.9136785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.9137148Z 2025-05-07T20:31:37.9137361Z self = 2025-05-07T20:31:37.9138537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.9139934Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc517344820>} 2025-05-07T20:31:37.9141300Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.9142340Z context = 2025-05-07T20:31:37.9142633Z 2025-05-07T20:31:37.9142813Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.9143342Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.9143845Z module_map=module_map) 2025-05-07T20:31:37.9144245Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.9144608Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:37.9144871Z E ^ 2025-05-07T20:31:37.9145339Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.9145794Z 2025-05-07T20:31:37.9146223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.9146741Z 2025-05-07T20:31:37.9146930Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.9147355Z self=, 2025-05-07T20:31:37.9147768Z T=2048, 2025-05-07T20:31:37.9147964Z D=5120, 2025-05-07T20:31:37.9148155Z scale_ub=None, 2025-05-07T20:31:37.9148376Z contiguous=True, 2025-05-07T20:31:37.9148604Z compiled=True, 2025-05-07T20:31:37.9148809Z ) 2025-05-07T20:31:38.3562518Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.3563992Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.3565373Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.3566829Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.3568237Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.3569651Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.3570985Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.3572400Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.3574052Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.3575323Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.3576565Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.3577802Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.3578958Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:38.3580010Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.3581249Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.3582554Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.3583813Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.3584930Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.3586137Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.3587514Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.3588599Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.3589530Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.3590294Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.3591334Z W0507 20:31:38.352000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.5183617Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.5185533Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.5188252Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.5191162Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.5193736Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.5195144Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.5196465Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.5197874Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.5199320Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.5200604Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.5201854Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.5203085Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.5204338Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:38.5205392Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.5206634Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.5207944Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.5209084Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.5210157Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.5211363Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.5212754Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.5213828Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.5214811Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.5215570Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.5216611Z W0507 20:31:38.515000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.9653190Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.9654299Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.9655959Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.9657432Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.9658896Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.9660312Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.9661648Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.9663221Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.9664726Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.9666003Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.9667251Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.9668481Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.9669552Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:38.9670595Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.9671845Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.9673149Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.9674323Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.9675415Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.9676621Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.9678126Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.9679204Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.9680138Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.9680896Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.9681943Z W0507 20:31:38.962000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.9954793Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.9956004Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.9957367Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.9958959Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.9960378Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.9961795Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.9963126Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.9964590Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.9966040Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.9967318Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.9968565Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.9969801Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.9970858Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:38.9971900Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.9973280Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.9974644Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.9975777Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.9976850Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.9978117Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.9979509Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.9980588Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.9981515Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.9982272Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.9983883Z W0507 20:31:38.992000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.4443956Z self = 2025-05-07T20:31:39.4444557Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:39.4444840Z 2025-05-07T20:31:39.4444922Z @given( 2025-05-07T20:31:39.4445160Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.4445478Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.4445795Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.4446142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.4446476Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.4446771Z ) 2025-05-07T20:31:39.4447130Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.4447592Z def test_silu_mul_quant( 2025-05-07T20:31:39.4447835Z self, 2025-05-07T20:31:39.4448039Z T: int, 2025-05-07T20:31:39.4448242Z D: int, 2025-05-07T20:31:39.4448470Z scale_ub: Optional[float], 2025-05-07T20:31:39.4448752Z contiguous: bool, 2025-05-07T20:31:39.4448996Z compiled: bool, 2025-05-07T20:31:39.4449222Z ) -> None: 2025-05-07T20:31:39.4449445Z torch.manual_seed(2025) 2025-05-07T20:31:39.4449694Z 2025-05-07T20:31:39.4449975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.4450329Z 2025-05-07T20:31:39.4450531Z x_sign = torch.sign(x) 2025-05-07T20:31:39.4450825Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.4451142Z x = x_sign * x_clamp 2025-05-07T20:31:39.4451391Z x0 = x[:, :D] 2025-05-07T20:31:39.4451609Z x1 = x[:, D:] 2025-05-07T20:31:39.4451825Z 2025-05-07T20:31:39.4452018Z if contiguous: 2025-05-07T20:31:39.4452256Z x0 = x0.contiguous() 2025-05-07T20:31:39.4452528Z x1 = x1.contiguous() 2025-05-07T20:31:39.4452776Z 2025-05-07T20:31:39.4452971Z if scale_ub is not None: 2025-05-07T20:31:39.4453450Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.4453800Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.4454117Z ) 2025-05-07T20:31:39.4454309Z else: 2025-05-07T20:31:39.4454532Z scale_ub_tensor = None 2025-05-07T20:31:39.4454792Z 2025-05-07T20:31:39.4455028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.4455350Z op = silu_mul_quant 2025-05-07T20:31:39.4455825Z if compiled: 2025-05-07T20:31:39.4456081Z op = torch.compile(op) 2025-05-07T20:31:39.4456388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.4456687Z 2025-05-07T20:31:39.4456888Z y_fp8, y_scale = fn() 2025-05-07T20:31:39.4457185Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:39.4457480Z 2025-05-07T20:31:39.4457726Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.4458138Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:39.4458443Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:39.4458766Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:39.4459130Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:39.4459450Z 2025-05-07T20:31:39.4459663Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:39.4459864Z 2025-05-07T20:31:39.4459975Z moe/activation_test.py:126: 2025-05-07T20:31:39.4460277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.4460616Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:39.4460961Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:39.4461946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:39.4462727Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:39.4463293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.4463992Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.4464696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:39.4465438Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:39.4466214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:39.4466978Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:39.4467727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:39.4468383Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:39.4469007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:39.4469531Z fn() 2025-05-07T20:31:39.4470055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:39.4470650Z self.fn.run( 2025-05-07T20:31:39.4471143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.4471683Z kernel = self.compile( 2025-05-07T20:31:39.4472239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.4472913Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.4473320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.4473556Z 2025-05-07T20:31:39.4473770Z self = 2025-05-07T20:31:39.4475003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.4476414Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc516e52a70>} 2025-05-07T20:31:39.4477785Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.4478832Z context = 2025-05-07T20:31:39.4479134Z 2025-05-07T20:31:39.4479307Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.4479842Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.4480333Z module_map=module_map) 2025-05-07T20:31:39.4480706Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.4481079Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:39.4481352Z E ^ 2025-05-07T20:31:39.4481825Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.4482290Z 2025-05-07T20:31:39.4482714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.4483242Z 2025-05-07T20:31:39.4483350Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.4483860Z self=, 2025-05-07T20:31:39.4484297Z T=128, 2025-05-07T20:31:39.4484521Z D=5120, 2025-05-07T20:31:39.4484723Z scale_ub=None, 2025-05-07T20:31:39.4484946Z contiguous=True, 2025-05-07T20:31:39.4485182Z compiled=True, 2025-05-07T20:31:39.4485394Z ) 2025-05-07T20:31:39.9157020Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.9158152Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:39.9159528Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.9161024Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.9162429Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.9163848Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.9165228Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.9166628Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.9168066Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.9169608Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:39.9170840Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.9172067Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:39.9173124Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:39.9174158Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:39.9175394Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.9176694Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.9177829Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:39.9179116Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:39.9180320Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.9181706Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.9182784Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.9183712Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.9184516Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:39.9185549Z W0507 20:31:39.912000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.0805061Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.0807100Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:40.0809612Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.0812325Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.0814834Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.0816513Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.0817857Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.0819368Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.0820826Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.0822109Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:40.0823359Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.0824593Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:40.0825799Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:40.0826844Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:40.0828096Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.0829400Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.0830549Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:40.0831622Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:40.0832828Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.0834240Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.0835359Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.0836294Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.0837051Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:40.0838102Z W0507 20:31:40.077000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.5306396Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.5308558Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:40.5311303Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.5314224Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.5315809Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.5317233Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.5325819Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.5327248Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.5328873Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.5330157Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:40.5331397Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.5332630Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:40.5333689Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:40.5334782Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:40.5336030Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.5337325Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.5338529Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:40.5339592Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:40.5340795Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.5342304Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.5343379Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.5344307Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.5345113Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:40.5346155Z W0507 20:31:40.527000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.5605308Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.5606405Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:40.5607767Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.5609216Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.5610768Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.5612186Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.5613516Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.5614924Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.5616373Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.5617646Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:40.5618991Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.5620228Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:40.5621278Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:40.5622327Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:40.5623573Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.5625010Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.5626147Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:40.5627215Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:40.5628421Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.5629820Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.5630909Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.5631845Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.5632608Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:40.5633733Z W0507 20:31:40.557000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.9685311Z self = 2025-05-07T20:31:40.9686526Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:40.9687229Z 2025-05-07T20:31:40.9687393Z @given( 2025-05-07T20:31:40.9687871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.9688497Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.9689131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.9689813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.9690487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.9691061Z ) 2025-05-07T20:31:40.9691782Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.9692688Z def test_silu_mul_quant( 2025-05-07T20:31:40.9693172Z self, 2025-05-07T20:31:40.9693570Z T: int, 2025-05-07T20:31:40.9693981Z D: int, 2025-05-07T20:31:40.9694421Z scale_ub: Optional[float], 2025-05-07T20:31:40.9694822Z contiguous: bool, 2025-05-07T20:31:40.9695106Z compiled: bool, 2025-05-07T20:31:40.9695338Z ) -> None: 2025-05-07T20:31:40.9695567Z torch.manual_seed(2025) 2025-05-07T20:31:40.9695822Z 2025-05-07T20:31:40.9696102Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.9696457Z 2025-05-07T20:31:40.9696664Z x_sign = torch.sign(x) 2025-05-07T20:31:40.9696973Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.9697288Z x = x_sign * x_clamp 2025-05-07T20:31:40.9697541Z x0 = x[:, :D] 2025-05-07T20:31:40.9697773Z x1 = x[:, D:] 2025-05-07T20:31:40.9697984Z 2025-05-07T20:31:40.9698248Z if contiguous: 2025-05-07T20:31:40.9698496Z x0 = x0.contiguous() 2025-05-07T20:31:40.9698761Z x1 = x1.contiguous() 2025-05-07T20:31:40.9699019Z 2025-05-07T20:31:40.9699224Z if scale_ub is not None: 2025-05-07T20:31:40.9699506Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.9700041Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.9700366Z ) 2025-05-07T20:31:40.9700564Z else: 2025-05-07T20:31:40.9700789Z scale_ub_tensor = None 2025-05-07T20:31:40.9701058Z 2025-05-07T20:31:40.9701298Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.9701624Z op = silu_mul_quant 2025-05-07T20:31:40.9701884Z if compiled: 2025-05-07T20:31:40.9702143Z op = torch.compile(op) 2025-05-07T20:31:40.9702450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.9702736Z 2025-05-07T20:31:40.9702941Z y_fp8, y_scale = fn() 2025-05-07T20:31:40.9703233Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:40.9703531Z 2025-05-07T20:31:40.9703786Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.9704127Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:40.9704433Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:40.9704765Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:40.9705130Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.9705451Z 2025-05-07T20:31:40.9705663Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:40.9705865Z 2025-05-07T20:31:40.9705975Z moe/activation_test.py:126: 2025-05-07T20:31:40.9706278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.9706621Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:40.9706964Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.9707770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:40.9708663Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:40.9709234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.9709942Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.9710648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:40.9711390Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.9712170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:40.9712943Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.9713687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:40.9714405Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:40.9715026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:40.9715560Z fn() 2025-05-07T20:31:40.9716090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:40.9716694Z self.fn.run( 2025-05-07T20:31:40.9717181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.9717724Z kernel = self.compile( 2025-05-07T20:31:40.9718287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.9718963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.9719368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.9719607Z 2025-05-07T20:31:40.9719827Z self = 2025-05-07T20:31:40.9720940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.9722473Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc514dd11b0>} 2025-05-07T20:31:40.9723856Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.9724958Z context = 2025-05-07T20:31:40.9725262Z 2025-05-07T20:31:40.9725437Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.9725988Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.9726475Z module_map=module_map) 2025-05-07T20:31:40.9726861Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.9727238Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:40.9727518Z E ^ 2025-05-07T20:31:40.9727998Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.9728463Z 2025-05-07T20:31:40.9728889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.9729415Z 2025-05-07T20:31:40.9729525Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.9729954Z self=, 2025-05-07T20:31:40.9730364Z T=4096, 2025-05-07T20:31:40.9730568Z D=5120, 2025-05-07T20:31:40.9730854Z scale_ub=None, 2025-05-07T20:31:40.9731083Z contiguous=True, 2025-05-07T20:31:40.9731320Z compiled=True, 2025-05-07T20:31:40.9731539Z ) 2025-05-07T20:31:41.4459026Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:41.4460118Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:41.4461488Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:41.4462952Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:41.4464352Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:41.4465767Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.4467099Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:41.4468506Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.4469960Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:41.4471411Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:41.4472655Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:41.4473892Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:41.4475012Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:41.4476059Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:41.4477306Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:41.4478616Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:41.4479758Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:41.4480821Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:41.4482186Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:41.4483573Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:41.4484656Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.4485638Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.4486396Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:41.4487441Z W0507 20:31:41.442000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.6091985Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:41.6094145Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:41.6095697Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:41.6097145Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:41.6098620Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:41.6100210Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.6101539Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:41.6102959Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.6104423Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:41.6105753Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:41.6106996Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:41.6108233Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:41.6109295Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:41.6110456Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:41.6111699Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:41.6113013Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:41.6114160Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:41.6115226Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:41.6116431Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:41.6117809Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:41.6118894Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.6119828Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.6120585Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:41.6121630Z W0507 20:31:41.605000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.0571401Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:42.0572698Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:42.0574068Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:42.0575577Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:42.0576994Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:42.0578492Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.0579829Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:42.0581237Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.0582839Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:42.0584114Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:42.0585372Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:42.0586600Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:42.0587664Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:42.0588719Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:42.0589969Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:42.0591275Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:42.0592418Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:42.0593481Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:42.0594704Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:42.0596121Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:42.0597284Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.0598214Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.0598975Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:42.0600018Z W0507 20:31:42.053000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.0869599Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:42.0870688Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:42.0872049Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:42.0873505Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:42.0875138Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:42.0876561Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.0877905Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:42.0879314Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.0880775Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:42.0882056Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:42.0883315Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:42.0884559Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:42.0885621Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:42.0886665Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:42.0887919Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:42.0889356Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:42.0890497Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:42.0891568Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:42.0892780Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:42.0894178Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:42.0895326Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.0896259Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.0897021Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:42.0898146Z W0507 20:31:42.083000 86695 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.4952318Z self = 2025-05-07T20:31:42.4953113Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:42.4953468Z 2025-05-07T20:31:42.4953554Z @given( 2025-05-07T20:31:42.4953805Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.4954131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.4954497Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.4954855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.4955204Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.4955498Z ) 2025-05-07T20:31:42.4956174Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.4956639Z def test_silu_mul_quant( 2025-05-07T20:31:42.4956894Z self, 2025-05-07T20:31:42.4957104Z T: int, 2025-05-07T20:31:42.4957318Z D: int, 2025-05-07T20:31:42.4957547Z scale_ub: Optional[float], 2025-05-07T20:31:42.4957852Z contiguous: bool, 2025-05-07T20:31:42.4958115Z compiled: bool, 2025-05-07T20:31:42.4958352Z ) -> None: 2025-05-07T20:31:42.4958598Z torch.manual_seed(2025) 2025-05-07T20:31:42.4958855Z 2025-05-07T20:31:42.4959140Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.4959503Z 2025-05-07T20:31:42.4959712Z x_sign = torch.sign(x) 2025-05-07T20:31:42.4960023Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.4960344Z x = x_sign * x_clamp 2025-05-07T20:31:42.4960600Z x0 = x[:, :D] 2025-05-07T20:31:42.4960831Z x1 = x[:, D:] 2025-05-07T20:31:42.4961048Z 2025-05-07T20:31:42.4961248Z if contiguous: 2025-05-07T20:31:42.4961497Z x0 = x0.contiguous() 2025-05-07T20:31:42.4961764Z x1 = x1.contiguous() 2025-05-07T20:31:42.4962016Z 2025-05-07T20:31:42.4962222Z if scale_ub is not None: 2025-05-07T20:31:42.4962510Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.4962864Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.4963186Z ) 2025-05-07T20:31:42.4963627Z else: 2025-05-07T20:31:42.4963852Z scale_ub_tensor = None 2025-05-07T20:31:42.4964121Z 2025-05-07T20:31:42.4964391Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.4964755Z op = silu_mul_quant 2025-05-07T20:31:42.4965021Z if compiled: 2025-05-07T20:31:42.4965283Z op = torch.compile(op) 2025-05-07T20:31:42.4965597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.4965885Z 2025-05-07T20:31:42.4966091Z y_fp8, y_scale = fn() 2025-05-07T20:31:42.4966387Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:42.4966699Z 2025-05-07T20:31:42.4966954Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.4967301Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:42.4967615Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:42.4967946Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:42.4968324Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:42.4968649Z 2025-05-07T20:31:42.4968869Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:42.4969074Z 2025-05-07T20:31:42.4969192Z moe/activation_test.py:126: 2025-05-07T20:31:42.4969504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4969856Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:42.4970207Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:42.4971030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:42.4971816Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:42.4972519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.4973232Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.4973948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:42.4974699Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:42.4975479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:42.4976245Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:42.4977001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:42.4977664Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:42.4978417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:42.4978950Z fn() 2025-05-07T20:31:42.4979481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:42.4980087Z self.fn.run( 2025-05-07T20:31:42.4980577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.4981120Z kernel = self.compile( 2025-05-07T20:31:42.4981681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.4982362Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.4982765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4983003Z 2025-05-07T20:31:42.4983220Z self = 2025-05-07T20:31:42.4984367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.4986002Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc516e51d80>} 2025-05-07T20:31:42.4987395Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.4988460Z context = 2025-05-07T20:31:42.4988760Z 2025-05-07T20:31:42.4988936Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.4989486Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.4989978Z module_map=module_map) 2025-05-07T20:31:42.4990359Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.4990748Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:42.4991036Z E ^ 2025-05-07T20:31:42.4991517Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.4991993Z 2025-05-07T20:31:42.4992424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.4992961Z 2025-05-07T20:31:42.4993076Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.4993517Z self=, 2025-05-07T20:31:42.4993928Z T=16384, 2025-05-07T20:31:42.4994141Z D=5120, 2025-05-07T20:31:42.4994355Z scale_ub=None, 2025-05-07T20:31:42.4994606Z contiguous=True, 2025-05-07T20:31:42.4994957Z compiled=True, 2025-05-07T20:31:42.4995188Z ) 2025-05-07T20:31:42.5401587Z W0507 20:31:42.538000 86695 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:31:42.5403165Z W0507 20:31:42.538000 86695 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:31:42.5404601Z W0507 20:31:42.538000 86695 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:31:42.5405627Z W0507 20:31:42.538000 86695 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:31:42.5406775Z W0507 20:31:42.538000 86695 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:31:42.6428224Z self = 2025-05-07T20:31:42.6429024Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:42.6429416Z 2025-05-07T20:31:42.6429573Z @given( 2025-05-07T20:31:42.6429834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.6430171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.6430499Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.6430838Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.6431202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.6431508Z ) 2025-05-07T20:31:42.6431879Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.6432331Z def test_silu_mul_quant( 2025-05-07T20:31:42.6441165Z self, 2025-05-07T20:31:42.6441392Z T: int, 2025-05-07T20:31:42.6441612Z D: int, 2025-05-07T20:31:42.6441845Z scale_ub: Optional[float], 2025-05-07T20:31:42.6442142Z contiguous: bool, 2025-05-07T20:31:42.6442756Z compiled: bool, 2025-05-07T20:31:42.6442997Z ) -> None: 2025-05-07T20:31:42.6443225Z torch.manual_seed(2025) 2025-05-07T20:31:42.6443472Z 2025-05-07T20:31:42.6443768Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.6444127Z 2025-05-07T20:31:42.6444333Z x_sign = torch.sign(x) 2025-05-07T20:31:42.6444687Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.6445016Z x = x_sign * x_clamp 2025-05-07T20:31:42.6445262Z x0 = x[:, :D] 2025-05-07T20:31:42.6445494Z x1 = x[:, D:] 2025-05-07T20:31:42.6445721Z 2025-05-07T20:31:42.6445912Z if contiguous: 2025-05-07T20:31:42.6446164Z x0 = x0.contiguous() 2025-05-07T20:31:42.6446444Z x1 = x1.contiguous() 2025-05-07T20:31:42.6446707Z 2025-05-07T20:31:42.6446904Z if scale_ub is not None: 2025-05-07T20:31:42.6447193Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.6447555Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.6447872Z ) 2025-05-07T20:31:42.6448076Z else: 2025-05-07T20:31:42.6448302Z scale_ub_tensor = None 2025-05-07T20:31:42.6448558Z 2025-05-07T20:31:42.6448807Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.6449141Z op = silu_mul_quant 2025-05-07T20:31:42.6449396Z if compiled: 2025-05-07T20:31:42.6449661Z op = torch.compile(op) 2025-05-07T20:31:42.6449978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.6450257Z 2025-05-07T20:31:42.6450463Z y_fp8, y_scale = fn() 2025-05-07T20:31:42.6450768Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:42.6451066Z 2025-05-07T20:31:42.6451498Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.6451850Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:42.6452156Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:42.6452483Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:42.6452858Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:42.6453183Z 2025-05-07T20:31:42.6453391Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:42.6453609Z 2025-05-07T20:31:42.6453719Z moe/activation_test.py:126: 2025-05-07T20:31:42.6454030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.6454403Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:42.6454783Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:42.6455908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:42.6456703Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:42.6457262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.6457971Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.6458787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:42.6459534Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:42.6460295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:42.6461061Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:42.6461806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:42.6462470Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:42.6463082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:42.6463753Z fn() 2025-05-07T20:31:42.6464279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:42.6464918Z self.fn.run( 2025-05-07T20:31:42.6465407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.6465957Z kernel = self.compile( 2025-05-07T20:31:42.6466518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.6467186Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.6467596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.6467826Z 2025-05-07T20:31:42.6468058Z self = 2025-05-07T20:31:42.6469167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.6470584Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc51510fe20>} 2025-05-07T20:31:42.6471947Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.6472998Z context = 2025-05-07T20:31:42.6473292Z 2025-05-07T20:31:42.6473475Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.6474125Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.6474658Z module_map=module_map) 2025-05-07T20:31:42.6475044Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.6475422Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:42.6475697Z E ^ 2025-05-07T20:31:42.6476176Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.6476637Z 2025-05-07T20:31:42.6477071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.6477594Z 2025-05-07T20:31:42.6477714Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.6478137Z self=, 2025-05-07T20:31:42.6478553Z T=1, 2025-05-07T20:31:42.6478752Z D=5120, 2025-05-07T20:31:42.6478959Z scale_ub=1200.0, 2025-05-07T20:31:42.6479197Z contiguous=True, 2025-05-07T20:31:42.6479432Z compiled=True, 2025-05-07T20:31:42.6479644Z ) 2025-05-07T20:31:42.9907317Z self = 2025-05-07T20:31:42.9908073Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:42.9908435Z 2025-05-07T20:31:42.9908521Z @given( 2025-05-07T20:31:42.9908773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.9909106Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.9909425Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.9909775Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.9910122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.9910416Z ) 2025-05-07T20:31:42.9910785Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.9911269Z def test_silu_mul_quant( 2025-05-07T20:31:42.9911528Z self, 2025-05-07T20:31:42.9911730Z T: int, 2025-05-07T20:31:42.9911941Z D: int, 2025-05-07T20:31:42.9912550Z scale_ub: Optional[float], 2025-05-07T20:31:42.9912830Z contiguous: bool, 2025-05-07T20:31:42.9913083Z compiled: bool, 2025-05-07T20:31:42.9913330Z ) -> None: 2025-05-07T20:31:42.9913552Z torch.manual_seed(2025) 2025-05-07T20:31:42.9913810Z 2025-05-07T20:31:42.9914101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.9914454Z 2025-05-07T20:31:42.9914685Z x_sign = torch.sign(x) 2025-05-07T20:31:42.9915017Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.9915334Z x = x_sign * x_clamp 2025-05-07T20:31:42.9915587Z x0 = x[:, :D] 2025-05-07T20:31:42.9915818Z x1 = x[:, D:] 2025-05-07T20:31:42.9916030Z 2025-05-07T20:31:42.9916232Z if contiguous: 2025-05-07T20:31:42.9916485Z x0 = x0.contiguous() 2025-05-07T20:31:42.9916749Z x1 = x1.contiguous() 2025-05-07T20:31:42.9917006Z 2025-05-07T20:31:42.9917215Z if scale_ub is not None: 2025-05-07T20:31:42.9917512Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.9917859Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.9918182Z ) 2025-05-07T20:31:42.9918390Z else: 2025-05-07T20:31:42.9918607Z scale_ub_tensor = None 2025-05-07T20:31:42.9918879Z 2025-05-07T20:31:42.9919132Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.9919456Z op = silu_mul_quant 2025-05-07T20:31:42.9919721Z if compiled: 2025-05-07T20:31:42.9919989Z op = torch.compile(op) 2025-05-07T20:31:42.9920296Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.9920584Z 2025-05-07T20:31:42.9920794Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.9920965Z 2025-05-07T20:31:42.9921762Z moe/activation_test.py:117: 2025-05-07T20:31:42.9922087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.9922443Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.9922741Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.9923316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.9923900Z return fn(*args, **kwargs) 2025-05-07T20:31:42.9924587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.9925297Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.9925855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.9926564Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.9927264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.9927806Z kernel = self.compile( 2025-05-07T20:31:42.9928367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.9929055Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.9929468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.9929704Z 2025-05-07T20:31:42.9929919Z self = 2025-05-07T20:31:42.9931034Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.9932468Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50fdb36d0>} 2025-05-07T20:31:42.9933855Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.9935046Z context = 2025-05-07T20:31:42.9935351Z 2025-05-07T20:31:42.9935526Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.9936070Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.9936563Z module_map=module_map) 2025-05-07T20:31:42.9936938Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.9937312Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.9937588Z E ^ 2025-05-07T20:31:42.9938180Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.9938656Z 2025-05-07T20:31:42.9939085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.9939623Z 2025-05-07T20:31:42.9939733Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.9940167Z self=, 2025-05-07T20:31:42.9940576Z T=1, 2025-05-07T20:31:42.9940775Z D=5120, 2025-05-07T20:31:42.9940985Z scale_ub=None, 2025-05-07T20:31:42.9941211Z contiguous=False, 2025-05-07T20:31:42.9941451Z compiled=True, 2025-05-07T20:31:42.9941685Z ) 2025-05-07T20:31:43.0623721Z self = 2025-05-07T20:31:43.0624513Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.0624927Z 2025-05-07T20:31:43.0625040Z @given( 2025-05-07T20:31:43.0625725Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.0626168Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.0626576Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.0626964Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.0627304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.0627607Z ) 2025-05-07T20:31:43.0627975Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.0628427Z def test_silu_mul_quant( 2025-05-07T20:31:43.0628684Z self, 2025-05-07T20:31:43.0628894Z T: int, 2025-05-07T20:31:43.0629099Z D: int, 2025-05-07T20:31:43.0629333Z scale_ub: Optional[float], 2025-05-07T20:31:43.0629622Z contiguous: bool, 2025-05-07T20:31:43.0629870Z compiled: bool, 2025-05-07T20:31:43.0630109Z ) -> None: 2025-05-07T20:31:43.0630340Z torch.manual_seed(2025) 2025-05-07T20:31:43.0630590Z 2025-05-07T20:31:43.0630889Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.0631250Z 2025-05-07T20:31:43.0631460Z x_sign = torch.sign(x) 2025-05-07T20:31:43.0631768Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.0632095Z x = x_sign * x_clamp 2025-05-07T20:31:43.0632349Z x0 = x[:, :D] 2025-05-07T20:31:43.0632578Z x1 = x[:, D:] 2025-05-07T20:31:43.0632800Z 2025-05-07T20:31:43.0633001Z if contiguous: 2025-05-07T20:31:43.0633236Z x0 = x0.contiguous() 2025-05-07T20:31:43.0633503Z x1 = x1.contiguous() 2025-05-07T20:31:43.0633757Z 2025-05-07T20:31:43.0633954Z if scale_ub is not None: 2025-05-07T20:31:43.0634244Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.0634599Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.0634959Z ) 2025-05-07T20:31:43.0635166Z else: 2025-05-07T20:31:43.0635399Z scale_ub_tensor = None 2025-05-07T20:31:43.0635662Z 2025-05-07T20:31:43.0635912Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.0636241Z op = silu_mul_quant 2025-05-07T20:31:43.0636663Z if compiled: 2025-05-07T20:31:43.0636919Z op = torch.compile(op) 2025-05-07T20:31:43.0637232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0637520Z 2025-05-07T20:31:43.0637717Z y_fp8, y_scale = fn() 2025-05-07T20:31:43.0638020Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:43.0638324Z 2025-05-07T20:31:43.0638571Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.0638924Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:43.0639236Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:43.0639558Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:43.0639931Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:43.0640265Z 2025-05-07T20:31:43.0640476Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:43.0640688Z 2025-05-07T20:31:43.0640794Z moe/activation_test.py:126: 2025-05-07T20:31:43.0641110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0641460Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:43.0641798Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:43.0642609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:43.0643382Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:43.0643941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.0644645Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.0645444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:43.0646190Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:43.0646961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:43.0647729Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:43.0648477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:43.0649137Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:43.0649747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:43.0650283Z fn() 2025-05-07T20:31:43.0650816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:43.0651417Z self.fn.run( 2025-05-07T20:31:43.0651894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.0652448Z kernel = self.compile( 2025-05-07T20:31:43.0653007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.0653674Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.0654081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0654310Z 2025-05-07T20:31:43.0654553Z self = 2025-05-07T20:31:43.0656125Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.0657568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc51510fac0>} 2025-05-07T20:31:43.0659175Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.0660240Z context = 2025-05-07T20:31:43.0660536Z 2025-05-07T20:31:43.0660713Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.0661241Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.0661724Z module_map=module_map) 2025-05-07T20:31:43.0662100Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.0662471Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:43.0662749Z E ^ 2025-05-07T20:31:43.0663225Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.0663688Z 2025-05-07T20:31:43.0664117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.0664637Z 2025-05-07T20:31:43.0664751Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.0665170Z self=, 2025-05-07T20:31:43.0665583Z T=1, 2025-05-07T20:31:43.0665779Z D=5120, 2025-05-07T20:31:43.0665976Z scale_ub=None, 2025-05-07T20:31:43.0666205Z contiguous=True, 2025-05-07T20:31:43.0666441Z compiled=False, 2025-05-07T20:31:43.0666653Z ) 2025-05-07T20:31:43.2319263Z self = 2025-05-07T20:31:43.2319977Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:43.2320599Z 2025-05-07T20:31:43.2320685Z @given( 2025-05-07T20:31:43.2320930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.2321259Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.2321579Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.2321923Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.2322265Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.2322553Z ) 2025-05-07T20:31:43.2322913Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.2323364Z def test_silu_mul_quant( 2025-05-07T20:31:43.2323614Z self, 2025-05-07T20:31:43.2323818Z T: int, 2025-05-07T20:31:43.2324027Z D: int, 2025-05-07T20:31:43.2324250Z scale_ub: Optional[float], 2025-05-07T20:31:43.2324534Z contiguous: bool, 2025-05-07T20:31:43.2324781Z compiled: bool, 2025-05-07T20:31:43.2325009Z ) -> None: 2025-05-07T20:31:43.2325241Z torch.manual_seed(2025) 2025-05-07T20:31:43.2325495Z 2025-05-07T20:31:43.2325770Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.2326125Z 2025-05-07T20:31:43.2326328Z x_sign = torch.sign(x) 2025-05-07T20:31:43.2326628Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.2326939Z x = x_sign * x_clamp 2025-05-07T20:31:43.2327186Z x0 = x[:, :D] 2025-05-07T20:31:43.2327408Z x1 = x[:, D:] 2025-05-07T20:31:43.2327617Z 2025-05-07T20:31:43.2327811Z if contiguous: 2025-05-07T20:31:43.2328048Z x0 = x0.contiguous() 2025-05-07T20:31:43.2328305Z x1 = x1.contiguous() 2025-05-07T20:31:43.2328558Z 2025-05-07T20:31:43.2328758Z if scale_ub is not None: 2025-05-07T20:31:43.2329034Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.2329380Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.2329701Z ) 2025-05-07T20:31:43.2329895Z else: 2025-05-07T20:31:43.2330114Z scale_ub_tensor = None 2025-05-07T20:31:43.2330376Z 2025-05-07T20:31:43.2330773Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.2331094Z op = silu_mul_quant 2025-05-07T20:31:43.2331353Z if compiled: 2025-05-07T20:31:43.2331603Z op = torch.compile(op) 2025-05-07T20:31:43.2331918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2332203Z 2025-05-07T20:31:43.2332408Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.2332576Z 2025-05-07T20:31:43.2332680Z moe/activation_test.py:117: 2025-05-07T20:31:43.2332985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2333321Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.2333610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2334326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.2335084Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.2335641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.2336332Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.2337011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.2337554Z kernel = self.compile( 2025-05-07T20:31:43.2338189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.2338863Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.2339267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2339497Z 2025-05-07T20:31:43.2339799Z self = 2025-05-07T20:31:43.2340896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.2342311Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50fdb3760>} 2025-05-07T20:31:43.2343683Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.2344766Z context = 2025-05-07T20:31:43.2345078Z 2025-05-07T20:31:43.2345262Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.2345794Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.2346273Z module_map=module_map) 2025-05-07T20:31:43.2346654Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.2347015Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.2347290Z E ^ 2025-05-07T20:31:43.2347773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.2348231Z 2025-05-07T20:31:43.2348662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.2349180Z 2025-05-07T20:31:43.2349287Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.2349717Z self=, 2025-05-07T20:31:43.2350123Z T=128, 2025-05-07T20:31:43.2350315Z D=5120, 2025-05-07T20:31:43.2350521Z scale_ub=None, 2025-05-07T20:31:43.2350743Z contiguous=False, 2025-05-07T20:31:43.2350972Z compiled=True, 2025-05-07T20:31:43.2351185Z ) 2025-05-07T20:31:43.2351520Z self = 2025-05-07T20:31:43.2352152Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.2352424Z 2025-05-07T20:31:43.2352503Z @given( 2025-05-07T20:31:43.2352743Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.2353064Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.2353373Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.2353715Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.2354063Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.2354351Z ) 2025-05-07T20:31:43.2354714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.2355165Z def test_silu_mul_quant( 2025-05-07T20:31:43.2355419Z self, 2025-05-07T20:31:43.2356044Z T: int, 2025-05-07T20:31:43.2356251Z D: int, 2025-05-07T20:31:43.2356492Z scale_ub: Optional[float], 2025-05-07T20:31:43.2356780Z contiguous: bool, 2025-05-07T20:31:43.2357032Z compiled: bool, 2025-05-07T20:31:43.2357265Z ) -> None: 2025-05-07T20:31:43.2357485Z torch.manual_seed(2025) 2025-05-07T20:31:43.2366814Z 2025-05-07T20:31:43.2367144Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.2367499Z 2025-05-07T20:31:43.2367705Z x_sign = torch.sign(x) 2025-05-07T20:31:43.2368000Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.2368321Z x = x_sign * x_clamp 2025-05-07T20:31:43.2368571Z x0 = x[:, :D] 2025-05-07T20:31:43.2368785Z x1 = x[:, D:] 2025-05-07T20:31:43.2368998Z 2025-05-07T20:31:43.2369192Z if contiguous: 2025-05-07T20:31:43.2369426Z x0 = x0.contiguous() 2025-05-07T20:31:43.2369879Z x1 = x1.contiguous() 2025-05-07T20:31:43.2370135Z 2025-05-07T20:31:43.2370325Z if scale_ub is not None: 2025-05-07T20:31:43.2370613Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.2370960Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.2371280Z ) 2025-05-07T20:31:43.2371469Z else: 2025-05-07T20:31:43.2371682Z scale_ub_tensor = None 2025-05-07T20:31:43.2371939Z 2025-05-07T20:31:43.2372170Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.2372490Z op = silu_mul_quant 2025-05-07T20:31:43.2372750Z if compiled: 2025-05-07T20:31:43.2372998Z op = torch.compile(op) 2025-05-07T20:31:43.2373306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2373585Z 2025-05-07T20:31:43.2373778Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.2373957Z 2025-05-07T20:31:43.2374059Z moe/activation_test.py:117: 2025-05-07T20:31:43.2374370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2374758Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.2375049Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2375628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.2376205Z return fn(*args, **kwargs) 2025-05-07T20:31:43.2376869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.2377576Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.2378220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.2378922Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.2379598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.2380142Z kernel = self.compile( 2025-05-07T20:31:43.2380701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.2381507Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.2381916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2382152Z 2025-05-07T20:31:43.2382365Z self = 2025-05-07T20:31:43.2383470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.2384870Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc51444ee60>} 2025-05-07T20:31:43.2386243Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.2387297Z context = 2025-05-07T20:31:43.2387591Z 2025-05-07T20:31:43.2387770Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.2388310Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.2388785Z module_map=module_map) 2025-05-07T20:31:43.2389155Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.2389511Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.2389772Z E ^ 2025-05-07T20:31:43.2390325Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.2390785Z 2025-05-07T20:31:43.2391208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.2391741Z 2025-05-07T20:31:43.2391848Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.2392271Z self=, 2025-05-07T20:31:43.2392679Z T=128, 2025-05-07T20:31:43.2392865Z D=7168, 2025-05-07T20:31:43.2393062Z scale_ub=1200.0, 2025-05-07T20:31:43.2393292Z contiguous=False, 2025-05-07T20:31:43.2393517Z compiled=False, 2025-05-07T20:31:43.2393727Z ) 2025-05-07T20:31:43.3657680Z self = 2025-05-07T20:31:43.3658498Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:43.3658835Z 2025-05-07T20:31:43.3658926Z @given( 2025-05-07T20:31:43.3659188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.3659516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.3659833Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.3660179Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.3660522Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.3660818Z ) 2025-05-07T20:31:43.3661176Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.3661633Z def test_silu_mul_quant( 2025-05-07T20:31:43.3661884Z self, 2025-05-07T20:31:43.3662082Z T: int, 2025-05-07T20:31:43.3662290Z D: int, 2025-05-07T20:31:43.3662519Z scale_ub: Optional[float], 2025-05-07T20:31:43.3662805Z contiguous: bool, 2025-05-07T20:31:43.3663052Z compiled: bool, 2025-05-07T20:31:43.3663287Z ) -> None: 2025-05-07T20:31:43.3663513Z torch.manual_seed(2025) 2025-05-07T20:31:43.3663764Z 2025-05-07T20:31:43.3664062Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.3664424Z 2025-05-07T20:31:43.3664647Z x_sign = torch.sign(x) 2025-05-07T20:31:43.3665271Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.3665590Z x = x_sign * x_clamp 2025-05-07T20:31:43.3665843Z x0 = x[:, :D] 2025-05-07T20:31:43.3666074Z x1 = x[:, D:] 2025-05-07T20:31:43.3666291Z 2025-05-07T20:31:43.3666485Z if contiguous: 2025-05-07T20:31:43.3666730Z x0 = x0.contiguous() 2025-05-07T20:31:43.3667003Z x1 = x1.contiguous() 2025-05-07T20:31:43.3667251Z 2025-05-07T20:31:43.3667454Z if scale_ub is not None: 2025-05-07T20:31:43.3667744Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.3668088Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.3668411Z ) 2025-05-07T20:31:43.3668615Z else: 2025-05-07T20:31:43.3668839Z scale_ub_tensor = None 2025-05-07T20:31:43.3669102Z 2025-05-07T20:31:43.3669354Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.3669682Z op = silu_mul_quant 2025-05-07T20:31:43.3669946Z if compiled: 2025-05-07T20:31:43.3670207Z op = torch.compile(op) 2025-05-07T20:31:43.3670521Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.3670798Z 2025-05-07T20:31:43.3671003Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.3671178Z 2025-05-07T20:31:43.3671292Z moe/activation_test.py:117: 2025-05-07T20:31:43.3671594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.3671933Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.3672229Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.3672942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.3673647Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.3674347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.3675112Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.3675791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.3676340Z kernel = self.compile( 2025-05-07T20:31:43.3676903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.3677580Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.3677986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.3678222Z 2025-05-07T20:31:43.3678438Z self = 2025-05-07T20:31:43.3679550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.3680972Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc51444dab0>} 2025-05-07T20:31:43.3682339Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.3683390Z context = 2025-05-07T20:31:43.3683693Z 2025-05-07T20:31:43.3683867Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.3684408Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.3684895Z module_map=module_map) 2025-05-07T20:31:43.3685276Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.3685649Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.3686079Z E ^ 2025-05-07T20:31:43.3686555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.3687022Z 2025-05-07T20:31:43.3687449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.3687975Z 2025-05-07T20:31:43.3688093Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.3688527Z self=, 2025-05-07T20:31:43.3688940Z T=128, 2025-05-07T20:31:43.3689144Z D=5120, 2025-05-07T20:31:43.3689350Z scale_ub=None, 2025-05-07T20:31:43.3689575Z contiguous=False, 2025-05-07T20:31:43.3689817Z compiled=False, 2025-05-07T20:31:43.3690041Z ) 2025-05-07T20:31:43.3690376Z self = 2025-05-07T20:31:43.3690890Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:43.3691173Z 2025-05-07T20:31:43.3691261Z @given( 2025-05-07T20:31:43.3691498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.3691828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.3692152Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.3692496Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.3692837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.3693139Z ) 2025-05-07T20:31:43.3693508Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.3693958Z def test_silu_mul_quant( 2025-05-07T20:31:43.3694215Z self, 2025-05-07T20:31:43.3694427Z T: int, 2025-05-07T20:31:43.3694656Z D: int, 2025-05-07T20:31:43.3694996Z scale_ub: Optional[float], 2025-05-07T20:31:43.3695284Z contiguous: bool, 2025-05-07T20:31:43.3695531Z compiled: bool, 2025-05-07T20:31:43.3695776Z ) -> None: 2025-05-07T20:31:43.3696005Z torch.manual_seed(2025) 2025-05-07T20:31:43.3696253Z 2025-05-07T20:31:43.3696539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.3696893Z 2025-05-07T20:31:43.3697096Z x_sign = torch.sign(x) 2025-05-07T20:31:43.3697408Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.3697734Z x = x_sign * x_clamp 2025-05-07T20:31:43.3697989Z x0 = x[:, :D] 2025-05-07T20:31:43.3698311Z x1 = x[:, D:] 2025-05-07T20:31:43.3698532Z 2025-05-07T20:31:43.3698732Z if contiguous: 2025-05-07T20:31:43.3698985Z x0 = x0.contiguous() 2025-05-07T20:31:43.3699260Z x1 = x1.contiguous() 2025-05-07T20:31:43.3699507Z 2025-05-07T20:31:43.3699726Z if scale_ub is not None: 2025-05-07T20:31:43.3700017Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.3700373Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.3700696Z ) 2025-05-07T20:31:43.3700903Z else: 2025-05-07T20:31:43.3701127Z scale_ub_tensor = None 2025-05-07T20:31:43.3701391Z 2025-05-07T20:31:43.3701640Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.3701969Z op = silu_mul_quant 2025-05-07T20:31:43.3702230Z if compiled: 2025-05-07T20:31:43.3702502Z op = torch.compile(op) 2025-05-07T20:31:43.3702814Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.3703095Z 2025-05-07T20:31:43.3703307Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.3703480Z 2025-05-07T20:31:43.3703590Z moe/activation_test.py:117: 2025-05-07T20:31:43.3703891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.3704233Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.3704531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.3705298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.3706126Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.3706683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.3707385Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.3708061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.3708610Z kernel = self.compile( 2025-05-07T20:31:43.3709170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.3709853Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.3710255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.3710494Z 2025-05-07T20:31:43.3710715Z self = 2025-05-07T20:31:43.3711824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.3713236Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc514dd2950>} 2025-05-07T20:31:43.3714611Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.3715737Z context = 2025-05-07T20:31:43.3716043Z 2025-05-07T20:31:43.3716221Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.3716767Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.3717247Z module_map=module_map) 2025-05-07T20:31:43.3717626Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.3717994Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.3718267Z E ^ 2025-05-07T20:31:43.3718743Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.3719210Z 2025-05-07T20:31:43.3719637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.3720159Z 2025-05-07T20:31:43.3720279Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.3720720Z self=, 2025-05-07T20:31:43.3721131Z T=128, 2025-05-07T20:31:43.3721331Z D=5120, 2025-05-07T20:31:43.3721544Z scale_ub=1200.0, 2025-05-07T20:31:43.3721778Z contiguous=True, 2025-05-07T20:31:43.3722014Z compiled=False, 2025-05-07T20:31:43.3722233Z ) 2025-05-07T20:31:43.5659586Z self = 2025-05-07T20:31:43.5660213Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:43.5660611Z 2025-05-07T20:31:43.5660698Z @given( 2025-05-07T20:31:43.5660935Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.5661253Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.5661572Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.5661917Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.5662360Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.5662679Z ) 2025-05-07T20:31:43.5663042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.5663843Z def test_silu_mul_quant( 2025-05-07T20:31:43.5664094Z self, 2025-05-07T20:31:43.5664296Z T: int, 2025-05-07T20:31:43.5664500Z D: int, 2025-05-07T20:31:43.5664721Z scale_ub: Optional[float], 2025-05-07T20:31:43.5664999Z contiguous: bool, 2025-05-07T20:31:43.5665245Z compiled: bool, 2025-05-07T20:31:43.5665517Z ) -> None: 2025-05-07T20:31:43.5665738Z torch.manual_seed(2025) 2025-05-07T20:31:43.5665986Z 2025-05-07T20:31:43.5666270Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.5666622Z 2025-05-07T20:31:43.5666816Z x_sign = torch.sign(x) 2025-05-07T20:31:43.5667118Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.5667437Z x = x_sign * x_clamp 2025-05-07T20:31:43.5667686Z x0 = x[:, :D] 2025-05-07T20:31:43.5667908Z x1 = x[:, D:] 2025-05-07T20:31:43.5668123Z 2025-05-07T20:31:43.5668309Z if contiguous: 2025-05-07T20:31:43.5668558Z x0 = x0.contiguous() 2025-05-07T20:31:43.5668823Z x1 = x1.contiguous() 2025-05-07T20:31:43.5669064Z 2025-05-07T20:31:43.5669263Z if scale_ub is not None: 2025-05-07T20:31:43.5669546Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.5669887Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.5670202Z ) 2025-05-07T20:31:43.5670403Z else: 2025-05-07T20:31:43.5670618Z scale_ub_tensor = None 2025-05-07T20:31:43.5670878Z 2025-05-07T20:31:43.5671120Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.5671447Z op = silu_mul_quant 2025-05-07T20:31:43.5671707Z if compiled: 2025-05-07T20:31:43.5671974Z op = torch.compile(op) 2025-05-07T20:31:43.5672434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5672717Z 2025-05-07T20:31:43.5672923Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.5673096Z 2025-05-07T20:31:43.5673212Z moe/activation_test.py:117: 2025-05-07T20:31:43.5673511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5673852Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.5674153Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5674913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.5675623Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.5676177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.5676878Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.5677561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.5678109Z kernel = self.compile( 2025-05-07T20:31:43.5678678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.5679359Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.5679761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5680000Z 2025-05-07T20:31:43.5680213Z self = 2025-05-07T20:31:43.5681317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.5682739Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc5144cc3a0>} 2025-05-07T20:31:43.5684107Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.5685297Z context = 2025-05-07T20:31:43.5685601Z 2025-05-07T20:31:43.5685773Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.5686312Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.5686789Z module_map=module_map) 2025-05-07T20:31:43.5687164Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.5687530Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.5687792Z E ^ 2025-05-07T20:31:43.5688275Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.5688741Z 2025-05-07T20:31:43.5689167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.5689693Z 2025-05-07T20:31:43.5689810Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.5690236Z self=, 2025-05-07T20:31:43.5690647Z T=1, 2025-05-07T20:31:43.5690841Z D=7168, 2025-05-07T20:31:43.5691038Z scale_ub=1200.0, 2025-05-07T20:31:43.5691277Z contiguous=True, 2025-05-07T20:31:43.5691509Z compiled=True, 2025-05-07T20:31:43.5691720Z ) 2025-05-07T20:31:43.5692050Z self = 2025-05-07T20:31:43.5692547Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:43.5692812Z 2025-05-07T20:31:43.5692900Z @given( 2025-05-07T20:31:43.5693220Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.5693545Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.5693866Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.5694207Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.5694550Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.5694843Z ) 2025-05-07T20:31:43.5695201Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.5695659Z def test_silu_mul_quant( 2025-05-07T20:31:43.5695914Z self, 2025-05-07T20:31:43.5696122Z T: int, 2025-05-07T20:31:43.5696331Z D: int, 2025-05-07T20:31:43.5696565Z scale_ub: Optional[float], 2025-05-07T20:31:43.5696853Z contiguous: bool, 2025-05-07T20:31:43.5697102Z compiled: bool, 2025-05-07T20:31:43.5697343Z ) -> None: 2025-05-07T20:31:43.5697573Z torch.manual_seed(2025) 2025-05-07T20:31:43.5697820Z 2025-05-07T20:31:43.5698252Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.5698610Z 2025-05-07T20:31:43.5698813Z x_sign = torch.sign(x) 2025-05-07T20:31:43.5699127Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.5699451Z x = x_sign * x_clamp 2025-05-07T20:31:43.5699703Z x0 = x[:, :D] 2025-05-07T20:31:43.5699937Z x1 = x[:, D:] 2025-05-07T20:31:43.5700158Z 2025-05-07T20:31:43.5700352Z if contiguous: 2025-05-07T20:31:43.5700598Z x0 = x0.contiguous() 2025-05-07T20:31:43.5700868Z x1 = x1.contiguous() 2025-05-07T20:31:43.5701121Z 2025-05-07T20:31:43.5701321Z if scale_ub is not None: 2025-05-07T20:31:43.5701612Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.5701959Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.5702272Z ) 2025-05-07T20:31:43.5702477Z else: 2025-05-07T20:31:43.5702700Z scale_ub_tensor = None 2025-05-07T20:31:43.5702968Z 2025-05-07T20:31:43.5703213Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.5703541Z op = silu_mul_quant 2025-05-07T20:31:43.5703933Z if compiled: 2025-05-07T20:31:43.5704195Z op = torch.compile(op) 2025-05-07T20:31:43.5704507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5704827Z 2025-05-07T20:31:43.5705046Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.5705216Z 2025-05-07T20:31:43.5705326Z moe/activation_test.py:117: 2025-05-07T20:31:43.5705632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5705964Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.5706262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5706837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.5707407Z return fn(*args, **kwargs) 2025-05-07T20:31:43.5708087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.5708800Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.5709356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.5710045Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.5710736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.5711292Z kernel = self.compile( 2025-05-07T20:31:43.5711847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.5712524Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.5712931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5713267Z 2025-05-07T20:31:43.5713492Z self = 2025-05-07T20:31:43.5714592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.5716042Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc5144cedd0>} 2025-05-07T20:31:43.5728081Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.5729139Z context = 2025-05-07T20:31:43.5729436Z 2025-05-07T20:31:43.5729626Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.5730157Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.5730644Z module_map=module_map) 2025-05-07T20:31:43.5731023Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.5731391Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.5731653Z E ^ 2025-05-07T20:31:43.5732135Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.5732597Z 2025-05-07T20:31:43.5733029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.5733550Z 2025-05-07T20:31:43.5733658Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.5734082Z self=, 2025-05-07T20:31:43.5734501Z T=1, 2025-05-07T20:31:43.5734727Z D=7168, 2025-05-07T20:31:43.5734949Z scale_ub=1200.0, 2025-05-07T20:31:43.5735184Z contiguous=False, 2025-05-07T20:31:43.5735420Z compiled=True, 2025-05-07T20:31:43.5735746Z ) 2025-05-07T20:31:43.9106201Z self = 2025-05-07T20:31:43.9106780Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:43.9107054Z 2025-05-07T20:31:43.9107145Z @given( 2025-05-07T20:31:43.9107383Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.9107707Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.9108032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.9108371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.9108714Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.9109012Z ) 2025-05-07T20:31:43.9109370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.9109848Z def test_silu_mul_quant( 2025-05-07T20:31:43.9110103Z self, 2025-05-07T20:31:43.9110310Z T: int, 2025-05-07T20:31:43.9110519Z D: int, 2025-05-07T20:31:43.9110749Z scale_ub: Optional[float], 2025-05-07T20:31:43.9111036Z contiguous: bool, 2025-05-07T20:31:43.9111280Z compiled: bool, 2025-05-07T20:31:43.9111522Z ) -> None: 2025-05-07T20:31:43.9111752Z torch.manual_seed(2025) 2025-05-07T20:31:43.9111998Z 2025-05-07T20:31:43.9112286Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.9112643Z 2025-05-07T20:31:43.9112849Z x_sign = torch.sign(x) 2025-05-07T20:31:43.9113146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.9113465Z x = x_sign * x_clamp 2025-05-07T20:31:43.9113713Z x0 = x[:, :D] 2025-05-07T20:31:43.9113931Z x1 = x[:, D:] 2025-05-07T20:31:43.9114145Z 2025-05-07T20:31:43.9114844Z if contiguous: 2025-05-07T20:31:43.9115093Z x0 = x0.contiguous() 2025-05-07T20:31:43.9115361Z x1 = x1.contiguous() 2025-05-07T20:31:43.9115611Z 2025-05-07T20:31:43.9115812Z if scale_ub is not None: 2025-05-07T20:31:43.9116100Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.9116451Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.9116761Z ) 2025-05-07T20:31:43.9116961Z else: 2025-05-07T20:31:43.9117180Z scale_ub_tensor = None 2025-05-07T20:31:43.9117443Z 2025-05-07T20:31:43.9117681Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.9118003Z op = silu_mul_quant 2025-05-07T20:31:43.9118265Z if compiled: 2025-05-07T20:31:43.9118523Z op = torch.compile(op) 2025-05-07T20:31:43.9118840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.9119131Z 2025-05-07T20:31:43.9119328Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.9119511Z 2025-05-07T20:31:43.9119616Z moe/activation_test.py:117: 2025-05-07T20:31:43.9119927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.9120262Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.9120558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.9121136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.9121716Z return fn(*args, **kwargs) 2025-05-07T20:31:43.9122384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.9123089Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.9123644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.9124335Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.9125071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.9125613Z kernel = self.compile( 2025-05-07T20:31:43.9126327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.9126994Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.9127399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.9127627Z 2025-05-07T20:31:43.9127845Z self = 2025-05-07T20:31:43.9128946Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.9130355Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc5144cc0d0>} 2025-05-07T20:31:43.9131724Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.9132775Z context = 2025-05-07T20:31:43.9133068Z 2025-05-07T20:31:43.9133245Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.9133775Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.9134258Z module_map=module_map) 2025-05-07T20:31:43.9134645Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.9135048Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.9135326Z E ^ 2025-05-07T20:31:43.9135915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.9136375Z 2025-05-07T20:31:43.9136813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.9137332Z 2025-05-07T20:31:43.9137447Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.9137867Z self=, 2025-05-07T20:31:43.9138360Z T=1, 2025-05-07T20:31:43.9138552Z D=7168, 2025-05-07T20:31:43.9138745Z scale_ub=None, 2025-05-07T20:31:43.9138973Z contiguous=False, 2025-05-07T20:31:43.9139211Z compiled=True, 2025-05-07T20:31:43.9139419Z ) 2025-05-07T20:31:44.0098517Z self = 2025-05-07T20:31:44.0099057Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:44.0099325Z 2025-05-07T20:31:44.0099412Z @given( 2025-05-07T20:31:44.0099667Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.0099991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.0100319Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.0100654Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.0101000Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.0101294Z ) 2025-05-07T20:31:44.0101648Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.0102103Z def test_silu_mul_quant( 2025-05-07T20:31:44.0102355Z self, 2025-05-07T20:31:44.0102551Z T: int, 2025-05-07T20:31:44.0102756Z D: int, 2025-05-07T20:31:44.0102988Z scale_ub: Optional[float], 2025-05-07T20:31:44.0103291Z contiguous: bool, 2025-05-07T20:31:44.0103536Z compiled: bool, 2025-05-07T20:31:44.0103781Z ) -> None: 2025-05-07T20:31:44.0104007Z torch.manual_seed(2025) 2025-05-07T20:31:44.0104257Z 2025-05-07T20:31:44.0104541Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.0104926Z 2025-05-07T20:31:44.0105485Z x_sign = torch.sign(x) 2025-05-07T20:31:44.0105778Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.0106094Z x = x_sign * x_clamp 2025-05-07T20:31:44.0106339Z x0 = x[:, :D] 2025-05-07T20:31:44.0106555Z x1 = x[:, D:] 2025-05-07T20:31:44.0106769Z 2025-05-07T20:31:44.0106962Z if contiguous: 2025-05-07T20:31:44.0107196Z x0 = x0.contiguous() 2025-05-07T20:31:44.0107460Z x1 = x1.contiguous() 2025-05-07T20:31:44.0107705Z 2025-05-07T20:31:44.0107900Z if scale_ub is not None: 2025-05-07T20:31:44.0108185Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.0108532Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.0108842Z ) 2025-05-07T20:31:44.0109042Z else: 2025-05-07T20:31:44.0109265Z scale_ub_tensor = None 2025-05-07T20:31:44.0109520Z 2025-05-07T20:31:44.0109758Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.0110084Z op = silu_mul_quant 2025-05-07T20:31:44.0110339Z if compiled: 2025-05-07T20:31:44.0110589Z op = torch.compile(op) 2025-05-07T20:31:44.0110898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.0111178Z 2025-05-07T20:31:44.0111376Z y_fp8, y_scale = fn() 2025-05-07T20:31:44.0111670Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:44.0111966Z 2025-05-07T20:31:44.0112208Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.0112551Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:44.0112852Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:44.0113170Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:44.0113683Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:44.0114008Z 2025-05-07T20:31:44.0114224Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:44.0114428Z 2025-05-07T20:31:44.0114536Z moe/activation_test.py:126: 2025-05-07T20:31:44.0114877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.0115232Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:44.0115566Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:44.0116371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:44.0117138Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:44.0117697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.0118388Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.0119095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:44.0119837Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:44.0120604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:44.0121365Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:44.0122113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:44.0122765Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:44.0123372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:44.0123916Z fn() 2025-05-07T20:31:44.0124446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:44.0125077Z self.fn.run( 2025-05-07T20:31:44.0125574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.0126242Z kernel = self.compile( 2025-05-07T20:31:44.0126801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.0127474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.0127872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.0128109Z 2025-05-07T20:31:44.0128321Z self = 2025-05-07T20:31:44.0129433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.0130851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc5144ce440>} 2025-05-07T20:31:44.0132219Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.0133274Z context = 2025-05-07T20:31:44.0133576Z 2025-05-07T20:31:44.0133748Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.0134287Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.0134787Z module_map=module_map) 2025-05-07T20:31:44.0135192Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.0135643Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:44.0135925Z E ^ 2025-05-07T20:31:44.0136400Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.0136868Z 2025-05-07T20:31:44.0137293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.0137811Z 2025-05-07T20:31:44.0137926Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.0138450Z self=, 2025-05-07T20:31:44.0138855Z T=1, 2025-05-07T20:31:44.0139048Z D=5120, 2025-05-07T20:31:44.0139251Z scale_ub=1200.0, 2025-05-07T20:31:44.0139481Z contiguous=False, 2025-05-07T20:31:44.0139720Z compiled=True, 2025-05-07T20:31:44.0139939Z ) 2025-05-07T20:31:44.1823639Z self = 2025-05-07T20:31:44.1824219Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.1824498Z 2025-05-07T20:31:44.1824579Z @given( 2025-05-07T20:31:44.1824824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.1825151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.1825470Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.1825817Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.1826157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.1826443Z ) 2025-05-07T20:31:44.1826802Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.1827254Z def test_silu_mul_quant( 2025-05-07T20:31:44.1827497Z self, 2025-05-07T20:31:44.1827700Z T: int, 2025-05-07T20:31:44.1827904Z D: int, 2025-05-07T20:31:44.1828130Z scale_ub: Optional[float], 2025-05-07T20:31:44.1828417Z contiguous: bool, 2025-05-07T20:31:44.1828666Z compiled: bool, 2025-05-07T20:31:44.1828901Z ) -> None: 2025-05-07T20:31:44.1829130Z torch.manual_seed(2025) 2025-05-07T20:31:44.1829384Z 2025-05-07T20:31:44.1829992Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.1830354Z 2025-05-07T20:31:44.1830563Z x_sign = torch.sign(x) 2025-05-07T20:31:44.1830871Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.1831184Z x = x_sign * x_clamp 2025-05-07T20:31:44.1831434Z x0 = x[:, :D] 2025-05-07T20:31:44.1831658Z x1 = x[:, D:] 2025-05-07T20:31:44.1831873Z 2025-05-07T20:31:44.1832070Z if contiguous: 2025-05-07T20:31:44.1832310Z x0 = x0.contiguous() 2025-05-07T20:31:44.1832572Z x1 = x1.contiguous() 2025-05-07T20:31:44.1832827Z 2025-05-07T20:31:44.1833031Z if scale_ub is not None: 2025-05-07T20:31:44.1833310Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.1833661Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.1833984Z ) 2025-05-07T20:31:44.1834185Z else: 2025-05-07T20:31:44.1834405Z scale_ub_tensor = None 2025-05-07T20:31:44.1834685Z 2025-05-07T20:31:44.1834961Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.1835283Z op = silu_mul_quant 2025-05-07T20:31:44.1835539Z if compiled: 2025-05-07T20:31:44.1835788Z op = torch.compile(op) 2025-05-07T20:31:44.1836099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.1836378Z 2025-05-07T20:31:44.1836579Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.1836751Z 2025-05-07T20:31:44.1836857Z moe/activation_test.py:117: 2025-05-07T20:31:44.1837158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.1837492Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.1837779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.1838506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.1839084Z return fn(*args, **kwargs) 2025-05-07T20:31:44.1839764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.1840461Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.1841010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.1841709Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.1842379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.1842921Z kernel = self.compile( 2025-05-07T20:31:44.1843477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.1844155Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.1844552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.1844792Z 2025-05-07T20:31:44.1845031Z self = 2025-05-07T20:31:44.1846153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.1847566Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50feca5f0>} 2025-05-07T20:31:44.1848933Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.1849983Z context = 2025-05-07T20:31:44.1850286Z 2025-05-07T20:31:44.1850457Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.1851078Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.1851554Z module_map=module_map) 2025-05-07T20:31:44.1851930Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.1852294Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.1852567Z E ^ 2025-05-07T20:31:44.1853038Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.1853506Z 2025-05-07T20:31:44.1853931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.1854452Z 2025-05-07T20:31:44.1854574Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.1854997Z self=, 2025-05-07T20:31:44.1855411Z T=1, 2025-05-07T20:31:44.1856082Z D=5120, 2025-05-07T20:31:44.1856290Z scale_ub=1200.0, 2025-05-07T20:31:44.1856517Z contiguous=False, 2025-05-07T20:31:44.1856753Z compiled=False, 2025-05-07T20:31:44.1856973Z ) 2025-05-07T20:31:44.1857298Z self = 2025-05-07T20:31:44.1857805Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:44.1858158Z 2025-05-07T20:31:44.1858242Z @given( 2025-05-07T20:31:44.1858476Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.1858797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.1859113Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.1859447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.1859917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.1860215Z ) 2025-05-07T20:31:44.1860582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.1861037Z def test_silu_mul_quant( 2025-05-07T20:31:44.1861284Z self, 2025-05-07T20:31:44.1861491Z T: int, 2025-05-07T20:31:44.1861693Z D: int, 2025-05-07T20:31:44.1861922Z scale_ub: Optional[float], 2025-05-07T20:31:44.1862204Z contiguous: bool, 2025-05-07T20:31:44.1862444Z compiled: bool, 2025-05-07T20:31:44.1862676Z ) -> None: 2025-05-07T20:31:44.1862899Z torch.manual_seed(2025) 2025-05-07T20:31:44.1863142Z 2025-05-07T20:31:44.1863427Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.1863775Z 2025-05-07T20:31:44.1863970Z x_sign = torch.sign(x) 2025-05-07T20:31:44.1864277Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.1864600Z x = x_sign * x_clamp 2025-05-07T20:31:44.1864851Z x0 = x[:, :D] 2025-05-07T20:31:44.1865076Z x1 = x[:, D:] 2025-05-07T20:31:44.1865295Z 2025-05-07T20:31:44.1865489Z if contiguous: 2025-05-07T20:31:44.1865728Z x0 = x0.contiguous() 2025-05-07T20:31:44.1865996Z x1 = x1.contiguous() 2025-05-07T20:31:44.1866244Z 2025-05-07T20:31:44.1866443Z if scale_ub is not None: 2025-05-07T20:31:44.1866751Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.1867099Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.1867408Z ) 2025-05-07T20:31:44.1867610Z else: 2025-05-07T20:31:44.1867832Z scale_ub_tensor = None 2025-05-07T20:31:44.1868094Z 2025-05-07T20:31:44.1868332Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.1868656Z op = silu_mul_quant 2025-05-07T20:31:44.1868917Z if compiled: 2025-05-07T20:31:44.1869171Z op = torch.compile(op) 2025-05-07T20:31:44.1869485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.1869772Z 2025-05-07T20:31:44.1869969Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.1870279Z 2025-05-07T20:31:44.1870384Z moe/activation_test.py:117: 2025-05-07T20:31:44.1870687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.1871020Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.1871320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.1872028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.1872734Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.1873283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.1873979Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.1874665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.1875210Z kernel = self.compile( 2025-05-07T20:31:44.1875772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.1876442Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.1876846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.1877073Z 2025-05-07T20:31:44.1877285Z self = 2025-05-07T20:31:44.1878383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.1879861Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50feca050>} 2025-05-07T20:31:44.1881234Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.1882283Z context = 2025-05-07T20:31:44.1882576Z 2025-05-07T20:31:44.1882746Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.1883280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.1883765Z module_map=module_map) 2025-05-07T20:31:44.1884133Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.1884498Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.1884765Z E ^ 2025-05-07T20:31:44.1885246Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.1885704Z 2025-05-07T20:31:44.1886130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.1886665Z 2025-05-07T20:31:44.1886772Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.1887201Z self=, 2025-05-07T20:31:44.1887616Z T=16384, 2025-05-07T20:31:44.1887809Z D=5120, 2025-05-07T20:31:44.1888022Z scale_ub=1200.0, 2025-05-07T20:31:44.1888258Z contiguous=False, 2025-05-07T20:31:44.1888484Z compiled=True, 2025-05-07T20:31:44.1888695Z ) 2025-05-07T20:31:44.2900838Z self = 2025-05-07T20:31:44.2901651Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.2902058Z 2025-05-07T20:31:44.2902172Z @given( 2025-05-07T20:31:44.2902503Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.2902837Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.2903365Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.2903716Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.2904065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.2904359Z ) 2025-05-07T20:31:44.2904734Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.2905197Z def test_silu_mul_quant( 2025-05-07T20:31:44.2905448Z self, 2025-05-07T20:31:44.2905659Z T: int, 2025-05-07T20:31:44.2905867Z D: int, 2025-05-07T20:31:44.2906098Z scale_ub: Optional[float], 2025-05-07T20:31:44.2906386Z contiguous: bool, 2025-05-07T20:31:44.2906644Z compiled: bool, 2025-05-07T20:31:44.2906876Z ) -> None: 2025-05-07T20:31:44.2907106Z torch.manual_seed(2025) 2025-05-07T20:31:44.2907370Z 2025-05-07T20:31:44.2907652Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.2908013Z 2025-05-07T20:31:44.2908232Z x_sign = torch.sign(x) 2025-05-07T20:31:44.2908543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.2908860Z x = x_sign * x_clamp 2025-05-07T20:31:44.2909113Z x0 = x[:, :D] 2025-05-07T20:31:44.2909344Z x1 = x[:, D:] 2025-05-07T20:31:44.2909557Z 2025-05-07T20:31:44.2909768Z if contiguous: 2025-05-07T20:31:44.2910020Z x0 = x0.contiguous() 2025-05-07T20:31:44.2910287Z x1 = x1.contiguous() 2025-05-07T20:31:44.2910542Z 2025-05-07T20:31:44.2910752Z if scale_ub is not None: 2025-05-07T20:31:44.2911035Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.2911389Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.2911713Z ) 2025-05-07T20:31:44.2911916Z else: 2025-05-07T20:31:44.2912285Z scale_ub_tensor = None 2025-05-07T20:31:44.2912558Z 2025-05-07T20:31:44.2912801Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.2913140Z op = silu_mul_quant 2025-05-07T20:31:44.2913405Z if compiled: 2025-05-07T20:31:44.2913669Z op = torch.compile(op) 2025-05-07T20:31:44.2913976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2914265Z 2025-05-07T20:31:44.2914473Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.2914644Z 2025-05-07T20:31:44.2914752Z moe/activation_test.py:117: 2025-05-07T20:31:44.2915065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2915433Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.2915751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2916336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.2916919Z return fn(*args, **kwargs) 2025-05-07T20:31:44.2917603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.2918317Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.2918864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.2919562Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.2920244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.2920793Z kernel = self.compile( 2025-05-07T20:31:44.2921347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.2922024Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.2922437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2922668Z 2025-05-07T20:31:44.2922890Z self = 2025-05-07T20:31:44.2924081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.2925556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50fec83a0>} 2025-05-07T20:31:44.2926961Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.2928015Z context = 2025-05-07T20:31:44.2928313Z 2025-05-07T20:31:44.2928495Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.2929030Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.2929521Z module_map=module_map) 2025-05-07T20:31:44.2929903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.2930279Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.2930548Z E ^ 2025-05-07T20:31:44.2931032Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.2931493Z 2025-05-07T20:31:44.2931928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.2932449Z 2025-05-07T20:31:44.2932558Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.2932988Z self=, 2025-05-07T20:31:44.2933488Z T=2048, 2025-05-07T20:31:44.2933691Z D=7168, 2025-05-07T20:31:44.2933891Z scale_ub=1200.0, 2025-05-07T20:31:44.2934129Z contiguous=False, 2025-05-07T20:31:44.2934373Z compiled=True, 2025-05-07T20:31:44.2934583Z ) 2025-05-07T20:31:44.2934920Z self = 2025-05-07T20:31:44.2935435Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.2935714Z 2025-05-07T20:31:44.2935795Z @given( 2025-05-07T20:31:44.2936036Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.2936362Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.2936672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.2937013Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.2937354Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.2937651Z ) 2025-05-07T20:31:44.2938103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.2938561Z def test_silu_mul_quant( 2025-05-07T20:31:44.2938817Z self, 2025-05-07T20:31:44.2939021Z T: int, 2025-05-07T20:31:44.2939227Z D: int, 2025-05-07T20:31:44.2939455Z scale_ub: Optional[float], 2025-05-07T20:31:44.2939736Z contiguous: bool, 2025-05-07T20:31:44.2939987Z compiled: bool, 2025-05-07T20:31:44.2940220Z ) -> None: 2025-05-07T20:31:44.2940441Z torch.manual_seed(2025) 2025-05-07T20:31:44.2940692Z 2025-05-07T20:31:44.2940978Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.2941324Z 2025-05-07T20:31:44.2941528Z x_sign = torch.sign(x) 2025-05-07T20:31:44.2941831Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.2942152Z x = x_sign * x_clamp 2025-05-07T20:31:44.2942397Z x0 = x[:, :D] 2025-05-07T20:31:44.2942624Z x1 = x[:, D:] 2025-05-07T20:31:44.2942842Z 2025-05-07T20:31:44.2943035Z if contiguous: 2025-05-07T20:31:44.2943276Z x0 = x0.contiguous() 2025-05-07T20:31:44.2943546Z x1 = x1.contiguous() 2025-05-07T20:31:44.2943887Z 2025-05-07T20:31:44.2944091Z if scale_ub is not None: 2025-05-07T20:31:44.2944380Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.2944725Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.2945044Z ) 2025-05-07T20:31:44.2945251Z else: 2025-05-07T20:31:44.2945501Z scale_ub_tensor = None 2025-05-07T20:31:44.2945789Z 2025-05-07T20:31:44.2946035Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.2946354Z op = silu_mul_quant 2025-05-07T20:31:44.2946616Z if compiled: 2025-05-07T20:31:44.2946877Z op = torch.compile(op) 2025-05-07T20:31:44.2947182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2947469Z 2025-05-07T20:31:44.2947680Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.2947850Z 2025-05-07T20:31:44.2947961Z moe/activation_test.py:117: 2025-05-07T20:31:44.2948262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2948605Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.2948900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2949472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.2950049Z return fn(*args, **kwargs) 2025-05-07T20:31:44.2950729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.2951432Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.2951979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.2952758Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.2953442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.2953988Z kernel = self.compile( 2025-05-07T20:31:44.2954553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.2955228Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.2955928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2956161Z 2025-05-07T20:31:44.2956377Z self = 2025-05-07T20:31:44.2957481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.2958892Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50feca200>} 2025-05-07T20:31:44.2960272Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.2961321Z context = 2025-05-07T20:31:44.2961615Z 2025-05-07T20:31:44.2961786Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.2962324Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.2962808Z module_map=module_map) 2025-05-07T20:31:44.2963179Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.2963549Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.2963821Z E ^ 2025-05-07T20:31:44.2964304Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.2964902Z 2025-05-07T20:31:44.2965339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.2965872Z 2025-05-07T20:31:44.4250611Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.4251936Z self=, 2025-05-07T20:31:44.4253088Z T=1, 2025-05-07T20:31:44.4253601Z D=5120, 2025-05-07T20:31:44.4254150Z scale_ub=None, 2025-05-07T20:31:44.4254668Z contiguous=False, 2025-05-07T20:31:44.4254899Z compiled=False, 2025-05-07T20:31:44.4255113Z ) 2025-05-07T20:31:44.4255445Z self = 2025-05-07T20:31:44.4256188Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:44.4256478Z 2025-05-07T20:31:44.4256562Z @given( 2025-05-07T20:31:44.4256801Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.4257133Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.4257441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.4257778Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.4258164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.4258448Z ) 2025-05-07T20:31:44.4258810Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.4259262Z def test_silu_mul_quant( 2025-05-07T20:31:44.4259502Z self, 2025-05-07T20:31:44.4259700Z T: int, 2025-05-07T20:31:44.4259901Z D: int, 2025-05-07T20:31:44.4260119Z scale_ub: Optional[float], 2025-05-07T20:31:44.4260398Z contiguous: bool, 2025-05-07T20:31:44.4260655Z compiled: bool, 2025-05-07T20:31:44.4260897Z ) -> None: 2025-05-07T20:31:44.4261317Z torch.manual_seed(2025) 2025-05-07T20:31:44.4261573Z 2025-05-07T20:31:44.4261856Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.4262205Z 2025-05-07T20:31:44.4262408Z x_sign = torch.sign(x) 2025-05-07T20:31:44.4262714Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.4263058Z x = x_sign * x_clamp 2025-05-07T20:31:44.4263303Z x0 = x[:, :D] 2025-05-07T20:31:44.4263520Z x1 = x[:, D:] 2025-05-07T20:31:44.4263738Z 2025-05-07T20:31:44.4263940Z if contiguous: 2025-05-07T20:31:44.4264175Z x0 = x0.contiguous() 2025-05-07T20:31:44.4264445Z x1 = x1.contiguous() 2025-05-07T20:31:44.4264694Z 2025-05-07T20:31:44.4264888Z if scale_ub is not None: 2025-05-07T20:31:44.4265169Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.4265552Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.4265877Z ) 2025-05-07T20:31:44.4266074Z else: 2025-05-07T20:31:44.4266289Z scale_ub_tensor = None 2025-05-07T20:31:44.4266547Z 2025-05-07T20:31:44.4266784Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.4267104Z op = silu_mul_quant 2025-05-07T20:31:44.4267358Z if compiled: 2025-05-07T20:31:44.4267604Z op = torch.compile(op) 2025-05-07T20:31:44.4267909Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.4268192Z 2025-05-07T20:31:44.4268387Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.4268563Z 2025-05-07T20:31:44.4268667Z moe/activation_test.py:117: 2025-05-07T20:31:44.4268973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.4269313Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.4269595Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.4270307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.4271014Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.4271557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.4272379Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.4273057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.4273602Z kernel = self.compile( 2025-05-07T20:31:44.4274150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.4274822Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.4275224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.4275450Z 2025-05-07T20:31:44.4275696Z self = 2025-05-07T20:31:44.4276823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.4278229Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50fecb490>} 2025-05-07T20:31:44.4279599Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.4280645Z context = 2025-05-07T20:31:44.4280938Z 2025-05-07T20:31:44.4281110Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.4281727Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.4282211Z module_map=module_map) 2025-05-07T20:31:44.4282592Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.4282951Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.4283218Z E ^ 2025-05-07T20:31:44.4283697Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.4284157Z 2025-05-07T20:31:44.4284581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.4285114Z 2025-05-07T20:31:44.4285223Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.4285649Z self=, 2025-05-07T20:31:44.4286059Z T=4096, 2025-05-07T20:31:44.4286253Z D=7168, 2025-05-07T20:31:44.4286455Z scale_ub=1200.0, 2025-05-07T20:31:44.4286698Z contiguous=False, 2025-05-07T20:31:44.4286930Z compiled=False, 2025-05-07T20:31:44.4287146Z ) 2025-05-07T20:31:44.4287480Z self = 2025-05-07T20:31:44.4287987Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:44.4288271Z 2025-05-07T20:31:44.4288349Z @given( 2025-05-07T20:31:44.4288586Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.4288902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.4289208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.4289545Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.4289882Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.4290173Z ) 2025-05-07T20:31:44.4290535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.4290990Z def test_silu_mul_quant( 2025-05-07T20:31:44.4291239Z self, 2025-05-07T20:31:44.4291440Z T: int, 2025-05-07T20:31:44.4291640Z D: int, 2025-05-07T20:31:44.4291859Z scale_ub: Optional[float], 2025-05-07T20:31:44.4292223Z contiguous: bool, 2025-05-07T20:31:44.4292470Z compiled: bool, 2025-05-07T20:31:44.4292694Z ) -> None: 2025-05-07T20:31:44.4292917Z torch.manual_seed(2025) 2025-05-07T20:31:44.4293166Z 2025-05-07T20:31:44.4293440Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.4293787Z 2025-05-07T20:31:44.4293987Z x_sign = torch.sign(x) 2025-05-07T20:31:44.4294287Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.4294596Z x = x_sign * x_clamp 2025-05-07T20:31:44.4294841Z x0 = x[:, :D] 2025-05-07T20:31:44.4295066Z x1 = x[:, D:] 2025-05-07T20:31:44.4295273Z 2025-05-07T20:31:44.4295488Z if contiguous: 2025-05-07T20:31:44.4295751Z x0 = x0.contiguous() 2025-05-07T20:31:44.4296017Z x1 = x1.contiguous() 2025-05-07T20:31:44.4296264Z 2025-05-07T20:31:44.4296465Z if scale_ub is not None: 2025-05-07T20:31:44.4296747Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.4297091Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.4297406Z ) 2025-05-07T20:31:44.4297600Z else: 2025-05-07T20:31:44.4297821Z scale_ub_tensor = None 2025-05-07T20:31:44.4298170Z 2025-05-07T20:31:44.4298406Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.4298728Z op = silu_mul_quant 2025-05-07T20:31:44.4298986Z if compiled: 2025-05-07T20:31:44.4299241Z op = torch.compile(op) 2025-05-07T20:31:44.4299542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.4299827Z 2025-05-07T20:31:44.4300032Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.4300200Z 2025-05-07T20:31:44.4300301Z moe/activation_test.py:117: 2025-05-07T20:31:44.4300686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.4301019Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.4301309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.4302012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.4302717Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.4303268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.4303958Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.4304641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.4305185Z kernel = self.compile( 2025-05-07T20:31:44.4305775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.4306459Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.4306859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.4307091Z 2025-05-07T20:31:44.4307309Z self = 2025-05-07T20:31:44.4308402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.4309799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f61c550>} 2025-05-07T20:31:44.4311172Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.4312220Z context = 2025-05-07T20:31:44.4312601Z 2025-05-07T20:31:44.4312779Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.4313308Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.4313791Z module_map=module_map) 2025-05-07T20:31:44.4314169Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.4314525Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.4314792Z E ^ 2025-05-07T20:31:44.4315269Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.4315727Z 2025-05-07T20:31:44.4316163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.4316688Z 2025-05-07T20:31:44.4316798Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.4317225Z self=, 2025-05-07T20:31:44.4317637Z T=16384, 2025-05-07T20:31:44.4317832Z D=7168, 2025-05-07T20:31:44.4318031Z scale_ub=None, 2025-05-07T20:31:44.4318253Z contiguous=True, 2025-05-07T20:31:44.4318482Z compiled=True, 2025-05-07T20:31:44.4318692Z ) 2025-05-07T20:31:44.6271349Z self = 2025-05-07T20:31:44.6272151Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:44.6272548Z 2025-05-07T20:31:44.6272663Z @given( 2025-05-07T20:31:44.6272992Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6273349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6273667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6274205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6274545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6274838Z ) 2025-05-07T20:31:44.6275212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6275709Z def test_silu_mul_quant( 2025-05-07T20:31:44.6275966Z self, 2025-05-07T20:31:44.6276165Z T: int, 2025-05-07T20:31:44.6276366Z D: int, 2025-05-07T20:31:44.6276592Z scale_ub: Optional[float], 2025-05-07T20:31:44.6276872Z contiguous: bool, 2025-05-07T20:31:44.6277118Z compiled: bool, 2025-05-07T20:31:44.6277342Z ) -> None: 2025-05-07T20:31:44.6277562Z torch.manual_seed(2025) 2025-05-07T20:31:44.6277808Z 2025-05-07T20:31:44.6278079Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6278427Z 2025-05-07T20:31:44.6278630Z x_sign = torch.sign(x) 2025-05-07T20:31:44.6278928Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.6279246Z x = x_sign * x_clamp 2025-05-07T20:31:44.6279494Z x0 = x[:, :D] 2025-05-07T20:31:44.6279710Z x1 = x[:, D:] 2025-05-07T20:31:44.6279930Z 2025-05-07T20:31:44.6280122Z if contiguous: 2025-05-07T20:31:44.6280357Z x0 = x0.contiguous() 2025-05-07T20:31:44.6280623Z x1 = x1.contiguous() 2025-05-07T20:31:44.6280874Z 2025-05-07T20:31:44.6281065Z if scale_ub is not None: 2025-05-07T20:31:44.6281342Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.6281688Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.6282000Z ) 2025-05-07T20:31:44.6282195Z else: 2025-05-07T20:31:44.6282411Z scale_ub_tensor = None 2025-05-07T20:31:44.6282669Z 2025-05-07T20:31:44.6282906Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.6283229Z op = silu_mul_quant 2025-05-07T20:31:44.6283490Z if compiled: 2025-05-07T20:31:44.6283739Z op = torch.compile(op) 2025-05-07T20:31:44.6284048Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6292174Z 2025-05-07T20:31:44.6292393Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.6292573Z 2025-05-07T20:31:44.6292680Z moe/activation_test.py:117: 2025-05-07T20:31:44.6292988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6293332Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.6293622Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6294204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.6294781Z return fn(*args, **kwargs) 2025-05-07T20:31:44.6295458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.6296151Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.6296706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.6297401Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.6298142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.6298698Z kernel = self.compile( 2025-05-07T20:31:44.6299259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.6299930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.6300335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6300571Z 2025-05-07T20:31:44.6300785Z self = 2025-05-07T20:31:44.6302006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.6303419Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f61d360>} 2025-05-07T20:31:44.6304788Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.6305880Z context = 2025-05-07T20:31:44.6306179Z 2025-05-07T20:31:44.6306348Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.6306880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.6307357Z module_map=module_map) 2025-05-07T20:31:44.6307735Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.6308104Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.6308377Z E ^ 2025-05-07T20:31:44.6308849Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.6309319Z 2025-05-07T20:31:44.6309743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.6310263Z 2025-05-07T20:31:44.6310375Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.6310800Z self=, 2025-05-07T20:31:44.6311203Z T=4096, 2025-05-07T20:31:44.6311395Z D=5120, 2025-05-07T20:31:44.6311595Z scale_ub=None, 2025-05-07T20:31:44.6311813Z contiguous=False, 2025-05-07T20:31:44.6312053Z compiled=True, 2025-05-07T20:31:44.6312264Z ) 2025-05-07T20:31:44.6312598Z self = 2025-05-07T20:31:44.6313101Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:44.6313456Z 2025-05-07T20:31:44.6313540Z @given( 2025-05-07T20:31:44.6313772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6314094Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6314414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6314750Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6315079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6315369Z ) 2025-05-07T20:31:44.6315764Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6316227Z def test_silu_mul_quant( 2025-05-07T20:31:44.6316476Z self, 2025-05-07T20:31:44.6316675Z T: int, 2025-05-07T20:31:44.6316944Z D: int, 2025-05-07T20:31:44.6317281Z scale_ub: Optional[float], 2025-05-07T20:31:44.6317599Z contiguous: bool, 2025-05-07T20:31:44.6317844Z compiled: bool, 2025-05-07T20:31:44.6318077Z ) -> None: 2025-05-07T20:31:44.6318298Z torch.manual_seed(2025) 2025-05-07T20:31:44.6318545Z 2025-05-07T20:31:44.6318824Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6319166Z 2025-05-07T20:31:44.6319366Z x_sign = torch.sign(x) 2025-05-07T20:31:44.6319664Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.6319975Z x = x_sign * x_clamp 2025-05-07T20:31:44.6320213Z x0 = x[:, :D] 2025-05-07T20:31:44.6320431Z x1 = x[:, D:] 2025-05-07T20:31:44.6320640Z 2025-05-07T20:31:44.6320824Z if contiguous: 2025-05-07T20:31:44.6321059Z x0 = x0.contiguous() 2025-05-07T20:31:44.6321320Z x1 = x1.contiguous() 2025-05-07T20:31:44.6321559Z 2025-05-07T20:31:44.6321850Z if scale_ub is not None: 2025-05-07T20:31:44.6322133Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.6322470Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.6322784Z ) 2025-05-07T20:31:44.6322979Z else: 2025-05-07T20:31:44.6323189Z scale_ub_tensor = None 2025-05-07T20:31:44.6323454Z 2025-05-07T20:31:44.6323697Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.6324013Z op = silu_mul_quant 2025-05-07T20:31:44.6324267Z if compiled: 2025-05-07T20:31:44.6324520Z op = torch.compile(op) 2025-05-07T20:31:44.6324820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6325098Z 2025-05-07T20:31:44.6325295Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.6325463Z 2025-05-07T20:31:44.6325570Z moe/activation_test.py:117: 2025-05-07T20:31:44.6325882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6326222Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.6326514Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.6327079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.6327655Z return fn(*args, **kwargs) 2025-05-07T20:31:44.6328325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.6329026Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.6329566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.6330254Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.6330926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.6331467Z kernel = self.compile( 2025-05-07T20:31:44.6332019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.6332682Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.6333167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.6333391Z 2025-05-07T20:31:44.6333602Z self = 2025-05-07T20:31:44.6334700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.6336147Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f61dea0>} 2025-05-07T20:31:44.6337518Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.6338642Z context = 2025-05-07T20:31:44.6338934Z 2025-05-07T20:31:44.6339103Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.6339633Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.6340108Z module_map=module_map) 2025-05-07T20:31:44.6340478Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.6340831Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.6341280Z E ^ 2025-05-07T20:31:44.6341752Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.6342207Z 2025-05-07T20:31:44.6342722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.6343249Z 2025-05-07T20:31:44.9980593Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.9981856Z self=, 2025-05-07T20:31:44.9982819Z T=4096, 2025-05-07T20:31:44.9983200Z D=5120, 2025-05-07T20:31:44.9983593Z scale_ub=1200.0, 2025-05-07T20:31:44.9984053Z contiguous=False, 2025-05-07T20:31:44.9984506Z compiled=False, 2025-05-07T20:31:44.9984895Z ) 2025-05-07T20:31:44.9985229Z self = 2025-05-07T20:31:44.9985741Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:44.9986034Z 2025-05-07T20:31:44.9986114Z @given( 2025-05-07T20:31:44.9986351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.9986675Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.9986994Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.9987335Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.9987677Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.9987975Z ) 2025-05-07T20:31:44.9988339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.9988793Z def test_silu_mul_quant( 2025-05-07T20:31:44.9989037Z self, 2025-05-07T20:31:44.9989242Z T: int, 2025-05-07T20:31:44.9989448Z D: int, 2025-05-07T20:31:44.9989670Z scale_ub: Optional[float], 2025-05-07T20:31:44.9989954Z contiguous: bool, 2025-05-07T20:31:44.9990203Z compiled: bool, 2025-05-07T20:31:44.9990437Z ) -> None: 2025-05-07T20:31:44.9990661Z torch.manual_seed(2025) 2025-05-07T20:31:44.9990912Z 2025-05-07T20:31:44.9991193Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.9991544Z 2025-05-07T20:31:44.9991750Z x_sign = torch.sign(x) 2025-05-07T20:31:44.9992051Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.9992366Z x = x_sign * x_clamp 2025-05-07T20:31:44.9992796Z x0 = x[:, :D] 2025-05-07T20:31:44.9993021Z x1 = x[:, D:] 2025-05-07T20:31:44.9993233Z 2025-05-07T20:31:44.9993428Z if contiguous: 2025-05-07T20:31:44.9993675Z x0 = x0.contiguous() 2025-05-07T20:31:44.9993942Z x1 = x1.contiguous() 2025-05-07T20:31:44.9994194Z 2025-05-07T20:31:44.9994404Z if scale_ub is not None: 2025-05-07T20:31:44.9994687Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.9995041Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.9995361Z ) 2025-05-07T20:31:44.9995587Z else: 2025-05-07T20:31:44.9995829Z scale_ub_tensor = None 2025-05-07T20:31:44.9996098Z 2025-05-07T20:31:44.9996336Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.9996669Z op = silu_mul_quant 2025-05-07T20:31:44.9996929Z if compiled: 2025-05-07T20:31:44.9997186Z op = torch.compile(op) 2025-05-07T20:31:44.9997499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9997781Z 2025-05-07T20:31:44.9997980Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.9998148Z 2025-05-07T20:31:44.9998250Z moe/activation_test.py:117: 2025-05-07T20:31:44.9998555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9998890Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.9999177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9999897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.0000610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.0001163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.0002002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.0002693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.0003254Z kernel = self.compile( 2025-05-07T20:31:45.0003811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.0004490Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.0004902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.0005131Z 2025-05-07T20:31:45.0005346Z self = 2025-05-07T20:31:45.0006466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.0007883Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f61e680>} 2025-05-07T20:31:45.0009271Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.0010329Z context = 2025-05-07T20:31:45.0010628Z 2025-05-07T20:31:45.0010803Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.0011347Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.0011832Z module_map=module_map) 2025-05-07T20:31:45.0012210Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.0012582Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.0012856Z E ^ 2025-05-07T20:31:45.0013341Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.0013911Z 2025-05-07T20:31:45.0014344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.0014876Z 2025-05-07T20:31:45.0014983Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.0015441Z self=, 2025-05-07T20:31:45.0015852Z T=4096, 2025-05-07T20:31:45.0016044Z D=5120, 2025-05-07T20:31:45.0016236Z scale_ub=1200.0, 2025-05-07T20:31:45.0016473Z contiguous=False, 2025-05-07T20:31:45.0016705Z compiled=True, 2025-05-07T20:31:45.0016908Z ) 2025-05-07T20:31:45.0017239Z self = 2025-05-07T20:31:45.0017750Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.0018112Z 2025-05-07T20:31:45.0018196Z @given( 2025-05-07T20:31:45.0018443Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.0018767Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.0019084Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.0019432Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.0019774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.0020065Z ) 2025-05-07T20:31:45.0020422Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.0020882Z def test_silu_mul_quant( 2025-05-07T20:31:45.0021130Z self, 2025-05-07T20:31:45.0021326Z T: int, 2025-05-07T20:31:45.0021529Z D: int, 2025-05-07T20:31:45.0021760Z scale_ub: Optional[float], 2025-05-07T20:31:45.0022038Z contiguous: bool, 2025-05-07T20:31:45.0022371Z compiled: bool, 2025-05-07T20:31:45.0022605Z ) -> None: 2025-05-07T20:31:45.0022826Z torch.manual_seed(2025) 2025-05-07T20:31:45.0023074Z 2025-05-07T20:31:45.0023359Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.0023701Z 2025-05-07T20:31:45.0023903Z x_sign = torch.sign(x) 2025-05-07T20:31:45.0024201Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.0024512Z x = x_sign * x_clamp 2025-05-07T20:31:45.0024760Z x0 = x[:, :D] 2025-05-07T20:31:45.0024985Z x1 = x[:, D:] 2025-05-07T20:31:45.0025199Z 2025-05-07T20:31:45.0025387Z if contiguous: 2025-05-07T20:31:45.0025627Z x0 = x0.contiguous() 2025-05-07T20:31:45.0025893Z x1 = x1.contiguous() 2025-05-07T20:31:45.0026134Z 2025-05-07T20:31:45.0026335Z if scale_ub is not None: 2025-05-07T20:31:45.0026615Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.0026964Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.0027279Z ) 2025-05-07T20:31:45.0027480Z else: 2025-05-07T20:31:45.0027695Z scale_ub_tensor = None 2025-05-07T20:31:45.0027955Z 2025-05-07T20:31:45.0028201Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.0028524Z op = silu_mul_quant 2025-05-07T20:31:45.0028786Z if compiled: 2025-05-07T20:31:45.0029040Z op = torch.compile(op) 2025-05-07T20:31:45.0029338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.0029618Z 2025-05-07T20:31:45.0029819Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.0029988Z 2025-05-07T20:31:45.0030095Z moe/activation_test.py:117: 2025-05-07T20:31:45.0030390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.0030726Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.0031020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.0031596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.0032174Z return fn(*args, **kwargs) 2025-05-07T20:31:45.0032936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.0033646Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.0034189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.0034888Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.0035601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.0036163Z kernel = self.compile( 2025-05-07T20:31:45.0036722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.0037402Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.0037805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.0038039Z 2025-05-07T20:31:45.0038253Z self = 2025-05-07T20:31:45.0039357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.0040766Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f61fac0>} 2025-05-07T20:31:45.0042221Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.0043275Z context = 2025-05-07T20:31:45.0043575Z 2025-05-07T20:31:45.0043752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.0044290Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.0044772Z module_map=module_map) 2025-05-07T20:31:45.0045143Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.0045506Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.0045778Z E ^ 2025-05-07T20:31:45.0046251Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.0046717Z 2025-05-07T20:31:45.0047146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.0047679Z 2025-05-07T20:31:45.1329313Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.1329773Z self=, 2025-05-07T20:31:45.1330230Z T=2048, 2025-05-07T20:31:45.1330425Z D=7168, 2025-05-07T20:31:45.1330619Z scale_ub=1200.0, 2025-05-07T20:31:45.1330857Z contiguous=False, 2025-05-07T20:31:45.1331096Z compiled=False, 2025-05-07T20:31:45.1331305Z ) 2025-05-07T20:31:45.1331639Z self = 2025-05-07T20:31:45.1332150Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.1332432Z 2025-05-07T20:31:45.1332517Z @given( 2025-05-07T20:31:45.1332751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.1333073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.1333387Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.1333720Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.1334062Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.1334360Z ) 2025-05-07T20:31:45.1334717Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.1335340Z def test_silu_mul_quant( 2025-05-07T20:31:45.1335589Z self, 2025-05-07T20:31:45.1335789Z T: int, 2025-05-07T20:31:45.1335985Z D: int, 2025-05-07T20:31:45.1336208Z scale_ub: Optional[float], 2025-05-07T20:31:45.1336491Z contiguous: bool, 2025-05-07T20:31:45.1336743Z compiled: bool, 2025-05-07T20:31:45.1336972Z ) -> None: 2025-05-07T20:31:45.1337194Z torch.manual_seed(2025) 2025-05-07T20:31:45.1337440Z 2025-05-07T20:31:45.1337725Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.1338144Z 2025-05-07T20:31:45.1338341Z x_sign = torch.sign(x) 2025-05-07T20:31:45.1338639Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.1338960Z x = x_sign * x_clamp 2025-05-07T20:31:45.1339203Z x0 = x[:, :D] 2025-05-07T20:31:45.1339426Z x1 = x[:, D:] 2025-05-07T20:31:45.1339647Z 2025-05-07T20:31:45.1339843Z if contiguous: 2025-05-07T20:31:45.1340092Z x0 = x0.contiguous() 2025-05-07T20:31:45.1340360Z x1 = x1.contiguous() 2025-05-07T20:31:45.1340602Z 2025-05-07T20:31:45.1340803Z if scale_ub is not None: 2025-05-07T20:31:45.1341089Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.1341438Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.1341752Z ) 2025-05-07T20:31:45.1341957Z else: 2025-05-07T20:31:45.1342179Z scale_ub_tensor = None 2025-05-07T20:31:45.1342436Z 2025-05-07T20:31:45.1342681Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.1343005Z op = silu_mul_quant 2025-05-07T20:31:45.1343256Z if compiled: 2025-05-07T20:31:45.1343635Z op = torch.compile(op) 2025-05-07T20:31:45.1343944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.1344225Z 2025-05-07T20:31:45.1344428Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.1344596Z 2025-05-07T20:31:45.1344704Z moe/activation_test.py:117: 2025-05-07T20:31:45.1344999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.1345343Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.1345685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.1346399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.1347103Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.1347656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.1348356Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.1349041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.1349583Z kernel = self.compile( 2025-05-07T20:31:45.1350147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.1350823Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.1351226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.1351463Z 2025-05-07T20:31:45.1351677Z self = 2025-05-07T20:31:45.1352788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.1354207Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc5144ce200>} 2025-05-07T20:31:45.1355870Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.1357061Z context = 2025-05-07T20:31:45.1357366Z 2025-05-07T20:31:45.1357539Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.1358075Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.1358560Z module_map=module_map) 2025-05-07T20:31:45.1358931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.1359297Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.1359566Z E ^ 2025-05-07T20:31:45.1360046Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.1360516Z 2025-05-07T20:31:45.1360955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.1361488Z 2025-05-07T20:31:45.1361600Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.1362034Z self=, 2025-05-07T20:31:45.1362444Z T=1, 2025-05-07T20:31:45.1362637Z D=7168, 2025-05-07T20:31:45.1362843Z scale_ub=None, 2025-05-07T20:31:45.1363065Z contiguous=True, 2025-05-07T20:31:45.1363301Z compiled=False, 2025-05-07T20:31:45.1363520Z ) 2025-05-07T20:31:45.1370711Z self = 2025-05-07T20:31:45.1371250Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.1371524Z 2025-05-07T20:31:45.1371760Z @given( 2025-05-07T20:31:45.1371999Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.1372306Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.1372619Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.1372948Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.1373269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.1373556Z ) 2025-05-07T20:31:45.1373913Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.1374358Z def test_silu_mul_quant( 2025-05-07T20:31:45.1374601Z self, 2025-05-07T20:31:45.1374792Z T: int, 2025-05-07T20:31:45.1374983Z D: int, 2025-05-07T20:31:45.1375207Z scale_ub: Optional[float], 2025-05-07T20:31:45.1375502Z contiguous: bool, 2025-05-07T20:31:45.1375774Z compiled: bool, 2025-05-07T20:31:45.1375998Z ) -> None: 2025-05-07T20:31:45.1376217Z torch.manual_seed(2025) 2025-05-07T20:31:45.1376469Z 2025-05-07T20:31:45.1376744Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.1377089Z 2025-05-07T20:31:45.1377292Z x_sign = torch.sign(x) 2025-05-07T20:31:45.1377581Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.1377899Z x = x_sign * x_clamp 2025-05-07T20:31:45.1378222Z x0 = x[:, :D] 2025-05-07T20:31:45.1378434Z x1 = x[:, D:] 2025-05-07T20:31:45.1378644Z 2025-05-07T20:31:45.1378841Z if contiguous: 2025-05-07T20:31:45.1379069Z x0 = x0.contiguous() 2025-05-07T20:31:45.1379331Z x1 = x1.contiguous() 2025-05-07T20:31:45.1379571Z 2025-05-07T20:31:45.1379759Z if scale_ub is not None: 2025-05-07T20:31:45.1380039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.1380380Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.1380691Z ) 2025-05-07T20:31:45.1380881Z else: 2025-05-07T20:31:45.1381099Z scale_ub_tensor = None 2025-05-07T20:31:45.1381359Z 2025-05-07T20:31:45.1381594Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.1382002Z op = silu_mul_quant 2025-05-07T20:31:45.1382255Z if compiled: 2025-05-07T20:31:45.1382501Z op = torch.compile(op) 2025-05-07T20:31:45.1382801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.1383081Z 2025-05-07T20:31:45.1383275Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.1383452Z 2025-05-07T20:31:45.1383554Z moe/activation_test.py:117: 2025-05-07T20:31:45.1383853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.1384182Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.1384474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.1385183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.1385942Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.1386486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.1387190Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.1387863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.1388408Z kernel = self.compile( 2025-05-07T20:31:45.1388953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.1389619Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.1390017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.1390247Z 2025-05-07T20:31:45.1390457Z self = 2025-05-07T20:31:45.1391639Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.1393039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f2444c0>} 2025-05-07T20:31:45.1394395Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.1395438Z context = 2025-05-07T20:31:45.1395732Z 2025-05-07T20:31:45.1395900Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.1396431Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.1396902Z module_map=module_map) 2025-05-07T20:31:45.1397273Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.1397633Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.1397894Z E ^ 2025-05-07T20:31:45.1398362Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.1398823Z 2025-05-07T20:31:45.1399245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.1399764Z 2025-05-07T20:31:45.1399875Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.1400290Z self=, 2025-05-07T20:31:45.1400703Z T=16384, 2025-05-07T20:31:45.1400903Z D=7168, 2025-05-07T20:31:45.1401106Z scale_ub=1200.0, 2025-05-07T20:31:45.1401335Z contiguous=False, 2025-05-07T20:31:45.1401577Z compiled=True, 2025-05-07T20:31:45.4041047Z ) 2025-05-07T20:31:45.4042023Z self = 2025-05-07T20:31:45.4043265Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.4043673Z 2025-05-07T20:31:45.4043791Z @given( 2025-05-07T20:31:45.4044115Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4044454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4044773Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4045163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4045518Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4045817Z ) 2025-05-07T20:31:45.4046181Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4046646Z def test_silu_mul_quant( 2025-05-07T20:31:45.4046903Z self, 2025-05-07T20:31:45.4047119Z T: int, 2025-05-07T20:31:45.4047334Z D: int, 2025-05-07T20:31:45.4047595Z scale_ub: Optional[float], 2025-05-07T20:31:45.4047887Z contiguous: bool, 2025-05-07T20:31:45.4048148Z compiled: bool, 2025-05-07T20:31:45.4048382Z ) -> None: 2025-05-07T20:31:45.4048615Z torch.manual_seed(2025) 2025-05-07T20:31:45.4048867Z 2025-05-07T20:31:45.4049151Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4049508Z 2025-05-07T20:31:45.4049720Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4050023Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4050363Z x = x_sign * x_clamp 2025-05-07T20:31:45.4050627Z x0 = x[:, :D] 2025-05-07T20:31:45.4050853Z x1 = x[:, D:] 2025-05-07T20:31:45.4051082Z 2025-05-07T20:31:45.4051285Z if contiguous: 2025-05-07T20:31:45.4051522Z x0 = x0.contiguous() 2025-05-07T20:31:45.4051978Z x1 = x1.contiguous() 2025-05-07T20:31:45.4052234Z 2025-05-07T20:31:45.4052433Z if scale_ub is not None: 2025-05-07T20:31:45.4052720Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4053078Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4053409Z ) 2025-05-07T20:31:45.4053611Z else: 2025-05-07T20:31:45.4053844Z scale_ub_tensor = None 2025-05-07T20:31:45.4054111Z 2025-05-07T20:31:45.4054350Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4054680Z op = silu_mul_quant 2025-05-07T20:31:45.4054947Z if compiled: 2025-05-07T20:31:45.4055217Z op = torch.compile(op) 2025-05-07T20:31:45.4055880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4056175Z 2025-05-07T20:31:45.4056375Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4056554Z 2025-05-07T20:31:45.4056660Z moe/activation_test.py:117: 2025-05-07T20:31:45.4056977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4057309Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4057606Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4058313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4058894Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4059565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4060274Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4060830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4061529Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4062200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4062756Z kernel = self.compile( 2025-05-07T20:31:45.4063315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4064157Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4064570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4064805Z 2025-05-07T20:31:45.4065020Z self = 2025-05-07T20:31:45.4066131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4067545Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f2455a0>} 2025-05-07T20:31:45.4068923Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4069981Z context = 2025-05-07T20:31:45.4070273Z 2025-05-07T20:31:45.4070452Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4070989Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4071467Z module_map=module_map) 2025-05-07T20:31:45.4071848Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4072220Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4072482Z E ^ 2025-05-07T20:31:45.4072960Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4073535Z 2025-05-07T20:31:45.4073972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4074500Z 2025-05-07T20:31:45.4074614Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4075054Z self=, 2025-05-07T20:31:45.4075499Z T=1, 2025-05-07T20:31:45.4075696Z D=7168, 2025-05-07T20:31:45.4075891Z scale_ub=None, 2025-05-07T20:31:45.4076117Z contiguous=False, 2025-05-07T20:31:45.4076354Z compiled=False, 2025-05-07T20:31:45.4076565Z ) 2025-05-07T20:31:45.4076898Z self = 2025-05-07T20:31:45.4077398Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.4077664Z 2025-05-07T20:31:45.4077752Z @given( 2025-05-07T20:31:45.4077983Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4078314Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4078630Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4078966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4079314Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4079610Z ) 2025-05-07T20:31:45.4079966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4080424Z def test_silu_mul_quant( 2025-05-07T20:31:45.4080676Z self, 2025-05-07T20:31:45.4080872Z T: int, 2025-05-07T20:31:45.4081075Z D: int, 2025-05-07T20:31:45.4081299Z scale_ub: Optional[float], 2025-05-07T20:31:45.4081571Z contiguous: bool, 2025-05-07T20:31:45.4081820Z compiled: bool, 2025-05-07T20:31:45.4082067Z ) -> None: 2025-05-07T20:31:45.4082296Z torch.manual_seed(2025) 2025-05-07T20:31:45.4082550Z 2025-05-07T20:31:45.4082829Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4083189Z 2025-05-07T20:31:45.4083395Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4083691Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4084101Z x = x_sign * x_clamp 2025-05-07T20:31:45.4084350Z x0 = x[:, :D] 2025-05-07T20:31:45.4084573Z x1 = x[:, D:] 2025-05-07T20:31:45.4084793Z 2025-05-07T20:31:45.4084996Z if contiguous: 2025-05-07T20:31:45.4085231Z x0 = x0.contiguous() 2025-05-07T20:31:45.4085548Z x1 = x1.contiguous() 2025-05-07T20:31:45.4085799Z 2025-05-07T20:31:45.4085996Z if scale_ub is not None: 2025-05-07T20:31:45.4086284Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4086633Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4086954Z ) 2025-05-07T20:31:45.4087149Z else: 2025-05-07T20:31:45.4087370Z scale_ub_tensor = None 2025-05-07T20:31:45.4087633Z 2025-05-07T20:31:45.4087873Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4088198Z op = silu_mul_quant 2025-05-07T20:31:45.4088454Z if compiled: 2025-05-07T20:31:45.4088713Z op = torch.compile(op) 2025-05-07T20:31:45.4089019Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4089302Z 2025-05-07T20:31:45.4089501Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4089679Z 2025-05-07T20:31:45.4089784Z moe/activation_test.py:117: 2025-05-07T20:31:45.4090087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4090426Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4090711Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4091416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4092121Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4092749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4093449Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4094132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4094682Z kernel = self.compile( 2025-05-07T20:31:45.4095231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4095902Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4096308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4096535Z 2025-05-07T20:31:45.4096746Z self = 2025-05-07T20:31:45.4097851Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4099322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f245d80>} 2025-05-07T20:31:45.4100699Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4101745Z context = 2025-05-07T20:31:45.4102038Z 2025-05-07T20:31:45.4102210Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4102747Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4103230Z module_map=module_map) 2025-05-07T20:31:45.4103612Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4103973Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4104243Z E ^ 2025-05-07T20:31:45.4104809Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4105318Z 2025-05-07T20:31:45.4105743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4106270Z 2025-05-07T20:31:45.4106382Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4106810Z self=, 2025-05-07T20:31:45.4107222Z T=2048, 2025-05-07T20:31:45.4107415Z D=7168, 2025-05-07T20:31:45.4107619Z scale_ub=None, 2025-05-07T20:31:45.4107848Z contiguous=False, 2025-05-07T20:31:45.4108075Z compiled=True, 2025-05-07T20:31:45.4108285Z ) 2025-05-07T20:31:45.5115834Z self = 2025-05-07T20:31:45.5116636Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.5117045Z 2025-05-07T20:31:45.5117160Z @given( 2025-05-07T20:31:45.5117496Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5117842Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5118164Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5118511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5118850Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5119148Z ) 2025-05-07T20:31:45.5119516Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5119965Z def test_silu_mul_quant( 2025-05-07T20:31:45.5120224Z self, 2025-05-07T20:31:45.5120436Z T: int, 2025-05-07T20:31:45.5120639Z D: int, 2025-05-07T20:31:45.5120876Z scale_ub: Optional[float], 2025-05-07T20:31:45.5121835Z contiguous: bool, 2025-05-07T20:31:45.5122093Z compiled: bool, 2025-05-07T20:31:45.5122334Z ) -> None: 2025-05-07T20:31:45.5122573Z torch.manual_seed(2025) 2025-05-07T20:31:45.5122828Z 2025-05-07T20:31:45.5123105Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5123461Z 2025-05-07T20:31:45.5123671Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5123972Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5124299Z x = x_sign * x_clamp 2025-05-07T20:31:45.5124555Z x0 = x[:, :D] 2025-05-07T20:31:45.5124777Z x1 = x[:, D:] 2025-05-07T20:31:45.5125002Z 2025-05-07T20:31:45.5125224Z if contiguous: 2025-05-07T20:31:45.5125486Z x0 = x0.contiguous() 2025-05-07T20:31:45.5125759Z x1 = x1.contiguous() 2025-05-07T20:31:45.5126014Z 2025-05-07T20:31:45.5126215Z if scale_ub is not None: 2025-05-07T20:31:45.5126511Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5126869Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5127187Z ) 2025-05-07T20:31:45.5127406Z else: 2025-05-07T20:31:45.5127644Z scale_ub_tensor = None 2025-05-07T20:31:45.5127909Z 2025-05-07T20:31:45.5128161Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5128494Z op = silu_mul_quant 2025-05-07T20:31:45.5128761Z if compiled: 2025-05-07T20:31:45.5129015Z op = torch.compile(op) 2025-05-07T20:31:45.5129327Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5129616Z 2025-05-07T20:31:45.5129818Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5129997Z 2025-05-07T20:31:45.5130102Z moe/activation_test.py:117: 2025-05-07T20:31:45.5130410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5130744Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5131047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5131643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5132392Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5133067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5133775Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5134330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5135019Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5135753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5136305Z kernel = self.compile( 2025-05-07T20:31:45.5136872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5137541Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5137958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5138272Z 2025-05-07T20:31:45.5138493Z self = 2025-05-07T20:31:45.5139600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5141000Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f246f80>} 2025-05-07T20:31:45.5142491Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5143547Z context = 2025-05-07T20:31:45.5143848Z 2025-05-07T20:31:45.5144029Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5144563Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5145049Z module_map=module_map) 2025-05-07T20:31:45.5145432Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5145806Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5146072Z E ^ 2025-05-07T20:31:45.5146557Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5147017Z 2025-05-07T20:31:45.5147460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5147981Z 2025-05-07T20:31:45.5148101Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.5148524Z self=, 2025-05-07T20:31:45.5148947Z T=4096, 2025-05-07T20:31:45.5149152Z D=7168, 2025-05-07T20:31:45.5149353Z scale_ub=None, 2025-05-07T20:31:45.5149585Z contiguous=False, 2025-05-07T20:31:45.5149826Z compiled=True, 2025-05-07T20:31:45.5150041Z ) 2025-05-07T20:31:45.5150376Z self = 2025-05-07T20:31:45.5150889Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.5151167Z 2025-05-07T20:31:45.5151246Z @given( 2025-05-07T20:31:45.5151490Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.5151819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.5152142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.5152489Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.5152833Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.5153130Z ) 2025-05-07T20:31:45.5153574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.5154030Z def test_silu_mul_quant( 2025-05-07T20:31:45.5154284Z self, 2025-05-07T20:31:45.5154482Z T: int, 2025-05-07T20:31:45.5154690Z D: int, 2025-05-07T20:31:45.5154923Z scale_ub: Optional[float], 2025-05-07T20:31:45.5155230Z contiguous: bool, 2025-05-07T20:31:45.5155506Z compiled: bool, 2025-05-07T20:31:45.5156034Z ) -> None: 2025-05-07T20:31:45.5156254Z torch.manual_seed(2025) 2025-05-07T20:31:45.5156506Z 2025-05-07T20:31:45.5156794Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.5157139Z 2025-05-07T20:31:45.5157349Z x_sign = torch.sign(x) 2025-05-07T20:31:45.5157657Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.5157979Z x = x_sign * x_clamp 2025-05-07T20:31:45.5158223Z x0 = x[:, :D] 2025-05-07T20:31:45.5158457Z x1 = x[:, D:] 2025-05-07T20:31:45.5158677Z 2025-05-07T20:31:45.5158869Z if contiguous: 2025-05-07T20:31:45.5159113Z x0 = x0.contiguous() 2025-05-07T20:31:45.5159384Z x1 = x1.contiguous() 2025-05-07T20:31:45.5159627Z 2025-05-07T20:31:45.5159830Z if scale_ub is not None: 2025-05-07T20:31:45.5160114Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.5160460Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.5160780Z ) 2025-05-07T20:31:45.5160985Z else: 2025-05-07T20:31:45.5161202Z scale_ub_tensor = None 2025-05-07T20:31:45.5161468Z 2025-05-07T20:31:45.5161717Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.5162033Z op = silu_mul_quant 2025-05-07T20:31:45.5162430Z if compiled: 2025-05-07T20:31:45.5171378Z op = torch.compile(op) 2025-05-07T20:31:45.5171761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5172054Z 2025-05-07T20:31:45.5172261Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.5172434Z 2025-05-07T20:31:45.5172549Z moe/activation_test.py:117: 2025-05-07T20:31:45.5172852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5173200Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.5173499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.5174070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.5174650Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5175331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5176046Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5176591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5177300Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5177983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5178629Z kernel = self.compile( 2025-05-07T20:31:45.5179194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5179873Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5180286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5180516Z 2025-05-07T20:31:45.5180732Z self = 2025-05-07T20:31:45.5181847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5183444Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f247e20>} 2025-05-07T20:31:45.5184819Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5185922Z context = 2025-05-07T20:31:45.5186216Z 2025-05-07T20:31:45.5186388Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5186927Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5187418Z module_map=module_map) 2025-05-07T20:31:45.5187790Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5188162Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5188444Z E ^ 2025-05-07T20:31:45.5188928Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5189388Z 2025-05-07T20:31:45.5189814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5190345Z 2025-05-07T20:31:45.8964868Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.8965776Z self=, 2025-05-07T20:31:45.8966214Z T=16384, 2025-05-07T20:31:45.8966429Z D=5120, 2025-05-07T20:31:45.8966636Z scale_ub=1200.0, 2025-05-07T20:31:45.8966881Z contiguous=False, 2025-05-07T20:31:45.8967128Z compiled=False, 2025-05-07T20:31:45.8967668Z ) 2025-05-07T20:31:45.8968016Z self = 2025-05-07T20:31:45.8968544Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.8968849Z 2025-05-07T20:31:45.8968942Z @given( 2025-05-07T20:31:45.8969182Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.8969513Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.8969843Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.8970184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.8970530Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.8970831Z ) 2025-05-07T20:31:45.8971201Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.8971651Z def test_silu_mul_quant( 2025-05-07T20:31:45.8971903Z self, 2025-05-07T20:31:45.8972114Z T: int, 2025-05-07T20:31:45.8972316Z D: int, 2025-05-07T20:31:45.8972559Z scale_ub: Optional[float], 2025-05-07T20:31:45.8972848Z contiguous: bool, 2025-05-07T20:31:45.8973098Z compiled: bool, 2025-05-07T20:31:45.8973348Z ) -> None: 2025-05-07T20:31:45.8973580Z torch.manual_seed(2025) 2025-05-07T20:31:45.8973828Z 2025-05-07T20:31:45.8974117Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.8974477Z 2025-05-07T20:31:45.8974680Z x_sign = torch.sign(x) 2025-05-07T20:31:45.8974994Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.8975319Z x = x_sign * x_clamp 2025-05-07T20:31:45.8975565Z x0 = x[:, :D] 2025-05-07T20:31:45.8975800Z x1 = x[:, D:] 2025-05-07T20:31:45.8976022Z 2025-05-07T20:31:45.8976227Z if contiguous: 2025-05-07T20:31:45.8976472Z x0 = x0.contiguous() 2025-05-07T20:31:45.8976747Z x1 = x1.contiguous() 2025-05-07T20:31:45.8977002Z 2025-05-07T20:31:45.8977208Z if scale_ub is not None: 2025-05-07T20:31:45.8977498Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.8977852Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.8978412Z ) 2025-05-07T20:31:45.8978620Z else: 2025-05-07T20:31:45.8978852Z scale_ub_tensor = None 2025-05-07T20:31:45.8979108Z 2025-05-07T20:31:45.8979352Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.8979677Z op = silu_mul_quant 2025-05-07T20:31:45.8979929Z if compiled: 2025-05-07T20:31:45.8980184Z op = torch.compile(op) 2025-05-07T20:31:45.8980490Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.8980765Z 2025-05-07T20:31:45.8980967Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.8981141Z 2025-05-07T20:31:45.8981247Z moe/activation_test.py:117: 2025-05-07T20:31:45.8981551Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.8981890Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.8982182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.8982893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.8983600Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.8984152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.8984855Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.8985589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.8986128Z kernel = self.compile( 2025-05-07T20:31:45.8986693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.8987454Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.8987857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.8988092Z 2025-05-07T20:31:45.8988310Z self = 2025-05-07T20:31:45.8989418Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.8990842Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ef417e0>} 2025-05-07T20:31:45.8992219Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.8993271Z context = 2025-05-07T20:31:45.8993571Z 2025-05-07T20:31:45.8993743Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.8994281Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.8994764Z module_map=module_map) 2025-05-07T20:31:45.8995137Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.8995537Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.8995821Z E ^ 2025-05-07T20:31:45.8996291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.8996755Z 2025-05-07T20:31:45.8997179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.8997709Z 2025-05-07T20:31:45.8997816Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.8998250Z self=, 2025-05-07T20:31:45.8998656Z T=16384, 2025-05-07T20:31:45.8998942Z D=5120, 2025-05-07T20:31:45.8999145Z scale_ub=1200.0, 2025-05-07T20:31:45.8999370Z contiguous=True, 2025-05-07T20:31:45.8999601Z compiled=True, 2025-05-07T20:31:45.8999815Z ) 2025-05-07T20:31:45.9000140Z self = 2025-05-07T20:31:45.9000651Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.9000943Z 2025-05-07T20:31:45.9001024Z @given( 2025-05-07T20:31:45.9001263Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.9001577Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.9001892Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.9002235Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.9002573Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.9002866Z ) 2025-05-07T20:31:45.9003224Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.9003675Z def test_silu_mul_quant( 2025-05-07T20:31:45.9003923Z self, 2025-05-07T20:31:45.9004124Z T: int, 2025-05-07T20:31:45.9004324Z D: int, 2025-05-07T20:31:45.9004551Z scale_ub: Optional[float], 2025-05-07T20:31:45.9004829Z contiguous: bool, 2025-05-07T20:31:45.9005072Z compiled: bool, 2025-05-07T20:31:45.9005303Z ) -> None: 2025-05-07T20:31:45.9005526Z torch.manual_seed(2025) 2025-05-07T20:31:45.9005776Z 2025-05-07T20:31:45.9006056Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.9006406Z 2025-05-07T20:31:45.9006621Z x_sign = torch.sign(x) 2025-05-07T20:31:45.9006916Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.9007231Z x = x_sign * x_clamp 2025-05-07T20:31:45.9007562Z x0 = x[:, :D] 2025-05-07T20:31:45.9007783Z x1 = x[:, D:] 2025-05-07T20:31:45.9007998Z 2025-05-07T20:31:45.9008191Z if contiguous: 2025-05-07T20:31:45.9008431Z x0 = x0.contiguous() 2025-05-07T20:31:45.9008696Z x1 = x1.contiguous() 2025-05-07T20:31:45.9008942Z 2025-05-07T20:31:45.9009139Z if scale_ub is not None: 2025-05-07T20:31:45.9009421Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.9009767Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.9010080Z ) 2025-05-07T20:31:45.9010284Z else: 2025-05-07T20:31:45.9010504Z scale_ub_tensor = None 2025-05-07T20:31:45.9010771Z 2025-05-07T20:31:45.9011008Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.9011332Z op = silu_mul_quant 2025-05-07T20:31:45.9011593Z if compiled: 2025-05-07T20:31:45.9011843Z op = torch.compile(op) 2025-05-07T20:31:45.9012157Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.9012445Z 2025-05-07T20:31:45.9012642Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.9012820Z 2025-05-07T20:31:45.9012923Z moe/activation_test.py:117: 2025-05-07T20:31:45.9013225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.9013555Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.9013847Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.9014421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.9014995Z return fn(*args, **kwargs) 2025-05-07T20:31:45.9015719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.9016425Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.9016980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.9017671Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.9018537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.9019084Z kernel = self.compile( 2025-05-07T20:31:45.9019644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.9020319Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.9020718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.9020950Z 2025-05-07T20:31:45.9021166Z self = 2025-05-07T20:31:45.9022275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.9023676Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ef41090>} 2025-05-07T20:31:45.9025051Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.9026156Z context = 2025-05-07T20:31:45.9026458Z 2025-05-07T20:31:45.9026631Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.9027166Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.9027645Z module_map=module_map) 2025-05-07T20:31:45.9028136Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.9028508Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.9028771Z E ^ 2025-05-07T20:31:45.9029248Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.9029714Z 2025-05-07T20:31:45.9030142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.9030663Z 2025-05-07T20:31:46.0924105Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.0924809Z self=, 2025-05-07T20:31:46.0925355Z T=16384, 2025-05-07T20:31:46.0925567Z D=5120, 2025-05-07T20:31:46.0925766Z scale_ub=None, 2025-05-07T20:31:46.0925991Z contiguous=False, 2025-05-07T20:31:46.0926219Z compiled=True, 2025-05-07T20:31:46.0926435Z ) 2025-05-07T20:31:46.0926770Z self = 2025-05-07T20:31:46.0927313Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.0927609Z 2025-05-07T20:31:46.0927689Z @given( 2025-05-07T20:31:46.0927942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.0928262Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.0928581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.0928923Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.0929255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.0929551Z ) 2025-05-07T20:31:46.0929914Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.0930369Z def test_silu_mul_quant( 2025-05-07T20:31:46.0930619Z self, 2025-05-07T20:31:46.0930820Z T: int, 2025-05-07T20:31:46.0931023Z D: int, 2025-05-07T20:31:46.0931242Z scale_ub: Optional[float], 2025-05-07T20:31:46.0931524Z contiguous: bool, 2025-05-07T20:31:46.0931775Z compiled: bool, 2025-05-07T20:31:46.0932005Z ) -> None: 2025-05-07T20:31:46.0932233Z torch.manual_seed(2025) 2025-05-07T20:31:46.0932843Z 2025-05-07T20:31:46.0933122Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.0933474Z 2025-05-07T20:31:46.0933681Z x_sign = torch.sign(x) 2025-05-07T20:31:46.0933977Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.0934295Z x = x_sign * x_clamp 2025-05-07T20:31:46.0934547Z x0 = x[:, :D] 2025-05-07T20:31:46.0934767Z x1 = x[:, D:] 2025-05-07T20:31:46.0934981Z 2025-05-07T20:31:46.0935185Z if contiguous: 2025-05-07T20:31:46.0935461Z x0 = x0.contiguous() 2025-05-07T20:31:46.0935736Z x1 = x1.contiguous() 2025-05-07T20:31:46.0935992Z 2025-05-07T20:31:46.0936194Z if scale_ub is not None: 2025-05-07T20:31:46.0936472Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.0936824Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.0937140Z ) 2025-05-07T20:31:46.0937335Z else: 2025-05-07T20:31:46.0937567Z scale_ub_tensor = None 2025-05-07T20:31:46.0937833Z 2025-05-07T20:31:46.0938172Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.0938496Z op = silu_mul_quant 2025-05-07T20:31:46.0938752Z if compiled: 2025-05-07T20:31:46.0939002Z op = torch.compile(op) 2025-05-07T20:31:46.0939310Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.0939596Z 2025-05-07T20:31:46.0939793Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.0939968Z 2025-05-07T20:31:46.0940071Z moe/activation_test.py:117: 2025-05-07T20:31:46.0940373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.0940714Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.0941001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.0941779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.0942377Z return fn(*args, **kwargs) 2025-05-07T20:31:46.0943050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.0943758Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.0944315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.0945017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.0945753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.0946297Z kernel = self.compile( 2025-05-07T20:31:46.0946858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.0947537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.0947944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.0948188Z 2025-05-07T20:31:46.0948403Z self = 2025-05-07T20:31:46.0949515Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.0950951Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ef42290>} 2025-05-07T20:31:46.0952331Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.0953390Z context = 2025-05-07T20:31:46.0953693Z 2025-05-07T20:31:46.0953956Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.0954496Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.0954976Z module_map=module_map) 2025-05-07T20:31:46.0955353Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.0956008Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.0956274Z E ^ 2025-05-07T20:31:46.0956754Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.0957219Z 2025-05-07T20:31:46.0957646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.0958173Z 2025-05-07T20:31:46.0958295Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.0958718Z self=, 2025-05-07T20:31:46.0959140Z T=2048, 2025-05-07T20:31:46.0959341Z D=5120, 2025-05-07T20:31:46.0959535Z scale_ub=None, 2025-05-07T20:31:46.0959761Z contiguous=False, 2025-05-07T20:31:46.0959995Z compiled=True, 2025-05-07T20:31:46.0960198Z ) 2025-05-07T20:31:46.1997651Z self = 2025-05-07T20:31:46.1998416Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.1998804Z 2025-05-07T20:31:46.1998924Z @given( 2025-05-07T20:31:46.1999247Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.1999692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.2000012Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.2000360Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.2001029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.2001330Z ) 2025-05-07T20:31:46.2001690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.2002147Z def test_silu_mul_quant( 2025-05-07T20:31:46.2002396Z self, 2025-05-07T20:31:46.2002598Z T: int, 2025-05-07T20:31:46.2002795Z D: int, 2025-05-07T20:31:46.2003019Z scale_ub: Optional[float], 2025-05-07T20:31:46.2003303Z contiguous: bool, 2025-05-07T20:31:46.2003543Z compiled: bool, 2025-05-07T20:31:46.2003781Z ) -> None: 2025-05-07T20:31:46.2004007Z torch.manual_seed(2025) 2025-05-07T20:31:46.2004248Z 2025-05-07T20:31:46.2004531Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.2004878Z 2025-05-07T20:31:46.2005072Z x_sign = torch.sign(x) 2025-05-07T20:31:46.2005370Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.2005694Z x = x_sign * x_clamp 2025-05-07T20:31:46.2005940Z x0 = x[:, :D] 2025-05-07T20:31:46.2006162Z x1 = x[:, D:] 2025-05-07T20:31:46.2006385Z 2025-05-07T20:31:46.2006579Z if contiguous: 2025-05-07T20:31:46.2006851Z x0 = x0.contiguous() 2025-05-07T20:31:46.2007113Z x1 = x1.contiguous() 2025-05-07T20:31:46.2007363Z 2025-05-07T20:31:46.2007564Z if scale_ub is not None: 2025-05-07T20:31:46.2007849Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.2008188Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.2008505Z ) 2025-05-07T20:31:46.2008708Z else: 2025-05-07T20:31:46.2008921Z scale_ub_tensor = None 2025-05-07T20:31:46.2009181Z 2025-05-07T20:31:46.2009422Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.2009744Z op = silu_mul_quant 2025-05-07T20:31:46.2010005Z if compiled: 2025-05-07T20:31:46.2010269Z op = torch.compile(op) 2025-05-07T20:31:46.2010570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.2010853Z 2025-05-07T20:31:46.2011219Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.2011387Z 2025-05-07T20:31:46.2011490Z moe/activation_test.py:117: 2025-05-07T20:31:46.2011798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.2012143Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.2012436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.2013007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.2013585Z return fn(*args, **kwargs) 2025-05-07T20:31:46.2014265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.2014967Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.2015577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.2016275Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.2016964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.2017505Z kernel = self.compile( 2025-05-07T20:31:46.2018172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.2018851Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.2019259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.2019487Z 2025-05-07T20:31:46.2019700Z self = 2025-05-07T20:31:46.2020895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.2022318Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ef42170>} 2025-05-07T20:31:46.2023706Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.2024751Z context = 2025-05-07T20:31:46.2025052Z 2025-05-07T20:31:46.2025224Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.2025810Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.2026293Z module_map=module_map) 2025-05-07T20:31:46.2026671Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.2027037Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.2027306Z E ^ 2025-05-07T20:31:46.2027783Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.2028252Z 2025-05-07T20:31:46.2028680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.2029210Z 2025-05-07T20:31:46.2029318Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.2029750Z self=, 2025-05-07T20:31:46.2030157Z T=2048, 2025-05-07T20:31:46.2030363Z D=5120, 2025-05-07T20:31:46.2030571Z scale_ub=1200.0, 2025-05-07T20:31:46.2030802Z contiguous=False, 2025-05-07T20:31:46.2031037Z compiled=True, 2025-05-07T20:31:46.2040127Z ) 2025-05-07T20:31:46.2040519Z self = 2025-05-07T20:31:46.2041043Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.2041443Z 2025-05-07T20:31:46.2041524Z @given( 2025-05-07T20:31:46.2041767Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.2042089Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.2042400Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.2042742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.2043082Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.2043369Z ) 2025-05-07T20:31:46.2043731Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.2044187Z def test_silu_mul_quant( 2025-05-07T20:31:46.2044442Z self, 2025-05-07T20:31:46.2044636Z T: int, 2025-05-07T20:31:46.2044836Z D: int, 2025-05-07T20:31:46.2045070Z scale_ub: Optional[float], 2025-05-07T20:31:46.2045370Z contiguous: bool, 2025-05-07T20:31:46.2045638Z compiled: bool, 2025-05-07T20:31:46.2045868Z ) -> None: 2025-05-07T20:31:46.2046092Z torch.manual_seed(2025) 2025-05-07T20:31:46.2046340Z 2025-05-07T20:31:46.2046620Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.2046962Z 2025-05-07T20:31:46.2047170Z x_sign = torch.sign(x) 2025-05-07T20:31:46.2047468Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.2047777Z x = x_sign * x_clamp 2025-05-07T20:31:46.2048024Z x0 = x[:, :D] 2025-05-07T20:31:46.2048250Z x1 = x[:, D:] 2025-05-07T20:31:46.2048457Z 2025-05-07T20:31:46.2048652Z if contiguous: 2025-05-07T20:31:46.2048882Z x0 = x0.contiguous() 2025-05-07T20:31:46.2049141Z x1 = x1.contiguous() 2025-05-07T20:31:46.2049395Z 2025-05-07T20:31:46.2049596Z if scale_ub is not None: 2025-05-07T20:31:46.2049962Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.2050306Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.2050625Z ) 2025-05-07T20:31:46.2050828Z else: 2025-05-07T20:31:46.2051037Z scale_ub_tensor = None 2025-05-07T20:31:46.2051301Z 2025-05-07T20:31:46.2051542Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.2051861Z op = silu_mul_quant 2025-05-07T20:31:46.2052117Z if compiled: 2025-05-07T20:31:46.2052375Z op = torch.compile(op) 2025-05-07T20:31:46.2052673Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.2052956Z 2025-05-07T20:31:46.2053157Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.2053323Z 2025-05-07T20:31:46.2053427Z moe/activation_test.py:117: 2025-05-07T20:31:46.2053732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.2054065Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.2054358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.2054924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.2055827Z return fn(*args, **kwargs) 2025-05-07T20:31:46.2056621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.2057319Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.2057864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.2058745Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.2059547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.2060179Z kernel = self.compile( 2025-05-07T20:31:46.2060826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.2061618Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.2062170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.2062404Z 2025-05-07T20:31:46.2062614Z self = 2025-05-07T20:31:46.2063710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.2065109Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ef43880>} 2025-05-07T20:31:46.2066475Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.2067508Z context = 2025-05-07T20:31:46.2067809Z 2025-05-07T20:31:46.2067980Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.2068511Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.2068990Z module_map=module_map) 2025-05-07T20:31:46.2069355Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.2069715Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.2069981Z E ^ 2025-05-07T20:31:46.2070451Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.2070910Z 2025-05-07T20:31:46.2071455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.2071983Z 2025-05-07T20:31:46.3969936Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.3970583Z self=, 2025-05-07T20:31:46.3971175Z T=4096, 2025-05-07T20:31:46.3971443Z D=5120, 2025-05-07T20:31:46.3971646Z scale_ub=1200.0, 2025-05-07T20:31:46.3971874Z contiguous=True, 2025-05-07T20:31:46.3972103Z compiled=True, 2025-05-07T20:31:46.3972319Z ) 2025-05-07T20:31:46.3972649Z self = 2025-05-07T20:31:46.3973164Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.3973451Z 2025-05-07T20:31:46.3973532Z @given( 2025-05-07T20:31:46.3973772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.3974092Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.3974412Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.3974767Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.3975103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.3975406Z ) 2025-05-07T20:31:46.3975818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.3976275Z def test_silu_mul_quant( 2025-05-07T20:31:46.3976520Z self, 2025-05-07T20:31:46.3976725Z T: int, 2025-05-07T20:31:46.3976952Z D: int, 2025-05-07T20:31:46.3977178Z scale_ub: Optional[float], 2025-05-07T20:31:46.3977455Z contiguous: bool, 2025-05-07T20:31:46.3977703Z compiled: bool, 2025-05-07T20:31:46.3977938Z ) -> None: 2025-05-07T20:31:46.3978268Z torch.manual_seed(2025) 2025-05-07T20:31:46.3978522Z 2025-05-07T20:31:46.3978809Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.3979159Z 2025-05-07T20:31:46.3979365Z x_sign = torch.sign(x) 2025-05-07T20:31:46.3979673Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.3979990Z x = x_sign * x_clamp 2025-05-07T20:31:46.3980243Z x0 = x[:, :D] 2025-05-07T20:31:46.3980823Z x1 = x[:, D:] 2025-05-07T20:31:46.3981035Z 2025-05-07T20:31:46.3981240Z if contiguous: 2025-05-07T20:31:46.3981486Z x0 = x0.contiguous() 2025-05-07T20:31:46.3981756Z x1 = x1.contiguous() 2025-05-07T20:31:46.3982004Z 2025-05-07T20:31:46.3982209Z if scale_ub is not None: 2025-05-07T20:31:46.3982489Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.3982844Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.3983165Z ) 2025-05-07T20:31:46.3983371Z else: 2025-05-07T20:31:46.3983586Z scale_ub_tensor = None 2025-05-07T20:31:46.3983844Z 2025-05-07T20:31:46.3984091Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.3984410Z op = silu_mul_quant 2025-05-07T20:31:46.3984674Z if compiled: 2025-05-07T20:31:46.3984931Z op = torch.compile(op) 2025-05-07T20:31:46.3985235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3985526Z 2025-05-07T20:31:46.3985729Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.3985898Z 2025-05-07T20:31:46.3986002Z moe/activation_test.py:117: 2025-05-07T20:31:46.3986308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3986652Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.3986946Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3987522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.3988101Z return fn(*args, **kwargs) 2025-05-07T20:31:46.3988785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.3989634Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.3990196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.3990900Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.3991584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.3992131Z kernel = self.compile( 2025-05-07T20:31:46.3992693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.3993374Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.3993778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3994018Z 2025-05-07T20:31:46.3994233Z self = 2025-05-07T20:31:46.3995377Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.3996847Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ed78940>} 2025-05-07T20:31:46.3998231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.3999283Z context = 2025-05-07T20:31:46.3999585Z 2025-05-07T20:31:46.3999760Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.4000301Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.4000800Z module_map=module_map) 2025-05-07T20:31:46.4001175Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.4001630Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.4001906Z E ^ 2025-05-07T20:31:46.4002384Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.4002858Z 2025-05-07T20:31:46.4003289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.4003823Z 2025-05-07T20:31:46.4003933Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.4004364Z self=, 2025-05-07T20:31:46.4004774Z T=128, 2025-05-07T20:31:46.4004972Z D=5120, 2025-05-07T20:31:46.4005175Z scale_ub=1200.0, 2025-05-07T20:31:46.4005407Z contiguous=False, 2025-05-07T20:31:46.4005651Z compiled=True, 2025-05-07T20:31:46.4005876Z ) 2025-05-07T20:31:46.5152640Z self = 2025-05-07T20:31:46.5153463Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.5153878Z 2025-05-07T20:31:46.5153970Z @given( 2025-05-07T20:31:46.5154214Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.5154538Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.5154867Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.5155209Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.5155842Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.5156150Z ) 2025-05-07T20:31:46.5156516Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.5156968Z def test_silu_mul_quant( 2025-05-07T20:31:46.5157219Z self, 2025-05-07T20:31:46.5157428Z T: int, 2025-05-07T20:31:46.5157995Z D: int, 2025-05-07T20:31:46.5158234Z scale_ub: Optional[float], 2025-05-07T20:31:46.5158519Z contiguous: bool, 2025-05-07T20:31:46.5158763Z compiled: bool, 2025-05-07T20:31:46.5159007Z ) -> None: 2025-05-07T20:31:46.5159238Z torch.manual_seed(2025) 2025-05-07T20:31:46.5159485Z 2025-05-07T20:31:46.5159773Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.5160128Z 2025-05-07T20:31:46.5160325Z x_sign = torch.sign(x) 2025-05-07T20:31:46.5160634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.5160962Z x = x_sign * x_clamp 2025-05-07T20:31:46.5161223Z x0 = x[:, :D] 2025-05-07T20:31:46.5161445Z x1 = x[:, D:] 2025-05-07T20:31:46.5161662Z 2025-05-07T20:31:46.5161865Z if contiguous: 2025-05-07T20:31:46.5162103Z x0 = x0.contiguous() 2025-05-07T20:31:46.5162374Z x1 = x1.contiguous() 2025-05-07T20:31:46.5162629Z 2025-05-07T20:31:46.5162833Z if scale_ub is not None: 2025-05-07T20:31:46.5163118Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.5163470Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.5163787Z ) 2025-05-07T20:31:46.5163991Z else: 2025-05-07T20:31:46.5164213Z scale_ub_tensor = None 2025-05-07T20:31:46.5164470Z 2025-05-07T20:31:46.5164714Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.5165041Z op = silu_mul_quant 2025-05-07T20:31:46.5165316Z if compiled: 2025-05-07T20:31:46.5165602Z op = torch.compile(op) 2025-05-07T20:31:46.5165912Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.5166189Z 2025-05-07T20:31:46.5166392Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.5166569Z 2025-05-07T20:31:46.5166673Z moe/activation_test.py:117: 2025-05-07T20:31:46.5166984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.5167327Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.5167621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.5168353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.5168927Z return fn(*args, **kwargs) 2025-05-07T20:31:46.5169604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.5170320Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.5170876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.5171572Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.5172256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.5172815Z kernel = self.compile( 2025-05-07T20:31:46.5173378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.5174062Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.5174481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.5174712Z 2025-05-07T20:31:46.5174933Z self = 2025-05-07T20:31:46.5176095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.5177528Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ed791b0>} 2025-05-07T20:31:46.5179070Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.5180136Z context = 2025-05-07T20:31:46.5180433Z 2025-05-07T20:31:46.5180613Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.5181154Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.5181647Z module_map=module_map) 2025-05-07T20:31:46.5182031Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.5182394Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.5182666Z E ^ 2025-05-07T20:31:46.5183145Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.5183608Z 2025-05-07T20:31:46.5184048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.5184573Z 2025-05-07T20:31:46.5184688Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.5185123Z self=, 2025-05-07T20:31:46.5185538Z T=16384, 2025-05-07T20:31:46.5185734Z D=7168, 2025-05-07T20:31:46.5185940Z scale_ub=1200.0, 2025-05-07T20:31:46.5186175Z contiguous=True, 2025-05-07T20:31:46.5186410Z compiled=True, 2025-05-07T20:31:46.5186619Z ) 2025-05-07T20:31:46.5186950Z self = 2025-05-07T20:31:46.5187471Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.5187756Z 2025-05-07T20:31:46.5187835Z @given( 2025-05-07T20:31:46.5188077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.5188401Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.5188719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.5189061Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.5189492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.5189783Z ) 2025-05-07T20:31:46.5190142Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.5190599Z def test_silu_mul_quant( 2025-05-07T20:31:46.5190850Z self, 2025-05-07T20:31:46.5191046Z T: int, 2025-05-07T20:31:46.5191250Z D: int, 2025-05-07T20:31:46.5191476Z scale_ub: Optional[float], 2025-05-07T20:31:46.5191750Z contiguous: bool, 2025-05-07T20:31:46.5192002Z compiled: bool, 2025-05-07T20:31:46.5192238Z ) -> None: 2025-05-07T20:31:46.5192457Z torch.manual_seed(2025) 2025-05-07T20:31:46.5192707Z 2025-05-07T20:31:46.5192993Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.5193341Z 2025-05-07T20:31:46.5193551Z x_sign = torch.sign(x) 2025-05-07T20:31:46.5193856Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.5194171Z x = x_sign * x_clamp 2025-05-07T20:31:46.5194429Z x0 = x[:, :D] 2025-05-07T20:31:46.5194655Z x1 = x[:, D:] 2025-05-07T20:31:46.5194866Z 2025-05-07T20:31:46.5195062Z if contiguous: 2025-05-07T20:31:46.5195319Z x0 = x0.contiguous() 2025-05-07T20:31:46.5195624Z x1 = x1.contiguous() 2025-05-07T20:31:46.5195870Z 2025-05-07T20:31:46.5196076Z if scale_ub is not None: 2025-05-07T20:31:46.5196364Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.5196711Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.5197036Z ) 2025-05-07T20:31:46.5197243Z else: 2025-05-07T20:31:46.5197456Z scale_ub_tensor = None 2025-05-07T20:31:46.5197719Z 2025-05-07T20:31:46.5198049Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.5198370Z op = silu_mul_quant 2025-05-07T20:31:46.5198629Z if compiled: 2025-05-07T20:31:46.5198884Z op = torch.compile(op) 2025-05-07T20:31:46.5199191Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.5199472Z 2025-05-07T20:31:46.5199675Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.5199844Z 2025-05-07T20:31:46.5199946Z moe/activation_test.py:117: 2025-05-07T20:31:46.5200251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.5200588Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.5200880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.5201451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.5202028Z return fn(*args, **kwargs) 2025-05-07T20:31:46.5202714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.5203427Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.5203981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.5204685Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.5205378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.5205963Z kernel = self.compile( 2025-05-07T20:31:46.5206525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.5207204Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.5207618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.5207849Z 2025-05-07T20:31:46.5208061Z self = 2025-05-07T20:31:46.5209182Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.5210684Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ed797e0>} 2025-05-07T20:31:46.5212075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.5213128Z context = 2025-05-07T20:31:46.5213429Z 2025-05-07T20:31:46.5213602Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.5214149Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.5214641Z module_map=module_map) 2025-05-07T20:31:46.5215013Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.5215388Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.5215663Z E ^ 2025-05-07T20:31:46.5216143Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.5216613Z 2025-05-07T20:31:46.5217039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.5217572Z 2025-05-07T20:31:46.8622536Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8623861Z self=, 2025-05-07T20:31:46.8624998Z T=16384, 2025-05-07T20:31:46.8625444Z D=5120, 2025-05-07T20:31:46.8625759Z scale_ub=1200.0, 2025-05-07T20:31:46.8626363Z contiguous=True, 2025-05-07T20:31:46.8626602Z compiled=False, 2025-05-07T20:31:46.8626821Z ) 2025-05-07T20:31:46.8627147Z self = 2025-05-07T20:31:46.8627676Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:46.8627966Z 2025-05-07T20:31:46.8628054Z @given( 2025-05-07T20:31:46.8628288Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8628613Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8628932Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8629277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8629613Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8629912Z ) 2025-05-07T20:31:46.8630275Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8630729Z def test_silu_mul_quant( 2025-05-07T20:31:46.8630982Z self, 2025-05-07T20:31:46.8631187Z T: int, 2025-05-07T20:31:46.8631384Z D: int, 2025-05-07T20:31:46.8631612Z scale_ub: Optional[float], 2025-05-07T20:31:46.8631901Z contiguous: bool, 2025-05-07T20:31:46.8632146Z compiled: bool, 2025-05-07T20:31:46.8632382Z ) -> None: 2025-05-07T20:31:46.8632608Z torch.manual_seed(2025) 2025-05-07T20:31:46.8632853Z 2025-05-07T20:31:46.8633139Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8633495Z 2025-05-07T20:31:46.8633714Z x_sign = torch.sign(x) 2025-05-07T20:31:46.8634012Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.8634332Z x = x_sign * x_clamp 2025-05-07T20:31:46.8634580Z x0 = x[:, :D] 2025-05-07T20:31:46.8634803Z x1 = x[:, D:] 2025-05-07T20:31:46.8635025Z 2025-05-07T20:31:46.8635226Z if contiguous: 2025-05-07T20:31:46.8643424Z x0 = x0.contiguous() 2025-05-07T20:31:46.8643722Z x1 = x1.contiguous() 2025-05-07T20:31:46.8643970Z 2025-05-07T20:31:46.8644161Z if scale_ub is not None: 2025-05-07T20:31:46.8644444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.8644994Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.8645308Z ) 2025-05-07T20:31:46.8645534Z else: 2025-05-07T20:31:46.8645775Z scale_ub_tensor = None 2025-05-07T20:31:46.8646028Z 2025-05-07T20:31:46.8646271Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.8646598Z op = silu_mul_quant 2025-05-07T20:31:46.8646847Z if compiled: 2025-05-07T20:31:46.8647107Z op = torch.compile(op) 2025-05-07T20:31:46.8647415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8647689Z 2025-05-07T20:31:46.8647889Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.8648064Z 2025-05-07T20:31:46.8648167Z moe/activation_test.py:117: 2025-05-07T20:31:46.8648481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8648813Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.8649114Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8649826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.8650528Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.8651077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.8651777Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.8652461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.8653003Z kernel = self.compile( 2025-05-07T20:31:46.8653648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.8654328Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.8654728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8654968Z 2025-05-07T20:31:46.8655182Z self = 2025-05-07T20:31:46.8656616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.8658130Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ed7a950>} 2025-05-07T20:31:46.8659509Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.8660549Z context = 2025-05-07T20:31:46.8660856Z 2025-05-07T20:31:46.8661028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.8661564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.8662051Z module_map=module_map) 2025-05-07T20:31:46.8662422Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.8662791Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.8663084Z E ^ 2025-05-07T20:31:46.8663559Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.8664022Z 2025-05-07T20:31:46.8664446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.8664977Z 2025-05-07T20:31:46.8665084Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8665515Z self=, 2025-05-07T20:31:46.8666060Z T=1, 2025-05-07T20:31:46.8666246Z D=7168, 2025-05-07T20:31:46.8666444Z scale_ub=1200.0, 2025-05-07T20:31:46.8666675Z contiguous=False, 2025-05-07T20:31:46.8666904Z compiled=False, 2025-05-07T20:31:46.8667119Z ) 2025-05-07T20:31:46.8667448Z self = 2025-05-07T20:31:46.8667949Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.8668229Z 2025-05-07T20:31:46.8668305Z @given( 2025-05-07T20:31:46.8668545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8668864Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8669174Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8669516Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8669851Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8670139Z ) 2025-05-07T20:31:46.8670489Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8670944Z def test_silu_mul_quant( 2025-05-07T20:31:46.8671191Z self, 2025-05-07T20:31:46.8671384Z T: int, 2025-05-07T20:31:46.8671584Z D: int, 2025-05-07T20:31:46.8671806Z scale_ub: Optional[float], 2025-05-07T20:31:46.8672079Z contiguous: bool, 2025-05-07T20:31:46.8672323Z compiled: bool, 2025-05-07T20:31:46.8672552Z ) -> None: 2025-05-07T20:31:46.8672770Z torch.manual_seed(2025) 2025-05-07T20:31:46.8673016Z 2025-05-07T20:31:46.8673298Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8673637Z 2025-05-07T20:31:46.8673837Z x_sign = torch.sign(x) 2025-05-07T20:31:46.8674136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.8674569Z x = x_sign * x_clamp 2025-05-07T20:31:46.8674820Z x0 = x[:, :D] 2025-05-07T20:31:46.8675042Z x1 = x[:, D:] 2025-05-07T20:31:46.8675253Z 2025-05-07T20:31:46.8675459Z if contiguous: 2025-05-07T20:31:46.8675731Z x0 = x0.contiguous() 2025-05-07T20:31:46.8675994Z x1 = x1.contiguous() 2025-05-07T20:31:46.8676232Z 2025-05-07T20:31:46.8676430Z if scale_ub is not None: 2025-05-07T20:31:46.8676711Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.8677049Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.8677363Z ) 2025-05-07T20:31:46.8677565Z else: 2025-05-07T20:31:46.8677776Z scale_ub_tensor = None 2025-05-07T20:31:46.8678032Z 2025-05-07T20:31:46.8678269Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.8678582Z op = silu_mul_quant 2025-05-07T20:31:46.8678835Z if compiled: 2025-05-07T20:31:46.8679092Z op = torch.compile(op) 2025-05-07T20:31:46.8679387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8679662Z 2025-05-07T20:31:46.8679864Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.8680030Z 2025-05-07T20:31:46.8680139Z moe/activation_test.py:117: 2025-05-07T20:31:46.8680432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8680765Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.8681050Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8681749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.8682450Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.8682997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.8683691Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.8684361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.8684986Z kernel = self.compile( 2025-05-07T20:31:46.8685561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.8686251Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.8686650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8686881Z 2025-05-07T20:31:46.8687092Z self = 2025-05-07T20:31:46.8688186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.8689574Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ed7bac0>} 2025-05-07T20:31:46.8690943Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.8691987Z context = 2025-05-07T20:31:46.8692279Z 2025-05-07T20:31:46.8692456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.8692988Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.8693459Z module_map=module_map) 2025-05-07T20:31:46.8693834Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.8694197Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.8694456Z E ^ 2025-05-07T20:31:46.8695012Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.8695478Z 2025-05-07T20:31:46.8695912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.8696431Z 2025-05-07T20:31:47.0605926Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.0606618Z self=, 2025-05-07T20:31:47.0607192Z T=4096, 2025-05-07T20:31:47.0607463Z D=7168, 2025-05-07T20:31:47.0607735Z scale_ub=1200.0, 2025-05-07T20:31:47.0608031Z contiguous=False, 2025-05-07T20:31:47.0608335Z compiled=True, 2025-05-07T20:31:47.0608616Z ) 2025-05-07T20:31:47.0608958Z self = 2025-05-07T20:31:47.0609473Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:47.0609790Z 2025-05-07T20:31:47.0609871Z @given( 2025-05-07T20:31:47.0610109Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.0610426Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.0610752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.0611096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.0611430Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.0611723Z ) 2025-05-07T20:31:47.0612083Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.0612535Z def test_silu_mul_quant( 2025-05-07T20:31:47.0612784Z self, 2025-05-07T20:31:47.0612984Z T: int, 2025-05-07T20:31:47.0613182Z D: int, 2025-05-07T20:31:47.0613407Z scale_ub: Optional[float], 2025-05-07T20:31:47.0613690Z contiguous: bool, 2025-05-07T20:31:47.0613933Z compiled: bool, 2025-05-07T20:31:47.0614169Z ) -> None: 2025-05-07T20:31:47.0614398Z torch.manual_seed(2025) 2025-05-07T20:31:47.0614650Z 2025-05-07T20:31:47.0614927Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.0615626Z 2025-05-07T20:31:47.0615829Z x_sign = torch.sign(x) 2025-05-07T20:31:47.0616123Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.0616440Z x = x_sign * x_clamp 2025-05-07T20:31:47.0616685Z x0 = x[:, :D] 2025-05-07T20:31:47.0616901Z x1 = x[:, D:] 2025-05-07T20:31:47.0617113Z 2025-05-07T20:31:47.0617309Z if contiguous: 2025-05-07T20:31:47.0617545Z x0 = x0.contiguous() 2025-05-07T20:31:47.0617814Z x1 = x1.contiguous() 2025-05-07T20:31:47.0618156Z 2025-05-07T20:31:47.0618399Z if scale_ub is not None: 2025-05-07T20:31:47.0618687Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.0619036Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.0619348Z ) 2025-05-07T20:31:47.0619558Z else: 2025-05-07T20:31:47.0619779Z scale_ub_tensor = None 2025-05-07T20:31:47.0620042Z 2025-05-07T20:31:47.0620288Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.0620611Z op = silu_mul_quant 2025-05-07T20:31:47.0620868Z if compiled: 2025-05-07T20:31:47.0621123Z op = torch.compile(op) 2025-05-07T20:31:47.0621430Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.0621711Z 2025-05-07T20:31:47.0621906Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.0622084Z 2025-05-07T20:31:47.0622189Z moe/activation_test.py:117: 2025-05-07T20:31:47.0622493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.0622828Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.0623120Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.0623878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.0624461Z return fn(*args, **kwargs) 2025-05-07T20:31:47.0625131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.0625910Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.0626465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.0627170Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.0627850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.0628398Z kernel = self.compile( 2025-05-07T20:31:47.0628959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.0629635Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.0630047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.0630285Z 2025-05-07T20:31:47.0630499Z self = 2025-05-07T20:31:47.0631613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.0633041Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ec80550>} 2025-05-07T20:31:47.0634421Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.0635485Z context = 2025-05-07T20:31:47.0635814Z 2025-05-07T20:31:47.0636013Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.0636640Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.0637117Z module_map=module_map) 2025-05-07T20:31:47.0637497Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.0637864Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.0638125Z E ^ 2025-05-07T20:31:47.0638603Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.0639080Z 2025-05-07T20:31:47.0639508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.0640031Z 2025-05-07T20:31:47.0640148Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.0640580Z self=, 2025-05-07T20:31:47.0640983Z T=128, 2025-05-07T20:31:47.0641176Z D=7168, 2025-05-07T20:31:47.0641380Z scale_ub=1200.0, 2025-05-07T20:31:47.0641607Z contiguous=False, 2025-05-07T20:31:47.0641840Z compiled=True, 2025-05-07T20:31:47.0642049Z ) 2025-05-07T20:31:47.1677529Z self = 2025-05-07T20:31:47.1678946Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:47.1679509Z 2025-05-07T20:31:47.1679672Z @given( 2025-05-07T20:31:47.1680153Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.1680792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.1681411Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.1682089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.1682766Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.1683349Z ) 2025-05-07T20:31:47.1684528Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.1685450Z def test_silu_mul_quant( 2025-05-07T20:31:47.1685815Z self, 2025-05-07T20:31:47.1686046Z T: int, 2025-05-07T20:31:47.1686256Z D: int, 2025-05-07T20:31:47.1686488Z scale_ub: Optional[float], 2025-05-07T20:31:47.1686764Z contiguous: bool, 2025-05-07T20:31:47.1687014Z compiled: bool, 2025-05-07T20:31:47.1687252Z ) -> None: 2025-05-07T20:31:47.1687473Z torch.manual_seed(2025) 2025-05-07T20:31:47.1687725Z 2025-05-07T20:31:47.1688015Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.1688360Z 2025-05-07T20:31:47.1688565Z x_sign = torch.sign(x) 2025-05-07T20:31:47.1688868Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.1689182Z x = x_sign * x_clamp 2025-05-07T20:31:47.1689434Z x0 = x[:, :D] 2025-05-07T20:31:47.1689670Z x1 = x[:, D:] 2025-05-07T20:31:47.1689889Z 2025-05-07T20:31:47.1690083Z if contiguous: 2025-05-07T20:31:47.1690326Z x0 = x0.contiguous() 2025-05-07T20:31:47.1690601Z x1 = x1.contiguous() 2025-05-07T20:31:47.1690846Z 2025-05-07T20:31:47.1691049Z if scale_ub is not None: 2025-05-07T20:31:47.1691333Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.1691677Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.1691999Z ) 2025-05-07T20:31:47.1692204Z else: 2025-05-07T20:31:47.1692417Z scale_ub_tensor = None 2025-05-07T20:31:47.1692677Z 2025-05-07T20:31:47.1692921Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.1693239Z op = silu_mul_quant 2025-05-07T20:31:47.1693500Z if compiled: 2025-05-07T20:31:47.1693757Z op = torch.compile(op) 2025-05-07T20:31:47.1694065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.1694354Z 2025-05-07T20:31:47.1694557Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.1694725Z 2025-05-07T20:31:47.1694835Z moe/activation_test.py:117: 2025-05-07T20:31:47.1695293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.1695632Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.1695925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.1696500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.1697081Z return fn(*args, **kwargs) 2025-05-07T20:31:47.1697761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.1698549Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.1699099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.1699806Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.1700490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.1701043Z kernel = self.compile( 2025-05-07T20:31:47.1701604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.1702283Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.1702692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.1702921Z 2025-05-07T20:31:47.1703133Z self = 2025-05-07T20:31:47.1704333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.1705822Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ec80f70>} 2025-05-07T20:31:47.1707207Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.1708257Z context = 2025-05-07T20:31:47.1708557Z 2025-05-07T20:31:47.1708732Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.1709275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.1709759Z module_map=module_map) 2025-05-07T20:31:47.1710135Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.1710509Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.1710782Z E ^ 2025-05-07T20:31:47.1711255Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.1711729Z 2025-05-07T20:31:47.1712157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.1712695Z 2025-05-07T20:31:47.1712805Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.1713236Z self=, 2025-05-07T20:31:47.1713646Z T=2048, 2025-05-07T20:31:47.1713846Z D=7168, 2025-05-07T20:31:47.1714054Z scale_ub=None, 2025-05-07T20:31:47.1714274Z contiguous=True, 2025-05-07T20:31:47.1714510Z compiled=True, 2025-05-07T20:31:47.1714734Z ) 2025-05-07T20:31:47.1715065Z self = 2025-05-07T20:31:47.1715580Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:47.1715888Z 2025-05-07T20:31:47.1715988Z @given( 2025-05-07T20:31:47.1716230Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.1716642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.1716959Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.1717300Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.1717635Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.1717930Z ) 2025-05-07T20:31:47.1718297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.1718757Z def test_silu_mul_quant( 2025-05-07T20:31:47.1719004Z self, 2025-05-07T20:31:47.1719210Z T: int, 2025-05-07T20:31:47.1719418Z D: int, 2025-05-07T20:31:47.1719646Z scale_ub: Optional[float], 2025-05-07T20:31:47.1719931Z contiguous: bool, 2025-05-07T20:31:47.1720184Z compiled: bool, 2025-05-07T20:31:47.1720417Z ) -> None: 2025-05-07T20:31:47.1720648Z torch.manual_seed(2025) 2025-05-07T20:31:47.1720902Z 2025-05-07T20:31:47.1721181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.1721542Z 2025-05-07T20:31:47.1721749Z x_sign = torch.sign(x) 2025-05-07T20:31:47.1722046Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.1722367Z x = x_sign * x_clamp 2025-05-07T20:31:47.1722620Z x0 = x[:, :D] 2025-05-07T20:31:47.1722841Z x1 = x[:, D:] 2025-05-07T20:31:47.1723060Z 2025-05-07T20:31:47.1723259Z if contiguous: 2025-05-07T20:31:47.1723500Z x0 = x0.contiguous() 2025-05-07T20:31:47.1723775Z x1 = x1.contiguous() 2025-05-07T20:31:47.1724028Z 2025-05-07T20:31:47.1724225Z if scale_ub is not None: 2025-05-07T20:31:47.1724512Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.1724946Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.1725268Z ) 2025-05-07T20:31:47.1725465Z else: 2025-05-07T20:31:47.1725686Z scale_ub_tensor = None 2025-05-07T20:31:47.1725954Z 2025-05-07T20:31:47.1726193Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.1726521Z op = silu_mul_quant 2025-05-07T20:31:47.1726784Z if compiled: 2025-05-07T20:31:47.1727037Z op = torch.compile(op) 2025-05-07T20:31:47.1727349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.1727638Z 2025-05-07T20:31:47.1727843Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.1728023Z 2025-05-07T20:31:47.1728132Z moe/activation_test.py:117: 2025-05-07T20:31:47.1728438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.1728776Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.1729063Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.1729644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.1730222Z return fn(*args, **kwargs) 2025-05-07T20:31:47.1730904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.1731613Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.1732164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.1732866Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.1733545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.1734094Z kernel = self.compile( 2025-05-07T20:31:47.1734656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.1735334Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.1735792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.1736139Z 2025-05-07T20:31:47.1736353Z self = 2025-05-07T20:31:47.1737465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.1738972Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ec81bd0>} 2025-05-07T20:31:47.1740349Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.1741412Z context = 2025-05-07T20:31:47.1741707Z 2025-05-07T20:31:47.1741886Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.1742432Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.1742914Z module_map=module_map) 2025-05-07T20:31:47.1743291Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.1743657Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.1743920Z E ^ 2025-05-07T20:31:47.1744397Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.1744862Z 2025-05-07T20:31:47.1745296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.1745851Z 2025-05-07T20:31:47.2598952Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.2599409Z self=, 2025-05-07T20:31:47.2599841Z T=16384, 2025-05-07T20:31:47.2600060Z D=5120, 2025-05-07T20:31:47.2600333Z scale_ub=None, 2025-05-07T20:31:47.2610731Z contiguous=False, 2025-05-07T20:31:47.2611088Z compiled=False, 2025-05-07T20:31:47.2611437Z ) 2025-05-07T20:31:47.2611900Z self = 2025-05-07T20:31:47.2612532Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:47.2612823Z 2025-05-07T20:31:47.2612902Z @given( 2025-05-07T20:31:47.2613143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.2613457Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.2613771Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.2614110Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.2614452Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.2614746Z ) 2025-05-07T20:31:47.2615105Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.2615591Z def test_silu_mul_quant( 2025-05-07T20:31:47.2615860Z self, 2025-05-07T20:31:47.2616059Z T: int, 2025-05-07T20:31:47.2616260Z D: int, 2025-05-07T20:31:47.2616475Z scale_ub: Optional[float], 2025-05-07T20:31:47.2616752Z contiguous: bool, 2025-05-07T20:31:47.2616993Z compiled: bool, 2025-05-07T20:31:47.2617214Z ) -> None: 2025-05-07T20:31:47.2617435Z torch.manual_seed(2025) 2025-05-07T20:31:47.2617678Z 2025-05-07T20:31:47.2617951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.2618417Z 2025-05-07T20:31:47.2618617Z x_sign = torch.sign(x) 2025-05-07T20:31:47.2618908Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.2620990Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.2623161Z 2025-05-07T20:31:47.2623286Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:47.2623512Z 2025-05-07T20:31:47.2623617Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.2624046Z self=, 2025-05-07T20:31:47.2624448Z T=4096, 2025-05-07T20:31:47.2624641Z D=7168, 2025-05-07T20:31:47.2624839Z scale_ub=1200.0, 2025-05-07T20:31:47.2625066Z contiguous=True, 2025-05-07T20:31:47.2625297Z compiled=True, 2025-05-07T20:31:47.2625506Z ) 2025-05-07T20:31:47.2625824Z self = 2025-05-07T20:31:47.2626332Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:47.2626612Z 2025-05-07T20:31:47.2626689Z @given( 2025-05-07T20:31:47.2626925Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.2627235Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.2627547Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.2627886Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.2628212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.2628504Z ) 2025-05-07T20:31:47.2628860Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.2629301Z def test_silu_mul_quant( 2025-05-07T20:31:47.2629545Z self, 2025-05-07T20:31:47.2629826Z T: int, 2025-05-07T20:31:47.2630027Z D: int, 2025-05-07T20:31:47.2630243Z scale_ub: Optional[float], 2025-05-07T20:31:47.2630521Z contiguous: bool, 2025-05-07T20:31:47.2630763Z compiled: bool, 2025-05-07T20:31:47.2630981Z ) -> None: 2025-05-07T20:31:47.2631199Z torch.manual_seed(2025) 2025-05-07T20:31:47.2631440Z 2025-05-07T20:31:47.2631707Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.2632051Z 2025-05-07T20:31:47.2632245Z x_sign = torch.sign(x) 2025-05-07T20:31:47.2632532Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.2634582Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.2636492Z 2025-05-07T20:31:47.2636611Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:47.2636831Z 2025-05-07T20:31:47.2636936Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.2637354Z self=, 2025-05-07T20:31:47.2637751Z T=16384, 2025-05-07T20:31:47.2637945Z D=7168, 2025-05-07T20:31:47.2638140Z scale_ub=None, 2025-05-07T20:31:47.2638354Z contiguous=False, 2025-05-07T20:31:47.2638583Z compiled=False, 2025-05-07T20:31:47.2638787Z ) 2025-05-07T20:31:47.2639103Z self = 2025-05-07T20:31:47.2639609Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:47.2639887Z 2025-05-07T20:31:47.2639973Z @given( 2025-05-07T20:31:47.2640204Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.2640597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.2640903Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.2641233Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.2641560Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.2641849Z ) 2025-05-07T20:31:47.2642203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.2642645Z def test_silu_mul_quant( 2025-05-07T20:31:47.2642889Z self, 2025-05-07T20:31:47.2643085Z T: int, 2025-05-07T20:31:47.2643277Z D: int, 2025-05-07T20:31:47.2643500Z scale_ub: Optional[float], 2025-05-07T20:31:47.2643775Z contiguous: bool, 2025-05-07T20:31:47.2644010Z compiled: bool, 2025-05-07T20:31:47.2644239Z ) -> None: 2025-05-07T20:31:47.2644455Z torch.manual_seed(2025) 2025-05-07T20:31:47.2644692Z 2025-05-07T20:31:47.2644969Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.2647074Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.2649009Z 2025-05-07T20:31:47.2649128Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:47.2649340Z 2025-05-07T20:31:47.2649448Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.2649948Z self=, 2025-05-07T20:31:47.2650357Z T=2048, 2025-05-07T20:31:47.2650545Z D=7168, 2025-05-07T20:31:47.2650736Z scale_ub=1200.0, 2025-05-07T20:31:47.2650963Z contiguous=True, 2025-05-07T20:31:47.2651186Z compiled=True, 2025-05-07T20:31:47.2651391Z ) 2025-05-07T20:31:47.2651706Z self = 2025-05-07T20:31:47.2652201Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:47.2652470Z 2025-05-07T20:31:47.2652552Z @given( 2025-05-07T20:31:47.2652776Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.2653090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.2653399Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.2653722Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.2654056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.2654349Z ) 2025-05-07T20:31:47.2654703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.2655146Z def test_silu_mul_quant( 2025-05-07T20:31:47.2655387Z self, 2025-05-07T20:31:47.2655911Z T: int, 2025-05-07T20:31:47.2656131Z D: int, 2025-05-07T20:31:47.2656351Z scale_ub: Optional[float], 2025-05-07T20:31:47.2656632Z contiguous: bool, 2025-05-07T20:31:47.2656868Z compiled: bool, 2025-05-07T20:31:47.2657093Z ) -> None: 2025-05-07T20:31:47.2657310Z torch.manual_seed(2025) 2025-05-07T20:31:47.2657550Z 2025-05-07T20:31:47.2657828Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.2658211Z 2025-05-07T20:31:47.2658401Z x_sign = torch.sign(x) 2025-05-07T20:31:47.2658699Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.2660752Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.2662783Z 2025-05-07T20:31:47.2662904Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:47.2663117Z 2025-05-07T20:31:47.2663226Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.2663639Z self=, 2025-05-07T20:31:47.2664045Z T=2048, 2025-05-07T20:31:47.2664233Z D=7168, 2025-05-07T20:31:47.2664418Z scale_ub=None, 2025-05-07T20:31:47.2664634Z contiguous=True, 2025-05-07T20:31:47.2664863Z compiled=False, 2025-05-07T20:31:47.2665063Z ) 2025-05-07T20:31:47.3933898Z self = 2025-05-07T20:31:47.3934573Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:47.3934948Z 2025-05-07T20:31:47.3935036Z @given( 2025-05-07T20:31:47.3935284Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.3935638Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.3935991Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.3936342Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.3936689Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.3936994Z ) 2025-05-07T20:31:47.3937366Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.3937831Z def test_silu_mul_quant( 2025-05-07T20:31:47.3938190Z self, 2025-05-07T20:31:47.3938404Z T: int, 2025-05-07T20:31:47.3938939Z D: int, 2025-05-07T20:31:47.3939175Z scale_ub: Optional[float], 2025-05-07T20:31:47.3939467Z contiguous: bool, 2025-05-07T20:31:47.3939727Z compiled: bool, 2025-05-07T20:31:47.3939968Z ) -> None: 2025-05-07T20:31:47.3940223Z torch.manual_seed(2025) 2025-05-07T20:31:47.3940483Z 2025-05-07T20:31:47.3940770Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.3941120Z 2025-05-07T20:31:47.3941326Z > x_sign = torch.sign(x) 2025-05-07T20:31:47.3943352Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.3945276Z 2025-05-07T20:31:47.3945413Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:47.3945636Z 2025-05-07T20:31:47.3945777Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.3946232Z self=, 2025-05-07T20:31:47.3946650Z T=1, 2025-05-07T20:31:47.3946849Z D=7168, 2025-05-07T20:31:47.3947052Z scale_ub=1200.0, 2025-05-07T20:31:47.3947290Z contiguous=True, 2025-05-07T20:31:47.3947528Z compiled=False, 2025-05-07T20:31:47.3947763Z ) 2025-05-07T20:31:47.3948099Z self = 2025-05-07T20:31:47.3948606Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:47.3948878Z 2025-05-07T20:31:47.3948962Z @given( 2025-05-07T20:31:47.3949213Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.3949543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.3949861Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.3950366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.3950714Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.3951010Z ) 2025-05-07T20:31:47.3951379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.3951842Z def test_silu_mul_quant( 2025-05-07T20:31:47.3952101Z self, 2025-05-07T20:31:47.3952307Z T: int, 2025-05-07T20:31:47.3952522Z D: int, 2025-05-07T20:31:47.3952757Z scale_ub: Optional[float], 2025-05-07T20:31:47.3953038Z contiguous: bool, 2025-05-07T20:31:47.3953294Z compiled: bool, 2025-05-07T20:31:47.3953540Z ) -> None: 2025-05-07T20:31:47.3953768Z torch.manual_seed(2025) 2025-05-07T20:31:47.3954024Z 2025-05-07T20:31:47.3954319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.3954668Z 2025-05-07T20:31:47.3954878Z x_sign = torch.sign(x) 2025-05-07T20:31:47.3955191Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.3955511Z x = x_sign * x_clamp 2025-05-07T20:31:47.3956076Z x0 = x[:, :D] 2025-05-07T20:31:47.3956310Z x1 = x[:, D:] 2025-05-07T20:31:47.3956530Z 2025-05-07T20:31:47.3956736Z if contiguous: 2025-05-07T20:31:47.3956997Z x0 = x0.contiguous() 2025-05-07T20:31:47.3957275Z x1 = x1.contiguous() 2025-05-07T20:31:47.3957539Z 2025-05-07T20:31:47.3957749Z if scale_ub is not None: 2025-05-07T20:31:47.3958041Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.3958388Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.3958716Z ) 2025-05-07T20:31:47.3958926Z else: 2025-05-07T20:31:47.3959304Z scale_ub_tensor = None 2025-05-07T20:31:47.3959574Z 2025-05-07T20:31:47.3959827Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.3960152Z op = silu_mul_quant 2025-05-07T20:31:47.3960423Z if compiled: 2025-05-07T20:31:47.3960687Z op = torch.compile(op) 2025-05-07T20:31:47.3961001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.3961293Z 2025-05-07T20:31:47.3961504Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.3961676Z 2025-05-07T20:31:47.3961784Z moe/activation_test.py:117: 2025-05-07T20:31:47.3962097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.3962441Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.3962744Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.3963457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.3964184Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.3964745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.3965449Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.3966186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.3966742Z kernel = self.compile( 2025-05-07T20:31:47.3967308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.3967983Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.3968398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.3968630Z 2025-05-07T20:31:47.3968851Z self = 2025-05-07T20:31:47.3969966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.3971495Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ec83b50>} 2025-05-07T20:31:47.3972869Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.3973919Z context = 2025-05-07T20:31:47.3974217Z 2025-05-07T20:31:47.3974402Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.3974938Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.3975433Z module_map=module_map) 2025-05-07T20:31:47.3975847Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.3976251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.3976524Z E ^ 2025-05-07T20:31:47.3977007Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.3977468Z 2025-05-07T20:31:47.3977902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.3978519Z 2025-05-07T20:31:47.3978634Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.3979059Z self=, 2025-05-07T20:31:47.3979472Z T=128, 2025-05-07T20:31:47.3979672Z D=5120, 2025-05-07T20:31:47.3979871Z scale_ub=None, 2025-05-07T20:31:47.3980098Z contiguous=True, 2025-05-07T20:31:47.3980337Z compiled=False, 2025-05-07T20:31:47.3980635Z ) 2025-05-07T20:31:47.4762442Z self = 2025-05-07T20:31:47.4762961Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:47.4763256Z 2025-05-07T20:31:47.4763343Z @given( 2025-05-07T20:31:47.4763662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.4764096Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.4764449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.4764785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.4765126Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.4765423Z ) 2025-05-07T20:31:47.4765777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.4766231Z def test_silu_mul_quant( 2025-05-07T20:31:47.4766482Z self, 2025-05-07T20:31:47.4766682Z T: int, 2025-05-07T20:31:47.4766900Z D: int, 2025-05-07T20:31:47.4767134Z scale_ub: Optional[float], 2025-05-07T20:31:47.4767419Z contiguous: bool, 2025-05-07T20:31:47.4767665Z compiled: bool, 2025-05-07T20:31:47.4767908Z ) -> None: 2025-05-07T20:31:47.4768137Z torch.manual_seed(2025) 2025-05-07T20:31:47.4768385Z 2025-05-07T20:31:47.4768671Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.4769056Z 2025-05-07T20:31:47.4769260Z x_sign = torch.sign(x) 2025-05-07T20:31:47.4769568Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.4769882Z x = x_sign * x_clamp 2025-05-07T20:31:47.4770134Z x0 = x[:, :D] 2025-05-07T20:31:47.4770363Z x1 = x[:, D:] 2025-05-07T20:31:47.4770573Z 2025-05-07T20:31:47.4770769Z if contiguous: 2025-05-07T20:31:47.4771014Z x0 = x0.contiguous() 2025-05-07T20:31:47.4771278Z x1 = x1.contiguous() 2025-05-07T20:31:47.4771526Z 2025-05-07T20:31:47.4771738Z if scale_ub is not None: 2025-05-07T20:31:47.4772015Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.4772369Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.4772983Z ) 2025-05-07T20:31:47.4773183Z else: 2025-05-07T20:31:47.4773404Z scale_ub_tensor = None 2025-05-07T20:31:47.4773672Z 2025-05-07T20:31:47.4773909Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.4774235Z op = silu_mul_quant 2025-05-07T20:31:47.4774498Z if compiled: 2025-05-07T20:31:47.4774758Z op = torch.compile(op) 2025-05-07T20:31:47.4775061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.4775347Z 2025-05-07T20:31:47.4775576Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.4775770Z 2025-05-07T20:31:47.4775876Z moe/activation_test.py:117: 2025-05-07T20:31:47.4776180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.4776522Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.4776809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.4777518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.4778399Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.4778954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.4779650Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.4780328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.4780877Z kernel = self.compile( 2025-05-07T20:31:47.4781430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.4782250Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.4782662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.4782894Z 2025-05-07T20:31:47.4783116Z self = 2025-05-07T20:31:47.4784213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.4785627Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50e564670>} 2025-05-07T20:31:47.4787050Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.4788105Z context = 2025-05-07T20:31:47.4788398Z 2025-05-07T20:31:47.4788578Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.4789119Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.4789608Z module_map=module_map) 2025-05-07T20:31:47.4789988Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.4790354Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.4790626Z E ^ 2025-05-07T20:31:47.4791104Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.4791563Z 2025-05-07T20:31:47.4791994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.4792516Z 2025-05-07T20:31:47.4792625Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.4793067Z self=, 2025-05-07T20:31:47.4793480Z T=128, 2025-05-07T20:31:47.4793761Z D=7168, 2025-05-07T20:31:47.4793966Z scale_ub=None, 2025-05-07T20:31:47.4794194Z contiguous=True, 2025-05-07T20:31:47.4794422Z compiled=False, 2025-05-07T20:31:47.4794638Z ) 2025-05-07T20:31:47.4794968Z self = 2025-05-07T20:31:47.4795472Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:47.4795785Z 2025-05-07T20:31:47.4795873Z @given( 2025-05-07T20:31:47.4796110Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.4796430Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.4796742Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.4797083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.4797428Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.4797715Z ) 2025-05-07T20:31:47.4798078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.4798541Z def test_silu_mul_quant( 2025-05-07T20:31:47.4798792Z self, 2025-05-07T20:31:47.4798989Z T: int, 2025-05-07T20:31:47.4799196Z D: int, 2025-05-07T20:31:47.4799428Z scale_ub: Optional[float], 2025-05-07T20:31:47.4799710Z contiguous: bool, 2025-05-07T20:31:47.4799958Z compiled: bool, 2025-05-07T20:31:47.4800189Z ) -> None: 2025-05-07T20:31:47.4800415Z torch.manual_seed(2025) 2025-05-07T20:31:47.4800667Z 2025-05-07T20:31:47.4800950Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.4801297Z 2025-05-07T20:31:47.4801504Z x_sign = torch.sign(x) 2025-05-07T20:31:47.4801808Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.4802122Z x = x_sign * x_clamp 2025-05-07T20:31:47.4802465Z x0 = x[:, :D] 2025-05-07T20:31:47.4802713Z x1 = x[:, D:] 2025-05-07T20:31:47.4802928Z 2025-05-07T20:31:47.4803126Z if contiguous: 2025-05-07T20:31:47.4803376Z x0 = x0.contiguous() 2025-05-07T20:31:47.4803642Z x1 = x1.contiguous() 2025-05-07T20:31:47.4803893Z 2025-05-07T20:31:47.4804100Z if scale_ub is not None: 2025-05-07T20:31:47.4813295Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.4813648Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.4813967Z ) 2025-05-07T20:31:47.4814167Z else: 2025-05-07T20:31:47.4814375Z scale_ub_tensor = None 2025-05-07T20:31:47.4814636Z 2025-05-07T20:31:47.4814879Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.4815198Z op = silu_mul_quant 2025-05-07T20:31:47.4815455Z if compiled: 2025-05-07T20:31:47.4815718Z op = torch.compile(op) 2025-05-07T20:31:47.4816072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.4816352Z 2025-05-07T20:31:47.4816551Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.4816726Z 2025-05-07T20:31:47.4816836Z moe/activation_test.py:117: 2025-05-07T20:31:47.4817133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.4817472Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.4817764Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.4818577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.4819289Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.4819845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.4820547Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.4821226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.4821778Z kernel = self.compile( 2025-05-07T20:31:47.4822344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.4823141Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.4823553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.4823789Z 2025-05-07T20:31:47.4824003Z self = 2025-05-07T20:31:47.4825114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.4826574Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50e564ee0>} 2025-05-07T20:31:47.4827955Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.4829015Z context = 2025-05-07T20:31:47.4829312Z 2025-05-07T20:31:47.4829491Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.4830030Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.4830510Z module_map=module_map) 2025-05-07T20:31:47.4830891Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.4831261Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.4831525Z E ^ 2025-05-07T20:31:47.4832089Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.4832549Z 2025-05-07T20:31:47.4832983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.4833507Z 2025-05-07T20:31:47.4833620Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.4834038Z self=, 2025-05-07T20:31:47.4834453Z T=2048, 2025-05-07T20:31:47.4834652Z D=7168, 2025-05-07T20:31:47.4834846Z scale_ub=1200.0, 2025-05-07T20:31:47.4835077Z contiguous=True, 2025-05-07T20:31:47.4835309Z compiled=False, 2025-05-07T20:31:47.4835528Z ) 2025-05-07T20:31:47.5783279Z self = 2025-05-07T20:31:47.5783850Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:47.5784137Z 2025-05-07T20:31:47.5784216Z @given( 2025-05-07T20:31:47.5784475Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.5784788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.5785102Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.5785458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.5785793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.5786080Z ) 2025-05-07T20:31:47.5786443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.5786892Z def test_silu_mul_quant( 2025-05-07T20:31:47.5787135Z self, 2025-05-07T20:31:47.5787337Z T: int, 2025-05-07T20:31:47.5787544Z D: int, 2025-05-07T20:31:47.5787763Z scale_ub: Optional[float], 2025-05-07T20:31:47.5788044Z contiguous: bool, 2025-05-07T20:31:47.5788290Z compiled: bool, 2025-05-07T20:31:47.5788521Z ) -> None: 2025-05-07T20:31:47.5788744Z torch.manual_seed(2025) 2025-05-07T20:31:47.5788998Z 2025-05-07T20:31:47.5789279Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.5791406Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.5793709Z 2025-05-07T20:31:47.5793831Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:47.5794053Z 2025-05-07T20:31:47.5794159Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.5794587Z self=, 2025-05-07T20:31:47.5795001Z T=1, 2025-05-07T20:31:47.5795193Z D=5120, 2025-05-07T20:31:47.5795396Z scale_ub=1200.0, 2025-05-07T20:31:47.5795644Z contiguous=True, 2025-05-07T20:31:47.5795911Z compiled=False, 2025-05-07T20:31:47.5796130Z ) 2025-05-07T20:31:47.5796451Z self = 2025-05-07T20:31:47.5796953Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:47.5797225Z 2025-05-07T20:31:47.5797304Z @given( 2025-05-07T20:31:47.5797540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.5797854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.5798172Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.5798506Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.5798838Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.5799130Z ) 2025-05-07T20:31:47.5799627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.5800077Z def test_silu_mul_quant( 2025-05-07T20:31:47.5800318Z self, 2025-05-07T20:31:47.5800517Z T: int, 2025-05-07T20:31:47.5800725Z D: int, 2025-05-07T20:31:47.5800943Z scale_ub: Optional[float], 2025-05-07T20:31:47.5801220Z contiguous: bool, 2025-05-07T20:31:47.5801465Z compiled: bool, 2025-05-07T20:31:47.5801687Z ) -> None: 2025-05-07T20:31:47.5801912Z torch.manual_seed(2025) 2025-05-07T20:31:47.5802161Z 2025-05-07T20:31:47.5802434Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.5802782Z 2025-05-07T20:31:47.5802986Z x_sign = torch.sign(x) 2025-05-07T20:31:47.5803280Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.5803594Z x = x_sign * x_clamp 2025-05-07T20:31:47.5803842Z x0 = x[:, :D] 2025-05-07T20:31:47.5804059Z x1 = x[:, D:] 2025-05-07T20:31:47.5804274Z 2025-05-07T20:31:47.5804470Z if contiguous: 2025-05-07T20:31:47.5804705Z x0 = x0.contiguous() 2025-05-07T20:31:47.5804970Z x1 = x1.contiguous() 2025-05-07T20:31:47.5805222Z 2025-05-07T20:31:47.5805416Z if scale_ub is not None: 2025-05-07T20:31:47.5805700Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.5806095Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.5806412Z ) 2025-05-07T20:31:47.5806605Z else: 2025-05-07T20:31:47.5806826Z scale_ub_tensor = None 2025-05-07T20:31:47.5807090Z 2025-05-07T20:31:47.5807327Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.5807649Z op = silu_mul_quant 2025-05-07T20:31:47.5807908Z if compiled: 2025-05-07T20:31:47.5808156Z op = torch.compile(op) 2025-05-07T20:31:47.5808467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.5808748Z 2025-05-07T20:31:47.5808948Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.5809124Z 2025-05-07T20:31:47.5809227Z moe/activation_test.py:117: 2025-05-07T20:31:47.5809532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.5809957Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.5810240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.5810952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.5811658Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.5812206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.5812907Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.5813586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.5814130Z kernel = self.compile( 2025-05-07T20:31:47.5814689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.5815364Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.5815768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.5815998Z 2025-05-07T20:31:47.5816210Z self = 2025-05-07T20:31:47.5817309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.5818779Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50e565e10>} 2025-05-07T20:31:47.5820227Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.5821278Z context = 2025-05-07T20:31:47.5821572Z 2025-05-07T20:31:47.5821744Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.5822280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.5822760Z module_map=module_map) 2025-05-07T20:31:47.5823134Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.5823496Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.5823773Z E ^ 2025-05-07T20:31:47.5824249Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.5824705Z 2025-05-07T20:31:47.5825133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.5825686Z 2025-05-07T20:31:47.5825811Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.5826246Z self=, 2025-05-07T20:31:47.5826650Z T=2048, 2025-05-07T20:31:47.5826841Z D=5120, 2025-05-07T20:31:47.5827039Z scale_ub=None, 2025-05-07T20:31:47.5827262Z contiguous=True, 2025-05-07T20:31:47.5827488Z compiled=False, 2025-05-07T20:31:47.5827702Z ) 2025-05-07T20:31:47.5828028Z self = 2025-05-07T20:31:47.5828528Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:47.5828812Z 2025-05-07T20:31:47.5828891Z @given( 2025-05-07T20:31:47.5829131Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.5829445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.5829767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.5830106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.5830529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.5830814Z ) 2025-05-07T20:31:47.5831174Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.5831622Z def test_silu_mul_quant( 2025-05-07T20:31:47.5831863Z self, 2025-05-07T20:31:47.5832062Z T: int, 2025-05-07T20:31:47.5832262Z D: int, 2025-05-07T20:31:47.5832485Z scale_ub: Optional[float], 2025-05-07T20:31:47.5832766Z contiguous: bool, 2025-05-07T20:31:47.5833014Z compiled: bool, 2025-05-07T20:31:47.5833234Z ) -> None: 2025-05-07T20:31:47.5833457Z torch.manual_seed(2025) 2025-05-07T20:31:47.5833702Z 2025-05-07T20:31:47.5833978Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.5834327Z 2025-05-07T20:31:47.5834549Z > x_sign = torch.sign(x) 2025-05-07T20:31:47.5836596Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.5838504Z 2025-05-07T20:31:47.5838627Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:47.5838843Z 2025-05-07T20:31:47.5838957Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.5839376Z self=, 2025-05-07T20:31:47.5839789Z T=16384, 2025-05-07T20:31:47.5840102Z D=5120, 2025-05-07T20:31:47.5840297Z scale_ub=None, 2025-05-07T20:31:47.5840519Z contiguous=True, 2025-05-07T20:31:47.5840752Z compiled=False, 2025-05-07T20:31:47.5840961Z ) 2025-05-07T20:31:47.6815257Z self = 2025-05-07T20:31:47.6815811Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:47.6816105Z 2025-05-07T20:31:47.6816185Z @given( 2025-05-07T20:31:47.6816423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.6816741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.6817059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.6817400Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.6817732Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.6818091Z ) 2025-05-07T20:31:47.6818453Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.6818914Z def test_silu_mul_quant( 2025-05-07T20:31:47.6819164Z self, 2025-05-07T20:31:47.6819364Z T: int, 2025-05-07T20:31:47.6819570Z D: int, 2025-05-07T20:31:47.6819799Z scale_ub: Optional[float], 2025-05-07T20:31:47.6820081Z contiguous: bool, 2025-05-07T20:31:47.6820331Z compiled: bool, 2025-05-07T20:31:47.6820555Z ) -> None: 2025-05-07T20:31:47.6820779Z torch.manual_seed(2025) 2025-05-07T20:31:47.6821029Z 2025-05-07T20:31:47.6821307Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.6823424Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.6826167Z 2025-05-07T20:31:47.6826296Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:47.6826538Z 2025-05-07T20:31:47.6826653Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.6827130Z self=, 2025-05-07T20:31:47.6827593Z T=4096, 2025-05-07T20:31:47.6827798Z D=5120, 2025-05-07T20:31:47.6828001Z scale_ub=None, 2025-05-07T20:31:47.6828229Z contiguous=True, 2025-05-07T20:31:47.6828470Z compiled=False, 2025-05-07T20:31:47.6828693Z ) 2025-05-07T20:31:47.6829048Z self = 2025-05-07T20:31:47.6829624Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:47.6829940Z 2025-05-07T20:31:47.6830025Z @given( 2025-05-07T20:31:47.6830273Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.6830629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.6830980Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.6831355Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.6831728Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.6832053Z ) 2025-05-07T20:31:47.6832456Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.6832971Z def test_silu_mul_quant( 2025-05-07T20:31:47.6833237Z self, 2025-05-07T20:31:47.6833446Z T: int, 2025-05-07T20:31:47.6833652Z D: int, 2025-05-07T20:31:47.6833886Z scale_ub: Optional[float], 2025-05-07T20:31:47.6834194Z contiguous: bool, 2025-05-07T20:31:47.6834451Z compiled: bool, 2025-05-07T20:31:47.6834693Z ) -> None: 2025-05-07T20:31:47.6834926Z torch.manual_seed(2025) 2025-05-07T20:31:47.6835316Z 2025-05-07T20:31:47.6835624Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.6838288Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.6840695Z 2025-05-07T20:31:47.6840823Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:47.6841065Z 2025-05-07T20:31:47.6841180Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.6841657Z self=, 2025-05-07T20:31:47.6842124Z T=2048, 2025-05-07T20:31:47.6842322Z D=5120, 2025-05-07T20:31:47.6842519Z scale_ub=None, 2025-05-07T20:31:47.6842755Z contiguous=False, 2025-05-07T20:31:47.6843000Z compiled=False, 2025-05-07T20:31:47.6843214Z ) 2025-05-07T20:31:47.6843574Z self = 2025-05-07T20:31:47.6844153Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:47.6844473Z 2025-05-07T20:31:47.6844558Z @given( 2025-05-07T20:31:47.6844798Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.6845152Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.6845495Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.6845862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.6846235Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.6846561Z ) 2025-05-07T20:31:47.6846962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.6847486Z def test_silu_mul_quant( 2025-05-07T20:31:47.6847759Z self, 2025-05-07T20:31:47.6848046Z T: int, 2025-05-07T20:31:47.6848261Z D: int, 2025-05-07T20:31:47.6848500Z scale_ub: Optional[float], 2025-05-07T20:31:47.6848802Z contiguous: bool, 2025-05-07T20:31:47.6849058Z compiled: bool, 2025-05-07T20:31:47.6849299Z ) -> None: 2025-05-07T20:31:47.6849530Z torch.manual_seed(2025) 2025-05-07T20:31:47.6849790Z 2025-05-07T20:31:47.6850089Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.6852744Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.6855151Z 2025-05-07T20:31:47.6855277Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:47.6855888Z 2025-05-07T20:31:47.6856047Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.6856473Z self=, 2025-05-07T20:31:47.6856886Z T=4096, 2025-05-07T20:31:47.6857084Z D=7168, 2025-05-07T20:31:47.6857286Z scale_ub=None, 2025-05-07T20:31:47.6857501Z contiguous=True, 2025-05-07T20:31:47.6857731Z compiled=True, 2025-05-07T20:31:47.6857943Z ) 2025-05-07T20:31:47.6858352Z self = 2025-05-07T20:31:47.6858858Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:47.6859269Z 2025-05-07T20:31:47.6859361Z @given( 2025-05-07T20:31:47.6859603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.6859962Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.6860306Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.6860672Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.6861049Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.6861369Z ) 2025-05-07T20:31:47.6861768Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.6862282Z def test_silu_mul_quant( 2025-05-07T20:31:47.6862547Z self, 2025-05-07T20:31:47.6862754Z T: int, 2025-05-07T20:31:47.6862959Z D: int, 2025-05-07T20:31:47.6863193Z scale_ub: Optional[float], 2025-05-07T20:31:47.6863493Z contiguous: bool, 2025-05-07T20:31:47.6863747Z compiled: bool, 2025-05-07T20:31:47.6863979Z ) -> None: 2025-05-07T20:31:47.6864208Z torch.manual_seed(2025) 2025-05-07T20:31:47.6864452Z 2025-05-07T20:31:47.6864736Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.6866900Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.6868807Z 2025-05-07T20:31:47.6868928Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:47.6869143Z 2025-05-07T20:31:47.6869257Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.6869679Z self=, 2025-05-07T20:31:47.6870089Z T=2048, 2025-05-07T20:31:47.6870283Z D=5120, 2025-05-07T20:31:47.6870606Z scale_ub=1200.0, 2025-05-07T20:31:47.6870839Z contiguous=False, 2025-05-07T20:31:47.6871074Z compiled=False, 2025-05-07T20:31:47.6871280Z ) 2025-05-07T20:31:47.6871612Z self = 2025-05-07T20:31:47.6872125Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:47.6872404Z 2025-05-07T20:31:47.6872490Z @given( 2025-05-07T20:31:47.6872719Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.6873044Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.6873364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.6873698Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.6874040Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.6874339Z ) 2025-05-07T20:31:47.6874694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.6875152Z def test_silu_mul_quant( 2025-05-07T20:31:47.6875408Z self, 2025-05-07T20:31:47.6875629Z T: int, 2025-05-07T20:31:47.6875859Z D: int, 2025-05-07T20:31:47.6876087Z scale_ub: Optional[float], 2025-05-07T20:31:47.6876362Z contiguous: bool, 2025-05-07T20:31:47.6876612Z compiled: bool, 2025-05-07T20:31:47.6876845Z ) -> None: 2025-05-07T20:31:47.6877070Z torch.manual_seed(2025) 2025-05-07T20:31:47.6877316Z 2025-05-07T20:31:47.6877600Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.6879784Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.6881690Z 2025-05-07T20:31:47.6881816Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:47.6882031Z 2025-05-07T20:31:47.6882143Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.6882575Z self=, 2025-05-07T20:31:47.6882986Z T=4096, 2025-05-07T20:31:47.6883183Z D=7168, 2025-05-07T20:31:47.6883376Z scale_ub=1200.0, 2025-05-07T20:31:47.6883606Z contiguous=True, 2025-05-07T20:31:47.6883836Z compiled=False, 2025-05-07T20:31:47.6884042Z ) 2025-05-07T20:31:47.8162618Z self = 2025-05-07T20:31:47.8163226Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:47.8163550Z 2025-05-07T20:31:47.8163639Z @given( 2025-05-07T20:31:47.8163893Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.8164249Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.8164602Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.8164974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.8165363Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.8165694Z ) 2025-05-07T20:31:47.8166099Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.8166617Z def test_silu_mul_quant( 2025-05-07T20:31:47.8166884Z self, 2025-05-07T20:31:47.8167098Z T: int, 2025-05-07T20:31:47.8167307Z D: int, 2025-05-07T20:31:47.8175948Z scale_ub: Optional[float], 2025-05-07T20:31:47.8176275Z contiguous: bool, 2025-05-07T20:31:47.8176537Z compiled: bool, 2025-05-07T20:31:47.8176770Z ) -> None: 2025-05-07T20:31:47.8176985Z torch.manual_seed(2025) 2025-05-07T20:31:47.8177493Z 2025-05-07T20:31:47.8177772Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.8179994Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.8181902Z 2025-05-07T20:31:47.8182031Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:47.8182249Z 2025-05-07T20:31:47.8182359Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.8182783Z self=, 2025-05-07T20:31:47.8183194Z T=16384, 2025-05-07T20:31:47.8183386Z D=7168, 2025-05-07T20:31:47.8183583Z scale_ub=None, 2025-05-07T20:31:47.8183805Z contiguous=False, 2025-05-07T20:31:47.8184034Z compiled=True, 2025-05-07T20:31:47.8184250Z ) 2025-05-07T20:31:47.8184580Z self = 2025-05-07T20:31:47.8185086Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:47.8185365Z 2025-05-07T20:31:47.8185443Z @given( 2025-05-07T20:31:47.8185678Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.8186036Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.8186360Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.8186699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.8187189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.8187478Z ) 2025-05-07T20:31:47.8187835Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.8188292Z def test_silu_mul_quant( 2025-05-07T20:31:47.8188542Z self, 2025-05-07T20:31:47.8188734Z T: int, 2025-05-07T20:31:47.8188934Z D: int, 2025-05-07T20:31:47.8189157Z scale_ub: Optional[float], 2025-05-07T20:31:47.8189432Z contiguous: bool, 2025-05-07T20:31:47.8189678Z compiled: bool, 2025-05-07T20:31:47.8189910Z ) -> None: 2025-05-07T20:31:47.8190125Z torch.manual_seed(2025) 2025-05-07T20:31:47.8190369Z 2025-05-07T20:31:47.8190652Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.8192741Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.8194654Z 2025-05-07T20:31:47.8194773Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:47.8194994Z 2025-05-07T20:31:47.8195100Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.8195523Z self=, 2025-05-07T20:31:47.8195935Z T=4096, 2025-05-07T20:31:47.8196122Z D=7168, 2025-05-07T20:31:47.8196322Z scale_ub=None, 2025-05-07T20:31:47.8196541Z contiguous=True, 2025-05-07T20:31:47.8196763Z compiled=False, 2025-05-07T20:31:47.8196972Z ) 2025-05-07T20:31:47.8197306Z self = 2025-05-07T20:31:47.8197807Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:47.8198173Z 2025-05-07T20:31:47.8198252Z @given( 2025-05-07T20:31:47.8198478Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.8198794Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.8199101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.8199428Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.8199761Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.8200048Z ) 2025-05-07T20:31:47.8200400Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.8200842Z def test_silu_mul_quant( 2025-05-07T20:31:47.8201082Z self, 2025-05-07T20:31:47.8201277Z T: int, 2025-05-07T20:31:47.8201469Z D: int, 2025-05-07T20:31:47.8201692Z scale_ub: Optional[float], 2025-05-07T20:31:47.8201966Z contiguous: bool, 2025-05-07T20:31:47.8202203Z compiled: bool, 2025-05-07T20:31:47.8202429Z ) -> None: 2025-05-07T20:31:47.8202650Z torch.manual_seed(2025) 2025-05-07T20:31:47.8202892Z 2025-05-07T20:31:47.8203167Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.8205267Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.8207227Z 2025-05-07T20:31:47.8207428Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:47.8207644Z 2025-05-07T20:31:47.8207755Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.8208174Z self=, 2025-05-07T20:31:47.8208587Z T=16384, 2025-05-07T20:31:47.8208783Z D=7168, 2025-05-07T20:31:47.8208971Z scale_ub=None, 2025-05-07T20:31:47.8209190Z contiguous=True, 2025-05-07T20:31:47.8209421Z compiled=False, 2025-05-07T20:31:47.8209623Z ) 2025-05-07T20:31:47.8209946Z self = 2025-05-07T20:31:47.8210451Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:47.8210730Z 2025-05-07T20:31:47.8210817Z @given( 2025-05-07T20:31:47.8211043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.8211359Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.8211669Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.8212003Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.8212338Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.8212625Z ) 2025-05-07T20:31:47.8212973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.8213423Z def test_silu_mul_quant( 2025-05-07T20:31:47.8213665Z self, 2025-05-07T20:31:47.8213854Z T: int, 2025-05-07T20:31:47.8214049Z D: int, 2025-05-07T20:31:47.8214271Z scale_ub: Optional[float], 2025-05-07T20:31:47.8214539Z contiguous: bool, 2025-05-07T20:31:47.8214780Z compiled: bool, 2025-05-07T20:31:47.8215004Z ) -> None: 2025-05-07T20:31:47.8215220Z torch.manual_seed(2025) 2025-05-07T20:31:47.8215459Z 2025-05-07T20:31:47.8215736Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.8217883Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.8219927Z 2025-05-07T20:31:47.8220055Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:47.8220269Z 2025-05-07T20:31:47.8220374Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.8220793Z self=, 2025-05-07T20:31:47.8221199Z T=16384, 2025-05-07T20:31:47.8221392Z D=7168, 2025-05-07T20:31:47.8221581Z scale_ub=1200.0, 2025-05-07T20:31:47.8221806Z contiguous=True, 2025-05-07T20:31:47.8222035Z compiled=False, 2025-05-07T20:31:47.8222234Z ) 2025-05-07T20:31:47.8222554Z self = 2025-05-07T20:31:47.8223062Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:47.8223340Z 2025-05-07T20:31:47.8223416Z @given( 2025-05-07T20:31:47.8223645Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.8223959Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.8224261Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.8224594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.8224929Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.8225218Z ) 2025-05-07T20:31:47.8225565Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.8226013Z def test_silu_mul_quant( 2025-05-07T20:31:47.8226258Z self, 2025-05-07T20:31:47.8226453Z T: int, 2025-05-07T20:31:47.8226738Z D: int, 2025-05-07T20:31:47.8226959Z scale_ub: Optional[float], 2025-05-07T20:31:47.8227228Z contiguous: bool, 2025-05-07T20:31:47.8227478Z compiled: bool, 2025-05-07T20:31:47.8227704Z ) -> None: 2025-05-07T20:31:47.8227917Z torch.manual_seed(2025) 2025-05-07T20:31:47.8228165Z 2025-05-07T20:31:47.8228442Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.8230549Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:47.8232451Z 2025-05-07T20:31:47.8232576Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:47.8232792Z 2025-05-07T20:31:47.8232901Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.8233319Z self=, 2025-05-07T20:31:47.8233726Z T=128, 2025-05-07T20:31:47.8233908Z D=5120, 2025-05-07T20:31:47.8234101Z scale_ub=1200.0, 2025-05-07T20:31:47.8234329Z contiguous=False, 2025-05-07T20:31:47.8234549Z compiled=False, 2025-05-07T20:31:47.8234753Z ) 2025-05-07T20:31:48.1780151Z self = 2025-05-07T20:31:48.1780726Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.1781011Z 2025-05-07T20:31:48.1781101Z @given( 2025-05-07T20:31:48.1781338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.1781694Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.1782016Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.1782358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.1783080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.1783379Z ) 2025-05-07T20:31:48.1783742Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.1784201Z def test_silu_mul_quant( 2025-05-07T20:31:48.1784453Z self, 2025-05-07T20:31:48.1784657Z T: int, 2025-05-07T20:31:48.1784866Z D: int, 2025-05-07T20:31:48.1785097Z scale_ub: Optional[float], 2025-05-07T20:31:48.1785384Z contiguous: bool, 2025-05-07T20:31:48.1785631Z compiled: bool, 2025-05-07T20:31:48.1785878Z ) -> None: 2025-05-07T20:31:48.1786155Z torch.manual_seed(2025) 2025-05-07T20:31:48.1786420Z 2025-05-07T20:31:48.1786721Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.1787121Z 2025-05-07T20:31:48.1787325Z x_sign = torch.sign(x) 2025-05-07T20:31:48.1787652Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.1788013Z x = x_sign * x_clamp 2025-05-07T20:31:48.1788275Z x0 = x[:, :D] 2025-05-07T20:31:48.1788511Z x1 = x[:, D:] 2025-05-07T20:31:48.1788739Z 2025-05-07T20:31:48.1788936Z if contiguous: 2025-05-07T20:31:48.1789190Z x0 = x0.contiguous() 2025-05-07T20:31:48.1789478Z x1 = x1.contiguous() 2025-05-07T20:31:48.1789740Z 2025-05-07T20:31:48.1789946Z if scale_ub is not None: 2025-05-07T20:31:48.1790252Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.1790625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.1790977Z ) 2025-05-07T20:31:48.1791184Z else: 2025-05-07T20:31:48.1791410Z scale_ub_tensor = None 2025-05-07T20:31:48.1791686Z 2025-05-07T20:31:48.1792115Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.1792459Z op = silu_mul_quant 2025-05-07T20:31:48.1792717Z if compiled: 2025-05-07T20:31:48.1792985Z op = torch.compile(op) 2025-05-07T20:31:48.1793302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.1793589Z 2025-05-07T20:31:48.1793788Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.1793966Z 2025-05-07T20:31:48.1794075Z moe/activation_test.py:117: 2025-05-07T20:31:48.1794383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.1794720Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.1795023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.1795763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.1796513Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.1797076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.1797797Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.1798495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.1799050Z kernel = self.compile( 2025-05-07T20:31:48.1799619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.1800303Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.1800715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.1800946Z 2025-05-07T20:31:48.1801162Z self = 2025-05-07T20:31:48.1802282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.1803714Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50e6cdcf0>} 2025-05-07T20:31:48.1805181Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.1806286Z context = 2025-05-07T20:31:48.1806587Z 2025-05-07T20:31:48.1806759Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.1807298Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.1807786Z module_map=module_map) 2025-05-07T20:31:48.1808162Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.1808529Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.1808803Z E ^ 2025-05-07T20:31:48.1809281Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.1809750Z 2025-05-07T20:31:48.1810177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.1810709Z 2025-05-07T20:31:48.1810819Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.1811251Z self=, 2025-05-07T20:31:48.1811659Z T=2048, 2025-05-07T20:31:48.1811859Z D=7168, 2025-05-07T20:31:48.1812063Z scale_ub=None, 2025-05-07T20:31:48.1812285Z contiguous=False, 2025-05-07T20:31:48.1812523Z compiled=False, 2025-05-07T20:31:48.1812741Z ) 2025-05-07T20:31:48.1813153Z self = 2025-05-07T20:31:48.1813671Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.1813957Z 2025-05-07T20:31:48.1814045Z @given( 2025-05-07T20:31:48.1814285Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.1814611Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.1814942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.1815284Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.1815629Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.1815939Z ) 2025-05-07T20:31:48.1816339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.1816795Z def test_silu_mul_quant( 2025-05-07T20:31:48.1817042Z self, 2025-05-07T20:31:48.1817250Z T: int, 2025-05-07T20:31:48.1817458Z D: int, 2025-05-07T20:31:48.1817682Z scale_ub: Optional[float], 2025-05-07T20:31:48.1817971Z contiguous: bool, 2025-05-07T20:31:48.1818333Z compiled: bool, 2025-05-07T20:31:48.1818562Z ) -> None: 2025-05-07T20:31:48.1818792Z torch.manual_seed(2025) 2025-05-07T20:31:48.1819053Z 2025-05-07T20:31:48.1819332Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.1821446Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.1823351Z 2025-05-07T20:31:48.1823477Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.1823708Z 2025-05-07T20:31:48.1823818Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.1824249Z self=, 2025-05-07T20:31:48.1824744Z T=128, 2025-05-07T20:31:48.1824940Z D=7168, 2025-05-07T20:31:48.1825136Z scale_ub=1200.0, 2025-05-07T20:31:48.1825360Z contiguous=True, 2025-05-07T20:31:48.1825594Z compiled=True, 2025-05-07T20:31:48.1825828Z ) 2025-05-07T20:31:48.2252770Z self = 2025-05-07T20:31:48.2253320Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.2253593Z 2025-05-07T20:31:48.2253681Z @given( 2025-05-07T20:31:48.2253917Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.2254242Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.2254670Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.2255048Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.2255379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.2256050Z ) 2025-05-07T20:31:48.2256429Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.2256877Z def test_silu_mul_quant( 2025-05-07T20:31:48.2257124Z self, 2025-05-07T20:31:48.2257324Z T: int, 2025-05-07T20:31:48.2257520Z D: int, 2025-05-07T20:31:48.2257747Z scale_ub: Optional[float], 2025-05-07T20:31:48.2258113Z contiguous: bool, 2025-05-07T20:31:48.2258359Z compiled: bool, 2025-05-07T20:31:48.2258594Z ) -> None: 2025-05-07T20:31:48.2258824Z torch.manual_seed(2025) 2025-05-07T20:31:48.2259083Z 2025-05-07T20:31:48.2259382Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.2259774Z 2025-05-07T20:31:48.2259975Z x_sign = torch.sign(x) 2025-05-07T20:31:48.2260527Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.2260850Z x = x_sign * x_clamp 2025-05-07T20:31:48.2261101Z x0 = x[:, :D] 2025-05-07T20:31:48.2261316Z x1 = x[:, D:] 2025-05-07T20:31:48.2261536Z 2025-05-07T20:31:48.2261735Z if contiguous: 2025-05-07T20:31:48.2261968Z x0 = x0.contiguous() 2025-05-07T20:31:48.2262241Z x1 = x1.contiguous() 2025-05-07T20:31:48.2262491Z 2025-05-07T20:31:48.2262685Z if scale_ub is not None: 2025-05-07T20:31:48.2262966Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.2263313Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.2263625Z ) 2025-05-07T20:31:48.2263826Z else: 2025-05-07T20:31:48.2264047Z scale_ub_tensor = None 2025-05-07T20:31:48.2264299Z 2025-05-07T20:31:48.2264543Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.2264866Z op = silu_mul_quant 2025-05-07T20:31:48.2265124Z if compiled: 2025-05-07T20:31:48.2265379Z op = torch.compile(op) 2025-05-07T20:31:48.2265685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.2265973Z 2025-05-07T20:31:48.2266167Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.2266342Z 2025-05-07T20:31:48.2266449Z moe/activation_test.py:117: 2025-05-07T20:31:48.2266749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.2267076Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.2267368Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.2267943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.2268514Z return fn(*args, **kwargs) 2025-05-07T20:31:48.2269190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.2269929Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.2270606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.2271306Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.2272147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.2272692Z kernel = self.compile( 2025-05-07T20:31:48.2273246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.2273919Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.2274315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.2274552Z 2025-05-07T20:31:48.2274764Z self = 2025-05-07T20:31:48.2275871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.2277297Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50e6cf0a0>} 2025-05-07T20:31:48.2278664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.2279713Z context = 2025-05-07T20:31:48.2280010Z 2025-05-07T20:31:48.2280180Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.2280715Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.2281276Z module_map=module_map) 2025-05-07T20:31:48.2281654Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.2282020Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.2282288Z E ^ 2025-05-07T20:31:48.2282767Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.2283233Z 2025-05-07T20:31:48.2283657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.2284178Z 2025-05-07T20:31:48.2284291Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.2284709Z self=, 2025-05-07T20:31:48.2285118Z T=128, 2025-05-07T20:31:48.2285313Z D=7168, 2025-05-07T20:31:48.2285509Z scale_ub=1200.0, 2025-05-07T20:31:48.2285739Z contiguous=True, 2025-05-07T20:31:48.2285971Z compiled=False, 2025-05-07T20:31:48.2286183Z ) 2025-05-07T20:31:48.2286520Z self = 2025-05-07T20:31:48.2287021Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.2287300Z 2025-05-07T20:31:48.2287387Z @given( 2025-05-07T20:31:48.2287617Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.2287935Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.2288247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.2288577Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.2288913Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.2289202Z ) 2025-05-07T20:31:48.2289556Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.2290004Z def test_silu_mul_quant( 2025-05-07T20:31:48.2290251Z self, 2025-05-07T20:31:48.2290452Z T: int, 2025-05-07T20:31:48.2290647Z D: int, 2025-05-07T20:31:48.2290875Z scale_ub: Optional[float], 2025-05-07T20:31:48.2291152Z contiguous: bool, 2025-05-07T20:31:48.2291393Z compiled: bool, 2025-05-07T20:31:48.2291747Z ) -> None: 2025-05-07T20:31:48.2291968Z torch.manual_seed(2025) 2025-05-07T20:31:48.2292209Z 2025-05-07T20:31:48.2292490Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.2292837Z 2025-05-07T20:31:48.2293030Z x_sign = torch.sign(x) 2025-05-07T20:31:48.2293333Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.2295406Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.2297320Z 2025-05-07T20:31:48.2297443Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:48.2297658Z 2025-05-07T20:31:48.2297772Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.2298335Z self=, 2025-05-07T20:31:48.2298751Z T=128, 2025-05-07T20:31:48.2298948Z D=5120, 2025-05-07T20:31:48.2299142Z scale_ub=1200.0, 2025-05-07T20:31:48.2299373Z contiguous=True, 2025-05-07T20:31:48.2299623Z compiled=True, 2025-05-07T20:31:48.2299831Z ) 2025-05-07T20:31:48.2300162Z self = 2025-05-07T20:31:48.2300662Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.2300932Z 2025-05-07T20:31:48.2301019Z @given( 2025-05-07T20:31:48.2301337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.2301660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.2311477Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.2311836Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.2312178Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.2312472Z ) 2025-05-07T20:31:48.2312840Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.2313287Z def test_silu_mul_quant( 2025-05-07T20:31:48.2313538Z self, 2025-05-07T20:31:48.2313741Z T: int, 2025-05-07T20:31:48.2313938Z D: int, 2025-05-07T20:31:48.2314167Z scale_ub: Optional[float], 2025-05-07T20:31:48.2314450Z contiguous: bool, 2025-05-07T20:31:48.2314692Z compiled: bool, 2025-05-07T20:31:48.2314931Z ) -> None: 2025-05-07T20:31:48.2315158Z torch.manual_seed(2025) 2025-05-07T20:31:48.2315401Z 2025-05-07T20:31:48.2315702Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.2316101Z 2025-05-07T20:31:48.2316297Z > x_sign = torch.sign(x) 2025-05-07T20:31:48.2318307Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.2320207Z 2025-05-07T20:31:48.2320330Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:48.2320557Z 2025-05-07T20:31:48.2320664Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.2321097Z self=, 2025-05-07T20:31:48.2321506Z T=128, 2025-05-07T20:31:48.2321702Z D=7168, 2025-05-07T20:31:48.2322046Z scale_ub=None, 2025-05-07T20:31:48.2322266Z contiguous=True, 2025-05-07T20:31:48.2322504Z compiled=True, 2025-05-07T20:31:48.2322719Z ) 2025-05-07T20:31:48.5244324Z self = 2025-05-07T20:31:48.5244942Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.5245218Z 2025-05-07T20:31:48.5245305Z @given( 2025-05-07T20:31:48.5245564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5245901Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5246232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5246803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5247494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5248123Z ) 2025-05-07T20:31:48.5248847Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5249759Z def test_silu_mul_quant( 2025-05-07T20:31:48.5250279Z self, 2025-05-07T20:31:48.5250685Z T: int, 2025-05-07T20:31:48.5251094Z D: int, 2025-05-07T20:31:48.5251551Z scale_ub: Optional[float], 2025-05-07T20:31:48.5252110Z contiguous: bool, 2025-05-07T20:31:48.5252612Z compiled: bool, 2025-05-07T20:31:48.5253091Z ) -> None: 2025-05-07T20:31:48.5253539Z torch.manual_seed(2025) 2025-05-07T20:31:48.5254048Z 2025-05-07T20:31:48.5254618Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5257655Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.5259673Z 2025-05-07T20:31:48.5259809Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.5260031Z 2025-05-07T20:31:48.5312118Z FAILED 2025-05-07T20:31:48.5312366Z 2025-05-07T20:31:48.5312666Z =================================== FAILURES =================================== 2025-05-07T20:31:48.5313130Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:31:48.5313577Z + Exception Group Traceback (most recent call last): 2025-05-07T20:31:48.5314243Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:31:48.5314909Z | yield 2025-05-07T20:31:48.5315498Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run 2025-05-07T20:31:48.5316208Z | self._callTestMethod(testMethod) 2025-05-07T20:31:48.5316992Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod 2025-05-07T20:31:48.5317771Z | method() 2025-05-07T20:31:48.5318694Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:31:48.5319727Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5320644Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:31:48.5321580Z | raise the_error_hypothesis_found 2025-05-07T20:31:48.5322277Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:31:48.5322962Z +-+---------------- 1 ---------------- 2025-05-07T20:31:48.5323381Z | Traceback (most recent call last): 2025-05-07T20:31:48.5324387Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:48.5325702Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5328415Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.5330511Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:48.5331086Z | self=, 2025-05-07T20:31:48.5331520Z | T=128, 2025-05-07T20:31:48.5331737Z | D=7168, 2025-05-07T20:31:48.5331963Z | scale_ub=1200.0, 2025-05-07T20:31:48.5332219Z | contiguous=True, 2025-05-07T20:31:48.5332518Z | compiled=False, 2025-05-07T20:31:48.5332815Z | ) 2025-05-07T20:31:48.5332993Z | 2025-05-07T20:31:48.5333544Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:31:48.5334173Z +---------------- 2 ---------------- 2025-05-07T20:31:48.5334503Z | Traceback (most recent call last): 2025-05-07T20:31:48.5335509Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:48.5336755Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5339051Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.5341089Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:48.5341533Z | self=, 2025-05-07T20:31:48.5341945Z | T=128, 2025-05-07T20:31:48.5342155Z | D=7168, 2025-05-07T20:31:48.5342364Z | scale_ub=None, 2025-05-07T20:31:48.5342616Z | contiguous=True, 2025-05-07T20:31:48.5342871Z | compiled=True, 2025-05-07T20:31:48.5343092Z | ) 2025-05-07T20:31:48.5343283Z | 2025-05-07T20:31:48.5343816Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:48.5344439Z +---------------- 3 ---------------- 2025-05-07T20:31:48.5344732Z | Traceback (most recent call last): 2025-05-07T20:31:48.5345463Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:48.5346282Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5348411Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.5350555Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:48.5351005Z | self=, 2025-05-07T20:31:48.5351420Z | T=128, 2025-05-07T20:31:48.5351627Z | D=5120, 2025-05-07T20:31:48.5351836Z | scale_ub=1200.0, 2025-05-07T20:31:48.5352084Z | contiguous=True, 2025-05-07T20:31:48.5352328Z | compiled=True, 2025-05-07T20:31:48.5352550Z | ) 2025-05-07T20:31:48.5352739Z | 2025-05-07T20:31:48.5353273Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:31:48.5353939Z +---------------- 4 ---------------- 2025-05-07T20:31:48.5354251Z | Traceback (most recent call last): 2025-05-07T20:31:48.5355198Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:31:48.5356422Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:48.5357367Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:31:48.5358368Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5359588Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:31:48.5360734Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:48.5361743Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:31:48.5362797Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5363852Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:31:48.5364953Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5366092Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:31:48.5367290Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5368396Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:31:48.5369385Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:48.5370299Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:31:48.5371106Z | fn() 2025-05-07T20:31:48.5371909Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:31:48.5372794Z | self.fn.run( 2025-05-07T20:31:48.5373531Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:31:48.5374364Z | kernel = self.compile( 2025-05-07T20:31:48.5375238Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:31:48.5376225Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5377108Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:48.5377914Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5378705Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5379062Z | def _kernel_quantize_fp8_row( 2025-05-07T20:31:48.5379330Z | ^ 2025-05-07T20:31:48.5379806Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5380386Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:48.5380795Z | # The test always failed when commented parts were varied together. 2025-05-07T20:31:48.5381319Z | self=, 2025-05-07T20:31:48.5381762Z | T=1, # or any other generated value 2025-05-07T20:31:48.5382078Z | D=5120, # or any other generated value 2025-05-07T20:31:48.5382427Z | scale_ub=None, # or any other generated value 2025-05-07T20:31:48.5382802Z | contiguous=True, # or any other generated value 2025-05-07T20:31:48.5383182Z | compiled=True, # or any other generated value 2025-05-07T20:31:48.5383485Z | ) 2025-05-07T20:31:48.5383673Z | 2025-05-07T20:31:48.5384210Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:48.5384820Z +------------------------------------ 2025-05-07T20:31:48.5385190Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:31:48.5385571Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5385996Z self=, 2025-05-07T20:31:48.5386562Z T=1, 2025-05-07T20:31:48.5386837Z D=5120, 2025-05-07T20:31:48.5387124Z scale_ub=None, 2025-05-07T20:31:48.5387572Z contiguous=True, 2025-05-07T20:31:48.5387894Z compiled=True, 2025-05-07T20:31:48.5388195Z ) 2025-05-07T20:31:48.5388651Z self = 2025-05-07T20:31:48.5389369Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.5389747Z 2025-05-07T20:31:48.5389868Z @given( 2025-05-07T20:31:48.5390204Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5390667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5391120Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5395657Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5396150Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5396629Z ) 2025-05-07T20:31:48.5397140Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5397773Z def test_silu_mul_quant( 2025-05-07T20:31:48.5398141Z self, 2025-05-07T20:31:48.5398438Z T: int, 2025-05-07T20:31:48.5398717Z D: int, 2025-05-07T20:31:48.5399045Z scale_ub: Optional[float], 2025-05-07T20:31:48.5399465Z contiguous: bool, 2025-05-07T20:31:48.5399808Z compiled: bool, 2025-05-07T20:31:48.5400140Z ) -> None: 2025-05-07T20:31:48.5400459Z torch.manual_seed(2025) 2025-05-07T20:31:48.5400810Z 2025-05-07T20:31:48.5401208Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5401710Z 2025-05-07T20:31:48.5401990Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5402418Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5402871Z x = x_sign * x_clamp 2025-05-07T20:31:48.5403231Z x0 = x[:, :D] 2025-05-07T20:31:48.5403543Z x1 = x[:, D:] 2025-05-07T20:31:48.5403850Z 2025-05-07T20:31:48.5404127Z if contiguous: 2025-05-07T20:31:48.5404459Z x0 = x0.contiguous() 2025-05-07T20:31:48.5404845Z x1 = x1.contiguous() 2025-05-07T20:31:48.5405202Z 2025-05-07T20:31:48.5405479Z if scale_ub is not None: 2025-05-07T20:31:48.5405883Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5406530Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5407012Z ) 2025-05-07T20:31:48.5407295Z else: 2025-05-07T20:31:48.5407600Z scale_ub_tensor = None 2025-05-07T20:31:48.5407971Z 2025-05-07T20:31:48.5408312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5408767Z op = silu_mul_quant 2025-05-07T20:31:48.5409122Z if compiled: 2025-05-07T20:31:48.5409482Z op = torch.compile(op) 2025-05-07T20:31:48.5409920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5410328Z 2025-05-07T20:31:48.5410610Z y_fp8, y_scale = fn() 2025-05-07T20:31:48.5411028Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:48.5411444Z 2025-05-07T20:31:48.5411786Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5412271Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:48.5412706Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:48.5413157Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:48.5413680Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5414128Z 2025-05-07T20:31:48.5414415Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:48.5414710Z 2025-05-07T20:31:48.5414856Z moe/activation_test.py:126: 2025-05-07T20:31:48.5415292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5415782Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:48.5416255Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5417494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:48.5418691Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:48.5419467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5420454Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5421456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:48.5422515Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5423593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:48.5424685Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5425735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:48.5426702Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:48.5427575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:48.5428342Z fn() 2025-05-07T20:31:48.5429090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:48.5429932Z self.fn.run( 2025-05-07T20:31:48.5430613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5431363Z kernel = self.compile( 2025-05-07T20:31:48.5432111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5433006Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5433545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5433861Z 2025-05-07T20:31:48.5434162Z self = 2025-05-07T20:31:48.5435688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5437803Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53d8f4af0>} 2025-05-07T20:31:48.5439615Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5441031Z context = 2025-05-07T20:31:48.5441445Z 2025-05-07T20:31:48.5441700Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5442405Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5443006Z module_map=module_map) 2025-05-07T20:31:48.5443473Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5443928Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:48.5444264Z E ^ 2025-05-07T20:31:48.5444873Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5445491Z 2025-05-07T20:31:48.5446032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5446697Z 2025-05-07T20:31:48.5446839Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5447357Z self=, 2025-05-07T20:31:48.5447894Z T=2048, 2025-05-07T20:31:48.5448253Z D=5120, 2025-05-07T20:31:48.5448497Z scale_ub=1200.0, 2025-05-07T20:31:48.5448807Z contiguous=True, 2025-05-07T20:31:48.5449099Z compiled=False, 2025-05-07T20:31:48.5449360Z ) 2025-05-07T20:31:48.5449765Z self = 2025-05-07T20:31:48.5450407Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.5450776Z 2025-05-07T20:31:48.5450892Z @given( 2025-05-07T20:31:48.5451188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5451583Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5451972Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5452384Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5452804Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5453191Z ) 2025-05-07T20:31:48.5453690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5454275Z def test_silu_mul_quant( 2025-05-07T20:31:48.5454617Z self, 2025-05-07T20:31:48.5454873Z T: int, 2025-05-07T20:31:48.5455132Z D: int, 2025-05-07T20:31:48.5455409Z scale_ub: Optional[float], 2025-05-07T20:31:48.5456078Z contiguous: bool, 2025-05-07T20:31:48.5456401Z compiled: bool, 2025-05-07T20:31:48.5456743Z ) -> None: 2025-05-07T20:31:48.5457015Z torch.manual_seed(2025) 2025-05-07T20:31:48.5457336Z 2025-05-07T20:31:48.5457692Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5458261Z 2025-05-07T20:31:48.5458537Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5458953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5459393Z x = x_sign * x_clamp 2025-05-07T20:31:48.5459730Z x0 = x[:, :D] 2025-05-07T20:31:48.5460031Z x1 = x[:, D:] 2025-05-07T20:31:48.5460319Z 2025-05-07T20:31:48.5460594Z if contiguous: 2025-05-07T20:31:48.5460925Z x0 = x0.contiguous() 2025-05-07T20:31:48.5461288Z x1 = x1.contiguous() 2025-05-07T20:31:48.5461843Z 2025-05-07T20:31:48.5462115Z if scale_ub is not None: 2025-05-07T20:31:48.5462489Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5462950Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5463400Z ) 2025-05-07T20:31:48.5463693Z else: 2025-05-07T20:31:48.5484497Z scale_ub_tensor = None 2025-05-07T20:31:48.5484890Z 2025-05-07T20:31:48.5485246Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5485700Z op = silu_mul_quant 2025-05-07T20:31:48.5486062Z if compiled: 2025-05-07T20:31:48.5486452Z op = torch.compile(op) 2025-05-07T20:31:48.5486901Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5487299Z 2025-05-07T20:31:48.5487599Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.5487837Z 2025-05-07T20:31:48.5487988Z moe/activation_test.py:117: 2025-05-07T20:31:48.5488405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5488888Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.5489299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5490278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.5491266Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.5492039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5492996Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5493937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5494701Z kernel = self.compile( 2025-05-07T20:31:48.5495715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5496666Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5497220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5497555Z 2025-05-07T20:31:48.5497847Z self = 2025-05-07T20:31:48.5499487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5501500Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53d946ef0>} 2025-05-07T20:31:48.5503418Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5504882Z context = 2025-05-07T20:31:48.5505299Z 2025-05-07T20:31:48.5505540Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5506343Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5507013Z module_map=module_map) 2025-05-07T20:31:48.5507534Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5508028Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.5508394Z E ^ 2025-05-07T20:31:48.5509064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5509718Z 2025-05-07T20:31:48.5510316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5511038Z 2025-05-07T20:31:48.5511320Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5511905Z self=, 2025-05-07T20:31:48.5512469Z T=2048, 2025-05-07T20:31:48.5512741Z D=5120, 2025-05-07T20:31:48.5513014Z scale_ub=1200.0, 2025-05-07T20:31:48.5513338Z contiguous=True, 2025-05-07T20:31:48.5513648Z compiled=True, 2025-05-07T20:31:48.5513927Z ) 2025-05-07T20:31:48.5514383Z self = 2025-05-07T20:31:48.5515094Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.5515466Z 2025-05-07T20:31:48.5515580Z @given( 2025-05-07T20:31:48.5515887Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5516356Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5516819Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5517290Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5517777Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5518202Z ) 2025-05-07T20:31:48.5518711Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5519350Z def test_silu_mul_quant( 2025-05-07T20:31:48.5519698Z self, 2025-05-07T20:31:48.5519975Z T: int, 2025-05-07T20:31:48.5520258Z D: int, 2025-05-07T20:31:48.5520585Z scale_ub: Optional[float], 2025-05-07T20:31:48.5520974Z contiguous: bool, 2025-05-07T20:31:48.5521327Z compiled: bool, 2025-05-07T20:31:48.5521653Z ) -> None: 2025-05-07T20:31:48.5521962Z torch.manual_seed(2025) 2025-05-07T20:31:48.5522317Z 2025-05-07T20:31:48.5522716Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5523202Z 2025-05-07T20:31:48.5523594Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5524019Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5524461Z x = x_sign * x_clamp 2025-05-07T20:31:48.5524813Z x0 = x[:, :D] 2025-05-07T20:31:48.5525131Z x1 = x[:, D:] 2025-05-07T20:31:48.5525422Z 2025-05-07T20:31:48.5525680Z if contiguous: 2025-05-07T20:31:48.5526020Z x0 = x0.contiguous() 2025-05-07T20:31:48.5526380Z x1 = x1.contiguous() 2025-05-07T20:31:48.5526786Z 2025-05-07T20:31:48.5527071Z if scale_ub is not None: 2025-05-07T20:31:48.5527465Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5527965Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5528411Z ) 2025-05-07T20:31:48.5528686Z else: 2025-05-07T20:31:48.5528977Z scale_ub_tensor = None 2025-05-07T20:31:48.5529339Z 2025-05-07T20:31:48.5529674Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5530131Z op = silu_mul_quant 2025-05-07T20:31:48.5530482Z if compiled: 2025-05-07T20:31:48.5530814Z op = torch.compile(op) 2025-05-07T20:31:48.5531232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5531618Z 2025-05-07T20:31:48.5531881Z y_fp8, y_scale = fn() 2025-05-07T20:31:48.5532262Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:48.5532660Z 2025-05-07T20:31:48.5532988Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5533462Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:48.5533885Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:48.5534329Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:48.5534828Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5535265Z 2025-05-07T20:31:48.5535533Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:48.5535813Z 2025-05-07T20:31:48.5535989Z moe/activation_test.py:126: 2025-05-07T20:31:48.5536386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5536954Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:48.5537384Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5538537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:48.5539584Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:48.5540354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5541319Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5542285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:48.5543307Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5544377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:48.5545473Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5546572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:48.5547489Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:48.5548154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:48.5548686Z fn() 2025-05-07T20:31:48.5549206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:48.5549802Z self.fn.run( 2025-05-07T20:31:48.5550399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5550942Z kernel = self.compile( 2025-05-07T20:31:48.5551496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5552171Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5552576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5552805Z 2025-05-07T20:31:48.5553018Z self = 2025-05-07T20:31:48.5554122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5556052Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53d9b4790>} 2025-05-07T20:31:48.5557442Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5558483Z context = 2025-05-07T20:31:48.5558789Z 2025-05-07T20:31:48.5558962Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5559501Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5559984Z module_map=module_map) 2025-05-07T20:31:48.5560359Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5560725Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:48.5561000Z E ^ 2025-05-07T20:31:48.5561473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5561935Z 2025-05-07T20:31:48.5562355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5563752Z 2025-05-07T20:31:48.5563859Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5564282Z self=, 2025-05-07T20:31:48.5564684Z T=16384, 2025-05-07T20:31:48.5564880Z D=7168, 2025-05-07T20:31:48.5565078Z scale_ub=1200.0, 2025-05-07T20:31:48.5565305Z contiguous=False, 2025-05-07T20:31:48.5565543Z compiled=False, 2025-05-07T20:31:48.5565755Z ) 2025-05-07T20:31:48.5566127Z self = 2025-05-07T20:31:48.5566636Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.5566924Z 2025-05-07T20:31:48.5567003Z @given( 2025-05-07T20:31:48.5567247Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5567561Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5567873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5568214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5568545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5568841Z ) 2025-05-07T20:31:48.5569203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5569647Z def test_silu_mul_quant( 2025-05-07T20:31:48.5569894Z self, 2025-05-07T20:31:48.5570092Z T: int, 2025-05-07T20:31:48.5570296Z D: int, 2025-05-07T20:31:48.5570512Z scale_ub: Optional[float], 2025-05-07T20:31:48.5570788Z contiguous: bool, 2025-05-07T20:31:48.5571032Z compiled: bool, 2025-05-07T20:31:48.5571256Z ) -> None: 2025-05-07T20:31:48.5571477Z torch.manual_seed(2025) 2025-05-07T20:31:48.5571727Z 2025-05-07T20:31:48.5572131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5572489Z 2025-05-07T20:31:48.5572695Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5572996Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5573315Z x = x_sign * x_clamp 2025-05-07T20:31:48.5573561Z x0 = x[:, :D] 2025-05-07T20:31:48.5573778Z x1 = x[:, D:] 2025-05-07T20:31:48.5573990Z 2025-05-07T20:31:48.5574184Z if contiguous: 2025-05-07T20:31:48.5574415Z x0 = x0.contiguous() 2025-05-07T20:31:48.5574683Z x1 = x1.contiguous() 2025-05-07T20:31:48.5574932Z 2025-05-07T20:31:48.5575127Z if scale_ub is not None: 2025-05-07T20:31:48.5575414Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5575757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5576068Z ) 2025-05-07T20:31:48.5576263Z else: 2025-05-07T20:31:48.5576522Z scale_ub_tensor = None 2025-05-07T20:31:48.5576788Z 2025-05-07T20:31:48.5577025Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5577345Z op = silu_mul_quant 2025-05-07T20:31:48.5577608Z if compiled: 2025-05-07T20:31:48.5577858Z op = torch.compile(op) 2025-05-07T20:31:48.5578345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5578626Z 2025-05-07T20:31:48.5578823Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.5578994Z 2025-05-07T20:31:48.5579112Z moe/activation_test.py:117: 2025-05-07T20:31:48.5579415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5579749Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.5580035Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5580738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.5581455Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.5582005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5582824Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5583504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5584050Z kernel = self.compile( 2025-05-07T20:31:48.5584597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5585269Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5585674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5585903Z 2025-05-07T20:31:48.5586127Z self = 2025-05-07T20:31:48.5587279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5588691Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53d28d2d0>} 2025-05-07T20:31:48.5590063Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5591104Z context = 2025-05-07T20:31:48.5591395Z 2025-05-07T20:31:48.5591569Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5592091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5592649Z module_map=module_map) 2025-05-07T20:31:48.5593027Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5593385Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.5593649Z E ^ 2025-05-07T20:31:48.5594122Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5594580Z 2025-05-07T20:31:48.5595010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5595528Z 2025-05-07T20:31:48.5595634Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5596059Z self=, 2025-05-07T20:31:48.5596465Z T=1, 2025-05-07T20:31:48.5596647Z D=7168, 2025-05-07T20:31:48.5596843Z scale_ub=None, 2025-05-07T20:31:48.5597062Z contiguous=True, 2025-05-07T20:31:48.5597283Z compiled=True, 2025-05-07T20:31:48.5597494Z ) 2025-05-07T20:31:48.5597826Z self = 2025-05-07T20:31:48.5598315Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.5598581Z 2025-05-07T20:31:48.5598657Z @given( 2025-05-07T20:31:48.5598889Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5599208Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5599510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5599842Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5600176Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5600460Z ) 2025-05-07T20:31:48.5600821Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5601268Z def test_silu_mul_quant( 2025-05-07T20:31:48.5601512Z self, 2025-05-07T20:31:48.5601702Z T: int, 2025-05-07T20:31:48.5601901Z D: int, 2025-05-07T20:31:48.5602130Z scale_ub: Optional[float], 2025-05-07T20:31:48.5602401Z contiguous: bool, 2025-05-07T20:31:48.5602645Z compiled: bool, 2025-05-07T20:31:48.5602961Z ) -> None: 2025-05-07T20:31:48.5603174Z torch.manual_seed(2025) 2025-05-07T20:31:48.5603421Z 2025-05-07T20:31:48.5603700Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5604038Z 2025-05-07T20:31:48.5604238Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5604537Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5604842Z x = x_sign * x_clamp 2025-05-07T20:31:48.5605086Z x0 = x[:, :D] 2025-05-07T20:31:48.5605307Z x1 = x[:, D:] 2025-05-07T20:31:48.5605511Z 2025-05-07T20:31:48.5605703Z if contiguous: 2025-05-07T20:31:48.5605938Z x0 = x0.contiguous() 2025-05-07T20:31:48.5606206Z x1 = x1.contiguous() 2025-05-07T20:31:48.5606487Z 2025-05-07T20:31:48.5606690Z if scale_ub is not None: 2025-05-07T20:31:48.5606965Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5607299Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5607611Z ) 2025-05-07T20:31:48.5607806Z else: 2025-05-07T20:31:48.5608011Z scale_ub_tensor = None 2025-05-07T20:31:48.5608265Z 2025-05-07T20:31:48.5608500Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5608813Z op = silu_mul_quant 2025-05-07T20:31:48.5609064Z if compiled: 2025-05-07T20:31:48.5609315Z op = torch.compile(op) 2025-05-07T20:31:48.5609610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5609886Z 2025-05-07T20:31:48.5610084Z y_fp8, y_scale = fn() 2025-05-07T20:31:48.5610369Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:48.5610661Z 2025-05-07T20:31:48.5610903Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5611331Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:48.5611624Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:48.5611945Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:48.5612308Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5612613Z 2025-05-07T20:31:48.5612820Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:48.5613017Z 2025-05-07T20:31:48.5613123Z moe/activation_test.py:126: 2025-05-07T20:31:48.5613415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5613747Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:48.5614083Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5614884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:48.5615644Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:48.5616204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5616952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5617644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:48.5618484Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5619246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:48.5620004Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5620739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:48.5621397Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:48.5622009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:48.5622620Z fn() 2025-05-07T20:31:48.5623130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:48.5623721Z self.fn.run( 2025-05-07T20:31:48.5624197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5624730Z kernel = self.compile( 2025-05-07T20:31:48.5625280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5625944Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5626342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5626572Z 2025-05-07T20:31:48.5626787Z self = 2025-05-07T20:31:48.5627882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5629287Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53c0fd5a0>} 2025-05-07T20:31:48.5630648Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5631691Z context = 2025-05-07T20:31:48.5631982Z 2025-05-07T20:31:48.5632153Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5632838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5633321Z module_map=module_map) 2025-05-07T20:31:48.5633692Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5634053Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:48.5634324Z E ^ 2025-05-07T20:31:48.5634796Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5635251Z 2025-05-07T20:31:48.5635673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5636196Z 2025-05-07T20:31:48.5636310Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5636769Z self=, 2025-05-07T20:31:48.5637166Z T=4096, 2025-05-07T20:31:48.5637360Z D=5120, 2025-05-07T20:31:48.5637555Z scale_ub=None, 2025-05-07T20:31:48.5637781Z contiguous=False, 2025-05-07T20:31:48.5638005Z compiled=False, 2025-05-07T20:31:48.5638213Z ) 2025-05-07T20:31:48.5638539Z self = 2025-05-07T20:31:48.5639040Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.5639320Z 2025-05-07T20:31:48.5639397Z @given( 2025-05-07T20:31:48.5639631Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5639941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5640254Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5640590Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5640917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5641204Z ) 2025-05-07T20:31:48.5641559Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5642009Z def test_silu_mul_quant( 2025-05-07T20:31:48.5642253Z self, 2025-05-07T20:31:48.5642453Z T: int, 2025-05-07T20:31:48.5642653Z D: int, 2025-05-07T20:31:48.5642871Z scale_ub: Optional[float], 2025-05-07T20:31:48.5643238Z contiguous: bool, 2025-05-07T20:31:48.5643486Z compiled: bool, 2025-05-07T20:31:48.5643707Z ) -> None: 2025-05-07T20:31:48.5643926Z torch.manual_seed(2025) 2025-05-07T20:31:48.5644173Z 2025-05-07T20:31:48.5644445Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5644791Z 2025-05-07T20:31:48.5644993Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5645287Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5645599Z x = x_sign * x_clamp 2025-05-07T20:31:48.5645841Z x0 = x[:, :D] 2025-05-07T20:31:48.5646055Z x1 = x[:, D:] 2025-05-07T20:31:48.5646267Z 2025-05-07T20:31:48.5646464Z if contiguous: 2025-05-07T20:31:48.5646732Z x0 = x0.contiguous() 2025-05-07T20:31:48.5646995Z x1 = x1.contiguous() 2025-05-07T20:31:48.5647236Z 2025-05-07T20:31:48.5647431Z if scale_ub is not None: 2025-05-07T20:31:48.5647707Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5648045Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5648356Z ) 2025-05-07T20:31:48.5648543Z else: 2025-05-07T20:31:48.5648754Z scale_ub_tensor = None 2025-05-07T20:31:48.5649009Z 2025-05-07T20:31:48.5649241Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5649561Z op = silu_mul_quant 2025-05-07T20:31:48.5649816Z if compiled: 2025-05-07T20:31:48.5650061Z op = torch.compile(op) 2025-05-07T20:31:48.5650362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5650638Z 2025-05-07T20:31:48.5650828Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.5650998Z 2025-05-07T20:31:48.5651097Z moe/activation_test.py:117: 2025-05-07T20:31:48.5651503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5651836Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.5652121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5652822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.5653522Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.5654063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5654755Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5655427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5656292Z kernel = self.compile( 2025-05-07T20:31:48.5665393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5666114Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5666567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5666806Z 2025-05-07T20:31:48.5667018Z self = 2025-05-07T20:31:48.5668126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5669531Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53f59e7a0>} 2025-05-07T20:31:48.5670909Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5671948Z context = 2025-05-07T20:31:48.5672465Z 2025-05-07T20:31:48.5672639Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5673177Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5673656Z module_map=module_map) 2025-05-07T20:31:48.5674025Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5674388Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.5674659Z E ^ 2025-05-07T20:31:48.5675133Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5675598Z 2025-05-07T20:31:48.5676029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5676557Z 2025-05-07T20:31:48.5676665Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5677097Z self=, 2025-05-07T20:31:48.5677505Z T=4096, 2025-05-07T20:31:48.5677704Z D=7168, 2025-05-07T20:31:48.5677905Z scale_ub=None, 2025-05-07T20:31:48.5678125Z contiguous=False, 2025-05-07T20:31:48.5678359Z compiled=False, 2025-05-07T20:31:48.5678574Z ) 2025-05-07T20:31:48.5678900Z self = 2025-05-07T20:31:48.5679409Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.5679694Z 2025-05-07T20:31:48.5679772Z @given( 2025-05-07T20:31:48.5680014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5680331Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5680649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5681122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5681457Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5681754Z ) 2025-05-07T20:31:48.5682122Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5682580Z def test_silu_mul_quant( 2025-05-07T20:31:48.5682822Z self, 2025-05-07T20:31:48.5683025Z T: int, 2025-05-07T20:31:48.5683229Z D: int, 2025-05-07T20:31:48.5683448Z scale_ub: Optional[float], 2025-05-07T20:31:48.5683728Z contiguous: bool, 2025-05-07T20:31:48.5683979Z compiled: bool, 2025-05-07T20:31:48.5684205Z ) -> None: 2025-05-07T20:31:48.5684431Z torch.manual_seed(2025) 2025-05-07T20:31:48.5684682Z 2025-05-07T20:31:48.5684960Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5685315Z 2025-05-07T20:31:48.5685520Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5685821Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5686138Z x = x_sign * x_clamp 2025-05-07T20:31:48.5686386Z x0 = x[:, :D] 2025-05-07T20:31:48.5686605Z x1 = x[:, D:] 2025-05-07T20:31:48.5686820Z 2025-05-07T20:31:48.5687019Z if contiguous: 2025-05-07T20:31:48.5687255Z x0 = x0.contiguous() 2025-05-07T20:31:48.5687521Z x1 = x1.contiguous() 2025-05-07T20:31:48.5687766Z 2025-05-07T20:31:48.5687955Z if scale_ub is not None: 2025-05-07T20:31:48.5688240Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5688589Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5688894Z ) 2025-05-07T20:31:48.5689092Z else: 2025-05-07T20:31:48.5689309Z scale_ub_tensor = None 2025-05-07T20:31:48.5689560Z 2025-05-07T20:31:48.5689799Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5690121Z op = silu_mul_quant 2025-05-07T20:31:48.5690373Z if compiled: 2025-05-07T20:31:48.5690627Z op = torch.compile(op) 2025-05-07T20:31:48.5690936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5691300Z 2025-05-07T20:31:48.5691504Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.5691671Z 2025-05-07T20:31:48.5691781Z moe/activation_test.py:117: 2025-05-07T20:31:48.5692076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5692411Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.5692700Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5693406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.5694105Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.5694656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5695354Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5696031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5696602Z kernel = self.compile( 2025-05-07T20:31:48.5697188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5697862Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5698337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5698572Z 2025-05-07T20:31:48.5698786Z self = 2025-05-07T20:31:48.5699892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5701379Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53d946cb0>} 2025-05-07T20:31:48.5702751Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5703790Z context = 2025-05-07T20:31:48.5704088Z 2025-05-07T20:31:48.5704258Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5704790Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5705267Z module_map=module_map) 2025-05-07T20:31:48.5705641Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5706003Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.5706273Z E ^ 2025-05-07T20:31:48.5706767Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5707262Z 2025-05-07T20:31:48.5707684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5708202Z 2025-05-07T20:31:48.5708314Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5708739Z self=, 2025-05-07T20:31:48.5709140Z T=128, 2025-05-07T20:31:48.5709334Z D=7168, 2025-05-07T20:31:48.5709532Z scale_ub=None, 2025-05-07T20:31:48.5709751Z contiguous=False, 2025-05-07T20:31:48.5709982Z compiled=True, 2025-05-07T20:31:48.5710188Z ) 2025-05-07T20:31:48.5710512Z self = 2025-05-07T20:31:48.5711015Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:48.5711291Z 2025-05-07T20:31:48.5711374Z @given( 2025-05-07T20:31:48.5711605Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5712020Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5712335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5712674Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5713004Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5713297Z ) 2025-05-07T20:31:48.5713657Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5714104Z def test_silu_mul_quant( 2025-05-07T20:31:48.5714350Z self, 2025-05-07T20:31:48.5714551Z T: int, 2025-05-07T20:31:48.5714750Z D: int, 2025-05-07T20:31:48.5714975Z scale_ub: Optional[float], 2025-05-07T20:31:48.5715256Z contiguous: bool, 2025-05-07T20:31:48.5715494Z compiled: bool, 2025-05-07T20:31:48.5715726Z ) -> None: 2025-05-07T20:31:48.5715962Z torch.manual_seed(2025) 2025-05-07T20:31:48.5716203Z 2025-05-07T20:31:48.5716486Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5716840Z 2025-05-07T20:31:48.5717038Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5717340Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5717656Z x = x_sign * x_clamp 2025-05-07T20:31:48.5717899Z x0 = x[:, :D] 2025-05-07T20:31:48.5718113Z x1 = x[:, D:] 2025-05-07T20:31:48.5718324Z 2025-05-07T20:31:48.5718514Z if contiguous: 2025-05-07T20:31:48.5718743Z x0 = x0.contiguous() 2025-05-07T20:31:48.5719004Z x1 = x1.contiguous() 2025-05-07T20:31:48.5719249Z 2025-05-07T20:31:48.5719443Z if scale_ub is not None: 2025-05-07T20:31:48.5719724Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5720064Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5720457Z ) 2025-05-07T20:31:48.5720657Z else: 2025-05-07T20:31:48.5720872Z scale_ub_tensor = None 2025-05-07T20:31:48.5721121Z 2025-05-07T20:31:48.5721362Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5721680Z op = silu_mul_quant 2025-05-07T20:31:48.5721932Z if compiled: 2025-05-07T20:31:48.5722186Z op = torch.compile(op) 2025-05-07T20:31:48.5722487Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5722764Z 2025-05-07T20:31:48.5722956Z y_fp8, y_scale = fn() 2025-05-07T20:31:48.5723254Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:48.5723547Z 2025-05-07T20:31:48.5723789Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5724130Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:48.5724427Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:48.5724746Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:48.5725114Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5725427Z 2025-05-07T20:31:48.5725635Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:48.5725841Z 2025-05-07T20:31:48.5725943Z moe/activation_test.py:126: 2025-05-07T20:31:48.5726247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5726636Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:48.5726967Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5727766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:48.5728536Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:48.5729087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5729785Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5730490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:48.5731318Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5732082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:48.5732844Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5733587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:48.5734237Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:48.5734845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:48.5735370Z fn() 2025-05-07T20:31:48.5735897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:48.5736506Z self.fn.run( 2025-05-07T20:31:48.5737012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5737551Z kernel = self.compile( 2025-05-07T20:31:48.5738165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5738826Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5739228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5739456Z 2025-05-07T20:31:48.5739671Z self = 2025-05-07T20:31:48.5740895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5742291Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc53e480d30>} 2025-05-07T20:31:48.5743661Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5744706Z context = 2025-05-07T20:31:48.5744995Z 2025-05-07T20:31:48.5745174Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5745709Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5746181Z module_map=module_map) 2025-05-07T20:31:48.5746562Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5746926Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:48.5747193Z E ^ 2025-05-07T20:31:48.5747673Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5748135Z 2025-05-07T20:31:48.5748556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5749074Z 2025-05-07T20:31:48.5749185Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5749602Z self=, 2025-05-07T20:31:48.5750009Z T=128, 2025-05-07T20:31:48.5750201Z D=7168, 2025-05-07T20:31:48.5750397Z scale_ub=None, 2025-05-07T20:31:48.5750615Z contiguous=False, 2025-05-07T20:31:48.5750851Z compiled=False, 2025-05-07T20:31:48.5751059Z ) 2025-05-07T20:31:48.5751385Z self = 2025-05-07T20:31:48.5751883Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.5752155Z 2025-05-07T20:31:48.5752324Z @given( 2025-05-07T20:31:48.5752553Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5752869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5753180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5753509Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5753843Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5754130Z ) 2025-05-07T20:31:48.5754486Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5754929Z def test_silu_mul_quant( 2025-05-07T20:31:48.5755172Z self, 2025-05-07T20:31:48.5755369Z T: int, 2025-05-07T20:31:48.5755787Z D: int, 2025-05-07T20:31:48.5756122Z scale_ub: Optional[float], 2025-05-07T20:31:48.5756443Z contiguous: bool, 2025-05-07T20:31:48.5756722Z compiled: bool, 2025-05-07T20:31:48.5756948Z ) -> None: 2025-05-07T20:31:48.5757168Z torch.manual_seed(2025) 2025-05-07T20:31:48.5757416Z 2025-05-07T20:31:48.5757698Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5758044Z 2025-05-07T20:31:48.5758236Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5758533Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5758843Z x = x_sign * x_clamp 2025-05-07T20:31:48.5759079Z x0 = x[:, :D] 2025-05-07T20:31:48.5759299Z x1 = x[:, D:] 2025-05-07T20:31:48.5759508Z 2025-05-07T20:31:48.5759690Z if contiguous: 2025-05-07T20:31:48.5759927Z x0 = x0.contiguous() 2025-05-07T20:31:48.5760197Z x1 = x1.contiguous() 2025-05-07T20:31:48.5760440Z 2025-05-07T20:31:48.5760632Z if scale_ub is not None: 2025-05-07T20:31:48.5761057Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5761405Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5761712Z ) 2025-05-07T20:31:48.5761915Z else: 2025-05-07T20:31:48.5762134Z scale_ub_tensor = None 2025-05-07T20:31:48.5762386Z 2025-05-07T20:31:48.5762623Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5762944Z op = silu_mul_quant 2025-05-07T20:31:48.5763201Z if compiled: 2025-05-07T20:31:48.5763454Z op = torch.compile(op) 2025-05-07T20:31:48.5763756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5764027Z 2025-05-07T20:31:48.5764225Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.5764397Z 2025-05-07T20:31:48.5764497Z moe/activation_test.py:117: 2025-05-07T20:31:48.5764799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5765125Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.5765424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5766125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.5766875Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.5767421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5768113Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5768786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5769322Z kernel = self.compile( 2025-05-07T20:31:48.5769882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5770550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5770950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5771182Z 2025-05-07T20:31:48.5771393Z self = 2025-05-07T20:31:48.5772619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5774012Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc5175b0430>} 2025-05-07T20:31:48.5775373Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5776410Z context = 2025-05-07T20:31:48.5776706Z 2025-05-07T20:31:48.5776881Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5777415Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5777899Z module_map=module_map) 2025-05-07T20:31:48.5778326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5778690Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.5778953Z E ^ 2025-05-07T20:31:48.5779424Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5779883Z 2025-05-07T20:31:48.5780305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5780831Z 2025-05-07T20:31:48.5780938Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5781361Z self=, 2025-05-07T20:31:48.5781850Z T=4096, 2025-05-07T20:31:48.5782045Z D=5120, 2025-05-07T20:31:48.5782241Z scale_ub=1200.0, 2025-05-07T20:31:48.5782466Z contiguous=True, 2025-05-07T20:31:48.5782701Z compiled=False, 2025-05-07T20:31:48.5782913Z ) 2025-05-07T20:31:48.5783238Z self = 2025-05-07T20:31:48.5783740Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.5784019Z 2025-05-07T20:31:48.5784097Z @given( 2025-05-07T20:31:48.5784337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5784657Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5784968Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5785306Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5785638Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5785930Z ) 2025-05-07T20:31:48.5786306Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5786786Z def test_silu_mul_quant( 2025-05-07T20:31:48.5787037Z self, 2025-05-07T20:31:48.5787239Z T: int, 2025-05-07T20:31:48.5787436Z D: int, 2025-05-07T20:31:48.5787667Z scale_ub: Optional[float], 2025-05-07T20:31:48.5787946Z contiguous: bool, 2025-05-07T20:31:48.5788188Z compiled: bool, 2025-05-07T20:31:48.5788417Z ) -> None: 2025-05-07T20:31:48.5788637Z torch.manual_seed(2025) 2025-05-07T20:31:48.5788883Z 2025-05-07T20:31:48.5789162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5789511Z 2025-05-07T20:31:48.5789714Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5790007Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5790322Z x = x_sign * x_clamp 2025-05-07T20:31:48.5790569Z x0 = x[:, :D] 2025-05-07T20:31:48.5790785Z x1 = x[:, D:] 2025-05-07T20:31:48.5791001Z 2025-05-07T20:31:48.5791194Z if contiguous: 2025-05-07T20:31:48.5791427Z x0 = x0.contiguous() 2025-05-07T20:31:48.5791692Z x1 = x1.contiguous() 2025-05-07T20:31:48.5792030Z 2025-05-07T20:31:48.5792226Z if scale_ub is not None: 2025-05-07T20:31:48.5792507Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5792850Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5793156Z ) 2025-05-07T20:31:48.5793355Z else: 2025-05-07T20:31:48.5793568Z scale_ub_tensor = None 2025-05-07T20:31:48.5793819Z 2025-05-07T20:31:48.5794057Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5794374Z op = silu_mul_quant 2025-05-07T20:31:48.5794627Z if compiled: 2025-05-07T20:31:48.5794875Z op = torch.compile(op) 2025-05-07T20:31:48.5795178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5795457Z 2025-05-07T20:31:48.5795655Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.5795830Z 2025-05-07T20:31:48.5795931Z moe/activation_test.py:117: 2025-05-07T20:31:48.5796232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5796590Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.5796903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5797603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.5798304Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.5798851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5799541Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5800218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5800839Z kernel = self.compile( 2025-05-07T20:31:48.5801396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5802071Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5802472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5802698Z 2025-05-07T20:31:48.5802911Z self = 2025-05-07T20:31:48.5804004Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5805396Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc52c335b40>} 2025-05-07T20:31:48.5806762Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5807810Z context = 2025-05-07T20:31:48.5807815Z 2025-05-07T20:31:48.5807984Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5808253Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5808370Z module_map=module_map) 2025-05-07T20:31:48.5808536Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5808637Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.5808720Z E ^ 2025-05-07T20:31:48.5809081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5809085Z 2025-05-07T20:31:48.5809516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5809629Z 2025-05-07T20:31:48.5809736Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5809962Z self=, 2025-05-07T20:31:48.5810045Z T=1, 2025-05-07T20:31:48.5810122Z D=5120, 2025-05-07T20:31:48.5810205Z scale_ub=None, 2025-05-07T20:31:48.5810296Z contiguous=True, 2025-05-07T20:31:48.5810380Z compiled=True, 2025-05-07T20:31:48.5810459Z ) 2025-05-07T20:31:48.5810682Z self = 2025-05-07T20:31:48.5810845Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.5810849Z 2025-05-07T20:31:48.5810932Z @given( 2025-05-07T20:31:48.5811053Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5811159Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5811283Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5811402Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5811523Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5811606Z ) 2025-05-07T20:31:48.5811856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5811958Z def test_silu_mul_quant( 2025-05-07T20:31:48.5812035Z self, 2025-05-07T20:31:48.5812112Z T: int, 2025-05-07T20:31:48.5812195Z D: int, 2025-05-07T20:31:48.5812295Z scale_ub: Optional[float], 2025-05-07T20:31:48.5812387Z contiguous: bool, 2025-05-07T20:31:48.5812479Z compiled: bool, 2025-05-07T20:31:48.5812559Z ) -> None: 2025-05-07T20:31:48.5812659Z torch.manual_seed(2025) 2025-05-07T20:31:48.5812737Z 2025-05-07T20:31:48.5812910Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5813076Z 2025-05-07T20:31:48.5813176Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5813305Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5813405Z x = x_sign * x_clamp 2025-05-07T20:31:48.5813488Z x0 = x[:, :D] 2025-05-07T20:31:48.5813569Z x1 = x[:, D:] 2025-05-07T20:31:48.5813650Z 2025-05-07T20:31:48.5813734Z if contiguous: 2025-05-07T20:31:48.5813828Z x0 = x0.contiguous() 2025-05-07T20:31:48.5813922Z x1 = x1.contiguous() 2025-05-07T20:31:48.5813995Z 2025-05-07T20:31:48.5814088Z if scale_ub is not None: 2025-05-07T20:31:48.5814201Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5814339Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5814417Z ) 2025-05-07T20:31:48.5814499Z else: 2025-05-07T20:31:48.5814593Z scale_ub_tensor = None 2025-05-07T20:31:48.5814669Z 2025-05-07T20:31:48.5814805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5814897Z op = silu_mul_quant 2025-05-07T20:31:48.5814987Z if compiled: 2025-05-07T20:31:48.5815100Z op = torch.compile(op) 2025-05-07T20:31:48.5815207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5815288Z 2025-05-07T20:31:48.5820442Z y_fp8, y_scale = fn() 2025-05-07T20:31:48.5820591Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:48.5820663Z 2025-05-07T20:31:48.5820807Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5820913Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:48.5821012Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:48.5821138Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:48.5821281Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5821352Z 2025-05-07T20:31:48.5821453Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:48.5821465Z 2025-05-07T20:31:48.5821565Z moe/activation_test.py:126: 2025-05-07T20:31:48.5821695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5821917Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:48.5822056Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5822639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:48.5822740Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:48.5823108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5823339Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5823713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:48.5823975Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5824387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:48.5824648Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5825039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:48.5825209Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:48.5825561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:48.5825643Z fn() 2025-05-07T20:31:48.5826054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:48.5826144Z self.fn.run( 2025-05-07T20:31:48.5826593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5826704Z kernel = self.compile( 2025-05-07T20:31:48.5827107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5827289Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5827416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5827420Z 2025-05-07T20:31:48.5827633Z self = 2025-05-07T20:31:48.5828428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5828950Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc517344820>} 2025-05-07T20:31:48.5829715Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5829920Z context = 2025-05-07T20:31:48.5829924Z 2025-05-07T20:31:48.5830089Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5830360Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5830475Z module_map=module_map) 2025-05-07T20:31:48.5830641Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5830748Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:48.5830829Z E ^ 2025-05-07T20:31:48.5831196Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5831201Z 2025-05-07T20:31:48.5831629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5831720Z 2025-05-07T20:31:48.5831825Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5832056Z self=, 2025-05-07T20:31:48.5832137Z T=2048, 2025-05-07T20:31:48.5832213Z D=5120, 2025-05-07T20:31:48.5832295Z scale_ub=None, 2025-05-07T20:31:48.5832387Z contiguous=True, 2025-05-07T20:31:48.5832468Z compiled=True, 2025-05-07T20:31:48.5832544Z ) 2025-05-07T20:31:48.5832768Z self = 2025-05-07T20:31:48.5832943Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.5832947Z 2025-05-07T20:31:48.5833026Z @given( 2025-05-07T20:31:48.5833152Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5833252Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5833373Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5833499Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5833617Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5833693Z ) 2025-05-07T20:31:48.5833944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5834044Z def test_silu_mul_quant( 2025-05-07T20:31:48.5834119Z self, 2025-05-07T20:31:48.5834193Z T: int, 2025-05-07T20:31:48.5834270Z D: int, 2025-05-07T20:31:48.5834371Z scale_ub: Optional[float], 2025-05-07T20:31:48.5834462Z contiguous: bool, 2025-05-07T20:31:48.5834552Z compiled: bool, 2025-05-07T20:31:48.5834629Z ) -> None: 2025-05-07T20:31:48.5834726Z torch.manual_seed(2025) 2025-05-07T20:31:48.5834803Z 2025-05-07T20:31:48.5835057Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5835131Z 2025-05-07T20:31:48.5835228Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5835359Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5835454Z x = x_sign * x_clamp 2025-05-07T20:31:48.5835534Z x0 = x[:, :D] 2025-05-07T20:31:48.5835615Z x1 = x[:, D:] 2025-05-07T20:31:48.5835693Z 2025-05-07T20:31:48.5835778Z if contiguous: 2025-05-07T20:31:48.5835871Z x0 = x0.contiguous() 2025-05-07T20:31:48.5835964Z x1 = x1.contiguous() 2025-05-07T20:31:48.5836037Z 2025-05-07T20:31:48.5836128Z if scale_ub is not None: 2025-05-07T20:31:48.5836244Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5836379Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5836454Z ) 2025-05-07T20:31:48.5836531Z else: 2025-05-07T20:31:48.5836630Z scale_ub_tensor = None 2025-05-07T20:31:48.5836701Z 2025-05-07T20:31:48.5836839Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5836934Z op = silu_mul_quant 2025-05-07T20:31:48.5837022Z if compiled: 2025-05-07T20:31:48.5837121Z op = torch.compile(op) 2025-05-07T20:31:48.5837227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5837302Z 2025-05-07T20:31:48.5837394Z y_fp8, y_scale = fn() 2025-05-07T20:31:48.5837517Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:48.5837593Z 2025-05-07T20:31:48.5837731Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5837837Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:48.5837943Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:48.5838066Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:48.5838212Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5838289Z 2025-05-07T20:31:48.5838388Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:48.5838393Z 2025-05-07T20:31:48.5838491Z moe/activation_test.py:126: 2025-05-07T20:31:48.5838712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5838818Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:48.5838957Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5839532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:48.5839633Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:48.5840007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5840234Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5840616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:48.5840875Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5841286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:48.5841544Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5841925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:48.5842099Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:48.5842448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:48.5842524Z fn() 2025-05-07T20:31:48.5843037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:48.5843118Z self.fn.run( 2025-05-07T20:31:48.5843462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5843564Z kernel = self.compile( 2025-05-07T20:31:48.5843953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5844131Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5844256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5844260Z 2025-05-07T20:31:48.5844469Z self = 2025-05-07T20:31:48.5845268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5845785Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc516e52a70>} 2025-05-07T20:31:48.5846582Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5846801Z context = 2025-05-07T20:31:48.5846806Z 2025-05-07T20:31:48.5846973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5847245Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5847351Z module_map=module_map) 2025-05-07T20:31:48.5847520Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5847623Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:48.5847705Z E ^ 2025-05-07T20:31:48.5848072Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5848184Z 2025-05-07T20:31:48.5848608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5848613Z 2025-05-07T20:31:48.5848723Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5848950Z self=, 2025-05-07T20:31:48.5849026Z T=128, 2025-05-07T20:31:48.5849109Z D=5120, 2025-05-07T20:31:48.5849191Z scale_ub=None, 2025-05-07T20:31:48.5849273Z contiguous=True, 2025-05-07T20:31:48.5849360Z compiled=True, 2025-05-07T20:31:48.5849435Z ) 2025-05-07T20:31:48.5849657Z self = 2025-05-07T20:31:48.5849837Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.5849842Z 2025-05-07T20:31:48.5849914Z @given( 2025-05-07T20:31:48.5850040Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5850144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5850260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5850381Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5850497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5850569Z ) 2025-05-07T20:31:48.5850823Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5850920Z def test_silu_mul_quant( 2025-05-07T20:31:48.5850996Z self, 2025-05-07T20:31:48.5851074Z T: int, 2025-05-07T20:31:48.5851149Z D: int, 2025-05-07T20:31:48.5851247Z scale_ub: Optional[float], 2025-05-07T20:31:48.5851340Z contiguous: bool, 2025-05-07T20:31:48.5851426Z compiled: bool, 2025-05-07T20:31:48.5851591Z ) -> None: 2025-05-07T20:31:48.5851689Z torch.manual_seed(2025) 2025-05-07T20:31:48.5851760Z 2025-05-07T20:31:48.5851936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5852012Z 2025-05-07T20:31:48.5852104Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5852235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5852325Z x = x_sign * x_clamp 2025-05-07T20:31:48.5852405Z x0 = x[:, :D] 2025-05-07T20:31:48.5852486Z x1 = x[:, D:] 2025-05-07T20:31:48.5852559Z 2025-05-07T20:31:48.5852641Z if contiguous: 2025-05-07T20:31:48.5852736Z x0 = x0.contiguous() 2025-05-07T20:31:48.5852825Z x1 = x1.contiguous() 2025-05-07T20:31:48.5852896Z 2025-05-07T20:31:48.5852990Z if scale_ub is not None: 2025-05-07T20:31:48.5853097Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5853239Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5853319Z ) 2025-05-07T20:31:48.5853393Z else: 2025-05-07T20:31:48.5853490Z scale_ub_tensor = None 2025-05-07T20:31:48.5853565Z 2025-05-07T20:31:48.5853695Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5853790Z op = silu_mul_quant 2025-05-07T20:31:48.5853874Z if compiled: 2025-05-07T20:31:48.5853973Z op = torch.compile(op) 2025-05-07T20:31:48.5854087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5854158Z 2025-05-07T20:31:48.5854249Z y_fp8, y_scale = fn() 2025-05-07T20:31:48.5854376Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:48.5854447Z 2025-05-07T20:31:48.5854590Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5854691Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:48.5854790Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:48.5854922Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:48.5855062Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5855133Z 2025-05-07T20:31:48.5855323Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:48.5855327Z 2025-05-07T20:31:48.5855424Z moe/activation_test.py:126: 2025-05-07T20:31:48.5855774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5855937Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:48.5856112Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5856698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:48.5856811Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:48.5857198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5857431Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5857803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:48.5858124Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5858527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:48.5858779Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5859159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:48.5859326Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:48.5859677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:48.5859755Z fn() 2025-05-07T20:31:48.5860300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:48.5860391Z self.fn.run( 2025-05-07T20:31:48.5860739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5860833Z kernel = self.compile( 2025-05-07T20:31:48.5861222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5861399Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5861527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5861532Z 2025-05-07T20:31:48.5861739Z self = 2025-05-07T20:31:48.5862530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5863044Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc514dd11b0>} 2025-05-07T20:31:48.5863803Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5863998Z context = 2025-05-07T20:31:48.5864003Z 2025-05-07T20:31:48.5864169Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5864435Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5864550Z module_map=module_map) 2025-05-07T20:31:48.5864720Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5864827Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:48.5864903Z E ^ 2025-05-07T20:31:48.5865380Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5865385Z 2025-05-07T20:31:48.5865808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5865812Z 2025-05-07T20:31:48.5865916Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5866144Z self=, 2025-05-07T20:31:48.5866222Z T=4096, 2025-05-07T20:31:48.5866299Z D=5120, 2025-05-07T20:31:48.5866388Z scale_ub=None, 2025-05-07T20:31:48.5866475Z contiguous=True, 2025-05-07T20:31:48.5866558Z compiled=True, 2025-05-07T20:31:48.5866637Z ) 2025-05-07T20:31:48.5866862Z self = 2025-05-07T20:31:48.5867033Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.5867038Z 2025-05-07T20:31:48.5867122Z @given( 2025-05-07T20:31:48.5867241Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5867345Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5867462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5867581Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5867701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5867775Z ) 2025-05-07T20:31:48.5868024Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5868123Z def test_silu_mul_quant( 2025-05-07T20:31:48.5868198Z self, 2025-05-07T20:31:48.5868274Z T: int, 2025-05-07T20:31:48.5868357Z D: int, 2025-05-07T20:31:48.5868456Z scale_ub: Optional[float], 2025-05-07T20:31:48.5868630Z contiguous: bool, 2025-05-07T20:31:48.5868721Z compiled: bool, 2025-05-07T20:31:48.5868801Z ) -> None: 2025-05-07T20:31:48.5868901Z torch.manual_seed(2025) 2025-05-07T20:31:48.5868982Z 2025-05-07T20:31:48.5869151Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5869228Z 2025-05-07T20:31:48.5869323Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5869449Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5869546Z x = x_sign * x_clamp 2025-05-07T20:31:48.5869625Z x0 = x[:, :D] 2025-05-07T20:31:48.5869705Z x1 = x[:, D:] 2025-05-07T20:31:48.5869779Z 2025-05-07T20:31:48.5869862Z if contiguous: 2025-05-07T20:31:48.5869955Z x0 = x0.contiguous() 2025-05-07T20:31:48.5870048Z x1 = x1.contiguous() 2025-05-07T20:31:48.5870125Z 2025-05-07T20:31:48.5870214Z if scale_ub is not None: 2025-05-07T20:31:48.5870330Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5870468Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5870546Z ) 2025-05-07T20:31:48.5870626Z else: 2025-05-07T20:31:48.5870720Z scale_ub_tensor = None 2025-05-07T20:31:48.5870796Z 2025-05-07T20:31:48.5870926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5871017Z op = silu_mul_quant 2025-05-07T20:31:48.5871104Z if compiled: 2025-05-07T20:31:48.5871204Z op = torch.compile(op) 2025-05-07T20:31:48.5871311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5871387Z 2025-05-07T20:31:48.5871478Z y_fp8, y_scale = fn() 2025-05-07T20:31:48.5871600Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:48.5871677Z 2025-05-07T20:31:48.5871814Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5871918Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:48.5872021Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:48.5872144Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:48.5872373Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5872449Z 2025-05-07T20:31:48.5872548Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:48.5872553Z 2025-05-07T20:31:48.5872652Z moe/activation_test.py:126: 2025-05-07T20:31:48.5872779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5872890Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:48.5873026Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5873588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:48.5873696Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:48.5874064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5874288Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5874668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:48.5874923Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5875332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:48.5875586Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5875965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:48.5876138Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:48.5876691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:48.5876775Z fn() 2025-05-07T20:31:48.5877178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:48.5877268Z self.fn.run( 2025-05-07T20:31:48.5877611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5877706Z kernel = self.compile( 2025-05-07T20:31:48.5878089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5878267Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5878393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5878398Z 2025-05-07T20:31:48.5878607Z self = 2025-05-07T20:31:48.5879397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5879911Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc516e51d80>} 2025-05-07T20:31:48.5880672Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5880868Z context = 2025-05-07T20:31:48.5880872Z 2025-05-07T20:31:48.5881042Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5881308Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5881423Z module_map=module_map) 2025-05-07T20:31:48.5881591Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5881775Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:48.5881854Z E ^ 2025-05-07T20:31:48.5882213Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5882218Z 2025-05-07T20:31:48.5882638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5882643Z 2025-05-07T20:31:48.5882754Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5882977Z self=, 2025-05-07T20:31:48.5883060Z T=16384, 2025-05-07T20:31:48.5883136Z D=5120, 2025-05-07T20:31:48.5883217Z scale_ub=None, 2025-05-07T20:31:48.5883306Z contiguous=True, 2025-05-07T20:31:48.5883393Z compiled=True, 2025-05-07T20:31:48.5883468Z ) 2025-05-07T20:31:48.5883692Z self = 2025-05-07T20:31:48.5883871Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.5883876Z 2025-05-07T20:31:48.5883952Z @given( 2025-05-07T20:31:48.5884076Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5884177Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5884296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5884414Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5884528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5884606Z ) 2025-05-07T20:31:48.5884854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5884950Z def test_silu_mul_quant( 2025-05-07T20:31:48.5885033Z self, 2025-05-07T20:31:48.5885110Z T: int, 2025-05-07T20:31:48.5885268Z D: int, 2025-05-07T20:31:48.5885375Z scale_ub: Optional[float], 2025-05-07T20:31:48.5885465Z contiguous: bool, 2025-05-07T20:31:48.5885563Z compiled: bool, 2025-05-07T20:31:48.5885642Z ) -> None: 2025-05-07T20:31:48.5885738Z torch.manual_seed(2025) 2025-05-07T20:31:48.5885816Z 2025-05-07T20:31:48.5885986Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5886061Z 2025-05-07T20:31:48.5886162Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5886286Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5886380Z x = x_sign * x_clamp 2025-05-07T20:31:48.5886477Z x0 = x[:, :D] 2025-05-07T20:31:48.5886567Z x1 = x[:, D:] 2025-05-07T20:31:48.5886656Z 2025-05-07T20:31:48.5886753Z if contiguous: 2025-05-07T20:31:48.5886845Z x0 = x0.contiguous() 2025-05-07T20:31:48.5886936Z x1 = x1.contiguous() 2025-05-07T20:31:48.5887009Z 2025-05-07T20:31:48.5887108Z if scale_ub is not None: 2025-05-07T20:31:48.5887217Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5887354Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5887432Z ) 2025-05-07T20:31:48.5887515Z else: 2025-05-07T20:31:48.5887609Z scale_ub_tensor = None 2025-05-07T20:31:48.5887682Z 2025-05-07T20:31:48.5887818Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5887909Z op = silu_mul_quant 2025-05-07T20:31:48.5887999Z if compiled: 2025-05-07T20:31:48.5888101Z op = torch.compile(op) 2025-05-07T20:31:48.5888210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5888290Z 2025-05-07T20:31:48.5888382Z y_fp8, y_scale = fn() 2025-05-07T20:31:48.5888504Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:48.5888579Z 2025-05-07T20:31:48.5888719Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5888820Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:48.5888923Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:48.5889127Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:48.5889272Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5889345Z 2025-05-07T20:31:48.5889445Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:48.5889450Z 2025-05-07T20:31:48.5889551Z moe/activation_test.py:126: 2025-05-07T20:31:48.5889677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5889783Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:48.5889921Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5890485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:48.5890598Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:48.5890962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5891190Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5891565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:48.5891820Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5892221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:48.5892475Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5892852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:48.5893103Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:48.5893450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:48.5893532Z fn() 2025-05-07T20:31:48.5893940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:48.5894022Z self.fn.run( 2025-05-07T20:31:48.5894367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5894462Z kernel = self.compile( 2025-05-07T20:31:48.5894845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5895026Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5895149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5895154Z 2025-05-07T20:31:48.5895366Z self = 2025-05-07T20:31:48.5896155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5896665Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc51510fe20>} 2025-05-07T20:31:48.5897419Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5897612Z context = 2025-05-07T20:31:48.5897616Z 2025-05-07T20:31:48.5897784Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5898120Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5898230Z module_map=module_map) 2025-05-07T20:31:48.5898485Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5898589Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:48.5898667Z E ^ 2025-05-07T20:31:48.5899031Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5899035Z 2025-05-07T20:31:48.5899451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5899456Z 2025-05-07T20:31:48.5899567Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5899791Z self=, 2025-05-07T20:31:48.5899868Z T=1, 2025-05-07T20:31:48.5899948Z D=5120, 2025-05-07T20:31:48.5900037Z scale_ub=1200.0, 2025-05-07T20:31:48.5900120Z contiguous=True, 2025-05-07T20:31:48.5900209Z compiled=True, 2025-05-07T20:31:48.5900281Z ) 2025-05-07T20:31:48.5900505Z self = 2025-05-07T20:31:48.5900675Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.5900680Z 2025-05-07T20:31:48.5900756Z @given( 2025-05-07T20:31:48.5900877Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5900975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5901092Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5901214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5901328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5901404Z ) 2025-05-07T20:31:48.5901656Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5901832Z def test_silu_mul_quant( 2025-05-07T20:31:48.5901913Z self, 2025-05-07T20:31:48.5901990Z T: int, 2025-05-07T20:31:48.5902067Z D: int, 2025-05-07T20:31:48.5902171Z scale_ub: Optional[float], 2025-05-07T20:31:48.5902262Z contiguous: bool, 2025-05-07T20:31:48.5902348Z compiled: bool, 2025-05-07T20:31:48.5902433Z ) -> None: 2025-05-07T20:31:48.5902529Z torch.manual_seed(2025) 2025-05-07T20:31:48.5902603Z 2025-05-07T20:31:48.5902778Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5902853Z 2025-05-07T20:31:48.5902946Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5903075Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5903162Z x = x_sign * x_clamp 2025-05-07T20:31:48.5903247Z x0 = x[:, :D] 2025-05-07T20:31:48.5903326Z x1 = x[:, D:] 2025-05-07T20:31:48.5903400Z 2025-05-07T20:31:48.5903489Z if contiguous: 2025-05-07T20:31:48.5903585Z x0 = x0.contiguous() 2025-05-07T20:31:48.5903674Z x1 = x1.contiguous() 2025-05-07T20:31:48.5903749Z 2025-05-07T20:31:48.5903841Z if scale_ub is not None: 2025-05-07T20:31:48.5903950Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5904089Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5904164Z ) 2025-05-07T20:31:48.5904240Z else: 2025-05-07T20:31:48.5904340Z scale_ub_tensor = None 2025-05-07T20:31:48.5904413Z 2025-05-07T20:31:48.5904542Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5904635Z op = silu_mul_quant 2025-05-07T20:31:48.5904720Z if compiled: 2025-05-07T20:31:48.5904825Z op = torch.compile(op) 2025-05-07T20:31:48.5904931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5905004Z 2025-05-07T20:31:48.5905101Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.5905105Z 2025-05-07T20:31:48.5905209Z moe/activation_test.py:117: 2025-05-07T20:31:48.5905336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5905440Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.5905628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5906000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.5906095Z return fn(*args, **kwargs) 2025-05-07T20:31:48.5906644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.5906749Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.5907109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5907329Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5907679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5907773Z kernel = self.compile( 2025-05-07T20:31:48.5908162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5908341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5908465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5908469Z 2025-05-07T20:31:48.5908678Z self = 2025-05-07T20:31:48.5909463Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5910074Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50fdb36d0>} 2025-05-07T20:31:48.5910832Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5911029Z context = 2025-05-07T20:31:48.5911034Z 2025-05-07T20:31:48.5911202Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5911467Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5911580Z module_map=module_map) 2025-05-07T20:31:48.5911743Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5911841Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.5911922Z E ^ 2025-05-07T20:31:48.5912283Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5912288Z 2025-05-07T20:31:48.5912710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5912719Z 2025-05-07T20:31:48.5912823Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5913049Z self=, 2025-05-07T20:31:48.5913130Z T=1, 2025-05-07T20:31:48.5913206Z D=5120, 2025-05-07T20:31:48.5913289Z scale_ub=None, 2025-05-07T20:31:48.5913379Z contiguous=False, 2025-05-07T20:31:48.5913466Z compiled=True, 2025-05-07T20:31:48.5913538Z ) 2025-05-07T20:31:48.5913764Z self = 2025-05-07T20:31:48.5913927Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:48.5913931Z 2025-05-07T20:31:48.5914012Z @given( 2025-05-07T20:31:48.5914135Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5914237Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5914356Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5914555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5914671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5914748Z ) 2025-05-07T20:31:48.5914993Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5915089Z def test_silu_mul_quant( 2025-05-07T20:31:48.5915169Z self, 2025-05-07T20:31:48.5915244Z T: int, 2025-05-07T20:31:48.5915321Z D: int, 2025-05-07T20:31:48.5915417Z scale_ub: Optional[float], 2025-05-07T20:31:48.5915506Z contiguous: bool, 2025-05-07T20:31:48.5915592Z compiled: bool, 2025-05-07T20:31:48.5915671Z ) -> None: 2025-05-07T20:31:48.5915766Z torch.manual_seed(2025) 2025-05-07T20:31:48.5915840Z 2025-05-07T20:31:48.5916016Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5916087Z 2025-05-07T20:31:48.5916183Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5916314Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5916416Z x = x_sign * x_clamp 2025-05-07T20:31:48.5916507Z x0 = x[:, :D] 2025-05-07T20:31:48.5916598Z x1 = x[:, D:] 2025-05-07T20:31:48.5916678Z 2025-05-07T20:31:48.5916764Z if contiguous: 2025-05-07T20:31:48.5916853Z x0 = x0.contiguous() 2025-05-07T20:31:48.5916946Z x1 = x1.contiguous() 2025-05-07T20:31:48.5917018Z 2025-05-07T20:31:48.5917108Z if scale_ub is not None: 2025-05-07T20:31:48.5917217Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5917351Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5917424Z ) 2025-05-07T20:31:48.5917500Z else: 2025-05-07T20:31:48.5917594Z scale_ub_tensor = None 2025-05-07T20:31:48.5917745Z 2025-05-07T20:31:48.5917881Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5917968Z op = silu_mul_quant 2025-05-07T20:31:48.5918058Z if compiled: 2025-05-07T20:31:48.5918157Z op = torch.compile(op) 2025-05-07T20:31:48.5918263Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5918338Z 2025-05-07T20:31:48.5918431Z y_fp8, y_scale = fn() 2025-05-07T20:31:48.5918550Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:48.5918624Z 2025-05-07T20:31:48.5918759Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5918857Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:48.5918959Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:48.5919078Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:48.5919216Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5919297Z 2025-05-07T20:31:48.5919397Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:48.5919401Z 2025-05-07T20:31:48.5919507Z moe/activation_test.py:126: 2025-05-07T20:31:48.5919634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5919740Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:48.5919876Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.5920440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:48.5920541Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:48.5920906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5921127Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5921503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:48.5921758Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5922242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:48.5922503Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.5922878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:48.5923047Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:48.5923390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:48.5923464Z fn() 2025-05-07T20:31:48.5923871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:48.5923954Z self.fn.run( 2025-05-07T20:31:48.5924295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5924395Z kernel = self.compile( 2025-05-07T20:31:48.5924777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5924954Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5925076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5925081Z 2025-05-07T20:31:48.5925287Z self = 2025-05-07T20:31:48.5926075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5926657Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc51510fac0>} 2025-05-07T20:31:48.5927418Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5927610Z context = 2025-05-07T20:31:48.5927615Z 2025-05-07T20:31:48.5927781Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5928049Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5928157Z module_map=module_map) 2025-05-07T20:31:48.5928322Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5928425Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:48.5928504Z E ^ 2025-05-07T20:31:48.5928865Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5928874Z 2025-05-07T20:31:48.5929295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5929300Z 2025-05-07T20:31:48.5929403Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5929625Z self=, 2025-05-07T20:31:48.5929699Z T=1, 2025-05-07T20:31:48.5929776Z D=5120, 2025-05-07T20:31:48.5929858Z scale_ub=None, 2025-05-07T20:31:48.5929940Z contiguous=True, 2025-05-07T20:31:48.5930030Z compiled=False, 2025-05-07T20:31:48.5930101Z ) 2025-05-07T20:31:48.5930321Z self = 2025-05-07T20:31:48.5930488Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.5930497Z 2025-05-07T20:31:48.5930571Z @given( 2025-05-07T20:31:48.5930692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5930872Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5930986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5931106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5931222Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5931293Z ) 2025-05-07T20:31:48.5931549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5931641Z def test_silu_mul_quant( 2025-05-07T20:31:48.5931716Z self, 2025-05-07T20:31:48.5931795Z T: int, 2025-05-07T20:31:48.5931869Z D: int, 2025-05-07T20:31:48.5931968Z scale_ub: Optional[float], 2025-05-07T20:31:48.5932057Z contiguous: bool, 2025-05-07T20:31:48.5932140Z compiled: bool, 2025-05-07T20:31:48.5932223Z ) -> None: 2025-05-07T20:31:48.5932321Z torch.manual_seed(2025) 2025-05-07T20:31:48.5932394Z 2025-05-07T20:31:48.5932566Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5932643Z 2025-05-07T20:31:48.5932733Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5932860Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5932948Z x = x_sign * x_clamp 2025-05-07T20:31:48.5933026Z x0 = x[:, :D] 2025-05-07T20:31:48.5933105Z x1 = x[:, D:] 2025-05-07T20:31:48.5933176Z 2025-05-07T20:31:48.5933258Z if contiguous: 2025-05-07T20:31:48.5933353Z x0 = x0.contiguous() 2025-05-07T20:31:48.5933439Z x1 = x1.contiguous() 2025-05-07T20:31:48.5933514Z 2025-05-07T20:31:48.5933604Z if scale_ub is not None: 2025-05-07T20:31:48.5933707Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5933844Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5933998Z ) 2025-05-07T20:31:48.5934074Z else: 2025-05-07T20:31:48.5934171Z scale_ub_tensor = None 2025-05-07T20:31:48.5934242Z 2025-05-07T20:31:48.5934375Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5934468Z op = silu_mul_quant 2025-05-07T20:31:48.5934550Z if compiled: 2025-05-07T20:31:48.5934650Z op = torch.compile(op) 2025-05-07T20:31:48.5934757Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5934827Z 2025-05-07T20:31:48.5934920Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.5934925Z 2025-05-07T20:31:48.5935022Z moe/activation_test.py:117: 2025-05-07T20:31:48.5935148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5935251Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.5935349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5935860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.5935961Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.5936322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5936576Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5936944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5937036Z kernel = self.compile( 2025-05-07T20:31:48.5937422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5937596Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5937718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5937725Z 2025-05-07T20:31:48.5937932Z self = 2025-05-07T20:31:48.5938765Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5939385Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50fdb3760>} 2025-05-07T20:31:48.5940140Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5940335Z context = 2025-05-07T20:31:48.5940340Z 2025-05-07T20:31:48.5940504Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5940775Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5940885Z module_map=module_map) 2025-05-07T20:31:48.5941053Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5941153Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.5941226Z E ^ 2025-05-07T20:31:48.5941581Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5941586Z 2025-05-07T20:31:48.5942006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5942010Z 2025-05-07T20:31:48.5942112Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5942335Z self=, 2025-05-07T20:31:48.5942412Z T=128, 2025-05-07T20:31:48.5942485Z D=5120, 2025-05-07T20:31:48.5942571Z scale_ub=None, 2025-05-07T20:31:48.5942733Z contiguous=False, 2025-05-07T20:31:48.5942817Z compiled=True, 2025-05-07T20:31:48.5942890Z ) 2025-05-07T20:31:48.5943109Z self = 2025-05-07T20:31:48.5943284Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:48.5943288Z 2025-05-07T20:31:48.5943366Z @given( 2025-05-07T20:31:48.5943481Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5943580Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5943697Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5943814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5943929Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5944002Z ) 2025-05-07T20:31:48.5944246Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5944342Z def test_silu_mul_quant( 2025-05-07T20:31:48.5944420Z self, 2025-05-07T20:31:48.5944496Z T: int, 2025-05-07T20:31:48.5944573Z D: int, 2025-05-07T20:31:48.5944671Z scale_ub: Optional[float], 2025-05-07T20:31:48.5944763Z contiguous: bool, 2025-05-07T20:31:48.5944849Z compiled: bool, 2025-05-07T20:31:48.5944926Z ) -> None: 2025-05-07T20:31:48.5945018Z torch.manual_seed(2025) 2025-05-07T20:31:48.5945092Z 2025-05-07T20:31:48.5945261Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5945335Z 2025-05-07T20:31:48.5945424Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5945550Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5945640Z x = x_sign * x_clamp 2025-05-07T20:31:48.5945717Z x0 = x[:, :D] 2025-05-07T20:31:48.5945794Z x1 = x[:, D:] 2025-05-07T20:31:48.5945868Z 2025-05-07T20:31:48.5945950Z if contiguous: 2025-05-07T20:31:48.5946042Z x0 = x0.contiguous() 2025-05-07T20:31:48.5946136Z x1 = x1.contiguous() 2025-05-07T20:31:48.5946208Z 2025-05-07T20:31:48.5946301Z if scale_ub is not None: 2025-05-07T20:31:48.5950762Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5950919Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5950994Z ) 2025-05-07T20:31:48.5951073Z else: 2025-05-07T20:31:48.5951166Z scale_ub_tensor = None 2025-05-07T20:31:48.5951235Z 2025-05-07T20:31:48.5951372Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5951460Z op = silu_mul_quant 2025-05-07T20:31:48.5951545Z if compiled: 2025-05-07T20:31:48.5951644Z op = torch.compile(op) 2025-05-07T20:31:48.5951748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5951821Z 2025-05-07T20:31:48.5951908Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.5951913Z 2025-05-07T20:31:48.5952009Z moe/activation_test.py:117: 2025-05-07T20:31:48.5952148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5952250Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.5952354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5952744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.5952838Z return fn(*args, **kwargs) 2025-05-07T20:31:48.5953346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.5953441Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.5953805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5954032Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5954481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5954580Z kernel = self.compile( 2025-05-07T20:31:48.5954970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5955151Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5955276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5955280Z 2025-05-07T20:31:48.5955491Z self = 2025-05-07T20:31:48.5956605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5957132Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc51444ee60>} 2025-05-07T20:31:48.5957899Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5958101Z context = 2025-05-07T20:31:48.5958106Z 2025-05-07T20:31:48.5958274Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5958544Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5958649Z module_map=module_map) 2025-05-07T20:31:48.5958811Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5958911Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.5958983Z E ^ 2025-05-07T20:31:48.5959346Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5959358Z 2025-05-07T20:31:48.5959781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5959930Z 2025-05-07T20:31:48.5960036Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5960270Z self=, 2025-05-07T20:31:48.5960344Z T=128, 2025-05-07T20:31:48.5960417Z D=7168, 2025-05-07T20:31:48.5960500Z scale_ub=1200.0, 2025-05-07T20:31:48.5960585Z contiguous=False, 2025-05-07T20:31:48.5960665Z compiled=False, 2025-05-07T20:31:48.5960738Z ) 2025-05-07T20:31:48.5960957Z self = 2025-05-07T20:31:48.5961136Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.5961141Z 2025-05-07T20:31:48.5961215Z @given( 2025-05-07T20:31:48.5961339Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5961439Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5961553Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5961673Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5961789Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5961860Z ) 2025-05-07T20:31:48.5962108Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5962201Z def test_silu_mul_quant( 2025-05-07T20:31:48.5962275Z self, 2025-05-07T20:31:48.5962353Z T: int, 2025-05-07T20:31:48.5962425Z D: int, 2025-05-07T20:31:48.5962521Z scale_ub: Optional[float], 2025-05-07T20:31:48.5962612Z contiguous: bool, 2025-05-07T20:31:48.5962694Z compiled: bool, 2025-05-07T20:31:48.5962769Z ) -> None: 2025-05-07T20:31:48.5962865Z torch.manual_seed(2025) 2025-05-07T20:31:48.5962934Z 2025-05-07T20:31:48.5963216Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5963293Z 2025-05-07T20:31:48.5963383Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5963514Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5963607Z x = x_sign * x_clamp 2025-05-07T20:31:48.5963684Z x0 = x[:, :D] 2025-05-07T20:31:48.5963765Z x1 = x[:, D:] 2025-05-07T20:31:48.5963833Z 2025-05-07T20:31:48.5963915Z if contiguous: 2025-05-07T20:31:48.5964005Z x0 = x0.contiguous() 2025-05-07T20:31:48.5964092Z x1 = x1.contiguous() 2025-05-07T20:31:48.5964161Z 2025-05-07T20:31:48.5964252Z if scale_ub is not None: 2025-05-07T20:31:48.5964356Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5964492Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5964570Z ) 2025-05-07T20:31:48.5964643Z else: 2025-05-07T20:31:48.5964735Z scale_ub_tensor = None 2025-05-07T20:31:48.5964815Z 2025-05-07T20:31:48.5964945Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5965032Z op = silu_mul_quant 2025-05-07T20:31:48.5965121Z if compiled: 2025-05-07T20:31:48.5965218Z op = torch.compile(op) 2025-05-07T20:31:48.5965326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5965397Z 2025-05-07T20:31:48.5965485Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.5965490Z 2025-05-07T20:31:48.5965588Z moe/activation_test.py:117: 2025-05-07T20:31:48.5965713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5965811Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.5965915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5966475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.5966577Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.5966947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5967170Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5967607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5967698Z kernel = self.compile( 2025-05-07T20:31:48.5968086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5968264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5968388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5968393Z 2025-05-07T20:31:48.5968601Z self = 2025-05-07T20:31:48.5969394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5969911Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc51444dab0>} 2025-05-07T20:31:48.5970675Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5970867Z context = 2025-05-07T20:31:48.5970871Z 2025-05-07T20:31:48.5971040Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5971308Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5971489Z module_map=module_map) 2025-05-07T20:31:48.5971656Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5971753Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.5971830Z E ^ 2025-05-07T20:31:48.5972196Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5972201Z 2025-05-07T20:31:48.5972621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5972626Z 2025-05-07T20:31:48.5972731Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5972955Z self=, 2025-05-07T20:31:48.5973027Z T=128, 2025-05-07T20:31:48.5973102Z D=5120, 2025-05-07T20:31:48.5973179Z scale_ub=None, 2025-05-07T20:31:48.5973263Z contiguous=False, 2025-05-07T20:31:48.5973348Z compiled=False, 2025-05-07T20:31:48.5973417Z ) 2025-05-07T20:31:48.5973649Z self = 2025-05-07T20:31:48.5973820Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.5973829Z 2025-05-07T20:31:48.5973901Z @given( 2025-05-07T20:31:48.5974021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5974116Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5974229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5974351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5974464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5974534Z ) 2025-05-07T20:31:48.5974783Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5974876Z def test_silu_mul_quant( 2025-05-07T20:31:48.5974950Z self, 2025-05-07T20:31:48.5975023Z T: int, 2025-05-07T20:31:48.5975094Z D: int, 2025-05-07T20:31:48.5975197Z scale_ub: Optional[float], 2025-05-07T20:31:48.5975287Z contiguous: bool, 2025-05-07T20:31:48.5975369Z compiled: bool, 2025-05-07T20:31:48.5975558Z ) -> None: 2025-05-07T20:31:48.5975652Z torch.manual_seed(2025) 2025-05-07T20:31:48.5975721Z 2025-05-07T20:31:48.5975895Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5975967Z 2025-05-07T20:31:48.5976054Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5976180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5976266Z x = x_sign * x_clamp 2025-05-07T20:31:48.5976347Z x0 = x[:, :D] 2025-05-07T20:31:48.5976443Z x1 = x[:, D:] 2025-05-07T20:31:48.5976515Z 2025-05-07T20:31:48.5976618Z if contiguous: 2025-05-07T20:31:48.5976715Z x0 = x0.contiguous() 2025-05-07T20:31:48.5976804Z x1 = x1.contiguous() 2025-05-07T20:31:48.5976875Z 2025-05-07T20:31:48.5976968Z if scale_ub is not None: 2025-05-07T20:31:48.5977071Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5977207Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5977284Z ) 2025-05-07T20:31:48.5977357Z else: 2025-05-07T20:31:48.5977451Z scale_ub_tensor = None 2025-05-07T20:31:48.5977522Z 2025-05-07T20:31:48.5977654Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5977741Z op = silu_mul_quant 2025-05-07T20:31:48.5977824Z if compiled: 2025-05-07T20:31:48.5977923Z op = torch.compile(op) 2025-05-07T20:31:48.5978085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5978155Z 2025-05-07T20:31:48.5978247Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.5978252Z 2025-05-07T20:31:48.5978347Z moe/activation_test.py:117: 2025-05-07T20:31:48.5978473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5978659Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.5978759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5979267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.5979368Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.5979731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5979954Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5980298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5980389Z kernel = self.compile( 2025-05-07T20:31:48.5980777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5980950Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5981080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5981084Z 2025-05-07T20:31:48.5981295Z self = 2025-05-07T20:31:48.5982089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5982600Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc514dd2950>} 2025-05-07T20:31:48.5983358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5983557Z context = 2025-05-07T20:31:48.5983561Z 2025-05-07T20:31:48.5983725Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5984075Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5984180Z module_map=module_map) 2025-05-07T20:31:48.5984341Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5984439Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.5984514Z E ^ 2025-05-07T20:31:48.5984872Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5984876Z 2025-05-07T20:31:48.5985298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5985303Z 2025-05-07T20:31:48.5985405Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5985637Z self=, 2025-05-07T20:31:48.5985711Z T=128, 2025-05-07T20:31:48.5985785Z D=5120, 2025-05-07T20:31:48.5985874Z scale_ub=1200.0, 2025-05-07T20:31:48.5985956Z contiguous=True, 2025-05-07T20:31:48.5986035Z compiled=False, 2025-05-07T20:31:48.5986107Z ) 2025-05-07T20:31:48.5986327Z self = 2025-05-07T20:31:48.5986497Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.5986505Z 2025-05-07T20:31:48.5986580Z @given( 2025-05-07T20:31:48.5986695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5986795Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5986908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5987023Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5987141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5987288Z ) 2025-05-07T20:31:48.5987535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5987629Z def test_silu_mul_quant( 2025-05-07T20:31:48.5987707Z self, 2025-05-07T20:31:48.5987779Z T: int, 2025-05-07T20:31:48.5987858Z D: int, 2025-05-07T20:31:48.5987956Z scale_ub: Optional[float], 2025-05-07T20:31:48.5988047Z contiguous: bool, 2025-05-07T20:31:48.5988130Z compiled: bool, 2025-05-07T20:31:48.5988206Z ) -> None: 2025-05-07T20:31:48.5988300Z torch.manual_seed(2025) 2025-05-07T20:31:48.5988370Z 2025-05-07T20:31:48.5988539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5988613Z 2025-05-07T20:31:48.5988701Z x_sign = torch.sign(x) 2025-05-07T20:31:48.5988825Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.5988914Z x = x_sign * x_clamp 2025-05-07T20:31:48.5988991Z x0 = x[:, :D] 2025-05-07T20:31:48.5989073Z x1 = x[:, D:] 2025-05-07T20:31:48.5989145Z 2025-05-07T20:31:48.5989226Z if contiguous: 2025-05-07T20:31:48.5989314Z x0 = x0.contiguous() 2025-05-07T20:31:48.5989411Z x1 = x1.contiguous() 2025-05-07T20:31:48.5989481Z 2025-05-07T20:31:48.5989572Z if scale_ub is not None: 2025-05-07T20:31:48.5989673Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.5989807Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.5989883Z ) 2025-05-07T20:31:48.5989955Z else: 2025-05-07T20:31:48.5990046Z scale_ub_tensor = None 2025-05-07T20:31:48.5990117Z 2025-05-07T20:31:48.5990246Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.5990334Z op = silu_mul_quant 2025-05-07T20:31:48.5990417Z if compiled: 2025-05-07T20:31:48.5990514Z op = torch.compile(op) 2025-05-07T20:31:48.5990623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5990697Z 2025-05-07T20:31:48.5990784Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.5990789Z 2025-05-07T20:31:48.5990973Z moe/activation_test.py:117: 2025-05-07T20:31:48.5991098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5991196Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.5991298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.5991804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.5991901Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.5992264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.5992484Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.5992838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.5992929Z kernel = self.compile( 2025-05-07T20:31:48.5993317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.5993499Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.5993620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.5993624Z 2025-05-07T20:31:48.5993832Z self = 2025-05-07T20:31:48.5994622Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.5995206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc5144cc3a0>} 2025-05-07T20:31:48.5995993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.5996217Z context = 2025-05-07T20:31:48.5996221Z 2025-05-07T20:31:48.5996388Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.5996655Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.5996760Z module_map=module_map) 2025-05-07T20:31:48.5996928Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.5997024Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.5997101Z E ^ 2025-05-07T20:31:48.5997468Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.5997473Z 2025-05-07T20:31:48.5997892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.5997902Z 2025-05-07T20:31:48.5998007Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5998230Z self=, 2025-05-07T20:31:48.5998304Z T=1, 2025-05-07T20:31:48.5998381Z D=7168, 2025-05-07T20:31:48.5998461Z scale_ub=1200.0, 2025-05-07T20:31:48.5998547Z contiguous=True, 2025-05-07T20:31:48.5998626Z compiled=True, 2025-05-07T20:31:48.5998696Z ) 2025-05-07T20:31:48.5998917Z self = 2025-05-07T20:31:48.5999081Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.5999085Z 2025-05-07T20:31:48.5999158Z @given( 2025-05-07T20:31:48.5999284Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5999381Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5999495Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5999693Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5999806Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5999880Z ) 2025-05-07T20:31:48.6000125Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6000218Z def test_silu_mul_quant( 2025-05-07T20:31:48.6000293Z self, 2025-05-07T20:31:48.6000367Z T: int, 2025-05-07T20:31:48.6000439Z D: int, 2025-05-07T20:31:48.6000538Z scale_ub: Optional[float], 2025-05-07T20:31:48.6000625Z contiguous: bool, 2025-05-07T20:31:48.6000708Z compiled: bool, 2025-05-07T20:31:48.6000787Z ) -> None: 2025-05-07T20:31:48.6000880Z torch.manual_seed(2025) 2025-05-07T20:31:48.6000949Z 2025-05-07T20:31:48.6001125Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6001198Z 2025-05-07T20:31:48.6001288Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6001417Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6001503Z x = x_sign * x_clamp 2025-05-07T20:31:48.6001581Z x0 = x[:, :D] 2025-05-07T20:31:48.6001658Z x1 = x[:, D:] 2025-05-07T20:31:48.6001726Z 2025-05-07T20:31:48.6001812Z if contiguous: 2025-05-07T20:31:48.6001907Z x0 = x0.contiguous() 2025-05-07T20:31:48.6001998Z x1 = x1.contiguous() 2025-05-07T20:31:48.6002070Z 2025-05-07T20:31:48.6002159Z if scale_ub is not None: 2025-05-07T20:31:48.6002262Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6002397Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6002468Z ) 2025-05-07T20:31:48.6002545Z else: 2025-05-07T20:31:48.6002637Z scale_ub_tensor = None 2025-05-07T20:31:48.6002806Z 2025-05-07T20:31:48.6002942Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6003030Z op = silu_mul_quant 2025-05-07T20:31:48.6003116Z if compiled: 2025-05-07T20:31:48.6003215Z op = torch.compile(op) 2025-05-07T20:31:48.6003318Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6003388Z 2025-05-07T20:31:48.6003481Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6003485Z 2025-05-07T20:31:48.6003580Z moe/activation_test.py:117: 2025-05-07T20:31:48.6003705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6003805Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6003902Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6004281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6004372Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6004880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6004979Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6005345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6005567Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6005915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6006004Z kernel = self.compile( 2025-05-07T20:31:48.6006420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6006617Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6006738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6006742Z 2025-05-07T20:31:48.6006955Z self = 2025-05-07T20:31:48.6007745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6008405Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc5144cedd0>} 2025-05-07T20:31:48.6009163Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6009358Z context = 2025-05-07T20:31:48.6009363Z 2025-05-07T20:31:48.6009531Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6009798Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6009911Z module_map=module_map) 2025-05-07T20:31:48.6010070Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6010166Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6010242Z E ^ 2025-05-07T20:31:48.6010599Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6010604Z 2025-05-07T20:31:48.6011028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6011033Z 2025-05-07T20:31:48.6011135Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6011359Z self=, 2025-05-07T20:31:48.6011436Z T=1, 2025-05-07T20:31:48.6011586Z D=7168, 2025-05-07T20:31:48.6011667Z scale_ub=1200.0, 2025-05-07T20:31:48.6011756Z contiguous=False, 2025-05-07T20:31:48.6011836Z compiled=True, 2025-05-07T20:31:48.6011915Z ) 2025-05-07T20:31:48.6012134Z self = 2025-05-07T20:31:48.6012303Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:48.6012312Z 2025-05-07T20:31:48.6012386Z @given( 2025-05-07T20:31:48.6012504Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6012602Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6012715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6012830Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6012944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6013014Z ) 2025-05-07T20:31:48.6013262Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6013362Z def test_silu_mul_quant( 2025-05-07T20:31:48.6013434Z self, 2025-05-07T20:31:48.6013508Z T: int, 2025-05-07T20:31:48.6013582Z D: int, 2025-05-07T20:31:48.6013682Z scale_ub: Optional[float], 2025-05-07T20:31:48.6013773Z contiguous: bool, 2025-05-07T20:31:48.6013856Z compiled: bool, 2025-05-07T20:31:48.6013931Z ) -> None: 2025-05-07T20:31:48.6014026Z torch.manual_seed(2025) 2025-05-07T20:31:48.6014095Z 2025-05-07T20:31:48.6014264Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6014338Z 2025-05-07T20:31:48.6014426Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6014550Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6014641Z x = x_sign * x_clamp 2025-05-07T20:31:48.6014717Z x0 = x[:, :D] 2025-05-07T20:31:48.6014794Z x1 = x[:, D:] 2025-05-07T20:31:48.6014866Z 2025-05-07T20:31:48.6014946Z if contiguous: 2025-05-07T20:31:48.6015042Z x0 = x0.contiguous() 2025-05-07T20:31:48.6015126Z x1 = x1.contiguous() 2025-05-07T20:31:48.6015196Z 2025-05-07T20:31:48.6015373Z if scale_ub is not None: 2025-05-07T20:31:48.6015476Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6015611Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6015686Z ) 2025-05-07T20:31:48.6015758Z else: 2025-05-07T20:31:48.6015848Z scale_ub_tensor = None 2025-05-07T20:31:48.6015919Z 2025-05-07T20:31:48.6016047Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6016135Z op = silu_mul_quant 2025-05-07T20:31:48.6016221Z if compiled: 2025-05-07T20:31:48.6016319Z op = torch.compile(op) 2025-05-07T20:31:48.6016425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6016494Z 2025-05-07T20:31:48.6016581Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6016591Z 2025-05-07T20:31:48.6016689Z moe/activation_test.py:117: 2025-05-07T20:31:48.6016814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6016918Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6017021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6017394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6017485Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6018040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6018136Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6018502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6018722Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6019147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6019242Z kernel = self.compile( 2025-05-07T20:31:48.6019634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6019813Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6019934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6019938Z 2025-05-07T20:31:48.6020144Z self = 2025-05-07T20:31:48.6020935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6021447Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc5144cc0d0>} 2025-05-07T20:31:48.6022207Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6022403Z context = 2025-05-07T20:31:48.6022408Z 2025-05-07T20:31:48.6022570Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6022841Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6022946Z module_map=module_map) 2025-05-07T20:31:48.6023110Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6023206Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6023279Z E ^ 2025-05-07T20:31:48.6023646Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6023650Z 2025-05-07T20:31:48.6024069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6024153Z 2025-05-07T20:31:48.6024258Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6024481Z self=, 2025-05-07T20:31:48.6024554Z T=1, 2025-05-07T20:31:48.6024630Z D=7168, 2025-05-07T20:31:48.6024708Z scale_ub=None, 2025-05-07T20:31:48.6024791Z contiguous=False, 2025-05-07T20:31:48.6024873Z compiled=True, 2025-05-07T20:31:48.6024943Z ) 2025-05-07T20:31:48.6025161Z self = 2025-05-07T20:31:48.6025326Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:48.6025330Z 2025-05-07T20:31:48.6025404Z @given( 2025-05-07T20:31:48.6025529Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6025625Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6025743Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6025863Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6025977Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6026048Z ) 2025-05-07T20:31:48.6026295Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6026404Z def test_silu_mul_quant( 2025-05-07T20:31:48.6026485Z self, 2025-05-07T20:31:48.6026574Z T: int, 2025-05-07T20:31:48.6026655Z D: int, 2025-05-07T20:31:48.6026751Z scale_ub: Optional[float], 2025-05-07T20:31:48.6026841Z contiguous: bool, 2025-05-07T20:31:48.6026924Z compiled: bool, 2025-05-07T20:31:48.6027003Z ) -> None: 2025-05-07T20:31:48.6027096Z torch.manual_seed(2025) 2025-05-07T20:31:48.6027244Z 2025-05-07T20:31:48.6027416Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6027487Z 2025-05-07T20:31:48.6027580Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6027705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6027790Z x = x_sign * x_clamp 2025-05-07T20:31:48.6027867Z x0 = x[:, :D] 2025-05-07T20:31:48.6027946Z x1 = x[:, D:] 2025-05-07T20:31:48.6028014Z 2025-05-07T20:31:48.6028095Z if contiguous: 2025-05-07T20:31:48.6028185Z x0 = x0.contiguous() 2025-05-07T20:31:48.6028272Z x1 = x1.contiguous() 2025-05-07T20:31:48.6028344Z 2025-05-07T20:31:48.6028432Z if scale_ub is not None: 2025-05-07T20:31:48.6028535Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6028672Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6028745Z ) 2025-05-07T20:31:48.6028817Z else: 2025-05-07T20:31:48.6028915Z scale_ub_tensor = None 2025-05-07T20:31:48.6028986Z 2025-05-07T20:31:48.6029116Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6029213Z op = silu_mul_quant 2025-05-07T20:31:48.6029293Z if compiled: 2025-05-07T20:31:48.6029389Z op = torch.compile(op) 2025-05-07T20:31:48.6029500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6029573Z 2025-05-07T20:31:48.6029661Z y_fp8, y_scale = fn() 2025-05-07T20:31:48.6029782Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:48.6029853Z 2025-05-07T20:31:48.6029989Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6030087Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:48.6030183Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:48.6030306Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:48.6030449Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.6030518Z 2025-05-07T20:31:48.6030617Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:48.6030622Z 2025-05-07T20:31:48.6030800Z moe/activation_test.py:126: 2025-05-07T20:31:48.6030928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6031032Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:48.6031166Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:48.6031737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:48.6031836Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:48.6032199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6032423Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6032799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:48.6033057Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.6033466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:48.6033718Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:48.6034102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:48.6034269Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:48.6034618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:48.6034691Z fn() 2025-05-07T20:31:48.6035200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:48.6035285Z self.fn.run( 2025-05-07T20:31:48.6035627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6035722Z kernel = self.compile( 2025-05-07T20:31:48.6036111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6036286Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6036417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6036421Z 2025-05-07T20:31:48.6036641Z self = 2025-05-07T20:31:48.6037472Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6037986Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc5144ce440>} 2025-05-07T20:31:48.6038750Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6038945Z context = 2025-05-07T20:31:48.6038950Z 2025-05-07T20:31:48.6039113Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6039380Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6039487Z module_map=module_map) 2025-05-07T20:31:48.6039647Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6039755Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:48.6039827Z E ^ 2025-05-07T20:31:48.6040186Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6040271Z 2025-05-07T20:31:48.6040698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6040702Z 2025-05-07T20:31:48.6040804Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6041031Z self=, 2025-05-07T20:31:48.6041104Z T=1, 2025-05-07T20:31:48.6041177Z D=5120, 2025-05-07T20:31:48.6041261Z scale_ub=1200.0, 2025-05-07T20:31:48.6041344Z contiguous=False, 2025-05-07T20:31:48.6041425Z compiled=True, 2025-05-07T20:31:48.6041498Z ) 2025-05-07T20:31:48.6041717Z self = 2025-05-07T20:31:48.6041886Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:48.6041891Z 2025-05-07T20:31:48.6041967Z @given( 2025-05-07T20:31:48.6042083Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6042189Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6042304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6042420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6042534Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6042603Z ) 2025-05-07T20:31:48.6042852Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6042947Z def test_silu_mul_quant( 2025-05-07T20:31:48.6043021Z self, 2025-05-07T20:31:48.6043094Z T: int, 2025-05-07T20:31:48.6043170Z D: int, 2025-05-07T20:31:48.6043267Z scale_ub: Optional[float], 2025-05-07T20:31:48.6043355Z contiguous: bool, 2025-05-07T20:31:48.6043440Z compiled: bool, 2025-05-07T20:31:48.6043598Z ) -> None: 2025-05-07T20:31:48.6043700Z torch.manual_seed(2025) 2025-05-07T20:31:48.6043771Z 2025-05-07T20:31:48.6043941Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6044021Z 2025-05-07T20:31:48.6044111Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6044235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6044325Z x = x_sign * x_clamp 2025-05-07T20:31:48.6044404Z x0 = x[:, :D] 2025-05-07T20:31:48.6044483Z x1 = x[:, D:] 2025-05-07T20:31:48.6044556Z 2025-05-07T20:31:48.6044640Z if contiguous: 2025-05-07T20:31:48.6044731Z x0 = x0.contiguous() 2025-05-07T20:31:48.6044820Z x1 = x1.contiguous() 2025-05-07T20:31:48.6044890Z 2025-05-07T20:31:48.6044980Z if scale_ub is not None: 2025-05-07T20:31:48.6045086Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6045225Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6045301Z ) 2025-05-07T20:31:48.6045375Z else: 2025-05-07T20:31:48.6045466Z scale_ub_tensor = None 2025-05-07T20:31:48.6045545Z 2025-05-07T20:31:48.6045675Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6045763Z op = silu_mul_quant 2025-05-07T20:31:48.6045848Z if compiled: 2025-05-07T20:31:48.6045947Z op = torch.compile(op) 2025-05-07T20:31:48.6046051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6046124Z 2025-05-07T20:31:48.6046213Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6046218Z 2025-05-07T20:31:48.6046319Z moe/activation_test.py:117: 2025-05-07T20:31:48.6046446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6046546Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6046647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6047026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6047117Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6047625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6047802Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6048170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6048394Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6048739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6048833Z kernel = self.compile( 2025-05-07T20:31:48.6049222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6049401Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6049528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6049538Z 2025-05-07T20:31:48.6049746Z self = 2025-05-07T20:31:48.6050540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6051049Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50feca5f0>} 2025-05-07T20:31:48.6051814Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6052083Z context = 2025-05-07T20:31:48.6052088Z 2025-05-07T20:31:48.6052255Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6052530Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6052640Z module_map=module_map) 2025-05-07T20:31:48.6052805Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6052905Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6052980Z E ^ 2025-05-07T20:31:48.6053341Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6053345Z 2025-05-07T20:31:48.6053766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6053770Z 2025-05-07T20:31:48.6053872Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6054104Z self=, 2025-05-07T20:31:48.6054180Z T=1, 2025-05-07T20:31:48.6054261Z D=5120, 2025-05-07T20:31:48.6054342Z scale_ub=1200.0, 2025-05-07T20:31:48.6054426Z contiguous=False, 2025-05-07T20:31:48.6054510Z compiled=False, 2025-05-07T20:31:48.6054580Z ) 2025-05-07T20:31:48.6054800Z self = 2025-05-07T20:31:48.6054972Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.6054976Z 2025-05-07T20:31:48.6055049Z @given( 2025-05-07T20:31:48.6055166Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6055265Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6055382Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6055500Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6055859Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6055969Z ) 2025-05-07T20:31:48.6056244Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6056561Z def test_silu_mul_quant( 2025-05-07T20:31:48.6056647Z self, 2025-05-07T20:31:48.6056742Z T: int, 2025-05-07T20:31:48.6056815Z D: int, 2025-05-07T20:31:48.6056911Z scale_ub: Optional[float], 2025-05-07T20:31:48.6057003Z contiguous: bool, 2025-05-07T20:31:48.6057086Z compiled: bool, 2025-05-07T20:31:48.6057165Z ) -> None: 2025-05-07T20:31:48.6057264Z torch.manual_seed(2025) 2025-05-07T20:31:48.6057335Z 2025-05-07T20:31:48.6057509Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6057581Z 2025-05-07T20:31:48.6057676Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6057804Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6057891Z x = x_sign * x_clamp 2025-05-07T20:31:48.6057972Z x0 = x[:, :D] 2025-05-07T20:31:48.6058103Z x1 = x[:, D:] 2025-05-07T20:31:48.6058176Z 2025-05-07T20:31:48.6058258Z if contiguous: 2025-05-07T20:31:48.6058358Z x0 = x0.contiguous() 2025-05-07T20:31:48.6058444Z x1 = x1.contiguous() 2025-05-07T20:31:48.6058516Z 2025-05-07T20:31:48.6058608Z if scale_ub is not None: 2025-05-07T20:31:48.6058712Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6058847Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6058923Z ) 2025-05-07T20:31:48.6058997Z else: 2025-05-07T20:31:48.6059094Z scale_ub_tensor = None 2025-05-07T20:31:48.6059166Z 2025-05-07T20:31:48.6059296Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6059387Z op = silu_mul_quant 2025-05-07T20:31:48.6059469Z if compiled: 2025-05-07T20:31:48.6059566Z op = torch.compile(op) 2025-05-07T20:31:48.6059792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6059866Z 2025-05-07T20:31:48.6059955Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6059965Z 2025-05-07T20:31:48.6060068Z moe/activation_test.py:117: 2025-05-07T20:31:48.6060194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6060300Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6060399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6060910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6061009Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6061374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6061597Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6061948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6062040Z kernel = self.compile( 2025-05-07T20:31:48.6062436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6062611Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6062734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6062739Z 2025-05-07T20:31:48.6062948Z self = 2025-05-07T20:31:48.6063740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6064262Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50feca050>} 2025-05-07T20:31:48.6065025Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6065326Z context = 2025-05-07T20:31:48.6065334Z 2025-05-07T20:31:48.6065499Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6065767Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6065875Z module_map=module_map) 2025-05-07T20:31:48.6066035Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6066133Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6066210Z E ^ 2025-05-07T20:31:48.6066573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6066577Z 2025-05-07T20:31:48.6067051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6067061Z 2025-05-07T20:31:48.6067162Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6067386Z self=, 2025-05-07T20:31:48.6067467Z T=16384, 2025-05-07T20:31:48.6067542Z D=5120, 2025-05-07T20:31:48.6067623Z scale_ub=1200.0, 2025-05-07T20:31:48.6067710Z contiguous=False, 2025-05-07T20:31:48.6067790Z compiled=True, 2025-05-07T20:31:48.6067860Z ) 2025-05-07T20:31:48.6068083Z self = 2025-05-07T20:31:48.6068261Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:48.6068265Z 2025-05-07T20:31:48.6068343Z @given( 2025-05-07T20:31:48.6068540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6068639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6068756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6068878Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6068993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6069066Z ) 2025-05-07T20:31:48.6069314Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6069409Z def test_silu_mul_quant( 2025-05-07T20:31:48.6069482Z self, 2025-05-07T20:31:48.6069556Z T: int, 2025-05-07T20:31:48.6069634Z D: int, 2025-05-07T20:31:48.6069730Z scale_ub: Optional[float], 2025-05-07T20:31:48.6069817Z contiguous: bool, 2025-05-07T20:31:48.6069902Z compiled: bool, 2025-05-07T20:31:48.6069983Z ) -> None: 2025-05-07T20:31:48.6070078Z torch.manual_seed(2025) 2025-05-07T20:31:48.6070152Z 2025-05-07T20:31:48.6070326Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6070402Z 2025-05-07T20:31:48.6070494Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6070622Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6070709Z x = x_sign * x_clamp 2025-05-07T20:31:48.6070789Z x0 = x[:, :D] 2025-05-07T20:31:48.6070866Z x1 = x[:, D:] 2025-05-07T20:31:48.6070943Z 2025-05-07T20:31:48.6071026Z if contiguous: 2025-05-07T20:31:48.6071114Z x0 = x0.contiguous() 2025-05-07T20:31:48.6071204Z x1 = x1.contiguous() 2025-05-07T20:31:48.6075453Z 2025-05-07T20:31:48.6075560Z if scale_ub is not None: 2025-05-07T20:31:48.6075677Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6075840Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6075914Z ) 2025-05-07T20:31:48.6076006Z else: 2025-05-07T20:31:48.6076111Z scale_ub_tensor = None 2025-05-07T20:31:48.6076179Z 2025-05-07T20:31:48.6076314Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6076507Z op = silu_mul_quant 2025-05-07T20:31:48.6076589Z if compiled: 2025-05-07T20:31:48.6076691Z op = torch.compile(op) 2025-05-07T20:31:48.6076796Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6076870Z 2025-05-07T20:31:48.6076959Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6076964Z 2025-05-07T20:31:48.6077062Z moe/activation_test.py:117: 2025-05-07T20:31:48.6077196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6077299Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6077399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6077782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6077877Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6078385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6078491Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6078857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6079086Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6079434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6079526Z kernel = self.compile( 2025-05-07T20:31:48.6079918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6080094Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6080300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6080305Z 2025-05-07T20:31:48.6080519Z self = 2025-05-07T20:31:48.6081323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6081840Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50fec83a0>} 2025-05-07T20:31:48.6082605Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6082802Z context = 2025-05-07T20:31:48.6082807Z 2025-05-07T20:31:48.6082978Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6083247Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6083362Z module_map=module_map) 2025-05-07T20:31:48.6083527Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6083629Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6083704Z E ^ 2025-05-07T20:31:48.6084065Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6084070Z 2025-05-07T20:31:48.6084497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6084502Z 2025-05-07T20:31:48.6084605Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6084833Z self=, 2025-05-07T20:31:48.6084912Z T=2048, 2025-05-07T20:31:48.6084985Z D=7168, 2025-05-07T20:31:48.6085070Z scale_ub=1200.0, 2025-05-07T20:31:48.6085155Z contiguous=False, 2025-05-07T20:31:48.6085319Z compiled=True, 2025-05-07T20:31:48.6085395Z ) 2025-05-07T20:31:48.6085618Z self = 2025-05-07T20:31:48.6085794Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:48.6085799Z 2025-05-07T20:31:48.6085875Z @given( 2025-05-07T20:31:48.6085994Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6086094Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6086209Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6086326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6086461Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6086537Z ) 2025-05-07T20:31:48.6086816Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6086915Z def test_silu_mul_quant( 2025-05-07T20:31:48.6086987Z self, 2025-05-07T20:31:48.6087065Z T: int, 2025-05-07T20:31:48.6087144Z D: int, 2025-05-07T20:31:48.6087242Z scale_ub: Optional[float], 2025-05-07T20:31:48.6087330Z contiguous: bool, 2025-05-07T20:31:48.6087418Z compiled: bool, 2025-05-07T20:31:48.6087495Z ) -> None: 2025-05-07T20:31:48.6087594Z torch.manual_seed(2025) 2025-05-07T20:31:48.6087664Z 2025-05-07T20:31:48.6087836Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6087911Z 2025-05-07T20:31:48.6088004Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6088131Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6088225Z x = x_sign * x_clamp 2025-05-07T20:31:48.6088302Z x0 = x[:, :D] 2025-05-07T20:31:48.6088379Z x1 = x[:, D:] 2025-05-07T20:31:48.6088452Z 2025-05-07T20:31:48.6088614Z if contiguous: 2025-05-07T20:31:48.6088706Z x0 = x0.contiguous() 2025-05-07T20:31:48.6088797Z x1 = x1.contiguous() 2025-05-07T20:31:48.6088872Z 2025-05-07T20:31:48.6088969Z if scale_ub is not None: 2025-05-07T20:31:48.6089076Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6089212Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6089288Z ) 2025-05-07T20:31:48.6089362Z else: 2025-05-07T20:31:48.6089455Z scale_ub_tensor = None 2025-05-07T20:31:48.6089527Z 2025-05-07T20:31:48.6089659Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6089748Z op = silu_mul_quant 2025-05-07T20:31:48.6089833Z if compiled: 2025-05-07T20:31:48.6089936Z op = torch.compile(op) 2025-05-07T20:31:48.6090041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6090113Z 2025-05-07T20:31:48.6090207Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6090211Z 2025-05-07T20:31:48.6090311Z moe/activation_test.py:117: 2025-05-07T20:31:48.6090438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6090542Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6090647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6091024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6091116Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6091624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6091721Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6092088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6092317Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6092663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6092842Z kernel = self.compile( 2025-05-07T20:31:48.6093229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6093408Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6093536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6093541Z 2025-05-07T20:31:48.6093749Z self = 2025-05-07T20:31:48.6094546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6095063Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50feca200>} 2025-05-07T20:31:48.6095834Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6096031Z context = 2025-05-07T20:31:48.6096036Z 2025-05-07T20:31:48.6096201Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6096472Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6096587Z module_map=module_map) 2025-05-07T20:31:48.6096776Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6096892Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6096966Z E ^ 2025-05-07T20:31:48.6097404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6097419Z 2025-05-07T20:31:48.6097844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6097848Z 2025-05-07T20:31:48.6097951Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6098253Z self=, 2025-05-07T20:31:48.6098328Z T=1, 2025-05-07T20:31:48.6098401Z D=5120, 2025-05-07T20:31:48.6098485Z scale_ub=None, 2025-05-07T20:31:48.6098569Z contiguous=False, 2025-05-07T20:31:48.6098651Z compiled=False, 2025-05-07T20:31:48.6098723Z ) 2025-05-07T20:31:48.6098943Z self = 2025-05-07T20:31:48.6099116Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.6099120Z 2025-05-07T20:31:48.6099201Z @given( 2025-05-07T20:31:48.6099320Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6099421Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6099541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6099658Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6099775Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6099847Z ) 2025-05-07T20:31:48.6100096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6100191Z def test_silu_mul_quant( 2025-05-07T20:31:48.6100265Z self, 2025-05-07T20:31:48.6100341Z T: int, 2025-05-07T20:31:48.6100414Z D: int, 2025-05-07T20:31:48.6100512Z scale_ub: Optional[float], 2025-05-07T20:31:48.6100603Z contiguous: bool, 2025-05-07T20:31:48.6100687Z compiled: bool, 2025-05-07T20:31:48.6100765Z ) -> None: 2025-05-07T20:31:48.6100865Z torch.manual_seed(2025) 2025-05-07T20:31:48.6100936Z 2025-05-07T20:31:48.6101108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6101296Z 2025-05-07T20:31:48.6101387Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6101511Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6101599Z x = x_sign * x_clamp 2025-05-07T20:31:48.6101676Z x0 = x[:, :D] 2025-05-07T20:31:48.6101757Z x1 = x[:, D:] 2025-05-07T20:31:48.6101827Z 2025-05-07T20:31:48.6101908Z if contiguous: 2025-05-07T20:31:48.6102000Z x0 = x0.contiguous() 2025-05-07T20:31:48.6102088Z x1 = x1.contiguous() 2025-05-07T20:31:48.6102159Z 2025-05-07T20:31:48.6102252Z if scale_ub is not None: 2025-05-07T20:31:48.6102356Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6102492Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6102567Z ) 2025-05-07T20:31:48.6102645Z else: 2025-05-07T20:31:48.6102742Z scale_ub_tensor = None 2025-05-07T20:31:48.6102817Z 2025-05-07T20:31:48.6102948Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6103044Z op = silu_mul_quant 2025-05-07T20:31:48.6103127Z if compiled: 2025-05-07T20:31:48.6103225Z op = torch.compile(op) 2025-05-07T20:31:48.6103334Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6103403Z 2025-05-07T20:31:48.6103491Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6103496Z 2025-05-07T20:31:48.6103594Z moe/activation_test.py:117: 2025-05-07T20:31:48.6103722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6103822Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6103923Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6104512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6104613Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6104978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6105206Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6105556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6105649Z kernel = self.compile( 2025-05-07T20:31:48.6106039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6106217Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6106342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6106346Z 2025-05-07T20:31:48.6106562Z self = 2025-05-07T20:31:48.6107354Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6107878Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50fecb490>} 2025-05-07T20:31:48.6108641Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6108836Z context = 2025-05-07T20:31:48.6108840Z 2025-05-07T20:31:48.6109008Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6109282Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6109391Z module_map=module_map) 2025-05-07T20:31:48.6109635Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6109732Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6109810Z E ^ 2025-05-07T20:31:48.6110171Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6110175Z 2025-05-07T20:31:48.6110597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6110607Z 2025-05-07T20:31:48.6110709Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6110934Z self=, 2025-05-07T20:31:48.6111011Z T=4096, 2025-05-07T20:31:48.6111083Z D=7168, 2025-05-07T20:31:48.6111163Z scale_ub=1200.0, 2025-05-07T20:31:48.6111256Z contiguous=False, 2025-05-07T20:31:48.6111339Z compiled=False, 2025-05-07T20:31:48.6111409Z ) 2025-05-07T20:31:48.6111633Z self = 2025-05-07T20:31:48.6111814Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.6111818Z 2025-05-07T20:31:48.6111892Z @given( 2025-05-07T20:31:48.6112018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6112115Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6112232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6112350Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6112464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6112537Z ) 2025-05-07T20:31:48.6112787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6112880Z def test_silu_mul_quant( 2025-05-07T20:31:48.6113034Z self, 2025-05-07T20:31:48.6113110Z T: int, 2025-05-07T20:31:48.6113183Z D: int, 2025-05-07T20:31:48.6113283Z scale_ub: Optional[float], 2025-05-07T20:31:48.6113377Z contiguous: bool, 2025-05-07T20:31:48.6113463Z compiled: bool, 2025-05-07T20:31:48.6113539Z ) -> None: 2025-05-07T20:31:48.6113632Z torch.manual_seed(2025) 2025-05-07T20:31:48.6113705Z 2025-05-07T20:31:48.6113880Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6113951Z 2025-05-07T20:31:48.6114044Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6114169Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6114254Z x = x_sign * x_clamp 2025-05-07T20:31:48.6114336Z x0 = x[:, :D] 2025-05-07T20:31:48.6114414Z x1 = x[:, D:] 2025-05-07T20:31:48.6114484Z 2025-05-07T20:31:48.6114567Z if contiguous: 2025-05-07T20:31:48.6114656Z x0 = x0.contiguous() 2025-05-07T20:31:48.6114750Z x1 = x1.contiguous() 2025-05-07T20:31:48.6114824Z 2025-05-07T20:31:48.6114914Z if scale_ub is not None: 2025-05-07T20:31:48.6115021Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6115160Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6115231Z ) 2025-05-07T20:31:48.6115307Z else: 2025-05-07T20:31:48.6115398Z scale_ub_tensor = None 2025-05-07T20:31:48.6115469Z 2025-05-07T20:31:48.6115604Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6115690Z op = silu_mul_quant 2025-05-07T20:31:48.6115772Z if compiled: 2025-05-07T20:31:48.6115873Z op = torch.compile(op) 2025-05-07T20:31:48.6115978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6116047Z 2025-05-07T20:31:48.6116140Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6116144Z 2025-05-07T20:31:48.6116240Z moe/activation_test.py:117: 2025-05-07T20:31:48.6116374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6116475Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6116680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6117221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6117316Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6117682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6117907Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6118253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6118346Z kernel = self.compile( 2025-05-07T20:31:48.6118739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6118917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6119043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6119053Z 2025-05-07T20:31:48.6119259Z self = 2025-05-07T20:31:48.6120053Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6120565Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f61c550>} 2025-05-07T20:31:48.6121404Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6121605Z context = 2025-05-07T20:31:48.6121614Z 2025-05-07T20:31:48.6121781Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6122051Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6122157Z module_map=module_map) 2025-05-07T20:31:48.6122318Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6122419Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6122492Z E ^ 2025-05-07T20:31:48.6122857Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6122861Z 2025-05-07T20:31:48.6123282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6123291Z 2025-05-07T20:31:48.6123395Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6123625Z self=, 2025-05-07T20:31:48.6123704Z T=16384, 2025-05-07T20:31:48.6123778Z D=7168, 2025-05-07T20:31:48.6123860Z scale_ub=None, 2025-05-07T20:31:48.6123942Z contiguous=True, 2025-05-07T20:31:48.6124026Z compiled=True, 2025-05-07T20:31:48.6124097Z ) 2025-05-07T20:31:48.6124317Z self = 2025-05-07T20:31:48.6124494Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.6124498Z 2025-05-07T20:31:48.6124571Z @given( 2025-05-07T20:31:48.6124690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6124791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6124906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6125029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6125145Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6125215Z ) 2025-05-07T20:31:48.6125550Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6125643Z def test_silu_mul_quant( 2025-05-07T20:31:48.6125717Z self, 2025-05-07T20:31:48.6125794Z T: int, 2025-05-07T20:31:48.6125867Z D: int, 2025-05-07T20:31:48.6125964Z scale_ub: Optional[float], 2025-05-07T20:31:48.6126055Z contiguous: bool, 2025-05-07T20:31:48.6126137Z compiled: bool, 2025-05-07T20:31:48.6126215Z ) -> None: 2025-05-07T20:31:48.6126312Z torch.manual_seed(2025) 2025-05-07T20:31:48.6126383Z 2025-05-07T20:31:48.6126556Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6126629Z 2025-05-07T20:31:48.6126730Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6126880Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6126988Z x = x_sign * x_clamp 2025-05-07T20:31:48.6127069Z x0 = x[:, :D] 2025-05-07T20:31:48.6127151Z x1 = x[:, D:] 2025-05-07T20:31:48.6127227Z 2025-05-07T20:31:48.6127310Z if contiguous: 2025-05-07T20:31:48.6127402Z x0 = x0.contiguous() 2025-05-07T20:31:48.6127490Z x1 = x1.contiguous() 2025-05-07T20:31:48.6127560Z 2025-05-07T20:31:48.6127653Z if scale_ub is not None: 2025-05-07T20:31:48.6127758Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6127893Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6127968Z ) 2025-05-07T20:31:48.6128040Z else: 2025-05-07T20:31:48.6128133Z scale_ub_tensor = None 2025-05-07T20:31:48.6128206Z 2025-05-07T20:31:48.6128335Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6128426Z op = silu_mul_quant 2025-05-07T20:31:48.6128510Z if compiled: 2025-05-07T20:31:48.6128713Z op = torch.compile(op) 2025-05-07T20:31:48.6128823Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6128896Z 2025-05-07T20:31:48.6128986Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6128991Z 2025-05-07T20:31:48.6129091Z moe/activation_test.py:117: 2025-05-07T20:31:48.6129219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6129320Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6129422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6129797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6129895Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6130398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6130493Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6130863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6131086Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6131437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6131527Z kernel = self.compile( 2025-05-07T20:31:48.6131916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6132093Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6132217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6132221Z 2025-05-07T20:31:48.6132428Z self = 2025-05-07T20:31:48.6133227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6133820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f61d360>} 2025-05-07T20:31:48.6134586Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6134779Z context = 2025-05-07T20:31:48.6134784Z 2025-05-07T20:31:48.6134950Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6135218Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6135328Z module_map=module_map) 2025-05-07T20:31:48.6135498Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6135596Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6135674Z E ^ 2025-05-07T20:31:48.6136035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6136040Z 2025-05-07T20:31:48.6136463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6136468Z 2025-05-07T20:31:48.6136571Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6136798Z self=, 2025-05-07T20:31:48.6136875Z T=4096, 2025-05-07T20:31:48.6136950Z D=5120, 2025-05-07T20:31:48.6137029Z scale_ub=None, 2025-05-07T20:31:48.6137118Z contiguous=False, 2025-05-07T20:31:48.6137199Z compiled=True, 2025-05-07T20:31:48.6137269Z ) 2025-05-07T20:31:48.6137568Z self = 2025-05-07T20:31:48.6137743Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:48.6137753Z 2025-05-07T20:31:48.6137828Z @given( 2025-05-07T20:31:48.6137945Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6138096Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6138215Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6138333Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6138446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6138521Z ) 2025-05-07T20:31:48.6138769Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6138862Z def test_silu_mul_quant( 2025-05-07T20:31:48.6138939Z self, 2025-05-07T20:31:48.6139014Z T: int, 2025-05-07T20:31:48.6139088Z D: int, 2025-05-07T20:31:48.6139193Z scale_ub: Optional[float], 2025-05-07T20:31:48.6139282Z contiguous: bool, 2025-05-07T20:31:48.6139374Z compiled: bool, 2025-05-07T20:31:48.6139450Z ) -> None: 2025-05-07T20:31:48.6139547Z torch.manual_seed(2025) 2025-05-07T20:31:48.6139621Z 2025-05-07T20:31:48.6139793Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6139865Z 2025-05-07T20:31:48.6139961Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6140084Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6140171Z x = x_sign * x_clamp 2025-05-07T20:31:48.6140251Z x0 = x[:, :D] 2025-05-07T20:31:48.6140327Z x1 = x[:, D:] 2025-05-07T20:31:48.6140398Z 2025-05-07T20:31:48.6140483Z if contiguous: 2025-05-07T20:31:48.6140572Z x0 = x0.contiguous() 2025-05-07T20:31:48.6140662Z x1 = x1.contiguous() 2025-05-07T20:31:48.6140732Z 2025-05-07T20:31:48.6140821Z if scale_ub is not None: 2025-05-07T20:31:48.6140932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6141066Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6141282Z ) 2025-05-07T20:31:48.6141361Z else: 2025-05-07T20:31:48.6141455Z scale_ub_tensor = None 2025-05-07T20:31:48.6141525Z 2025-05-07T20:31:48.6141657Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6141746Z op = silu_mul_quant 2025-05-07T20:31:48.6141828Z if compiled: 2025-05-07T20:31:48.6141929Z op = torch.compile(op) 2025-05-07T20:31:48.6142033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6142102Z 2025-05-07T20:31:48.6142196Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6142201Z 2025-05-07T20:31:48.6142296Z moe/activation_test.py:117: 2025-05-07T20:31:48.6142425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6142528Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6142628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6143007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6143106Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6143610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6143709Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6144072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6144301Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6144647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6144739Z kernel = self.compile( 2025-05-07T20:31:48.6145209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6145387Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6145519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6145524Z 2025-05-07T20:31:48.6145732Z self = 2025-05-07T20:31:48.6146574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6147091Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f61dea0>} 2025-05-07T20:31:48.6147856Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6148052Z context = 2025-05-07T20:31:48.6148061Z 2025-05-07T20:31:48.6148227Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6148494Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6148603Z module_map=module_map) 2025-05-07T20:31:48.6148766Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6148864Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6148938Z E ^ 2025-05-07T20:31:48.6149297Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6149302Z 2025-05-07T20:31:48.6149729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6149733Z 2025-05-07T20:31:48.6149837Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6150143Z self=, 2025-05-07T20:31:48.6150218Z T=4096, 2025-05-07T20:31:48.6150290Z D=5120, 2025-05-07T20:31:48.6150373Z scale_ub=1200.0, 2025-05-07T20:31:48.6150457Z contiguous=False, 2025-05-07T20:31:48.6150538Z compiled=False, 2025-05-07T20:31:48.6150613Z ) 2025-05-07T20:31:48.6150833Z self = 2025-05-07T20:31:48.6151008Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.6151013Z 2025-05-07T20:31:48.6151090Z @given( 2025-05-07T20:31:48.6151210Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6151311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6151429Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6151546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6151664Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6151739Z ) 2025-05-07T20:31:48.6151987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6152081Z def test_silu_mul_quant( 2025-05-07T20:31:48.6152154Z self, 2025-05-07T20:31:48.6152228Z T: int, 2025-05-07T20:31:48.6152305Z D: int, 2025-05-07T20:31:48.6152402Z scale_ub: Optional[float], 2025-05-07T20:31:48.6152489Z contiguous: bool, 2025-05-07T20:31:48.6152576Z compiled: bool, 2025-05-07T20:31:48.6152653Z ) -> None: 2025-05-07T20:31:48.6152749Z torch.manual_seed(2025) 2025-05-07T20:31:48.6152819Z 2025-05-07T20:31:48.6152990Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6153067Z 2025-05-07T20:31:48.6153155Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6153358Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6153448Z x = x_sign * x_clamp 2025-05-07T20:31:48.6153530Z x0 = x[:, :D] 2025-05-07T20:31:48.6153607Z x1 = x[:, D:] 2025-05-07T20:31:48.6153680Z 2025-05-07T20:31:48.6153761Z if contiguous: 2025-05-07T20:31:48.6153852Z x0 = x0.contiguous() 2025-05-07T20:31:48.6153941Z x1 = x1.contiguous() 2025-05-07T20:31:48.6154014Z 2025-05-07T20:31:48.6154104Z if scale_ub is not None: 2025-05-07T20:31:48.6154210Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6154344Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6154420Z ) 2025-05-07T20:31:48.6154494Z else: 2025-05-07T20:31:48.6154585Z scale_ub_tensor = None 2025-05-07T20:31:48.6154658Z 2025-05-07T20:31:48.6154788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6154881Z op = silu_mul_quant 2025-05-07T20:31:48.6154966Z if compiled: 2025-05-07T20:31:48.6155063Z op = torch.compile(op) 2025-05-07T20:31:48.6155173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6155245Z 2025-05-07T20:31:48.6155334Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6155338Z 2025-05-07T20:31:48.6155438Z moe/activation_test.py:117: 2025-05-07T20:31:48.6155768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6155921Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6156054Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6156592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6156701Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6157083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6157312Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6157665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6157920Z kernel = self.compile( 2025-05-07T20:31:48.6158312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6158491Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6158613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6158618Z 2025-05-07T20:31:48.6158823Z self = 2025-05-07T20:31:48.6159623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6160138Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f61e680>} 2025-05-07T20:31:48.6160909Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6161104Z context = 2025-05-07T20:31:48.6161108Z 2025-05-07T20:31:48.6161275Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6161542Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6161647Z module_map=module_map) 2025-05-07T20:31:48.6161811Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6162043Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6162121Z E ^ 2025-05-07T20:31:48.6162484Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6162495Z 2025-05-07T20:31:48.6162924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6162930Z 2025-05-07T20:31:48.6163037Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6163264Z self=, 2025-05-07T20:31:48.6163340Z T=4096, 2025-05-07T20:31:48.6163417Z D=5120, 2025-05-07T20:31:48.6163497Z scale_ub=1200.0, 2025-05-07T20:31:48.6163582Z contiguous=False, 2025-05-07T20:31:48.6163666Z compiled=True, 2025-05-07T20:31:48.6163737Z ) 2025-05-07T20:31:48.6163960Z self = 2025-05-07T20:31:48.6164137Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:48.6164142Z 2025-05-07T20:31:48.6164216Z @given( 2025-05-07T20:31:48.6164337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6164438Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6164554Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6164674Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6164787Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6164861Z ) 2025-05-07T20:31:48.6165108Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6165202Z def test_silu_mul_quant( 2025-05-07T20:31:48.6165279Z self, 2025-05-07T20:31:48.6165354Z T: int, 2025-05-07T20:31:48.6165434Z D: int, 2025-05-07T20:31:48.6165534Z scale_ub: Optional[float], 2025-05-07T20:31:48.6165620Z contiguous: bool, 2025-05-07T20:31:48.6165715Z compiled: bool, 2025-05-07T20:31:48.6165812Z ) -> None: 2025-05-07T20:31:48.6165915Z torch.manual_seed(2025) 2025-05-07T20:31:48.6166001Z 2025-05-07T20:31:48.6166258Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6166329Z 2025-05-07T20:31:48.6166419Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6166545Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6166631Z x = x_sign * x_clamp 2025-05-07T20:31:48.6166712Z x0 = x[:, :D] 2025-05-07T20:31:48.6166789Z x1 = x[:, D:] 2025-05-07T20:31:48.6166858Z 2025-05-07T20:31:48.6166944Z if contiguous: 2025-05-07T20:31:48.6167033Z x0 = x0.contiguous() 2025-05-07T20:31:48.6167119Z x1 = x1.contiguous() 2025-05-07T20:31:48.6167193Z 2025-05-07T20:31:48.6167283Z if scale_ub is not None: 2025-05-07T20:31:48.6167388Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6167531Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6167603Z ) 2025-05-07T20:31:48.6167677Z else: 2025-05-07T20:31:48.6167771Z scale_ub_tensor = None 2025-05-07T20:31:48.6167846Z 2025-05-07T20:31:48.6167979Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6168069Z op = silu_mul_quant 2025-05-07T20:31:48.6168152Z if compiled: 2025-05-07T20:31:48.6168251Z op = torch.compile(op) 2025-05-07T20:31:48.6168356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6168425Z 2025-05-07T20:31:48.6168515Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6168519Z 2025-05-07T20:31:48.6168615Z moe/activation_test.py:117: 2025-05-07T20:31:48.6168740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6168844Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6168942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6169398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6169494Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6170003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6170103Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6170465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6170687Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6171034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6171125Z kernel = self.compile( 2025-05-07T20:31:48.6171515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6171694Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6171819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6171828Z 2025-05-07T20:31:48.6172039Z self = 2025-05-07T20:31:48.6172829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6173340Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f61fac0>} 2025-05-07T20:31:48.6174099Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6174295Z context = 2025-05-07T20:31:48.6174303Z 2025-05-07T20:31:48.6174469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6174816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6174926Z module_map=module_map) 2025-05-07T20:31:48.6175088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6175184Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6175260Z E ^ 2025-05-07T20:31:48.6175618Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6175623Z 2025-05-07T20:31:48.6176046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6176051Z 2025-05-07T20:31:48.6176158Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6176407Z self=, 2025-05-07T20:31:48.6176498Z T=2048, 2025-05-07T20:31:48.6176590Z D=7168, 2025-05-07T20:31:48.6176675Z scale_ub=1200.0, 2025-05-07T20:31:48.6176761Z contiguous=False, 2025-05-07T20:31:48.6176844Z compiled=False, 2025-05-07T20:31:48.6176917Z ) 2025-05-07T20:31:48.6177142Z self = 2025-05-07T20:31:48.6177316Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.6177320Z 2025-05-07T20:31:48.6177399Z @given( 2025-05-07T20:31:48.6177516Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6177613Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6177731Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6177847Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6178095Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6178172Z ) 2025-05-07T20:31:48.6178419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6178521Z def test_silu_mul_quant( 2025-05-07T20:31:48.6178595Z self, 2025-05-07T20:31:48.6178669Z T: int, 2025-05-07T20:31:48.6178744Z D: int, 2025-05-07T20:31:48.6178840Z scale_ub: Optional[float], 2025-05-07T20:31:48.6178927Z contiguous: bool, 2025-05-07T20:31:48.6179012Z compiled: bool, 2025-05-07T20:31:48.6179089Z ) -> None: 2025-05-07T20:31:48.6179182Z torch.manual_seed(2025) 2025-05-07T20:31:48.6179257Z 2025-05-07T20:31:48.6179426Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6179497Z 2025-05-07T20:31:48.6179590Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6179714Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6179804Z x = x_sign * x_clamp 2025-05-07T20:31:48.6179888Z x0 = x[:, :D] 2025-05-07T20:31:48.6179964Z x1 = x[:, D:] 2025-05-07T20:31:48.6180040Z 2025-05-07T20:31:48.6180125Z if contiguous: 2025-05-07T20:31:48.6180216Z x0 = x0.contiguous() 2025-05-07T20:31:48.6180303Z x1 = x1.contiguous() 2025-05-07T20:31:48.6180374Z 2025-05-07T20:31:48.6180463Z if scale_ub is not None: 2025-05-07T20:31:48.6180571Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6180704Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6180777Z ) 2025-05-07T20:31:48.6180855Z else: 2025-05-07T20:31:48.6180947Z scale_ub_tensor = None 2025-05-07T20:31:48.6181017Z 2025-05-07T20:31:48.6181149Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6181238Z op = silu_mul_quant 2025-05-07T20:31:48.6181326Z if compiled: 2025-05-07T20:31:48.6181428Z op = torch.compile(op) 2025-05-07T20:31:48.6181534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6181607Z 2025-05-07T20:31:48.6181694Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6181781Z 2025-05-07T20:31:48.6181878Z moe/activation_test.py:117: 2025-05-07T20:31:48.6182008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6182108Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6182207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6182718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6182814Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6183186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6183409Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6183759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6183854Z kernel = self.compile( 2025-05-07T20:31:48.6184250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6184425Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6184549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6184554Z 2025-05-07T20:31:48.6184758Z self = 2025-05-07T20:31:48.6185550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6186135Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc5144ce200>} 2025-05-07T20:31:48.6186952Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6187150Z context = 2025-05-07T20:31:48.6187155Z 2025-05-07T20:31:48.6187321Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6187588Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6187695Z module_map=module_map) 2025-05-07T20:31:48.6187857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6187954Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6188027Z E ^ 2025-05-07T20:31:48.6188392Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6188397Z 2025-05-07T20:31:48.6188818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6188828Z 2025-05-07T20:31:48.6188933Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6189158Z self=, 2025-05-07T20:31:48.6189232Z T=1, 2025-05-07T20:31:48.6189308Z D=7168, 2025-05-07T20:31:48.6189387Z scale_ub=None, 2025-05-07T20:31:48.6189470Z contiguous=True, 2025-05-07T20:31:48.6189554Z compiled=False, 2025-05-07T20:31:48.6189625Z ) 2025-05-07T20:31:48.6189845Z self = 2025-05-07T20:31:48.6190011Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.6190016Z 2025-05-07T20:31:48.6190091Z @given( 2025-05-07T20:31:48.6190217Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6190314Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6190532Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6190651Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6190764Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6190836Z ) 2025-05-07T20:31:48.6191087Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6191181Z def test_silu_mul_quant( 2025-05-07T20:31:48.6191254Z self, 2025-05-07T20:31:48.6191329Z T: int, 2025-05-07T20:31:48.6191403Z D: int, 2025-05-07T20:31:48.6191499Z scale_ub: Optional[float], 2025-05-07T20:31:48.6191590Z contiguous: bool, 2025-05-07T20:31:48.6191674Z compiled: bool, 2025-05-07T20:31:48.6191754Z ) -> None: 2025-05-07T20:31:48.6191848Z torch.manual_seed(2025) 2025-05-07T20:31:48.6191921Z 2025-05-07T20:31:48.6192096Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6192168Z 2025-05-07T20:31:48.6192264Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6192391Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6192479Z x = x_sign * x_clamp 2025-05-07T20:31:48.6192557Z x0 = x[:, :D] 2025-05-07T20:31:48.6192638Z x1 = x[:, D:] 2025-05-07T20:31:48.6192707Z 2025-05-07T20:31:48.6192788Z if contiguous: 2025-05-07T20:31:48.6192880Z x0 = x0.contiguous() 2025-05-07T20:31:48.6192966Z x1 = x1.contiguous() 2025-05-07T20:31:48.6193037Z 2025-05-07T20:31:48.6193131Z if scale_ub is not None: 2025-05-07T20:31:48.6193234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6193373Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6193449Z ) 2025-05-07T20:31:48.6193526Z else: 2025-05-07T20:31:48.6193698Z scale_ub_tensor = None 2025-05-07T20:31:48.6193770Z 2025-05-07T20:31:48.6193902Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6193998Z op = silu_mul_quant 2025-05-07T20:31:48.6194084Z if compiled: 2025-05-07T20:31:48.6194183Z op = torch.compile(op) 2025-05-07T20:31:48.6194289Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6194359Z 2025-05-07T20:31:48.6194448Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6194454Z 2025-05-07T20:31:48.6198856Z moe/activation_test.py:117: 2025-05-07T20:31:48.6198990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6199097Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6199197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6199711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6199815Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6200181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6200408Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6200758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6200849Z kernel = self.compile( 2025-05-07T20:31:48.6201242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6201417Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6201540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6201545Z 2025-05-07T20:31:48.6201757Z self = 2025-05-07T20:31:48.6202556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6203177Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f2444c0>} 2025-05-07T20:31:48.6203943Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6204140Z context = 2025-05-07T20:31:48.6204148Z 2025-05-07T20:31:48.6204311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6204582Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6204692Z module_map=module_map) 2025-05-07T20:31:48.6204855Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6204958Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6205039Z E ^ 2025-05-07T20:31:48.6205400Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6205405Z 2025-05-07T20:31:48.6205830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6205835Z 2025-05-07T20:31:48.6205937Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6206163Z self=, 2025-05-07T20:31:48.6206243Z T=16384, 2025-05-07T20:31:48.6206337Z D=7168, 2025-05-07T20:31:48.6206423Z scale_ub=1200.0, 2025-05-07T20:31:48.6206533Z contiguous=False, 2025-05-07T20:31:48.6206695Z compiled=True, 2025-05-07T20:31:48.6206767Z ) 2025-05-07T20:31:48.6206992Z self = 2025-05-07T20:31:48.6207174Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:48.6207179Z 2025-05-07T20:31:48.6207257Z @given( 2025-05-07T20:31:48.6207374Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6207471Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6207589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6207705Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6207819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6207895Z ) 2025-05-07T20:31:48.6208143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6208240Z def test_silu_mul_quant( 2025-05-07T20:31:48.6208314Z self, 2025-05-07T20:31:48.6208392Z T: int, 2025-05-07T20:31:48.6208468Z D: int, 2025-05-07T20:31:48.6208565Z scale_ub: Optional[float], 2025-05-07T20:31:48.6208653Z contiguous: bool, 2025-05-07T20:31:48.6208745Z compiled: bool, 2025-05-07T20:31:48.6208822Z ) -> None: 2025-05-07T20:31:48.6208915Z torch.manual_seed(2025) 2025-05-07T20:31:48.6208987Z 2025-05-07T20:31:48.6209157Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6209229Z 2025-05-07T20:31:48.6209322Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6209447Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6209533Z x = x_sign * x_clamp 2025-05-07T20:31:48.6209614Z x0 = x[:, :D] 2025-05-07T20:31:48.6209690Z x1 = x[:, D:] 2025-05-07T20:31:48.6209766Z 2025-05-07T20:31:48.6209847Z if contiguous: 2025-05-07T20:31:48.6209936Z x0 = x0.contiguous() 2025-05-07T20:31:48.6210025Z x1 = x1.contiguous() 2025-05-07T20:31:48.6210098Z 2025-05-07T20:31:48.6210187Z if scale_ub is not None: 2025-05-07T20:31:48.6210293Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6210511Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6210584Z ) 2025-05-07T20:31:48.6210660Z else: 2025-05-07T20:31:48.6210752Z scale_ub_tensor = None 2025-05-07T20:31:48.6210822Z 2025-05-07T20:31:48.6210956Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6211044Z op = silu_mul_quant 2025-05-07T20:31:48.6211129Z if compiled: 2025-05-07T20:31:48.6211226Z op = torch.compile(op) 2025-05-07T20:31:48.6211329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6211402Z 2025-05-07T20:31:48.6211490Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6211494Z 2025-05-07T20:31:48.6211590Z moe/activation_test.py:117: 2025-05-07T20:31:48.6211724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6211824Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6211923Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6212310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6212402Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6212911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6213006Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6213368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6213595Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6213943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6214112Z kernel = self.compile( 2025-05-07T20:31:48.6214506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6214685Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6214812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6214816Z 2025-05-07T20:31:48.6215022Z self = 2025-05-07T20:31:48.6215819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6216336Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f2455a0>} 2025-05-07T20:31:48.6217153Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6217359Z context = 2025-05-07T20:31:48.6217364Z 2025-05-07T20:31:48.6217532Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6217805Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6217914Z module_map=module_map) 2025-05-07T20:31:48.6218148Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6218251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6218325Z E ^ 2025-05-07T20:31:48.6218687Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6218692Z 2025-05-07T20:31:48.6219124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6219211Z 2025-05-07T20:31:48.6219314Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6219548Z self=, 2025-05-07T20:31:48.6219624Z T=1, 2025-05-07T20:31:48.6219698Z D=7168, 2025-05-07T20:31:48.6219783Z scale_ub=None, 2025-05-07T20:31:48.6219871Z contiguous=False, 2025-05-07T20:31:48.6219953Z compiled=False, 2025-05-07T20:31:48.6220027Z ) 2025-05-07T20:31:48.6220248Z self = 2025-05-07T20:31:48.6220417Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.6220422Z 2025-05-07T20:31:48.6220501Z @given( 2025-05-07T20:31:48.6220621Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6220725Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6220843Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6220961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6221086Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6221157Z ) 2025-05-07T20:31:48.6221406Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6221500Z def test_silu_mul_quant( 2025-05-07T20:31:48.6221573Z self, 2025-05-07T20:31:48.6221646Z T: int, 2025-05-07T20:31:48.6221724Z D: int, 2025-05-07T20:31:48.6221822Z scale_ub: Optional[float], 2025-05-07T20:31:48.6221910Z contiguous: bool, 2025-05-07T20:31:48.6221996Z compiled: bool, 2025-05-07T20:31:48.6222073Z ) -> None: 2025-05-07T20:31:48.6222167Z torch.manual_seed(2025) 2025-05-07T20:31:48.6222245Z 2025-05-07T20:31:48.6222415Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6222574Z 2025-05-07T20:31:48.6222665Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6222789Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6222884Z x = x_sign * x_clamp 2025-05-07T20:31:48.6222962Z x0 = x[:, :D] 2025-05-07T20:31:48.6223040Z x1 = x[:, D:] 2025-05-07T20:31:48.6223114Z 2025-05-07T20:31:48.6223196Z if contiguous: 2025-05-07T20:31:48.6223286Z x0 = x0.contiguous() 2025-05-07T20:31:48.6223379Z x1 = x1.contiguous() 2025-05-07T20:31:48.6223451Z 2025-05-07T20:31:48.6223540Z if scale_ub is not None: 2025-05-07T20:31:48.6223648Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6223782Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6223860Z ) 2025-05-07T20:31:48.6223933Z else: 2025-05-07T20:31:48.6224025Z scale_ub_tensor = None 2025-05-07T20:31:48.6224099Z 2025-05-07T20:31:48.6224235Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6224322Z op = silu_mul_quant 2025-05-07T20:31:48.6224409Z if compiled: 2025-05-07T20:31:48.6224510Z op = torch.compile(op) 2025-05-07T20:31:48.6224616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6224689Z 2025-05-07T20:31:48.6224779Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6224784Z 2025-05-07T20:31:48.6224880Z moe/activation_test.py:117: 2025-05-07T20:31:48.6225010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6225115Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6225219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6225732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6225828Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6226199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6226426Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6226885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6226977Z kernel = self.compile( 2025-05-07T20:31:48.6227367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6227544Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6227668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6227673Z 2025-05-07T20:31:48.6227881Z self = 2025-05-07T20:31:48.6228683Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6229199Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f245d80>} 2025-05-07T20:31:48.6229970Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6230167Z context = 2025-05-07T20:31:48.6230171Z 2025-05-07T20:31:48.6230341Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6230611Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6230720Z module_map=module_map) 2025-05-07T20:31:48.6230962Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6231061Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6231135Z E ^ 2025-05-07T20:31:48.6231505Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6231510Z 2025-05-07T20:31:48.6231932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6231937Z 2025-05-07T20:31:48.6232045Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6232269Z self=, 2025-05-07T20:31:48.6232345Z T=2048, 2025-05-07T20:31:48.6232423Z D=7168, 2025-05-07T20:31:48.6232502Z scale_ub=None, 2025-05-07T20:31:48.6232589Z contiguous=False, 2025-05-07T20:31:48.6232673Z compiled=True, 2025-05-07T20:31:48.6232748Z ) 2025-05-07T20:31:48.6232976Z self = 2025-05-07T20:31:48.6233153Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:48.6233157Z 2025-05-07T20:31:48.6233238Z @given( 2025-05-07T20:31:48.6233361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6233460Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6233577Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6233699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6233812Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6233884Z ) 2025-05-07T20:31:48.6234136Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6234228Z def test_silu_mul_quant( 2025-05-07T20:31:48.6234305Z self, 2025-05-07T20:31:48.6234379Z T: int, 2025-05-07T20:31:48.6234454Z D: int, 2025-05-07T20:31:48.6234557Z scale_ub: Optional[float], 2025-05-07T20:31:48.6234649Z contiguous: bool, 2025-05-07T20:31:48.6234733Z compiled: bool, 2025-05-07T20:31:48.6234817Z ) -> None: 2025-05-07T20:31:48.6234911Z torch.manual_seed(2025) 2025-05-07T20:31:48.6235066Z 2025-05-07T20:31:48.6235238Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6235309Z 2025-05-07T20:31:48.6235399Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6235527Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6235613Z x = x_sign * x_clamp 2025-05-07T20:31:48.6235694Z x0 = x[:, :D] 2025-05-07T20:31:48.6235773Z x1 = x[:, D:] 2025-05-07T20:31:48.6235844Z 2025-05-07T20:31:48.6235930Z if contiguous: 2025-05-07T20:31:48.6236019Z x0 = x0.contiguous() 2025-05-07T20:31:48.6236108Z x1 = x1.contiguous() 2025-05-07T20:31:48.6236183Z 2025-05-07T20:31:48.6236273Z if scale_ub is not None: 2025-05-07T20:31:48.6236387Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6236530Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6236622Z ) 2025-05-07T20:31:48.6236704Z else: 2025-05-07T20:31:48.6236825Z scale_ub_tensor = None 2025-05-07T20:31:48.6236896Z 2025-05-07T20:31:48.6237026Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6237117Z op = silu_mul_quant 2025-05-07T20:31:48.6237200Z if compiled: 2025-05-07T20:31:48.6237309Z op = torch.compile(op) 2025-05-07T20:31:48.6237416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6237485Z 2025-05-07T20:31:48.6237578Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6237582Z 2025-05-07T20:31:48.6237680Z moe/activation_test.py:117: 2025-05-07T20:31:48.6237807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6237911Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6238088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6238465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6238566Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6239071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6239175Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6239540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6239765Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6240116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6240208Z kernel = self.compile( 2025-05-07T20:31:48.6240605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6240782Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6240908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6240913Z 2025-05-07T20:31:48.6241124Z self = 2025-05-07T20:31:48.6241915Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6242431Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f246f80>} 2025-05-07T20:31:48.6243198Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6243391Z context = 2025-05-07T20:31:48.6243483Z 2025-05-07T20:31:48.6243649Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6243919Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6244028Z module_map=module_map) 2025-05-07T20:31:48.6244193Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6244292Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6244368Z E ^ 2025-05-07T20:31:48.6244727Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6244732Z 2025-05-07T20:31:48.6245163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6245167Z 2025-05-07T20:31:48.6245271Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6245496Z self=, 2025-05-07T20:31:48.6245579Z T=4096, 2025-05-07T20:31:48.6245652Z D=7168, 2025-05-07T20:31:48.6245732Z scale_ub=None, 2025-05-07T20:31:48.6245823Z contiguous=False, 2025-05-07T20:31:48.6245906Z compiled=True, 2025-05-07T20:31:48.6245976Z ) 2025-05-07T20:31:48.6246199Z self = 2025-05-07T20:31:48.6246372Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:48.6246377Z 2025-05-07T20:31:48.6246454Z @given( 2025-05-07T20:31:48.6246572Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6246669Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6246788Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6247020Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6247153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6247229Z ) 2025-05-07T20:31:48.6247482Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6247574Z def test_silu_mul_quant( 2025-05-07T20:31:48.6247650Z self, 2025-05-07T20:31:48.6247724Z T: int, 2025-05-07T20:31:48.6247802Z D: int, 2025-05-07T20:31:48.6247899Z scale_ub: Optional[float], 2025-05-07T20:31:48.6247987Z contiguous: bool, 2025-05-07T20:31:48.6248073Z compiled: bool, 2025-05-07T20:31:48.6248150Z ) -> None: 2025-05-07T20:31:48.6248244Z torch.manual_seed(2025) 2025-05-07T20:31:48.6248319Z 2025-05-07T20:31:48.6248490Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6248561Z 2025-05-07T20:31:48.6248655Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6248786Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6248874Z x = x_sign * x_clamp 2025-05-07T20:31:48.6248957Z x0 = x[:, :D] 2025-05-07T20:31:48.6249036Z x1 = x[:, D:] 2025-05-07T20:31:48.6249115Z 2025-05-07T20:31:48.6249196Z if contiguous: 2025-05-07T20:31:48.6249285Z x0 = x0.contiguous() 2025-05-07T20:31:48.6249374Z x1 = x1.contiguous() 2025-05-07T20:31:48.6249446Z 2025-05-07T20:31:48.6249535Z if scale_ub is not None: 2025-05-07T20:31:48.6249643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6249778Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6249850Z ) 2025-05-07T20:31:48.6249929Z else: 2025-05-07T20:31:48.6250020Z scale_ub_tensor = None 2025-05-07T20:31:48.6250091Z 2025-05-07T20:31:48.6250229Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6250317Z op = silu_mul_quant 2025-05-07T20:31:48.6250405Z if compiled: 2025-05-07T20:31:48.6250509Z op = torch.compile(op) 2025-05-07T20:31:48.6250613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6250769Z 2025-05-07T20:31:48.6250859Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6250863Z 2025-05-07T20:31:48.6250961Z moe/activation_test.py:117: 2025-05-07T20:31:48.6251092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6251193Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6251292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6251672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6251765Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6252273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6252369Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6252738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6252968Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6253318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6253411Z kernel = self.compile( 2025-05-07T20:31:48.6253803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6253980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6254107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6254112Z 2025-05-07T20:31:48.6254319Z self = 2025-05-07T20:31:48.6255211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6256056Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50f247e20>} 2025-05-07T20:31:48.6256864Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6257066Z context = 2025-05-07T20:31:48.6257072Z 2025-05-07T20:31:48.6257236Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6257511Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6257624Z module_map=module_map) 2025-05-07T20:31:48.6257786Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6257887Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6257967Z E ^ 2025-05-07T20:31:48.6258381Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6258386Z 2025-05-07T20:31:48.6258810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6258814Z 2025-05-07T20:31:48.6258917Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6259147Z self=, 2025-05-07T20:31:48.6259223Z T=16384, 2025-05-07T20:31:48.6259298Z D=5120, 2025-05-07T20:31:48.6259382Z scale_ub=1200.0, 2025-05-07T20:31:48.6259467Z contiguous=False, 2025-05-07T20:31:48.6259548Z compiled=False, 2025-05-07T20:31:48.6259621Z ) 2025-05-07T20:31:48.6259847Z self = 2025-05-07T20:31:48.6260030Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.6260194Z 2025-05-07T20:31:48.6260272Z @given( 2025-05-07T20:31:48.6260391Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6260494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6260610Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6260726Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6260848Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6260920Z ) 2025-05-07T20:31:48.6261177Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6261270Z def test_silu_mul_quant( 2025-05-07T20:31:48.6261343Z self, 2025-05-07T20:31:48.6261420Z T: int, 2025-05-07T20:31:48.6261494Z D: int, 2025-05-07T20:31:48.6261595Z scale_ub: Optional[float], 2025-05-07T20:31:48.6261686Z contiguous: bool, 2025-05-07T20:31:48.6261770Z compiled: bool, 2025-05-07T20:31:48.6261852Z ) -> None: 2025-05-07T20:31:48.6261949Z torch.manual_seed(2025) 2025-05-07T20:31:48.6262019Z 2025-05-07T20:31:48.6262194Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6262265Z 2025-05-07T20:31:48.6262355Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6262483Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6262570Z x = x_sign * x_clamp 2025-05-07T20:31:48.6262650Z x0 = x[:, :D] 2025-05-07T20:31:48.6262736Z x1 = x[:, D:] 2025-05-07T20:31:48.6262807Z 2025-05-07T20:31:48.6262888Z if contiguous: 2025-05-07T20:31:48.6262983Z x0 = x0.contiguous() 2025-05-07T20:31:48.6263071Z x1 = x1.contiguous() 2025-05-07T20:31:48.6263144Z 2025-05-07T20:31:48.6263350Z if scale_ub is not None: 2025-05-07T20:31:48.6263460Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6263596Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6263679Z ) 2025-05-07T20:31:48.6263753Z else: 2025-05-07T20:31:48.6263851Z scale_ub_tensor = None 2025-05-07T20:31:48.6263921Z 2025-05-07T20:31:48.6264051Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6264143Z op = silu_mul_quant 2025-05-07T20:31:48.6264227Z if compiled: 2025-05-07T20:31:48.6264324Z op = torch.compile(op) 2025-05-07T20:31:48.6264432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6264504Z 2025-05-07T20:31:48.6264594Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6264598Z 2025-05-07T20:31:48.6264697Z moe/activation_test.py:117: 2025-05-07T20:31:48.6264823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6264932Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6265031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6265540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6265647Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6266012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6266234Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6266636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6266727Z kernel = self.compile( 2025-05-07T20:31:48.6267120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6267303Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6267426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6267431Z 2025-05-07T20:31:48.6267727Z self = 2025-05-07T20:31:48.6268522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6269040Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ef417e0>} 2025-05-07T20:31:48.6269805Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6270001Z context = 2025-05-07T20:31:48.6270010Z 2025-05-07T20:31:48.6270176Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6270450Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6270562Z module_map=module_map) 2025-05-07T20:31:48.6270724Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6270823Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6270904Z E ^ 2025-05-07T20:31:48.6271266Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6271271Z 2025-05-07T20:31:48.6271694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6271699Z 2025-05-07T20:31:48.6271801Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6272171Z self=, 2025-05-07T20:31:48.6272251Z T=16384, 2025-05-07T20:31:48.6272326Z D=5120, 2025-05-07T20:31:48.6272412Z scale_ub=1200.0, 2025-05-07T20:31:48.6272501Z contiguous=True, 2025-05-07T20:31:48.6272583Z compiled=True, 2025-05-07T20:31:48.6272654Z ) 2025-05-07T20:31:48.6272882Z self = 2025-05-07T20:31:48.6273060Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.6273064Z 2025-05-07T20:31:48.6273142Z @given( 2025-05-07T20:31:48.6273262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6273359Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6273477Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6273594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6273710Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6273793Z ) 2025-05-07T20:31:48.6274042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6274142Z def test_silu_mul_quant( 2025-05-07T20:31:48.6274215Z self, 2025-05-07T20:31:48.6274290Z T: int, 2025-05-07T20:31:48.6274365Z D: int, 2025-05-07T20:31:48.6274462Z scale_ub: Optional[float], 2025-05-07T20:31:48.6274551Z contiguous: bool, 2025-05-07T20:31:48.6274638Z compiled: bool, 2025-05-07T20:31:48.6274714Z ) -> None: 2025-05-07T20:31:48.6274810Z torch.manual_seed(2025) 2025-05-07T20:31:48.6274886Z 2025-05-07T20:31:48.6275057Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6275128Z 2025-05-07T20:31:48.6275222Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6275349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6275434Z x = x_sign * x_clamp 2025-05-07T20:31:48.6275518Z x0 = x[:, :D] 2025-05-07T20:31:48.6275595Z x1 = x[:, D:] 2025-05-07T20:31:48.6275670Z 2025-05-07T20:31:48.6275752Z if contiguous: 2025-05-07T20:31:48.6275925Z x0 = x0.contiguous() 2025-05-07T20:31:48.6276016Z x1 = x1.contiguous() 2025-05-07T20:31:48.6276086Z 2025-05-07T20:31:48.6276176Z if scale_ub is not None: 2025-05-07T20:31:48.6276285Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6276419Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6276492Z ) 2025-05-07T20:31:48.6276582Z else: 2025-05-07T20:31:48.6276687Z scale_ub_tensor = None 2025-05-07T20:31:48.6276772Z 2025-05-07T20:31:48.6276916Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6277005Z op = silu_mul_quant 2025-05-07T20:31:48.6277091Z if compiled: 2025-05-07T20:31:48.6277191Z op = torch.compile(op) 2025-05-07T20:31:48.6277301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6277374Z 2025-05-07T20:31:48.6277464Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6277469Z 2025-05-07T20:31:48.6277571Z moe/activation_test.py:117: 2025-05-07T20:31:48.6277704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6277806Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6277905Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6278284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6278377Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6278890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6278985Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6279429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6279660Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6280008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6280104Z kernel = self.compile( 2025-05-07T20:31:48.6280498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6280676Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6280801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6280806Z 2025-05-07T20:31:48.6281012Z self = 2025-05-07T20:31:48.6281808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6282326Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ef41090>} 2025-05-07T20:31:48.6283092Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6283288Z context = 2025-05-07T20:31:48.6283293Z 2025-05-07T20:31:48.6283458Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6283730Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6283839Z module_map=module_map) 2025-05-07T20:31:48.6284001Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6284108Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6284184Z E ^ 2025-05-07T20:31:48.6284545Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6284629Z 2025-05-07T20:31:48.6285056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6285060Z 2025-05-07T20:31:48.6285165Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6285393Z self=, 2025-05-07T20:31:48.6285470Z T=16384, 2025-05-07T20:31:48.6285545Z D=5120, 2025-05-07T20:31:48.6285632Z scale_ub=None, 2025-05-07T20:31:48.6285719Z contiguous=False, 2025-05-07T20:31:48.6285801Z compiled=True, 2025-05-07T20:31:48.6285877Z ) 2025-05-07T20:31:48.6286099Z self = 2025-05-07T20:31:48.6286281Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:48.6286289Z 2025-05-07T20:31:48.6286366Z @given( 2025-05-07T20:31:48.6286494Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6286597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6286719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6286840Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6286961Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6287036Z ) 2025-05-07T20:31:48.6287285Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6287388Z def test_silu_mul_quant( 2025-05-07T20:31:48.6287462Z self, 2025-05-07T20:31:48.6287540Z T: int, 2025-05-07T20:31:48.6287614Z D: int, 2025-05-07T20:31:48.6287713Z scale_ub: Optional[float], 2025-05-07T20:31:48.6287807Z contiguous: bool, 2025-05-07T20:31:48.6287995Z compiled: bool, 2025-05-07T20:31:48.6288076Z ) -> None: 2025-05-07T20:31:48.6288176Z torch.manual_seed(2025) 2025-05-07T20:31:48.6288253Z 2025-05-07T20:31:48.6288426Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6288503Z 2025-05-07T20:31:48.6288594Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6288719Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6288811Z x = x_sign * x_clamp 2025-05-07T20:31:48.6288892Z x0 = x[:, :D] 2025-05-07T20:31:48.6288974Z x1 = x[:, D:] 2025-05-07T20:31:48.6289050Z 2025-05-07T20:31:48.6289135Z if contiguous: 2025-05-07T20:31:48.6289231Z x0 = x0.contiguous() 2025-05-07T20:31:48.6289319Z x1 = x1.contiguous() 2025-05-07T20:31:48.6289391Z 2025-05-07T20:31:48.6289485Z if scale_ub is not None: 2025-05-07T20:31:48.6289592Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6289733Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6289811Z ) 2025-05-07T20:31:48.6289889Z else: 2025-05-07T20:31:48.6289986Z scale_ub_tensor = None 2025-05-07T20:31:48.6290062Z 2025-05-07T20:31:48.6290194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6290286Z op = silu_mul_quant 2025-05-07T20:31:48.6290373Z if compiled: 2025-05-07T20:31:48.6290473Z op = torch.compile(op) 2025-05-07T20:31:48.6290584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6290656Z 2025-05-07T20:31:48.6290747Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6290752Z 2025-05-07T20:31:48.6290854Z moe/activation_test.py:117: 2025-05-07T20:31:48.6290984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6291088Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6291193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6291574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6291668Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6292257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6292354Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6292723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6292946Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6293293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6293392Z kernel = self.compile( 2025-05-07T20:31:48.6293781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6293966Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6294089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6294102Z 2025-05-07T20:31:48.6294310Z self = 2025-05-07T20:31:48.6295109Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6295621Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ef42290>} 2025-05-07T20:31:48.6296491Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6296711Z context = 2025-05-07T20:31:48.6296716Z 2025-05-07T20:31:48.6296885Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6297161Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6297269Z module_map=module_map) 2025-05-07T20:31:48.6297439Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6297537Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6297615Z E ^ 2025-05-07T20:31:48.6297977Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6297982Z 2025-05-07T20:31:48.6298453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6298458Z 2025-05-07T20:31:48.6298573Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6298801Z self=, 2025-05-07T20:31:48.6298885Z T=2048, 2025-05-07T20:31:48.6298963Z D=5120, 2025-05-07T20:31:48.6299044Z scale_ub=None, 2025-05-07T20:31:48.6299130Z contiguous=False, 2025-05-07T20:31:48.6299216Z compiled=True, 2025-05-07T20:31:48.6299289Z ) 2025-05-07T20:31:48.6299511Z self = 2025-05-07T20:31:48.6299691Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:48.6299696Z 2025-05-07T20:31:48.6299773Z @given( 2025-05-07T20:31:48.6299895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6299994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6300110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6300232Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6300352Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6300426Z ) 2025-05-07T20:31:48.6300679Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6300857Z def test_silu_mul_quant( 2025-05-07T20:31:48.6300934Z self, 2025-05-07T20:31:48.6301012Z T: int, 2025-05-07T20:31:48.6301089Z D: int, 2025-05-07T20:31:48.6301190Z scale_ub: Optional[float], 2025-05-07T20:31:48.6301285Z contiguous: bool, 2025-05-07T20:31:48.6301370Z compiled: bool, 2025-05-07T20:31:48.6301452Z ) -> None: 2025-05-07T20:31:48.6301549Z torch.manual_seed(2025) 2025-05-07T20:31:48.6301621Z 2025-05-07T20:31:48.6301796Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6301868Z 2025-05-07T20:31:48.6301959Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6302089Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6302181Z x = x_sign * x_clamp 2025-05-07T20:31:48.6302261Z x0 = x[:, :D] 2025-05-07T20:31:48.6302344Z x1 = x[:, D:] 2025-05-07T20:31:48.6302416Z 2025-05-07T20:31:48.6302505Z if contiguous: 2025-05-07T20:31:48.6302599Z x0 = x0.contiguous() 2025-05-07T20:31:48.6302687Z x1 = x1.contiguous() 2025-05-07T20:31:48.6302763Z 2025-05-07T20:31:48.6302855Z if scale_ub is not None: 2025-05-07T20:31:48.6302960Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6303101Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6303176Z ) 2025-05-07T20:31:48.6303251Z else: 2025-05-07T20:31:48.6303348Z scale_ub_tensor = None 2025-05-07T20:31:48.6303421Z 2025-05-07T20:31:48.6303553Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6303649Z op = silu_mul_quant 2025-05-07T20:31:48.6303732Z if compiled: 2025-05-07T20:31:48.6303914Z op = torch.compile(op) 2025-05-07T20:31:48.6304027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6304100Z 2025-05-07T20:31:48.6304199Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6304204Z 2025-05-07T20:31:48.6304302Z moe/activation_test.py:117: 2025-05-07T20:31:48.6304430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6304539Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6304638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6305015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6305115Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6305620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6305719Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6306088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6306312Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6306717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6306811Z kernel = self.compile( 2025-05-07T20:31:48.6307203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6307383Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6307508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6307512Z 2025-05-07T20:31:48.6307722Z self = 2025-05-07T20:31:48.6308519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6309033Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ef42170>} 2025-05-07T20:31:48.6309884Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6310080Z context = 2025-05-07T20:31:48.6310085Z 2025-05-07T20:31:48.6310256Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6310526Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6310639Z module_map=module_map) 2025-05-07T20:31:48.6310807Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6310906Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6310985Z E ^ 2025-05-07T20:31:48.6311351Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6311355Z 2025-05-07T20:31:48.6311779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6311784Z 2025-05-07T20:31:48.6311891Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6312118Z self=, 2025-05-07T20:31:48.6312198Z T=2048, 2025-05-07T20:31:48.6312273Z D=5120, 2025-05-07T20:31:48.6312355Z scale_ub=1200.0, 2025-05-07T20:31:48.6312447Z contiguous=False, 2025-05-07T20:31:48.6312529Z compiled=True, 2025-05-07T20:31:48.6312601Z ) 2025-05-07T20:31:48.6312906Z self = 2025-05-07T20:31:48.6313084Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:48.6313094Z 2025-05-07T20:31:48.6313168Z @given( 2025-05-07T20:31:48.6313289Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6313388Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6313507Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6313625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6313741Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6313817Z ) 2025-05-07T20:31:48.6314065Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6314158Z def test_silu_mul_quant( 2025-05-07T20:31:48.6314236Z self, 2025-05-07T20:31:48.6314312Z T: int, 2025-05-07T20:31:48.6314387Z D: int, 2025-05-07T20:31:48.6314488Z scale_ub: Optional[float], 2025-05-07T20:31:48.6314582Z contiguous: bool, 2025-05-07T20:31:48.6314667Z compiled: bool, 2025-05-07T20:31:48.6314747Z ) -> None: 2025-05-07T20:31:48.6314844Z torch.manual_seed(2025) 2025-05-07T20:31:48.6314918Z 2025-05-07T20:31:48.6315090Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6315164Z 2025-05-07T20:31:48.6315260Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6315387Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6315476Z x = x_sign * x_clamp 2025-05-07T20:31:48.6315557Z x0 = x[:, :D] 2025-05-07T20:31:48.6315635Z x1 = x[:, D:] 2025-05-07T20:31:48.6315707Z 2025-05-07T20:31:48.6315792Z if contiguous: 2025-05-07T20:31:48.6315884Z x0 = x0.contiguous() 2025-05-07T20:31:48.6315972Z x1 = x1.contiguous() 2025-05-07T20:31:48.6316046Z 2025-05-07T20:31:48.6316137Z if scale_ub is not None: 2025-05-07T20:31:48.6316249Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6316385Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6316459Z ) 2025-05-07T20:31:48.6316643Z else: 2025-05-07T20:31:48.6316738Z scale_ub_tensor = None 2025-05-07T20:31:48.6316812Z 2025-05-07T20:31:48.6316946Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6317036Z op = silu_mul_quant 2025-05-07T20:31:48.6317121Z if compiled: 2025-05-07T20:31:48.6317222Z op = torch.compile(op) 2025-05-07T20:31:48.6317329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6317401Z 2025-05-07T20:31:48.6317498Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6317502Z 2025-05-07T20:31:48.6317600Z moe/activation_test.py:117: 2025-05-07T20:31:48.6317731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6317835Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6317939Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6318319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6318418Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6318924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6319026Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6319392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6319619Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6319965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6320059Z kernel = self.compile( 2025-05-07T20:31:48.6324733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6324935Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6325068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6325073Z 2025-05-07T20:31:48.6325291Z self = 2025-05-07T20:31:48.6326086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6326651Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ef43880>} 2025-05-07T20:31:48.6327422Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6327616Z context = 2025-05-07T20:31:48.6327624Z 2025-05-07T20:31:48.6327790Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6328057Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6328164Z module_map=module_map) 2025-05-07T20:31:48.6328328Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6328426Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6328502Z E ^ 2025-05-07T20:31:48.6328864Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6328869Z 2025-05-07T20:31:48.6329303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6329308Z 2025-05-07T20:31:48.6329411Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6329638Z self=, 2025-05-07T20:31:48.6329799Z T=4096, 2025-05-07T20:31:48.6329871Z D=5120, 2025-05-07T20:31:48.6329950Z scale_ub=1200.0, 2025-05-07T20:31:48.6330037Z contiguous=True, 2025-05-07T20:31:48.6330117Z compiled=True, 2025-05-07T20:31:48.6330188Z ) 2025-05-07T20:31:48.6330414Z self = 2025-05-07T20:31:48.6330586Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.6330590Z 2025-05-07T20:31:48.6330668Z @given( 2025-05-07T20:31:48.6330785Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6330882Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6330999Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6331122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6331236Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6331316Z ) 2025-05-07T20:31:48.6331564Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6331659Z def test_silu_mul_quant( 2025-05-07T20:31:48.6331732Z self, 2025-05-07T20:31:48.6331804Z T: int, 2025-05-07T20:31:48.6331879Z D: int, 2025-05-07T20:31:48.6331974Z scale_ub: Optional[float], 2025-05-07T20:31:48.6332060Z contiguous: bool, 2025-05-07T20:31:48.6332147Z compiled: bool, 2025-05-07T20:31:48.6332222Z ) -> None: 2025-05-07T20:31:48.6332315Z torch.manual_seed(2025) 2025-05-07T20:31:48.6332392Z 2025-05-07T20:31:48.6332561Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6332631Z 2025-05-07T20:31:48.6332724Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6332930Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6333021Z x = x_sign * x_clamp 2025-05-07T20:31:48.6333097Z x0 = x[:, :D] 2025-05-07T20:31:48.6333178Z x1 = x[:, D:] 2025-05-07T20:31:48.6333252Z 2025-05-07T20:31:48.6333336Z if contiguous: 2025-05-07T20:31:48.6333427Z x0 = x0.contiguous() 2025-05-07T20:31:48.6333515Z x1 = x1.contiguous() 2025-05-07T20:31:48.6333583Z 2025-05-07T20:31:48.6333672Z if scale_ub is not None: 2025-05-07T20:31:48.6333780Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6333915Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6333985Z ) 2025-05-07T20:31:48.6334062Z else: 2025-05-07T20:31:48.6334153Z scale_ub_tensor = None 2025-05-07T20:31:48.6334225Z 2025-05-07T20:31:48.6334356Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6334443Z op = silu_mul_quant 2025-05-07T20:31:48.6334533Z if compiled: 2025-05-07T20:31:48.6334631Z op = torch.compile(op) 2025-05-07T20:31:48.6334736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6334811Z 2025-05-07T20:31:48.6334900Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6334904Z 2025-05-07T20:31:48.6335004Z moe/activation_test.py:117: 2025-05-07T20:31:48.6335133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6335231Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6335331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6335713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6335803Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6336312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6336410Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6336821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6337131Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6337476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6337571Z kernel = self.compile( 2025-05-07T20:31:48.6337960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6338208Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6338338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6338343Z 2025-05-07T20:31:48.6338551Z self = 2025-05-07T20:31:48.6339355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6339874Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ed78940>} 2025-05-07T20:31:48.6340638Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6340832Z context = 2025-05-07T20:31:48.6340837Z 2025-05-07T20:31:48.6340999Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6341271Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6341454Z module_map=module_map) 2025-05-07T20:31:48.6341619Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6341719Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6341796Z E ^ 2025-05-07T20:31:48.6342157Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6342165Z 2025-05-07T20:31:48.6342585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6342590Z 2025-05-07T20:31:48.6342691Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6342920Z self=, 2025-05-07T20:31:48.6342995Z T=128, 2025-05-07T20:31:48.6343067Z D=5120, 2025-05-07T20:31:48.6343150Z scale_ub=1200.0, 2025-05-07T20:31:48.6343234Z contiguous=False, 2025-05-07T20:31:48.6343313Z compiled=True, 2025-05-07T20:31:48.6343391Z ) 2025-05-07T20:31:48.6343610Z self = 2025-05-07T20:31:48.6343785Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:48.6343794Z 2025-05-07T20:31:48.6343868Z @given( 2025-05-07T20:31:48.6343985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6344089Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6344204Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6344318Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6344434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6344505Z ) 2025-05-07T20:31:48.6344751Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6344851Z def test_silu_mul_quant( 2025-05-07T20:31:48.6344923Z self, 2025-05-07T20:31:48.6344995Z T: int, 2025-05-07T20:31:48.6345072Z D: int, 2025-05-07T20:31:48.6345173Z scale_ub: Optional[float], 2025-05-07T20:31:48.6345260Z contiguous: bool, 2025-05-07T20:31:48.6345344Z compiled: bool, 2025-05-07T20:31:48.6345503Z ) -> None: 2025-05-07T20:31:48.6345600Z torch.manual_seed(2025) 2025-05-07T20:31:48.6345669Z 2025-05-07T20:31:48.6345839Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6345910Z 2025-05-07T20:31:48.6345999Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6346122Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6346212Z x = x_sign * x_clamp 2025-05-07T20:31:48.6346289Z x0 = x[:, :D] 2025-05-07T20:31:48.6346367Z x1 = x[:, D:] 2025-05-07T20:31:48.6346439Z 2025-05-07T20:31:48.6346518Z if contiguous: 2025-05-07T20:31:48.6346607Z x0 = x0.contiguous() 2025-05-07T20:31:48.6346697Z x1 = x1.contiguous() 2025-05-07T20:31:48.6346766Z 2025-05-07T20:31:48.6346860Z if scale_ub is not None: 2025-05-07T20:31:48.6346967Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6347099Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6347179Z ) 2025-05-07T20:31:48.6347250Z else: 2025-05-07T20:31:48.6347340Z scale_ub_tensor = None 2025-05-07T20:31:48.6347412Z 2025-05-07T20:31:48.6347541Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6347628Z op = silu_mul_quant 2025-05-07T20:31:48.6347714Z if compiled: 2025-05-07T20:31:48.6347811Z op = torch.compile(op) 2025-05-07T20:31:48.6347914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6347986Z 2025-05-07T20:31:48.6348073Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6348078Z 2025-05-07T20:31:48.6348176Z moe/activation_test.py:117: 2025-05-07T20:31:48.6348303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6348478Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6348581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6348954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6349049Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6349555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6349648Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6350015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6350239Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6350583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6350680Z kernel = self.compile( 2025-05-07T20:31:48.6351073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6351248Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6351375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6351380Z 2025-05-07T20:31:48.6351586Z self = 2025-05-07T20:31:48.6352382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6352894Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ed791b0>} 2025-05-07T20:31:48.6353664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6353965Z context = 2025-05-07T20:31:48.6353970Z 2025-05-07T20:31:48.6354134Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6354407Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6354512Z module_map=module_map) 2025-05-07T20:31:48.6354676Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6354773Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6354846Z E ^ 2025-05-07T20:31:48.6355209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6355214Z 2025-05-07T20:31:48.6355882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6355896Z 2025-05-07T20:31:48.6356050Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6356372Z self=, 2025-05-07T20:31:48.6356476Z T=16384, 2025-05-07T20:31:48.6356574Z D=7168, 2025-05-07T20:31:48.6356659Z scale_ub=1200.0, 2025-05-07T20:31:48.6356758Z contiguous=True, 2025-05-07T20:31:48.6356848Z compiled=True, 2025-05-07T20:31:48.6356916Z ) 2025-05-07T20:31:48.6357138Z self = 2025-05-07T20:31:48.6357317Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.6357321Z 2025-05-07T20:31:48.6357392Z @given( 2025-05-07T20:31:48.6357509Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6357609Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6357871Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6357997Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6358109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6358184Z ) 2025-05-07T20:31:48.6358434Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6358526Z def test_silu_mul_quant( 2025-05-07T20:31:48.6358598Z self, 2025-05-07T20:31:48.6358677Z T: int, 2025-05-07T20:31:48.6358751Z D: int, 2025-05-07T20:31:48.6358851Z scale_ub: Optional[float], 2025-05-07T20:31:48.6358942Z contiguous: bool, 2025-05-07T20:31:48.6359024Z compiled: bool, 2025-05-07T20:31:48.6359101Z ) -> None: 2025-05-07T20:31:48.6359196Z torch.manual_seed(2025) 2025-05-07T20:31:48.6359265Z 2025-05-07T20:31:48.6359438Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6359509Z 2025-05-07T20:31:48.6359605Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6359735Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6359819Z x = x_sign * x_clamp 2025-05-07T20:31:48.6359901Z x0 = x[:, :D] 2025-05-07T20:31:48.6359986Z x1 = x[:, D:] 2025-05-07T20:31:48.6360056Z 2025-05-07T20:31:48.6360138Z if contiguous: 2025-05-07T20:31:48.6360230Z x0 = x0.contiguous() 2025-05-07T20:31:48.6360316Z x1 = x1.contiguous() 2025-05-07T20:31:48.6360385Z 2025-05-07T20:31:48.6360474Z if scale_ub is not None: 2025-05-07T20:31:48.6360577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6360712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6360785Z ) 2025-05-07T20:31:48.6360856Z else: 2025-05-07T20:31:48.6360952Z scale_ub_tensor = None 2025-05-07T20:31:48.6361020Z 2025-05-07T20:31:48.6361152Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6361247Z op = silu_mul_quant 2025-05-07T20:31:48.6361328Z if compiled: 2025-05-07T20:31:48.6361427Z op = torch.compile(op) 2025-05-07T20:31:48.6361659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6361729Z 2025-05-07T20:31:48.6361817Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6361822Z 2025-05-07T20:31:48.6361923Z moe/activation_test.py:117: 2025-05-07T20:31:48.6362049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6362150Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6362247Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6362622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6362719Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6363227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6363322Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6363687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6363917Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6364267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6364357Z kernel = self.compile( 2025-05-07T20:31:48.6364745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6364924Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6365046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6365050Z 2025-05-07T20:31:48.6365259Z self = 2025-05-07T20:31:48.6366666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6367241Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ed797e0>} 2025-05-07T20:31:48.6368008Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6368202Z context = 2025-05-07T20:31:48.6368207Z 2025-05-07T20:31:48.6368375Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6368647Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6368755Z module_map=module_map) 2025-05-07T20:31:48.6368921Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6369021Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6369098Z E ^ 2025-05-07T20:31:48.6369458Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6369462Z 2025-05-07T20:31:48.6369881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6369886Z 2025-05-07T20:31:48.6369993Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6370217Z self=, 2025-05-07T20:31:48.6370290Z T=16384, 2025-05-07T20:31:48.6370366Z D=5120, 2025-05-07T20:31:48.6370446Z scale_ub=1200.0, 2025-05-07T20:31:48.6370534Z contiguous=True, 2025-05-07T20:31:48.6370614Z compiled=False, 2025-05-07T20:31:48.6370682Z ) 2025-05-07T20:31:48.6370907Z self = 2025-05-07T20:31:48.6371168Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.6371173Z 2025-05-07T20:31:48.6371246Z @given( 2025-05-07T20:31:48.6371367Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6371464Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6371578Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6371700Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6371813Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6371886Z ) 2025-05-07T20:31:48.6372133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6372223Z def test_silu_mul_quant( 2025-05-07T20:31:48.6372299Z self, 2025-05-07T20:31:48.6372377Z T: int, 2025-05-07T20:31:48.6372449Z D: int, 2025-05-07T20:31:48.6372548Z scale_ub: Optional[float], 2025-05-07T20:31:48.6372643Z contiguous: bool, 2025-05-07T20:31:48.6372725Z compiled: bool, 2025-05-07T20:31:48.6372805Z ) -> None: 2025-05-07T20:31:48.6372897Z torch.manual_seed(2025) 2025-05-07T20:31:48.6372965Z 2025-05-07T20:31:48.6373137Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6373207Z 2025-05-07T20:31:48.6373301Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6373424Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6373510Z x = x_sign * x_clamp 2025-05-07T20:31:48.6373591Z x0 = x[:, :D] 2025-05-07T20:31:48.6373668Z x1 = x[:, D:] 2025-05-07T20:31:48.6373738Z 2025-05-07T20:31:48.6373822Z if contiguous: 2025-05-07T20:31:48.6373910Z x0 = x0.contiguous() 2025-05-07T20:31:48.6374078Z x1 = x1.contiguous() 2025-05-07T20:31:48.6374153Z 2025-05-07T20:31:48.6374241Z if scale_ub is not None: 2025-05-07T20:31:48.6374344Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6374486Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6374558Z ) 2025-05-07T20:31:48.6374630Z else: 2025-05-07T20:31:48.6374725Z scale_ub_tensor = None 2025-05-07T20:31:48.6374792Z 2025-05-07T20:31:48.6374924Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6375013Z op = silu_mul_quant 2025-05-07T20:31:48.6375094Z if compiled: 2025-05-07T20:31:48.6375194Z op = torch.compile(op) 2025-05-07T20:31:48.6375299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6375367Z 2025-05-07T20:31:48.6375460Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6375464Z 2025-05-07T20:31:48.6375558Z moe/activation_test.py:117: 2025-05-07T20:31:48.6375689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6375793Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6375891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6376410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6376506Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6376868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6377092Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6377436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6377529Z kernel = self.compile( 2025-05-07T20:31:48.6377919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6378160Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6378287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6378379Z 2025-05-07T20:31:48.6378586Z self = 2025-05-07T20:31:48.6379380Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6379893Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ed7a950>} 2025-05-07T20:31:48.6380656Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6380854Z context = 2025-05-07T20:31:48.6380864Z 2025-05-07T20:31:48.6381031Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6381302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6381408Z module_map=module_map) 2025-05-07T20:31:48.6381568Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6381669Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6381742Z E ^ 2025-05-07T20:31:48.6382104Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6382109Z 2025-05-07T20:31:48.6382531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6382536Z 2025-05-07T20:31:48.6382786Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6383020Z self=, 2025-05-07T20:31:48.6383102Z T=1, 2025-05-07T20:31:48.6383174Z D=7168, 2025-05-07T20:31:48.6383258Z scale_ub=1200.0, 2025-05-07T20:31:48.6383340Z contiguous=False, 2025-05-07T20:31:48.6383420Z compiled=False, 2025-05-07T20:31:48.6383491Z ) 2025-05-07T20:31:48.6383710Z self = 2025-05-07T20:31:48.6383881Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.6383889Z 2025-05-07T20:31:48.6383961Z @given( 2025-05-07T20:31:48.6384077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6384176Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6384290Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6384407Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6384529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6384598Z ) 2025-05-07T20:31:48.6384844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6384942Z def test_silu_mul_quant( 2025-05-07T20:31:48.6385015Z self, 2025-05-07T20:31:48.6385094Z T: int, 2025-05-07T20:31:48.6385167Z D: int, 2025-05-07T20:31:48.6385262Z scale_ub: Optional[float], 2025-05-07T20:31:48.6385353Z contiguous: bool, 2025-05-07T20:31:48.6385437Z compiled: bool, 2025-05-07T20:31:48.6385511Z ) -> None: 2025-05-07T20:31:48.6385612Z torch.manual_seed(2025) 2025-05-07T20:31:48.6385681Z 2025-05-07T20:31:48.6385855Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6385926Z 2025-05-07T20:31:48.6386014Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6386142Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6386237Z x = x_sign * x_clamp 2025-05-07T20:31:48.6386312Z x0 = x[:, :D] 2025-05-07T20:31:48.6386409Z x1 = x[:, D:] 2025-05-07T20:31:48.6386586Z 2025-05-07T20:31:48.6386676Z if contiguous: 2025-05-07T20:31:48.6386769Z x0 = x0.contiguous() 2025-05-07T20:31:48.6386855Z x1 = x1.contiguous() 2025-05-07T20:31:48.6386923Z 2025-05-07T20:31:48.6387014Z if scale_ub is not None: 2025-05-07T20:31:48.6387117Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6387254Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6387326Z ) 2025-05-07T20:31:48.6387400Z else: 2025-05-07T20:31:48.6387495Z scale_ub_tensor = None 2025-05-07T20:31:48.6387563Z 2025-05-07T20:31:48.6387690Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6387781Z op = silu_mul_quant 2025-05-07T20:31:48.6387865Z if compiled: 2025-05-07T20:31:48.6387968Z op = torch.compile(op) 2025-05-07T20:31:48.6388074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6388143Z 2025-05-07T20:31:48.6388240Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6388250Z 2025-05-07T20:31:48.6388346Z moe/activation_test.py:117: 2025-05-07T20:31:48.6388471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6388574Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6388670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6389179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6389278Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6389641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6389946Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6390295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6390393Z kernel = self.compile( 2025-05-07T20:31:48.6390785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6390960Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6391085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6391089Z 2025-05-07T20:31:48.6391299Z self = 2025-05-07T20:31:48.6392090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6392611Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ed7bac0>} 2025-05-07T20:31:48.6393377Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6393575Z context = 2025-05-07T20:31:48.6393579Z 2025-05-07T20:31:48.6393744Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6394014Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6394122Z module_map=module_map) 2025-05-07T20:31:48.6394282Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6394378Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6394461Z E ^ 2025-05-07T20:31:48.6394826Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6394830Z 2025-05-07T20:31:48.6395340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6395345Z 2025-05-07T20:31:48.6395446Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6395671Z self=, 2025-05-07T20:31:48.6395748Z T=4096, 2025-05-07T20:31:48.6395821Z D=7168, 2025-05-07T20:31:48.6395902Z scale_ub=1200.0, 2025-05-07T20:31:48.6395990Z contiguous=False, 2025-05-07T20:31:48.6396071Z compiled=True, 2025-05-07T20:31:48.6396147Z ) 2025-05-07T20:31:48.6396367Z self = 2025-05-07T20:31:48.6396563Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:48.6396569Z 2025-05-07T20:31:48.6396653Z @given( 2025-05-07T20:31:48.6396793Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6396889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6397014Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6397132Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6397245Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6397322Z ) 2025-05-07T20:31:48.6397568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6397663Z def test_silu_mul_quant( 2025-05-07T20:31:48.6397737Z self, 2025-05-07T20:31:48.6397809Z T: int, 2025-05-07T20:31:48.6397885Z D: int, 2025-05-07T20:31:48.6397983Z scale_ub: Optional[float], 2025-05-07T20:31:48.6398069Z contiguous: bool, 2025-05-07T20:31:48.6398154Z compiled: bool, 2025-05-07T20:31:48.6398228Z ) -> None: 2025-05-07T20:31:48.6398399Z torch.manual_seed(2025) 2025-05-07T20:31:48.6398477Z 2025-05-07T20:31:48.6398647Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6398722Z 2025-05-07T20:31:48.6398818Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6398941Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6399029Z x = x_sign * x_clamp 2025-05-07T20:31:48.6399106Z x0 = x[:, :D] 2025-05-07T20:31:48.6399183Z x1 = x[:, D:] 2025-05-07T20:31:48.6399254Z 2025-05-07T20:31:48.6399334Z if contiguous: 2025-05-07T20:31:48.6399423Z x0 = x0.contiguous() 2025-05-07T20:31:48.6399515Z x1 = x1.contiguous() 2025-05-07T20:31:48.6399585Z 2025-05-07T20:31:48.6399674Z if scale_ub is not None: 2025-05-07T20:31:48.6399781Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6399917Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6399987Z ) 2025-05-07T20:31:48.6400065Z else: 2025-05-07T20:31:48.6400157Z scale_ub_tensor = None 2025-05-07T20:31:48.6400226Z 2025-05-07T20:31:48.6400362Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6400456Z op = silu_mul_quant 2025-05-07T20:31:48.6400542Z if compiled: 2025-05-07T20:31:48.6400639Z op = torch.compile(op) 2025-05-07T20:31:48.6400743Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6400816Z 2025-05-07T20:31:48.6400904Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6400909Z 2025-05-07T20:31:48.6401005Z moe/activation_test.py:117: 2025-05-07T20:31:48.6401137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6401237Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6401336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6401723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6401813Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6402324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6402504Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6402866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6403090Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6403437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6403533Z kernel = self.compile( 2025-05-07T20:31:48.6403924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6404098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6404229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6404233Z 2025-05-07T20:31:48.6404440Z self = 2025-05-07T20:31:48.6405242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6405755Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ec80550>} 2025-05-07T20:31:48.6406515Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6406859Z context = 2025-05-07T20:31:48.6406864Z 2025-05-07T20:31:48.6407032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6407307Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6407418Z module_map=module_map) 2025-05-07T20:31:48.6407580Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6407681Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6407753Z E ^ 2025-05-07T20:31:48.6408112Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6408119Z 2025-05-07T20:31:48.6408542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6408546Z 2025-05-07T20:31:48.6408646Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6408880Z self=, 2025-05-07T20:31:48.6408953Z T=128, 2025-05-07T20:31:48.6409025Z D=7168, 2025-05-07T20:31:48.6409113Z scale_ub=1200.0, 2025-05-07T20:31:48.6409202Z contiguous=False, 2025-05-07T20:31:48.6409281Z compiled=True, 2025-05-07T20:31:48.6409354Z ) 2025-05-07T20:31:48.6409575Z self = 2025-05-07T20:31:48.6409748Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:48.6409753Z 2025-05-07T20:31:48.6409826Z @given( 2025-05-07T20:31:48.6409945Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6410046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6410162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6410279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6410397Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6410468Z ) 2025-05-07T20:31:48.6410720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6410814Z def test_silu_mul_quant( 2025-05-07T20:31:48.6410967Z self, 2025-05-07T20:31:48.6411046Z T: int, 2025-05-07T20:31:48.6411121Z D: int, 2025-05-07T20:31:48.6411217Z scale_ub: Optional[float], 2025-05-07T20:31:48.6411310Z contiguous: bool, 2025-05-07T20:31:48.6411395Z compiled: bool, 2025-05-07T20:31:48.6411472Z ) -> None: 2025-05-07T20:31:48.6411572Z torch.manual_seed(2025) 2025-05-07T20:31:48.6411646Z 2025-05-07T20:31:48.6411815Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6411892Z 2025-05-07T20:31:48.6411980Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6412103Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6412194Z x = x_sign * x_clamp 2025-05-07T20:31:48.6412271Z x0 = x[:, :D] 2025-05-07T20:31:48.6412357Z x1 = x[:, D:] 2025-05-07T20:31:48.6412425Z 2025-05-07T20:31:48.6412506Z if contiguous: 2025-05-07T20:31:48.6412601Z x0 = x0.contiguous() 2025-05-07T20:31:48.6412693Z x1 = x1.contiguous() 2025-05-07T20:31:48.6412765Z 2025-05-07T20:31:48.6412861Z if scale_ub is not None: 2025-05-07T20:31:48.6412965Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6413099Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6413177Z ) 2025-05-07T20:31:48.6413251Z else: 2025-05-07T20:31:48.6413341Z scale_ub_tensor = None 2025-05-07T20:31:48.6413415Z 2025-05-07T20:31:48.6413545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6413631Z op = silu_mul_quant 2025-05-07T20:31:48.6413717Z if compiled: 2025-05-07T20:31:48.6413814Z op = torch.compile(op) 2025-05-07T20:31:48.6413920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6414092Z 2025-05-07T20:31:48.6414185Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6414189Z 2025-05-07T20:31:48.6414288Z moe/activation_test.py:117: 2025-05-07T20:31:48.6414417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6414516Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6414619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6414998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6415096Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6415598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6415694Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6416062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6416291Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6416685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6416790Z kernel = self.compile( 2025-05-07T20:31:48.6417179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6417360Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6417482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6417486Z 2025-05-07T20:31:48.6417692Z self = 2025-05-07T20:31:48.6418548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6419058Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ec80f70>} 2025-05-07T20:31:48.6419904Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6420099Z context = 2025-05-07T20:31:48.6420104Z 2025-05-07T20:31:48.6420275Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6420543Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6420648Z module_map=module_map) 2025-05-07T20:31:48.6420815Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6420917Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6420991Z E ^ 2025-05-07T20:31:48.6421353Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6421363Z 2025-05-07T20:31:48.6421786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6421790Z 2025-05-07T20:31:48.6421896Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6422121Z self=, 2025-05-07T20:31:48.6422194Z T=2048, 2025-05-07T20:31:48.6422272Z D=7168, 2025-05-07T20:31:48.6422352Z scale_ub=None, 2025-05-07T20:31:48.6422435Z contiguous=True, 2025-05-07T20:31:48.6422521Z compiled=True, 2025-05-07T20:31:48.6422591Z ) 2025-05-07T20:31:48.6422812Z self = 2025-05-07T20:31:48.6423063Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.6423067Z 2025-05-07T20:31:48.6423144Z @given( 2025-05-07T20:31:48.6423266Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6423371Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6423486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6423607Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6423719Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6423789Z ) 2025-05-07T20:31:48.6424041Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6424134Z def test_silu_mul_quant( 2025-05-07T20:31:48.6424205Z self, 2025-05-07T20:31:48.6424283Z T: int, 2025-05-07T20:31:48.6424357Z D: int, 2025-05-07T20:31:48.6424457Z scale_ub: Optional[float], 2025-05-07T20:31:48.6424543Z contiguous: bool, 2025-05-07T20:31:48.6424626Z compiled: bool, 2025-05-07T20:31:48.6424710Z ) -> None: 2025-05-07T20:31:48.6424809Z torch.manual_seed(2025) 2025-05-07T20:31:48.6424880Z 2025-05-07T20:31:48.6425053Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6425127Z 2025-05-07T20:31:48.6425217Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6425348Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6425437Z x = x_sign * x_clamp 2025-05-07T20:31:48.6425517Z x0 = x[:, :D] 2025-05-07T20:31:48.6425598Z x1 = x[:, D:] 2025-05-07T20:31:48.6425668Z 2025-05-07T20:31:48.6425749Z if contiguous: 2025-05-07T20:31:48.6425844Z x0 = x0.contiguous() 2025-05-07T20:31:48.6425929Z x1 = x1.contiguous() 2025-05-07T20:31:48.6426003Z 2025-05-07T20:31:48.6426091Z if scale_ub is not None: 2025-05-07T20:31:48.6426194Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6426338Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6426409Z ) 2025-05-07T20:31:48.6426498Z else: 2025-05-07T20:31:48.6426605Z scale_ub_tensor = None 2025-05-07T20:31:48.6426777Z 2025-05-07T20:31:48.6426908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6426999Z op = silu_mul_quant 2025-05-07T20:31:48.6427082Z if compiled: 2025-05-07T20:31:48.6427180Z op = torch.compile(op) 2025-05-07T20:31:48.6427288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6427357Z 2025-05-07T20:31:48.6427453Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6427457Z 2025-05-07T20:31:48.6427553Z moe/activation_test.py:117: 2025-05-07T20:31:48.6427678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6427782Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6427881Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6428261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6428359Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6428860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6428966Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6429329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6429551Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6429900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6429992Z kernel = self.compile( 2025-05-07T20:31:48.6430383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6430640Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6430763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6430772Z 2025-05-07T20:31:48.6430983Z self = 2025-05-07T20:31:48.6431779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6432295Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ec81bd0>} 2025-05-07T20:31:48.6433054Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6433250Z context = 2025-05-07T20:31:48.6433254Z 2025-05-07T20:31:48.6433424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6433696Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6433808Z module_map=module_map) 2025-05-07T20:31:48.6433969Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6434064Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6434141Z E ^ 2025-05-07T20:31:48.6434498Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6434502Z 2025-05-07T20:31:48.6434922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6434932Z 2025-05-07T20:31:48.6435036Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6435267Z self=, 2025-05-07T20:31:48.6435346Z T=16384, 2025-05-07T20:31:48.6435498Z D=5120, 2025-05-07T20:31:48.6435583Z scale_ub=None, 2025-05-07T20:31:48.6435671Z contiguous=False, 2025-05-07T20:31:48.6435752Z compiled=False, 2025-05-07T20:31:48.6435823Z ) 2025-05-07T20:31:48.6436049Z self = 2025-05-07T20:31:48.6436227Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.6436232Z 2025-05-07T20:31:48.6436310Z @given( 2025-05-07T20:31:48.6436427Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6436525Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6436645Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6436762Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6436880Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6436955Z ) 2025-05-07T20:31:48.6437203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6437300Z def test_silu_mul_quant( 2025-05-07T20:31:48.6437376Z self, 2025-05-07T20:31:48.6437449Z T: int, 2025-05-07T20:31:48.6437520Z D: int, 2025-05-07T20:31:48.6437619Z scale_ub: Optional[float], 2025-05-07T20:31:48.6437705Z contiguous: bool, 2025-05-07T20:31:48.6437791Z compiled: bool, 2025-05-07T20:31:48.6437867Z ) -> None: 2025-05-07T20:31:48.6437960Z torch.manual_seed(2025) 2025-05-07T20:31:48.6438035Z 2025-05-07T20:31:48.6438205Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6438275Z 2025-05-07T20:31:48.6438368Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6438493Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6440454Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6440466Z 2025-05-07T20:31:48.6440585Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:48.6440589Z 2025-05-07T20:31:48.6440692Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6440920Z self=, 2025-05-07T20:31:48.6440994Z T=4096, 2025-05-07T20:31:48.6441072Z D=7168, 2025-05-07T20:31:48.6441152Z scale_ub=1200.0, 2025-05-07T20:31:48.6441234Z contiguous=True, 2025-05-07T20:31:48.6441328Z compiled=True, 2025-05-07T20:31:48.6441397Z ) 2025-05-07T20:31:48.6441620Z self = 2025-05-07T20:31:48.6441800Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.6441805Z 2025-05-07T20:31:48.6441877Z @given( 2025-05-07T20:31:48.6441992Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6442092Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6442204Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6442326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6442438Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6442510Z ) 2025-05-07T20:31:48.6442757Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6442849Z def test_silu_mul_quant( 2025-05-07T20:31:48.6442921Z self, 2025-05-07T20:31:48.6443001Z T: int, 2025-05-07T20:31:48.6443073Z D: int, 2025-05-07T20:31:48.6443169Z scale_ub: Optional[float], 2025-05-07T20:31:48.6443262Z contiguous: bool, 2025-05-07T20:31:48.6443451Z compiled: bool, 2025-05-07T20:31:48.6443527Z ) -> None: 2025-05-07T20:31:48.6443625Z torch.manual_seed(2025) 2025-05-07T20:31:48.6443693Z 2025-05-07T20:31:48.6443864Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6443935Z 2025-05-07T20:31:48.6444024Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6444150Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6446008Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6446022Z 2025-05-07T20:31:48.6446147Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:48.6446152Z 2025-05-07T20:31:48.6446257Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6446497Z self=, 2025-05-07T20:31:48.6446582Z T=16384, 2025-05-07T20:31:48.6446668Z D=7168, 2025-05-07T20:31:48.6446756Z scale_ub=None, 2025-05-07T20:31:48.6446848Z contiguous=False, 2025-05-07T20:31:48.6446930Z compiled=False, 2025-05-07T20:31:48.6447003Z ) 2025-05-07T20:31:48.6447222Z self = 2025-05-07T20:31:48.6451889Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.6452001Z 2025-05-07T20:31:48.6452096Z @given( 2025-05-07T20:31:48.6452219Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6452327Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6452444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6452559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6452670Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6452746Z ) 2025-05-07T20:31:48.6452998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6453092Z def test_silu_mul_quant( 2025-05-07T20:31:48.6453166Z self, 2025-05-07T20:31:48.6453242Z T: int, 2025-05-07T20:31:48.6453320Z D: int, 2025-05-07T20:31:48.6453417Z scale_ub: Optional[float], 2025-05-07T20:31:48.6453506Z contiguous: bool, 2025-05-07T20:31:48.6453595Z compiled: bool, 2025-05-07T20:31:48.6453678Z ) -> None: 2025-05-07T20:31:48.6453781Z torch.manual_seed(2025) 2025-05-07T20:31:48.6453856Z 2025-05-07T20:31:48.6454027Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6456312Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6456327Z 2025-05-07T20:31:48.6456453Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6456458Z 2025-05-07T20:31:48.6456562Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6456837Z self=, 2025-05-07T20:31:48.6456929Z T=2048, 2025-05-07T20:31:48.6457006Z D=7168, 2025-05-07T20:31:48.6457253Z scale_ub=1200.0, 2025-05-07T20:31:48.6457335Z contiguous=True, 2025-05-07T20:31:48.6457420Z compiled=True, 2025-05-07T20:31:48.6457490Z ) 2025-05-07T20:31:48.6457714Z self = 2025-05-07T20:31:48.6457892Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.6457897Z 2025-05-07T20:31:48.6457969Z @given( 2025-05-07T20:31:48.6458140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6458240Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6458353Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6458468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6458586Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6458665Z ) 2025-05-07T20:31:48.6458919Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6459019Z def test_silu_mul_quant( 2025-05-07T20:31:48.6459093Z self, 2025-05-07T20:31:48.6459172Z T: int, 2025-05-07T20:31:48.6459244Z D: int, 2025-05-07T20:31:48.6459340Z scale_ub: Optional[float], 2025-05-07T20:31:48.6459435Z contiguous: bool, 2025-05-07T20:31:48.6459517Z compiled: bool, 2025-05-07T20:31:48.6459595Z ) -> None: 2025-05-07T20:31:48.6459696Z torch.manual_seed(2025) 2025-05-07T20:31:48.6459765Z 2025-05-07T20:31:48.6459936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6460014Z 2025-05-07T20:31:48.6460106Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6460236Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6462200Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6462211Z 2025-05-07T20:31:48.6462335Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:48.6462340Z 2025-05-07T20:31:48.6462442Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6462669Z self=, 2025-05-07T20:31:48.6462747Z T=2048, 2025-05-07T20:31:48.6462820Z D=7168, 2025-05-07T20:31:48.6462903Z scale_ub=None, 2025-05-07T20:31:48.6462990Z contiguous=True, 2025-05-07T20:31:48.6463076Z compiled=False, 2025-05-07T20:31:48.6463147Z ) 2025-05-07T20:31:48.6463372Z self = 2025-05-07T20:31:48.6463550Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.6463555Z 2025-05-07T20:31:48.6463634Z @given( 2025-05-07T20:31:48.6463752Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6463848Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6463968Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6464083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6464200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6464275Z ) 2025-05-07T20:31:48.6464523Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6464621Z def test_silu_mul_quant( 2025-05-07T20:31:48.6464695Z self, 2025-05-07T20:31:48.6464767Z T: int, 2025-05-07T20:31:48.6464849Z D: int, 2025-05-07T20:31:48.6464946Z scale_ub: Optional[float], 2025-05-07T20:31:48.6465034Z contiguous: bool, 2025-05-07T20:31:48.6465210Z compiled: bool, 2025-05-07T20:31:48.6465286Z ) -> None: 2025-05-07T20:31:48.6465380Z torch.manual_seed(2025) 2025-05-07T20:31:48.6465458Z 2025-05-07T20:31:48.6465625Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6465697Z 2025-05-07T20:31:48.6465794Z > x_sign = torch.sign(x) 2025-05-07T20:31:48.6467633Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6467640Z 2025-05-07T20:31:48.6467766Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:48.6467771Z 2025-05-07T20:31:48.6467873Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6468105Z self=, 2025-05-07T20:31:48.6468181Z T=1, 2025-05-07T20:31:48.6468253Z D=7168, 2025-05-07T20:31:48.6468337Z scale_ub=1200.0, 2025-05-07T20:31:48.6468420Z contiguous=True, 2025-05-07T20:31:48.6468500Z compiled=False, 2025-05-07T20:31:48.6468575Z ) 2025-05-07T20:31:48.6468794Z self = 2025-05-07T20:31:48.6468961Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.6468966Z 2025-05-07T20:31:48.6469047Z @given( 2025-05-07T20:31:48.6469244Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6469348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6469463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6469585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6469703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6469774Z ) 2025-05-07T20:31:48.6470022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6470121Z def test_silu_mul_quant( 2025-05-07T20:31:48.6470199Z self, 2025-05-07T20:31:48.6470271Z T: int, 2025-05-07T20:31:48.6470351Z D: int, 2025-05-07T20:31:48.6470448Z scale_ub: Optional[float], 2025-05-07T20:31:48.6470533Z contiguous: bool, 2025-05-07T20:31:48.6470625Z compiled: bool, 2025-05-07T20:31:48.6470701Z ) -> None: 2025-05-07T20:31:48.6470794Z torch.manual_seed(2025) 2025-05-07T20:31:48.6470868Z 2025-05-07T20:31:48.6471040Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6471114Z 2025-05-07T20:31:48.6471210Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6471339Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6471431Z x = x_sign * x_clamp 2025-05-07T20:31:48.6471507Z x0 = x[:, :D] 2025-05-07T20:31:48.6471583Z x1 = x[:, D:] 2025-05-07T20:31:48.6471658Z 2025-05-07T20:31:48.6471740Z if contiguous: 2025-05-07T20:31:48.6471834Z x0 = x0.contiguous() 2025-05-07T20:31:48.6471926Z x1 = x1.contiguous() 2025-05-07T20:31:48.6471996Z 2025-05-07T20:31:48.6472084Z if scale_ub is not None: 2025-05-07T20:31:48.6472194Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6472334Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6472406Z ) 2025-05-07T20:31:48.6472483Z else: 2025-05-07T20:31:48.6472577Z scale_ub_tensor = None 2025-05-07T20:31:48.6472658Z 2025-05-07T20:31:48.6472790Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6472879Z op = silu_mul_quant 2025-05-07T20:31:48.6473051Z if compiled: 2025-05-07T20:31:48.6473150Z op = torch.compile(op) 2025-05-07T20:31:48.6473254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6473328Z 2025-05-07T20:31:48.6473418Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6473423Z 2025-05-07T20:31:48.6473519Z moe/activation_test.py:117: 2025-05-07T20:31:48.6473655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6473757Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6473862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6474378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6474478Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6474854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6475085Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6475434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6475530Z kernel = self.compile( 2025-05-07T20:31:48.6475921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6476102Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6476230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6476235Z 2025-05-07T20:31:48.6476465Z self = 2025-05-07T20:31:48.6477369Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6477891Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50ec83b50>} 2025-05-07T20:31:48.6478660Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6478856Z context = 2025-05-07T20:31:48.6478860Z 2025-05-07T20:31:48.6479032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6479302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6479417Z module_map=module_map) 2025-05-07T20:31:48.6479588Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6479685Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6479763Z E ^ 2025-05-07T20:31:48.6480130Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6480135Z 2025-05-07T20:31:48.6480557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6480561Z 2025-05-07T20:31:48.6480670Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6480897Z self=, 2025-05-07T20:31:48.6480976Z T=128, 2025-05-07T20:31:48.6481056Z D=5120, 2025-05-07T20:31:48.6481135Z scale_ub=None, 2025-05-07T20:31:48.6481217Z contiguous=True, 2025-05-07T20:31:48.6481302Z compiled=False, 2025-05-07T20:31:48.6481378Z ) 2025-05-07T20:31:48.6481599Z self = 2025-05-07T20:31:48.6481778Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.6481890Z 2025-05-07T20:31:48.6481965Z @given( 2025-05-07T20:31:48.6482091Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6482190Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6482306Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6482428Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6482542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6482611Z ) 2025-05-07T20:31:48.6482864Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6482957Z def test_silu_mul_quant( 2025-05-07T20:31:48.6483030Z self, 2025-05-07T20:31:48.6483107Z T: int, 2025-05-07T20:31:48.6483179Z D: int, 2025-05-07T20:31:48.6483286Z scale_ub: Optional[float], 2025-05-07T20:31:48.6483373Z contiguous: bool, 2025-05-07T20:31:48.6483455Z compiled: bool, 2025-05-07T20:31:48.6483541Z ) -> None: 2025-05-07T20:31:48.6483634Z torch.manual_seed(2025) 2025-05-07T20:31:48.6483704Z 2025-05-07T20:31:48.6483877Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6483949Z 2025-05-07T20:31:48.6484039Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6484168Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6484255Z x = x_sign * x_clamp 2025-05-07T20:31:48.6484333Z x0 = x[:, :D] 2025-05-07T20:31:48.6484415Z x1 = x[:, D:] 2025-05-07T20:31:48.6484486Z 2025-05-07T20:31:48.6484568Z if contiguous: 2025-05-07T20:31:48.6484662Z x0 = x0.contiguous() 2025-05-07T20:31:48.6484748Z x1 = x1.contiguous() 2025-05-07T20:31:48.6484821Z 2025-05-07T20:31:48.6484990Z if scale_ub is not None: 2025-05-07T20:31:48.6485098Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6485238Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6485316Z ) 2025-05-07T20:31:48.6485387Z else: 2025-05-07T20:31:48.6485486Z scale_ub_tensor = None 2025-05-07T20:31:48.6485557Z 2025-05-07T20:31:48.6485690Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6485780Z op = silu_mul_quant 2025-05-07T20:31:48.6485863Z if compiled: 2025-05-07T20:31:48.6485962Z op = torch.compile(op) 2025-05-07T20:31:48.6486071Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6486140Z 2025-05-07T20:31:48.6486233Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6486238Z 2025-05-07T20:31:48.6486334Z moe/activation_test.py:117: 2025-05-07T20:31:48.6486473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6486595Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6486717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6487230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6487334Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6487703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6487932Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6488277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6488368Z kernel = self.compile( 2025-05-07T20:31:48.6488762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6488943Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6489067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6489075Z 2025-05-07T20:31:48.6489367Z self = 2025-05-07T20:31:48.6490163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6490682Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50e564670>} 2025-05-07T20:31:48.6491447Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6491651Z context = 2025-05-07T20:31:48.6491655Z 2025-05-07T20:31:48.6491821Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6492095Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6492207Z module_map=module_map) 2025-05-07T20:31:48.6492370Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6492471Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6492544Z E ^ 2025-05-07T20:31:48.6492903Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6492908Z 2025-05-07T20:31:48.6493336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6493340Z 2025-05-07T20:31:48.6493444Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6493748Z self=, 2025-05-07T20:31:48.6493829Z T=128, 2025-05-07T20:31:48.6493902Z D=7168, 2025-05-07T20:31:48.6493988Z scale_ub=None, 2025-05-07T20:31:48.6494069Z contiguous=True, 2025-05-07T20:31:48.6494150Z compiled=False, 2025-05-07T20:31:48.6494223Z ) 2025-05-07T20:31:48.6494447Z self = 2025-05-07T20:31:48.6494617Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.6494621Z 2025-05-07T20:31:48.6494698Z @given( 2025-05-07T20:31:48.6494816Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6494914Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6495032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6495149Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6495267Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6495343Z ) 2025-05-07T20:31:48.6495591Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6495691Z def test_silu_mul_quant( 2025-05-07T20:31:48.6495764Z self, 2025-05-07T20:31:48.6495836Z T: int, 2025-05-07T20:31:48.6495913Z D: int, 2025-05-07T20:31:48.6496011Z scale_ub: Optional[float], 2025-05-07T20:31:48.6496098Z contiguous: bool, 2025-05-07T20:31:48.6496185Z compiled: bool, 2025-05-07T20:31:48.6496260Z ) -> None: 2025-05-07T20:31:48.6496354Z torch.manual_seed(2025) 2025-05-07T20:31:48.6496429Z 2025-05-07T20:31:48.6496598Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6496672Z 2025-05-07T20:31:48.6496761Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6496883Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6496973Z x = x_sign * x_clamp 2025-05-07T20:31:48.6497054Z x0 = x[:, :D] 2025-05-07T20:31:48.6497132Z x1 = x[:, D:] 2025-05-07T20:31:48.6497207Z 2025-05-07T20:31:48.6497288Z if contiguous: 2025-05-07T20:31:48.6497464Z x0 = x0.contiguous() 2025-05-07T20:31:48.6497555Z x1 = x1.contiguous() 2025-05-07T20:31:48.6497624Z 2025-05-07T20:31:48.6497712Z if scale_ub is not None: 2025-05-07T20:31:48.6497820Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6497954Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6498127Z ) 2025-05-07T20:31:48.6498202Z else: 2025-05-07T20:31:48.6498294Z scale_ub_tensor = None 2025-05-07T20:31:48.6498368Z 2025-05-07T20:31:48.6498498Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6498585Z op = silu_mul_quant 2025-05-07T20:31:48.6498672Z if compiled: 2025-05-07T20:31:48.6498770Z op = torch.compile(op) 2025-05-07T20:31:48.6498879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6498955Z 2025-05-07T20:31:48.6499043Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6499048Z 2025-05-07T20:31:48.6499151Z moe/activation_test.py:117: 2025-05-07T20:31:48.6499280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6499380Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6499483Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6499991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6500086Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6500453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6500676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6501105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6501199Z kernel = self.compile( 2025-05-07T20:31:48.6501591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6501773Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6501896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6501901Z 2025-05-07T20:31:48.6502107Z self = 2025-05-07T20:31:48.6502903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6503418Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50e564ee0>} 2025-05-07T20:31:48.6504184Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6504382Z context = 2025-05-07T20:31:48.6504386Z 2025-05-07T20:31:48.6504555Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6504823Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6504928Z module_map=module_map) 2025-05-07T20:31:48.6505093Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6505190Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6505266Z E ^ 2025-05-07T20:31:48.6505635Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6505640Z 2025-05-07T20:31:48.6506061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6506144Z 2025-05-07T20:31:48.6506253Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6506504Z self=, 2025-05-07T20:31:48.6506596Z T=2048, 2025-05-07T20:31:48.6506684Z D=7168, 2025-05-07T20:31:48.6506766Z scale_ub=1200.0, 2025-05-07T20:31:48.6506849Z contiguous=True, 2025-05-07T20:31:48.6506934Z compiled=False, 2025-05-07T20:31:48.6507005Z ) 2025-05-07T20:31:48.6507224Z self = 2025-05-07T20:31:48.6507402Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.6507406Z 2025-05-07T20:31:48.6507484Z @given( 2025-05-07T20:31:48.6507609Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6507710Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6507828Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6507956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6508071Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6508142Z ) 2025-05-07T20:31:48.6508394Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6508487Z def test_silu_mul_quant( 2025-05-07T20:31:48.6508561Z self, 2025-05-07T20:31:48.6508634Z T: int, 2025-05-07T20:31:48.6508707Z D: int, 2025-05-07T20:31:48.6508808Z scale_ub: Optional[float], 2025-05-07T20:31:48.6508894Z contiguous: bool, 2025-05-07T20:31:48.6508976Z compiled: bool, 2025-05-07T20:31:48.6509055Z ) -> None: 2025-05-07T20:31:48.6509148Z torch.manual_seed(2025) 2025-05-07T20:31:48.6509218Z 2025-05-07T20:31:48.6509496Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6511348Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6511359Z 2025-05-07T20:31:48.6511481Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6511486Z 2025-05-07T20:31:48.6511586Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6511820Z self=, 2025-05-07T20:31:48.6511895Z T=1, 2025-05-07T20:31:48.6511972Z D=5120, 2025-05-07T20:31:48.6512058Z scale_ub=1200.0, 2025-05-07T20:31:48.6512140Z contiguous=True, 2025-05-07T20:31:48.6512230Z compiled=False, 2025-05-07T20:31:48.6512305Z ) 2025-05-07T20:31:48.6512525Z self = 2025-05-07T20:31:48.6512695Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.6512700Z 2025-05-07T20:31:48.6512775Z @given( 2025-05-07T20:31:48.6512892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6512994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6513107Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6513223Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6513337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6513408Z ) 2025-05-07T20:31:48.6513661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6513753Z def test_silu_mul_quant( 2025-05-07T20:31:48.6513828Z self, 2025-05-07T20:31:48.6513906Z T: int, 2025-05-07T20:31:48.6514059Z D: int, 2025-05-07T20:31:48.6514154Z scale_ub: Optional[float], 2025-05-07T20:31:48.6514248Z contiguous: bool, 2025-05-07T20:31:48.6514330Z compiled: bool, 2025-05-07T20:31:48.6514409Z ) -> None: 2025-05-07T20:31:48.6514507Z torch.manual_seed(2025) 2025-05-07T20:31:48.6514578Z 2025-05-07T20:31:48.6514747Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6514821Z 2025-05-07T20:31:48.6514910Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6515041Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6515126Z x = x_sign * x_clamp 2025-05-07T20:31:48.6515203Z x0 = x[:, :D] 2025-05-07T20:31:48.6515285Z x1 = x[:, D:] 2025-05-07T20:31:48.6515357Z 2025-05-07T20:31:48.6515445Z if contiguous: 2025-05-07T20:31:48.6515544Z x0 = x0.contiguous() 2025-05-07T20:31:48.6515631Z x1 = x1.contiguous() 2025-05-07T20:31:48.6515706Z 2025-05-07T20:31:48.6515798Z if scale_ub is not None: 2025-05-07T20:31:48.6515901Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6516036Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6516110Z ) 2025-05-07T20:31:48.6516182Z else: 2025-05-07T20:31:48.6516274Z scale_ub_tensor = None 2025-05-07T20:31:48.6516347Z 2025-05-07T20:31:48.6516477Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6516583Z op = silu_mul_quant 2025-05-07T20:31:48.6516675Z if compiled: 2025-05-07T20:31:48.6516789Z op = torch.compile(op) 2025-05-07T20:31:48.6516905Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6516976Z 2025-05-07T20:31:48.6517065Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6517150Z 2025-05-07T20:31:48.6517252Z moe/activation_test.py:117: 2025-05-07T20:31:48.6517379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6517483Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6517584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6518094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6518192Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6518559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6518782Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6519134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6519226Z kernel = self.compile( 2025-05-07T20:31:48.6519620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6519802Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6519930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6519935Z 2025-05-07T20:31:48.6520146Z self = 2025-05-07T20:31:48.6520938Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6521453Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50e565e10>} 2025-05-07T20:31:48.6522219Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6522493Z context = 2025-05-07T20:31:48.6522498Z 2025-05-07T20:31:48.6522667Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6522935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6523044Z module_map=module_map) 2025-05-07T20:31:48.6523204Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6523302Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6523380Z E ^ 2025-05-07T20:31:48.6523741Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6523746Z 2025-05-07T20:31:48.6524172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6524181Z 2025-05-07T20:31:48.6524283Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6524516Z self=, 2025-05-07T20:31:48.6524592Z T=2048, 2025-05-07T20:31:48.6524664Z D=5120, 2025-05-07T20:31:48.6524743Z scale_ub=None, 2025-05-07T20:31:48.6524825Z contiguous=True, 2025-05-07T20:31:48.6524905Z compiled=False, 2025-05-07T20:31:48.6524976Z ) 2025-05-07T20:31:48.6525201Z self = 2025-05-07T20:31:48.6525373Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.6525378Z 2025-05-07T20:31:48.6525453Z @given( 2025-05-07T20:31:48.6525573Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6525670Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6525863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6525981Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6526093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6526173Z ) 2025-05-07T20:31:48.6526419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6526510Z def test_silu_mul_quant( 2025-05-07T20:31:48.6526589Z self, 2025-05-07T20:31:48.6526666Z T: int, 2025-05-07T20:31:48.6526739Z D: int, 2025-05-07T20:31:48.6526837Z scale_ub: Optional[float], 2025-05-07T20:31:48.6526925Z contiguous: bool, 2025-05-07T20:31:48.6527011Z compiled: bool, 2025-05-07T20:31:48.6527086Z ) -> None: 2025-05-07T20:31:48.6527179Z torch.manual_seed(2025) 2025-05-07T20:31:48.6527249Z 2025-05-07T20:31:48.6527417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6527488Z 2025-05-07T20:31:48.6527592Z > x_sign = torch.sign(x) 2025-05-07T20:31:48.6529444Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6529454Z 2025-05-07T20:31:48.6529576Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:48.6529581Z 2025-05-07T20:31:48.6529684Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6529911Z self=, 2025-05-07T20:31:48.6529990Z T=16384, 2025-05-07T20:31:48.6530066Z D=5120, 2025-05-07T20:31:48.6530156Z scale_ub=None, 2025-05-07T20:31:48.6530246Z contiguous=True, 2025-05-07T20:31:48.6530326Z compiled=False, 2025-05-07T20:31:48.6530484Z ) 2025-05-07T20:31:48.6530704Z self = 2025-05-07T20:31:48.6530880Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.6530885Z 2025-05-07T20:31:48.6530963Z @given( 2025-05-07T20:31:48.6531079Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6531177Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6531293Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6531411Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6531531Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6531602Z ) 2025-05-07T20:31:48.6531848Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6531952Z def test_silu_mul_quant( 2025-05-07T20:31:48.6532028Z self, 2025-05-07T20:31:48.6532103Z T: int, 2025-05-07T20:31:48.6532179Z D: int, 2025-05-07T20:31:48.6532283Z scale_ub: Optional[float], 2025-05-07T20:31:48.6532372Z contiguous: bool, 2025-05-07T20:31:48.6532462Z compiled: bool, 2025-05-07T20:31:48.6532538Z ) -> None: 2025-05-07T20:31:48.6532630Z torch.manual_seed(2025) 2025-05-07T20:31:48.6532706Z 2025-05-07T20:31:48.6532874Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6534792Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6534802Z 2025-05-07T20:31:48.6534922Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6534927Z 2025-05-07T20:31:48.6535031Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6535255Z self=, 2025-05-07T20:31:48.6535327Z T=4096, 2025-05-07T20:31:48.6535404Z D=5120, 2025-05-07T20:31:48.6535482Z scale_ub=None, 2025-05-07T20:31:48.6535565Z contiguous=True, 2025-05-07T20:31:48.6535649Z compiled=False, 2025-05-07T20:31:48.6535716Z ) 2025-05-07T20:31:48.6535933Z self = 2025-05-07T20:31:48.6536105Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.6536109Z 2025-05-07T20:31:48.6536183Z @given( 2025-05-07T20:31:48.6536305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6536418Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6536549Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6536686Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6536798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6536870Z ) 2025-05-07T20:31:48.6537119Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6537210Z def test_silu_mul_quant( 2025-05-07T20:31:48.6537283Z self, 2025-05-07T20:31:48.6537361Z T: int, 2025-05-07T20:31:48.6537432Z D: int, 2025-05-07T20:31:48.6537534Z scale_ub: Optional[float], 2025-05-07T20:31:48.6537622Z contiguous: bool, 2025-05-07T20:31:48.6537705Z compiled: bool, 2025-05-07T20:31:48.6537784Z ) -> None: 2025-05-07T20:31:48.6537878Z torch.manual_seed(2025) 2025-05-07T20:31:48.6537953Z 2025-05-07T20:31:48.6538209Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6540040Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6540199Z 2025-05-07T20:31:48.6540320Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6540325Z 2025-05-07T20:31:48.6540429Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6540653Z self=, 2025-05-07T20:31:48.6540738Z T=2048, 2025-05-07T20:31:48.6540812Z D=5120, 2025-05-07T20:31:48.6540897Z scale_ub=None, 2025-05-07T20:31:48.6540983Z contiguous=False, 2025-05-07T20:31:48.6541070Z compiled=False, 2025-05-07T20:31:48.6541146Z ) 2025-05-07T20:31:48.6541365Z self = 2025-05-07T20:31:48.6541538Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.6541543Z 2025-05-07T20:31:48.6541620Z @given( 2025-05-07T20:31:48.6541738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6541835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6541954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6542070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6542190Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6542261Z ) 2025-05-07T20:31:48.6542616Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6542711Z def test_silu_mul_quant( 2025-05-07T20:31:48.6542784Z self, 2025-05-07T20:31:48.6542859Z T: int, 2025-05-07T20:31:48.6542935Z D: int, 2025-05-07T20:31:48.6543030Z scale_ub: Optional[float], 2025-05-07T20:31:48.6543117Z contiguous: bool, 2025-05-07T20:31:48.6543202Z compiled: bool, 2025-05-07T20:31:48.6543279Z ) -> None: 2025-05-07T20:31:48.6543372Z torch.manual_seed(2025) 2025-05-07T20:31:48.6543446Z 2025-05-07T20:31:48.6543614Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6545455Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6545464Z 2025-05-07T20:31:48.6545583Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6545587Z 2025-05-07T20:31:48.6545693Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6545917Z self=, 2025-05-07T20:31:48.6545992Z T=4096, 2025-05-07T20:31:48.6546071Z D=7168, 2025-05-07T20:31:48.6546152Z scale_ub=None, 2025-05-07T20:31:48.6546236Z contiguous=True, 2025-05-07T20:31:48.6546319Z compiled=True, 2025-05-07T20:31:48.6546392Z ) 2025-05-07T20:31:48.6546637Z self = 2025-05-07T20:31:48.6546830Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.6546839Z 2025-05-07T20:31:48.6546915Z @given( 2025-05-07T20:31:48.6547037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6547217Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6547331Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6547452Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6547566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6547637Z ) 2025-05-07T20:31:48.6547886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6547978Z def test_silu_mul_quant( 2025-05-07T20:31:48.6548053Z self, 2025-05-07T20:31:48.6548131Z T: int, 2025-05-07T20:31:48.6548205Z D: int, 2025-05-07T20:31:48.6548304Z scale_ub: Optional[float], 2025-05-07T20:31:48.6548394Z contiguous: bool, 2025-05-07T20:31:48.6548477Z compiled: bool, 2025-05-07T20:31:48.6548558Z ) -> None: 2025-05-07T20:31:48.6548658Z torch.manual_seed(2025) 2025-05-07T20:31:48.6548728Z 2025-05-07T20:31:48.6548900Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6550741Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6550748Z 2025-05-07T20:31:48.6550868Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6550873Z 2025-05-07T20:31:48.6550976Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6551278Z self=, 2025-05-07T20:31:48.6551357Z T=2048, 2025-05-07T20:31:48.6551430Z D=5120, 2025-05-07T20:31:48.6551516Z scale_ub=1200.0, 2025-05-07T20:31:48.6551601Z contiguous=False, 2025-05-07T20:31:48.6551683Z compiled=False, 2025-05-07T20:31:48.6551758Z ) 2025-05-07T20:31:48.6551976Z self = 2025-05-07T20:31:48.6552151Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.6552155Z 2025-05-07T20:31:48.6552234Z @given( 2025-05-07T20:31:48.6552350Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6552447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6552561Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6552677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6552796Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6552872Z ) 2025-05-07T20:31:48.6553120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6553219Z def test_silu_mul_quant( 2025-05-07T20:31:48.6553296Z self, 2025-05-07T20:31:48.6553371Z T: int, 2025-05-07T20:31:48.6553452Z D: int, 2025-05-07T20:31:48.6553548Z scale_ub: Optional[float], 2025-05-07T20:31:48.6553638Z contiguous: bool, 2025-05-07T20:31:48.6553725Z compiled: bool, 2025-05-07T20:31:48.6553803Z ) -> None: 2025-05-07T20:31:48.6553899Z torch.manual_seed(2025) 2025-05-07T20:31:48.6553974Z 2025-05-07T20:31:48.6554142Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6556229Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6556377Z 2025-05-07T20:31:48.6556501Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6556506Z 2025-05-07T20:31:48.6556615Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6556841Z self=, 2025-05-07T20:31:48.6556917Z T=4096, 2025-05-07T20:31:48.6556995Z D=7168, 2025-05-07T20:31:48.6557076Z scale_ub=1200.0, 2025-05-07T20:31:48.6557158Z contiguous=True, 2025-05-07T20:31:48.6557243Z compiled=False, 2025-05-07T20:31:48.6557314Z ) 2025-05-07T20:31:48.6557532Z self = 2025-05-07T20:31:48.6557709Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.6557714Z 2025-05-07T20:31:48.6557789Z @given( 2025-05-07T20:31:48.6557916Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6558017Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6558132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6558252Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6558366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6558440Z ) 2025-05-07T20:31:48.6558690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6558783Z def test_silu_mul_quant( 2025-05-07T20:31:48.6558857Z self, 2025-05-07T20:31:48.6558934Z T: int, 2025-05-07T20:31:48.6559008Z D: int, 2025-05-07T20:31:48.6559109Z scale_ub: Optional[float], 2025-05-07T20:31:48.6559197Z contiguous: bool, 2025-05-07T20:31:48.6559394Z compiled: bool, 2025-05-07T20:31:48.6559477Z ) -> None: 2025-05-07T20:31:48.6559572Z torch.manual_seed(2025) 2025-05-07T20:31:48.6559643Z 2025-05-07T20:31:48.6559821Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6561654Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6561661Z 2025-05-07T20:31:48.6561780Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6561785Z 2025-05-07T20:31:48.6561890Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6562118Z self=, 2025-05-07T20:31:48.6562203Z T=16384, 2025-05-07T20:31:48.6562280Z D=7168, 2025-05-07T20:31:48.6562362Z scale_ub=None, 2025-05-07T20:31:48.6562449Z contiguous=False, 2025-05-07T20:31:48.6562528Z compiled=True, 2025-05-07T20:31:48.6562601Z ) 2025-05-07T20:31:48.6562820Z self = 2025-05-07T20:31:48.6562996Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:48.6563001Z 2025-05-07T20:31:48.6563079Z @given( 2025-05-07T20:31:48.6563195Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6563293Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6563414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6563532Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6563654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6563725Z ) 2025-05-07T20:31:48.6563977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6564156Z def test_silu_mul_quant( 2025-05-07T20:31:48.6564230Z self, 2025-05-07T20:31:48.6564304Z T: int, 2025-05-07T20:31:48.6564380Z D: int, 2025-05-07T20:31:48.6564479Z scale_ub: Optional[float], 2025-05-07T20:31:48.6564566Z contiguous: bool, 2025-05-07T20:31:48.6564655Z compiled: bool, 2025-05-07T20:31:48.6564733Z ) -> None: 2025-05-07T20:31:48.6564825Z torch.manual_seed(2025) 2025-05-07T20:31:48.6564900Z 2025-05-07T20:31:48.6565066Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6566959Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6566969Z 2025-05-07T20:31:48.6567088Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6567093Z 2025-05-07T20:31:48.6567196Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6567419Z self=, 2025-05-07T20:31:48.6567493Z T=4096, 2025-05-07T20:31:48.6567569Z D=7168, 2025-05-07T20:31:48.6567649Z scale_ub=None, 2025-05-07T20:31:48.6567732Z contiguous=True, 2025-05-07T20:31:48.6567817Z compiled=False, 2025-05-07T20:31:48.6567888Z ) 2025-05-07T20:31:48.6568183Z self = 2025-05-07T20:31:48.6568359Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.6568369Z 2025-05-07T20:31:48.6568443Z @given( 2025-05-07T20:31:48.6568563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6568660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6568772Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6568890Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6569002Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6569072Z ) 2025-05-07T20:31:48.6569324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6569417Z def test_silu_mul_quant( 2025-05-07T20:31:48.6569493Z self, 2025-05-07T20:31:48.6569572Z T: int, 2025-05-07T20:31:48.6569645Z D: int, 2025-05-07T20:31:48.6569743Z scale_ub: Optional[float], 2025-05-07T20:31:48.6569837Z contiguous: bool, 2025-05-07T20:31:48.6569921Z compiled: bool, 2025-05-07T20:31:48.6570005Z ) -> None: 2025-05-07T20:31:48.6570102Z torch.manual_seed(2025) 2025-05-07T20:31:48.6570170Z 2025-05-07T20:31:48.6570340Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6572174Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6572180Z 2025-05-07T20:31:48.6572304Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6572309Z 2025-05-07T20:31:48.6572411Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6572634Z self=, 2025-05-07T20:31:48.6572819Z T=16384, 2025-05-07T20:31:48.6572896Z D=7168, 2025-05-07T20:31:48.6572977Z scale_ub=None, 2025-05-07T20:31:48.6573059Z contiguous=True, 2025-05-07T20:31:48.6573139Z compiled=False, 2025-05-07T20:31:48.6573210Z ) 2025-05-07T20:31:48.6573428Z self = 2025-05-07T20:31:48.6573601Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.6573605Z 2025-05-07T20:31:48.6573683Z @given( 2025-05-07T20:31:48.6573796Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6573893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6574008Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6574127Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6574243Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6574319Z ) 2025-05-07T20:31:48.6574569Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6574664Z def test_silu_mul_quant( 2025-05-07T20:31:48.6574740Z self, 2025-05-07T20:31:48.6574815Z T: int, 2025-05-07T20:31:48.6574891Z D: int, 2025-05-07T20:31:48.6574985Z scale_ub: Optional[float], 2025-05-07T20:31:48.6575071Z contiguous: bool, 2025-05-07T20:31:48.6579878Z compiled: bool, 2025-05-07T20:31:48.6579973Z ) -> None: 2025-05-07T20:31:48.6580070Z torch.manual_seed(2025) 2025-05-07T20:31:48.6580144Z 2025-05-07T20:31:48.6580322Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6582283Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6582294Z 2025-05-07T20:31:48.6582416Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6582421Z 2025-05-07T20:31:48.6582523Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6582754Z self=, 2025-05-07T20:31:48.6582827Z T=16384, 2025-05-07T20:31:48.6582905Z D=7168, 2025-05-07T20:31:48.6582987Z scale_ub=1200.0, 2025-05-07T20:31:48.6583068Z contiguous=True, 2025-05-07T20:31:48.6583153Z compiled=False, 2025-05-07T20:31:48.6583228Z ) 2025-05-07T20:31:48.6583448Z self = 2025-05-07T20:31:48.6583633Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.6583642Z 2025-05-07T20:31:48.6583718Z @given( 2025-05-07T20:31:48.6583836Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6583936Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6584051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6584172Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6584288Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6584359Z ) 2025-05-07T20:31:48.6584614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6584708Z def test_silu_mul_quant( 2025-05-07T20:31:48.6584780Z self, 2025-05-07T20:31:48.6584856Z T: int, 2025-05-07T20:31:48.6584932Z D: int, 2025-05-07T20:31:48.6585028Z scale_ub: Optional[float], 2025-05-07T20:31:48.6585119Z contiguous: bool, 2025-05-07T20:31:48.6585201Z compiled: bool, 2025-05-07T20:31:48.6585363Z ) -> None: 2025-05-07T20:31:48.6585460Z torch.manual_seed(2025) 2025-05-07T20:31:48.6585529Z 2025-05-07T20:31:48.6585702Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6587602Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6587608Z 2025-05-07T20:31:48.6587729Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6587734Z 2025-05-07T20:31:48.6587842Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6588068Z self=, 2025-05-07T20:31:48.6588149Z T=128, 2025-05-07T20:31:48.6588223Z D=5120, 2025-05-07T20:31:48.6588308Z scale_ub=1200.0, 2025-05-07T20:31:48.6588395Z contiguous=False, 2025-05-07T20:31:48.6588474Z compiled=False, 2025-05-07T20:31:48.6588545Z ) 2025-05-07T20:31:48.6588766Z self = 2025-05-07T20:31:48.6588937Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.6588941Z 2025-05-07T20:31:48.6589017Z @given( 2025-05-07T20:31:48.6589136Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6589233Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6589431Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6589547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6589663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6589737Z ) 2025-05-07T20:31:48.6589984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6590078Z def test_silu_mul_quant( 2025-05-07T20:31:48.6590150Z self, 2025-05-07T20:31:48.6590222Z T: int, 2025-05-07T20:31:48.6590300Z D: int, 2025-05-07T20:31:48.6590394Z scale_ub: Optional[float], 2025-05-07T20:31:48.6590485Z contiguous: bool, 2025-05-07T20:31:48.6590570Z compiled: bool, 2025-05-07T20:31:48.6590646Z ) -> None: 2025-05-07T20:31:48.6590739Z torch.manual_seed(2025) 2025-05-07T20:31:48.6590811Z 2025-05-07T20:31:48.6590982Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6591052Z 2025-05-07T20:31:48.6591152Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6591278Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6591370Z x = x_sign * x_clamp 2025-05-07T20:31:48.6591451Z x0 = x[:, :D] 2025-05-07T20:31:48.6591529Z x1 = x[:, D:] 2025-05-07T20:31:48.6591603Z 2025-05-07T20:31:48.6591686Z if contiguous: 2025-05-07T20:31:48.6591778Z x0 = x0.contiguous() 2025-05-07T20:31:48.6591868Z x1 = x1.contiguous() 2025-05-07T20:31:48.6591937Z 2025-05-07T20:31:48.6592025Z if scale_ub is not None: 2025-05-07T20:31:48.6592133Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6592268Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6592340Z ) 2025-05-07T20:31:48.6592419Z else: 2025-05-07T20:31:48.6592512Z scale_ub_tensor = None 2025-05-07T20:31:48.6592580Z 2025-05-07T20:31:48.6592722Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6592808Z op = silu_mul_quant 2025-05-07T20:31:48.6592898Z if compiled: 2025-05-07T20:31:48.6592995Z op = torch.compile(op) 2025-05-07T20:31:48.6593184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6593257Z 2025-05-07T20:31:48.6593347Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6593351Z 2025-05-07T20:31:48.6593446Z moe/activation_test.py:117: 2025-05-07T20:31:48.6593577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6593677Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6593776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6594291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6594387Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6594761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6594988Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6595342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6595434Z kernel = self.compile( 2025-05-07T20:31:48.6595826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6596009Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6596133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6596138Z 2025-05-07T20:31:48.6596355Z self = 2025-05-07T20:31:48.6597274Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6597790Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50e6cdcf0>} 2025-05-07T20:31:48.6598563Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6598760Z context = 2025-05-07T20:31:48.6598765Z 2025-05-07T20:31:48.6598931Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6599204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6599311Z module_map=module_map) 2025-05-07T20:31:48.6599479Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6599581Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6599654Z E ^ 2025-05-07T20:31:48.6600021Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6600031Z 2025-05-07T20:31:48.6600452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6600457Z 2025-05-07T20:31:48.6600560Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6600787Z self=, 2025-05-07T20:31:48.6600861Z T=2048, 2025-05-07T20:31:48.6600933Z D=7168, 2025-05-07T20:31:48.6601016Z scale_ub=None, 2025-05-07T20:31:48.6601099Z contiguous=False, 2025-05-07T20:31:48.6601183Z compiled=False, 2025-05-07T20:31:48.6601252Z ) 2025-05-07T20:31:48.6601476Z self = 2025-05-07T20:31:48.6601654Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.6601658Z 2025-05-07T20:31:48.6601827Z @given( 2025-05-07T20:31:48.6601947Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6602046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6602159Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6602275Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6602394Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6602465Z ) 2025-05-07T20:31:48.6602716Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6602808Z def test_silu_mul_quant( 2025-05-07T20:31:48.6602882Z self, 2025-05-07T20:31:48.6602958Z T: int, 2025-05-07T20:31:48.6603029Z D: int, 2025-05-07T20:31:48.6603124Z scale_ub: Optional[float], 2025-05-07T20:31:48.6603221Z contiguous: bool, 2025-05-07T20:31:48.6603304Z compiled: bool, 2025-05-07T20:31:48.6603381Z ) -> None: 2025-05-07T20:31:48.6603476Z torch.manual_seed(2025) 2025-05-07T20:31:48.6603551Z 2025-05-07T20:31:48.6603720Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6605569Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6605575Z 2025-05-07T20:31:48.6605695Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6605776Z 2025-05-07T20:31:48.6605879Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6606105Z self=, 2025-05-07T20:31:48.6606187Z T=128, 2025-05-07T20:31:48.6606260Z D=7168, 2025-05-07T20:31:48.6606339Z scale_ub=1200.0, 2025-05-07T20:31:48.6606434Z contiguous=True, 2025-05-07T20:31:48.6606529Z compiled=True, 2025-05-07T20:31:48.6606601Z ) 2025-05-07T20:31:48.6606844Z self = 2025-05-07T20:31:48.6607011Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.6607015Z 2025-05-07T20:31:48.6607091Z @given( 2025-05-07T20:31:48.6607208Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6607304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6607420Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6607539Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6607653Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6607726Z ) 2025-05-07T20:31:48.6607974Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6608072Z def test_silu_mul_quant( 2025-05-07T20:31:48.6608146Z self, 2025-05-07T20:31:48.6608220Z T: int, 2025-05-07T20:31:48.6608293Z D: int, 2025-05-07T20:31:48.6608394Z scale_ub: Optional[float], 2025-05-07T20:31:48.6608482Z contiguous: bool, 2025-05-07T20:31:48.6608568Z compiled: bool, 2025-05-07T20:31:48.6608643Z ) -> None: 2025-05-07T20:31:48.6608735Z torch.manual_seed(2025) 2025-05-07T20:31:48.6608808Z 2025-05-07T20:31:48.6608977Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6609049Z 2025-05-07T20:31:48.6609143Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6609272Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6609358Z x = x_sign * x_clamp 2025-05-07T20:31:48.6609437Z x0 = x[:, :D] 2025-05-07T20:31:48.6609514Z x1 = x[:, D:] 2025-05-07T20:31:48.6609690Z 2025-05-07T20:31:48.6609773Z if contiguous: 2025-05-07T20:31:48.6609865Z x0 = x0.contiguous() 2025-05-07T20:31:48.6609955Z x1 = x1.contiguous() 2025-05-07T20:31:48.6610026Z 2025-05-07T20:31:48.6610116Z if scale_ub is not None: 2025-05-07T20:31:48.6610222Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.6610358Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.6610428Z ) 2025-05-07T20:31:48.6610506Z else: 2025-05-07T20:31:48.6610597Z scale_ub_tensor = None 2025-05-07T20:31:48.6610665Z 2025-05-07T20:31:48.6610800Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.6610887Z op = silu_mul_quant 2025-05-07T20:31:48.6610969Z if compiled: 2025-05-07T20:31:48.6611074Z op = torch.compile(op) 2025-05-07T20:31:48.6611179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6611260Z 2025-05-07T20:31:48.6611349Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.6611353Z 2025-05-07T20:31:48.6611450Z moe/activation_test.py:117: 2025-05-07T20:31:48.6611584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6611683Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.6611782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.6612162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.6612253Z return fn(*args, **kwargs) 2025-05-07T20:31:48.6612758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.6612858Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.6613302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.6613530Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.6613881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.6613975Z kernel = self.compile( 2025-05-07T20:31:48.6614372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.6614549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.6614674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.6614678Z 2025-05-07T20:31:48.6614887Z self = 2025-05-07T20:31:48.6615688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.6616208Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc50e6cf0a0>} 2025-05-07T20:31:48.6616971Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.6617168Z context = 2025-05-07T20:31:48.6617173Z 2025-05-07T20:31:48.6617338Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.6617608Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.6617722Z module_map=module_map) 2025-05-07T20:31:48.6617887Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.6617988Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.6618200Z E ^ 2025-05-07T20:31:48.6618564Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.6618568Z 2025-05-07T20:31:48.6618993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.6618998Z 2025-05-07T20:31:48.6619101Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6619331Z self=, 2025-05-07T20:31:48.6619405Z T=128, 2025-05-07T20:31:48.6619479Z D=7168, 2025-05-07T20:31:48.6619562Z scale_ub=1200.0, 2025-05-07T20:31:48.6619646Z contiguous=True, 2025-05-07T20:31:48.6619727Z compiled=False, 2025-05-07T20:31:48.6619804Z ) 2025-05-07T20:31:48.6620034Z self = 2025-05-07T20:31:48.6620205Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.6620215Z 2025-05-07T20:31:48.6620291Z @given( 2025-05-07T20:31:48.6620410Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6620513Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6620626Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6620747Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6620865Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6620936Z ) 2025-05-07T20:31:48.6621186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6621283Z def test_silu_mul_quant( 2025-05-07T20:31:48.6621359Z self, 2025-05-07T20:31:48.6621435Z T: int, 2025-05-07T20:31:48.6621512Z D: int, 2025-05-07T20:31:48.6621687Z scale_ub: Optional[float], 2025-05-07T20:31:48.6621776Z contiguous: bool, 2025-05-07T20:31:48.6621863Z compiled: bool, 2025-05-07T20:31:48.6621939Z ) -> None: 2025-05-07T20:31:48.6622041Z torch.manual_seed(2025) 2025-05-07T20:31:48.6622111Z 2025-05-07T20:31:48.6622282Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6622359Z 2025-05-07T20:31:48.6622449Z x_sign = torch.sign(x) 2025-05-07T20:31:48.6622575Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.6624422Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6624428Z 2025-05-07T20:31:48.6624549Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:48.6624553Z 2025-05-07T20:31:48.6624659Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6624885Z self=, 2025-05-07T20:31:48.6624956Z T=128, 2025-05-07T20:31:48.6625032Z D=5120, 2025-05-07T20:31:48.6625112Z scale_ub=1200.0, 2025-05-07T20:31:48.6625197Z contiguous=True, 2025-05-07T20:31:48.6625277Z compiled=True, 2025-05-07T20:31:48.6625346Z ) 2025-05-07T20:31:48.6625566Z self = 2025-05-07T20:31:48.6625733Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.6625737Z 2025-05-07T20:31:48.6625809Z @given( 2025-05-07T20:31:48.6625936Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6626032Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6626145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6626353Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6626485Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6626568Z ) 2025-05-07T20:31:48.6626830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6626924Z def test_silu_mul_quant( 2025-05-07T20:31:48.6627000Z self, 2025-05-07T20:31:48.6627071Z T: int, 2025-05-07T20:31:48.6627145Z D: int, 2025-05-07T20:31:48.6627243Z scale_ub: Optional[float], 2025-05-07T20:31:48.6627330Z contiguous: bool, 2025-05-07T20:31:48.6627413Z compiled: bool, 2025-05-07T20:31:48.6627490Z ) -> None: 2025-05-07T20:31:48.6627583Z torch.manual_seed(2025) 2025-05-07T20:31:48.6627652Z 2025-05-07T20:31:48.6627828Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6627899Z 2025-05-07T20:31:48.6627991Z > x_sign = torch.sign(x) 2025-05-07T20:31:48.6629829Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6629834Z 2025-05-07T20:31:48.6629958Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:48.6629962Z 2025-05-07T20:31:48.6630064Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6630373Z self=, 2025-05-07T20:31:48.6630453Z T=128, 2025-05-07T20:31:48.6630524Z D=7168, 2025-05-07T20:31:48.6630605Z scale_ub=None, 2025-05-07T20:31:48.6630689Z contiguous=True, 2025-05-07T20:31:48.6630769Z compiled=True, 2025-05-07T20:31:48.6630838Z ) 2025-05-07T20:31:48.6631059Z self = 2025-05-07T20:31:48.6631224Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.6631229Z 2025-05-07T20:31:48.6631305Z @given( 2025-05-07T20:31:48.6631422Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6631517Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6631638Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6631755Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6631866Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6631941Z ) 2025-05-07T20:31:48.6632191Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6632283Z def test_silu_mul_quant( 2025-05-07T20:31:48.6632366Z self, 2025-05-07T20:31:48.6632439Z T: int, 2025-05-07T20:31:48.6632516Z D: int, 2025-05-07T20:31:48.6632611Z scale_ub: Optional[float], 2025-05-07T20:31:48.6632696Z contiguous: bool, 2025-05-07T20:31:48.6632784Z compiled: bool, 2025-05-07T20:31:48.6632859Z ) -> None: 2025-05-07T20:31:48.6632952Z torch.manual_seed(2025) 2025-05-07T20:31:48.6633026Z 2025-05-07T20:31:48.6633193Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6635024Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6635111Z 2025-05-07T20:31:48.6635229Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6635360Z =============================== warnings summary =============================== 2025-05-07T20:31:48.6635677Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:48.6635983Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:48.6636292Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:48.6637246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:31:48.6637481Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:31:48.6637489Z 2025-05-07T20:31:48.6637667Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:31:48.6638967Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:31:48.6639159Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:31:48.6639164Z 2025-05-07T20:31:48.6639471Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:31:48.6639639Z ================== 1 failed, 1 passed, 13 warnings in 29.66s =================== 2025-05-07T20:31:50.3886825Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:31:50.4503895Z 2025-05-07T20:31:50.4504467Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:31:50.4504842Z 2025-05-07T20:31:50.4504846Z 2025-05-07T20:31:50.4525339Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:31:52.6080002Z ============================= test session starts ============================== 2025-05-07T20:31:52.6080676Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:52.6081216Z cachedir: .pytest_cache 2025-05-07T20:31:52.6081810Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:52.6082556Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:52.6082977Z plugins: hypothesis-6.131.14 2025-05-07T20:31:54.2148914Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:54.3936282Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:31:54.3936696Z run-last-failure: rerun previous 1 failure 2025-05-07T20:31:54.3936928Z 2025-05-07T20:31:56.5185695Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.5186841Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:56.5188218Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.5190137Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.5191554Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.5192974Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.5194308Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.5195717Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.5197166Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.5198489Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:56.5199886Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.5201130Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:56.5202181Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:56.5203220Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:56.5204465Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.5205781Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.5206924Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:56.5208005Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:56.5209244Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.5210628Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.5211713Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.5212638Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.5213482Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:56.5214526Z W0507 20:31:56.517000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.5354689Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.5355975Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:56.5357366Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.5358900Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.5360310Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.5361725Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.5363305Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.5364723Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.5366166Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.5367436Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:56.5368686Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.5369920Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:56.5370980Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:56.5372015Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:56.5373256Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.5374570Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.5375703Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:56.5376892Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:56.5378196Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.5379578Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.5380672Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.5381604Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.5382357Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:56.5383390Z W0507 20:31:56.534000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.1021847Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.1022564Z self=, 2025-05-07T20:31:57.1022986Z T=1, 2025-05-07T20:31:57.1023192Z D=5120, 2025-05-07T20:31:57.1023401Z scale_ub=None, 2025-05-07T20:31:57.1023625Z contiguous=True, 2025-05-07T20:31:57.1024257Z compiled=True, 2025-05-07T20:31:57.1024485Z ) 2025-05-07T20:31:57.1024821Z self = 2025-05-07T20:31:57.1025340Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:57.1025616Z 2025-05-07T20:31:57.1025707Z @given( 2025-05-07T20:31:57.1025950Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.1026282Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.1026603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.1026950Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.1027287Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.1027588Z ) 2025-05-07T20:31:57.1027955Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.1028413Z def test_silu_mul_quant( 2025-05-07T20:31:57.1028670Z self, 2025-05-07T20:31:57.1028903Z T: int, 2025-05-07T20:31:57.1029132Z D: int, 2025-05-07T20:31:57.1029364Z scale_ub: Optional[float], 2025-05-07T20:31:57.1029650Z contiguous: bool, 2025-05-07T20:31:57.1029904Z compiled: bool, 2025-05-07T20:31:57.1030149Z ) -> None: 2025-05-07T20:31:57.1030377Z torch.manual_seed(2025) 2025-05-07T20:31:57.1030625Z 2025-05-07T20:31:57.1030913Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.1031271Z 2025-05-07T20:31:57.1031483Z x_sign = torch.sign(x) 2025-05-07T20:31:57.1031788Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.1032111Z x = x_sign * x_clamp 2025-05-07T20:31:57.1032359Z x0 = x[:, :D] 2025-05-07T20:31:57.1032580Z x1 = x[:, D:] 2025-05-07T20:31:57.1032800Z 2025-05-07T20:31:57.1032998Z if contiguous: 2025-05-07T20:31:57.1033238Z x0 = x0.contiguous() 2025-05-07T20:31:57.1033509Z x1 = x1.contiguous() 2025-05-07T20:31:57.1033766Z 2025-05-07T20:31:57.1033964Z if scale_ub is not None: 2025-05-07T20:31:57.1034251Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.1034767Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.1035082Z ) 2025-05-07T20:31:57.1035288Z else: 2025-05-07T20:31:57.1035510Z scale_ub_tensor = None 2025-05-07T20:31:57.1035770Z 2025-05-07T20:31:57.1036020Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.1036347Z op = silu_mul_quant 2025-05-07T20:31:57.1036607Z if compiled: 2025-05-07T20:31:57.1036858Z op = torch.compile(op) 2025-05-07T20:31:57.1037169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.1037457Z 2025-05-07T20:31:57.1037653Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.1037957Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.1038257Z 2025-05-07T20:31:57.1038507Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.1038856Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.1039161Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.1039489Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.1039859Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.1040182Z 2025-05-07T20:31:57.1040388Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.1040597Z 2025-05-07T20:31:57.1040703Z moe/activation_test.py:126: 2025-05-07T20:31:57.1041009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1041354Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.1041690Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.1042501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.1043360Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.1043925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.1044627Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.1045335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.1046076Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.1046841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:57.1047606Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.1048354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.1049065Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.1049674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.1050212Z fn() 2025-05-07T20:31:57.1050736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.1051329Z self.fn.run( 2025-05-07T20:31:57.1051804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.1052352Z kernel = self.compile( 2025-05-07T20:31:57.1052910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.1053574Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.1053979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1054214Z 2025-05-07T20:31:57.1054436Z self = 2025-05-07T20:31:57.1056103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.1058380Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc8283400>} 2025-05-07T20:31:57.1060121Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.1061172Z context = 2025-05-07T20:31:57.1061475Z 2025-05-07T20:31:57.1061646Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.1062188Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.1062663Z module_map=module_map) 2025-05-07T20:31:57.1063045Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.1063417Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.1063692Z E ^ 2025-05-07T20:31:57.1064164Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.1064630Z 2025-05-07T20:31:57.1065054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.1065574Z 2025-05-07T20:31:57.1065686Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.1066103Z self=, 2025-05-07T20:31:57.1066510Z T=2048, 2025-05-07T20:31:57.1066703Z D=5120, 2025-05-07T20:31:57.1067030Z scale_ub=1200.0, 2025-05-07T20:31:57.1067256Z contiguous=True, 2025-05-07T20:31:57.1067497Z compiled=False, 2025-05-07T20:31:57.1067706Z ) 2025-05-07T20:31:58.0251254Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:58.0252391Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:58.0253772Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:58.0255263Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:58.0256900Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:58.0258381Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.0259771Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:58.0261182Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.0262629Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:58.0264254Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:58.0265499Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:58.0266735Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:58.0267796Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:58.0268830Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:58.0270080Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:58.0271386Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:58.0272525Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:58.0273591Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:58.0274923Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:58.0276315Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:58.0277396Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.0278327Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.0279125Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:58.0280175Z W0507 20:31:58.021000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.2317956Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:58.2319054Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:58.2320420Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:58.2321879Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:58.2323302Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:58.2324862Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.2326198Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:58.2327600Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.2329108Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:58.2330391Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:58.2331646Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:58.2332880Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:58.2333933Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:58.2335095Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:58.2336340Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:58.2337655Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:58.2338883Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:58.2339993Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:58.2341203Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:58.2342598Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:58.2343684Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.2344620Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.2345371Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:58.2346416Z W0507 20:31:58.228000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.0042402Z self = 2025-05-07T20:31:59.0043163Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:59.0043462Z 2025-05-07T20:31:59.0043548Z @given( 2025-05-07T20:31:59.0043803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.0044129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.0044447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.0044795Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.0045136Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.0045427Z ) 2025-05-07T20:31:59.0045793Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.0046253Z def test_silu_mul_quant( 2025-05-07T20:31:59.0046499Z self, 2025-05-07T20:31:59.0046706Z T: int, 2025-05-07T20:31:59.0046916Z D: int, 2025-05-07T20:31:59.0047139Z scale_ub: Optional[float], 2025-05-07T20:31:59.0047423Z contiguous: bool, 2025-05-07T20:31:59.0047678Z compiled: bool, 2025-05-07T20:31:59.0047909Z ) -> None: 2025-05-07T20:31:59.0048143Z torch.manual_seed(2025) 2025-05-07T20:31:59.0048392Z 2025-05-07T20:31:59.0048672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.0049029Z 2025-05-07T20:31:59.0049234Z x_sign = torch.sign(x) 2025-05-07T20:31:59.0049539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.0049862Z x = x_sign * x_clamp 2025-05-07T20:31:59.0050111Z x0 = x[:, :D] 2025-05-07T20:31:59.0050338Z x1 = x[:, D:] 2025-05-07T20:31:59.0050549Z 2025-05-07T20:31:59.0050745Z if contiguous: 2025-05-07T20:31:59.0050988Z x0 = x0.contiguous() 2025-05-07T20:31:59.0051253Z x1 = x1.contiguous() 2025-05-07T20:31:59.0051622Z 2025-05-07T20:31:59.0051827Z if scale_ub is not None: 2025-05-07T20:31:59.0052109Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.0052470Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.0052789Z ) 2025-05-07T20:31:59.0052990Z else: 2025-05-07T20:31:59.0053212Z scale_ub_tensor = None 2025-05-07T20:31:59.0053480Z 2025-05-07T20:31:59.0053720Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.0054047Z op = silu_mul_quant 2025-05-07T20:31:59.0054307Z if compiled: 2025-05-07T20:31:59.0054561Z op = torch.compile(op) 2025-05-07T20:31:59.0054911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.0055191Z 2025-05-07T20:31:59.0055395Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.0055727Z 2025-05-07T20:31:59.0055843Z moe/activation_test.py:117: 2025-05-07T20:31:59.0056149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.0056491Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.0056785Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.0057508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.0058273Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.0058829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.0059582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.0060257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.0060806Z kernel = self.compile( 2025-05-07T20:31:59.0061366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.0062043Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.0062447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.0062813Z 2025-05-07T20:31:59.0063029Z self = 2025-05-07T20:31:59.0064139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.0065550Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc3962e60>} 2025-05-07T20:31:59.0066930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.0067976Z context = 2025-05-07T20:31:59.0068277Z 2025-05-07T20:31:59.0068453Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.0068990Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.0069521Z module_map=module_map) 2025-05-07T20:31:59.0069897Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.0070261Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.0070530Z E ^ 2025-05-07T20:31:59.0071002Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.0071467Z 2025-05-07T20:31:59.0071891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.0072412Z 2025-05-07T20:31:59.0072669Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.0073101Z self=, 2025-05-07T20:31:59.0073512Z T=2048, 2025-05-07T20:31:59.0073709Z D=5120, 2025-05-07T20:31:59.0073911Z scale_ub=1200.0, 2025-05-07T20:31:59.0074135Z contiguous=True, 2025-05-07T20:31:59.0074362Z compiled=True, 2025-05-07T20:31:59.0074575Z ) 2025-05-07T20:31:59.0074901Z self = 2025-05-07T20:31:59.0075405Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:59.0075680Z 2025-05-07T20:31:59.0075765Z @given( 2025-05-07T20:31:59.0075998Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.0076321Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.0076638Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.0076982Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.0077322Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.0077616Z ) 2025-05-07T20:31:59.0077977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.0078429Z def test_silu_mul_quant( 2025-05-07T20:31:59.0078678Z self, 2025-05-07T20:31:59.0078882Z T: int, 2025-05-07T20:31:59.0079080Z D: int, 2025-05-07T20:31:59.0079306Z scale_ub: Optional[float], 2025-05-07T20:31:59.0079585Z contiguous: bool, 2025-05-07T20:31:59.0079831Z compiled: bool, 2025-05-07T20:31:59.0080063Z ) -> None: 2025-05-07T20:31:59.0080285Z torch.manual_seed(2025) 2025-05-07T20:31:59.0080529Z 2025-05-07T20:31:59.0080810Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.0081161Z 2025-05-07T20:31:59.0081356Z x_sign = torch.sign(x) 2025-05-07T20:31:59.0081659Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.0081981Z x = x_sign * x_clamp 2025-05-07T20:31:59.0082227Z x0 = x[:, :D] 2025-05-07T20:31:59.0082444Z x1 = x[:, D:] 2025-05-07T20:31:59.0082654Z 2025-05-07T20:31:59.0082934Z if contiguous: 2025-05-07T20:31:59.0083166Z x0 = x0.contiguous() 2025-05-07T20:31:59.0083430Z x1 = x1.contiguous() 2025-05-07T20:31:59.0083678Z 2025-05-07T20:31:59.0083870Z if scale_ub is not None: 2025-05-07T20:31:59.0084151Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.0084501Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.0084814Z ) 2025-05-07T20:31:59.0085012Z else: 2025-05-07T20:31:59.0085227Z scale_ub_tensor = None 2025-05-07T20:31:59.0085479Z 2025-05-07T20:31:59.0085718Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.0086037Z op = silu_mul_quant 2025-05-07T20:31:59.0086287Z if compiled: 2025-05-07T20:31:59.0086545Z op = torch.compile(op) 2025-05-07T20:31:59.0086849Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.0087127Z 2025-05-07T20:31:59.0087345Z y_fp8, y_scale = fn() 2025-05-07T20:31:59.0087638Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:59.0087935Z 2025-05-07T20:31:59.0088186Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.0088527Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:59.0088828Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:59.0089173Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:59.0089574Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:59.0089888Z 2025-05-07T20:31:59.0090100Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:59.0090302Z 2025-05-07T20:31:59.0090411Z moe/activation_test.py:126: 2025-05-07T20:31:59.0090711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.0091186Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:59.0091530Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:59.0092336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:59.0093113Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:59.0093672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.0094370Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.0095070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:59.0095811Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:59.0096587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:59.0097356Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:59.0098182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:59.0098836Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:59.0099502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:59.0100030Z fn() 2025-05-07T20:31:59.0100543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:59.0101134Z self.fn.run( 2025-05-07T20:31:59.0101616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.0102152Z kernel = self.compile( 2025-05-07T20:31:59.0102715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.0103385Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.0103875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.0104103Z 2025-05-07T20:31:59.0104317Z self = 2025-05-07T20:31:59.0105423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.0106829Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc243d6c0>} 2025-05-07T20:31:59.0108207Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.0109247Z context = 2025-05-07T20:31:59.0109555Z 2025-05-07T20:31:59.0109724Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.0110261Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.0110743Z module_map=module_map) 2025-05-07T20:31:59.0111110Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.0111476Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:59.0111757Z E ^ 2025-05-07T20:31:59.0112229Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.0112689Z 2025-05-07T20:31:59.0113193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.0113724Z 2025-05-07T20:31:59.0113833Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.0114263Z self=, 2025-05-07T20:31:59.0114667Z T=16384, 2025-05-07T20:31:59.0114867Z D=7168, 2025-05-07T20:31:59.0115069Z scale_ub=1200.0, 2025-05-07T20:31:59.0115295Z contiguous=False, 2025-05-07T20:31:59.0115529Z compiled=False, 2025-05-07T20:31:59.0115744Z ) 2025-05-07T20:31:59.5706574Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.5708778Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:31:59.5710542Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.5712032Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.5713460Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.5714903Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.5716267Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.5717698Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.5719345Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.5720633Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:31:59.5721903Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.5723165Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:31:59.5724249Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:59.5725315Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:31:59.5726580Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.5727913Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.5729198Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:59.5730348Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:31:59.5731566Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.5732957Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.5734046Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.5734995Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.5735758Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:31:59.5736808Z W0507 20:31:59.567000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.7277276Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.7278365Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:31:59.7279745Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.7281203Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.7282816Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.7284226Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.7285569Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.7286988Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.7288448Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.7289780Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:31:59.7291033Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.7292378Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:31:59.7293448Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:31:59.7294500Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:31:59.7295757Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.7297070Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.7298292Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:31:59.7299372Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:31:59.7300640Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.7302035Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.7303116Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.7304058Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.7304822Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:31:59.7305949Z W0507 20:31:59.724000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.7806872Z self = 2025-05-07T20:32:00.7807421Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:00.7807720Z 2025-05-07T20:32:00.7807806Z @given( 2025-05-07T20:32:00.7808089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.7808425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.7808736Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.7809082Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.7809509Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.7810112Z ) 2025-05-07T20:32:00.7810829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.7811746Z def test_silu_mul_quant( 2025-05-07T20:32:00.7812227Z self, 2025-05-07T20:32:00.7812618Z T: int, 2025-05-07T20:32:00.7813021Z D: int, 2025-05-07T20:32:00.7813455Z scale_ub: Optional[float], 2025-05-07T20:32:00.7814012Z contiguous: bool, 2025-05-07T20:32:00.7814493Z compiled: bool, 2025-05-07T20:32:00.7814948Z ) -> None: 2025-05-07T20:32:00.7815375Z torch.manual_seed(2025) 2025-05-07T20:32:00.7815867Z 2025-05-07T20:32:00.7816427Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.7817113Z 2025-05-07T20:32:00.7817507Z x_sign = torch.sign(x) 2025-05-07T20:32:00.7818186Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.7819218Z x = x_sign * x_clamp 2025-05-07T20:32:00.7819607Z x0 = x[:, :D] 2025-05-07T20:32:00.7819868Z x1 = x[:, D:] 2025-05-07T20:32:00.7820075Z 2025-05-07T20:32:00.7820273Z if contiguous: 2025-05-07T20:32:00.7820512Z x0 = x0.contiguous() 2025-05-07T20:32:00.7820778Z x1 = x1.contiguous() 2025-05-07T20:32:00.7821026Z 2025-05-07T20:32:00.7821226Z if scale_ub is not None: 2025-05-07T20:32:00.7821502Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.7821848Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.7822163Z ) 2025-05-07T20:32:00.7822362Z else: 2025-05-07T20:32:00.7822573Z scale_ub_tensor = None 2025-05-07T20:32:00.7822834Z 2025-05-07T20:32:00.7823074Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.7823391Z op = silu_mul_quant 2025-05-07T20:32:00.7823648Z if compiled: 2025-05-07T20:32:00.7823906Z op = torch.compile(op) 2025-05-07T20:32:00.7824210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.7824491Z 2025-05-07T20:32:00.7824703Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.7824871Z 2025-05-07T20:32:00.7824974Z moe/activation_test.py:117: 2025-05-07T20:32:00.7825277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7825614Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.7825900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.7826620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.7827332Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.7827891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.7828589Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.7829283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.7829835Z kernel = self.compile( 2025-05-07T20:32:00.7830530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.7831200Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.7831609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7831837Z 2025-05-07T20:32:00.7832061Z self = 2025-05-07T20:32:00.7833181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.7834598Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc243d510>} 2025-05-07T20:32:00.7835989Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.7837049Z context = 2025-05-07T20:32:00.7837344Z 2025-05-07T20:32:00.7837519Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.7838048Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.7838533Z module_map=module_map) 2025-05-07T20:32:00.7838910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.7839274Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.7839556Z E ^ 2025-05-07T20:32:00.7840143Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.7840607Z 2025-05-07T20:32:00.7841047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.7841576Z 2025-05-07T20:32:00.7841682Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.7842119Z self=, 2025-05-07T20:32:00.7842535Z T=1, 2025-05-07T20:32:00.7850272Z D=7168, 2025-05-07T20:32:00.7850490Z scale_ub=None, 2025-05-07T20:32:00.7850718Z contiguous=True, 2025-05-07T20:32:00.7850947Z compiled=True, 2025-05-07T20:32:00.7851169Z ) 2025-05-07T20:32:00.7851510Z self = 2025-05-07T20:32:00.7852004Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:00.7852272Z 2025-05-07T20:32:00.7852357Z @given( 2025-05-07T20:32:00.7852598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.7852920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.7853234Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.7853570Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.7853904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.7854191Z ) 2025-05-07T20:32:00.7854553Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.7855012Z def test_silu_mul_quant( 2025-05-07T20:32:00.7855254Z self, 2025-05-07T20:32:00.7855454Z T: int, 2025-05-07T20:32:00.7856011Z D: int, 2025-05-07T20:32:00.7856229Z scale_ub: Optional[float], 2025-05-07T20:32:00.7856508Z contiguous: bool, 2025-05-07T20:32:00.7856758Z compiled: bool, 2025-05-07T20:32:00.7856980Z ) -> None: 2025-05-07T20:32:00.7857206Z torch.manual_seed(2025) 2025-05-07T20:32:00.7857464Z 2025-05-07T20:32:00.7857750Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.7858149Z 2025-05-07T20:32:00.7858532Z x_sign = torch.sign(x) 2025-05-07T20:32:00.7858834Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.7859144Z x = x_sign * x_clamp 2025-05-07T20:32:00.7859386Z x0 = x[:, :D] 2025-05-07T20:32:00.7859610Z x1 = x[:, D:] 2025-05-07T20:32:00.7859818Z 2025-05-07T20:32:00.7860010Z if contiguous: 2025-05-07T20:32:00.7860253Z x0 = x0.contiguous() 2025-05-07T20:32:00.7860510Z x1 = x1.contiguous() 2025-05-07T20:32:00.7860754Z 2025-05-07T20:32:00.7860960Z if scale_ub is not None: 2025-05-07T20:32:00.7861234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.7861576Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.7861889Z ) 2025-05-07T20:32:00.7862086Z else: 2025-05-07T20:32:00.7862304Z scale_ub_tensor = None 2025-05-07T20:32:00.7862559Z 2025-05-07T20:32:00.7862793Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.7863118Z op = silu_mul_quant 2025-05-07T20:32:00.7863372Z if compiled: 2025-05-07T20:32:00.7863624Z op = torch.compile(op) 2025-05-07T20:32:00.7863941Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.7864218Z 2025-05-07T20:32:00.7864420Z y_fp8, y_scale = fn() 2025-05-07T20:32:00.7864703Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:00.7864995Z 2025-05-07T20:32:00.7865242Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.7865583Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:00.7865874Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:00.7866190Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:00.7866672Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.7866988Z 2025-05-07T20:32:00.7867202Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:00.7867408Z 2025-05-07T20:32:00.7867510Z moe/activation_test.py:126: 2025-05-07T20:32:00.7867808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7868145Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:00.7868468Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.7869271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:00.7870088Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:00.7870646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.7871333Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.7872041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:00.7872774Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.7873541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:00.7874299Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.7875036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:00.7875684Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:00.7876291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:00.7876810Z fn() 2025-05-07T20:32:00.7877332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:00.7877921Z self.fn.run( 2025-05-07T20:32:00.7878388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.7879008Z kernel = self.compile( 2025-05-07T20:32:00.7879560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.7880277Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.7880670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7880906Z 2025-05-07T20:32:00.7881117Z self = 2025-05-07T20:32:00.7882224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.7883627Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc243d7e0>} 2025-05-07T20:32:00.7884999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.7886047Z context = 2025-05-07T20:32:00.7886346Z 2025-05-07T20:32:00.7886515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.7887051Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.7887519Z module_map=module_map) 2025-05-07T20:32:00.7887901Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.7888339Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:00.7888605Z E ^ 2025-05-07T20:32:00.7889074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.7889540Z 2025-05-07T20:32:00.7889960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.7890478Z 2025-05-07T20:32:00.7890590Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.7891008Z self=, 2025-05-07T20:32:00.7891414Z T=4096, 2025-05-07T20:32:00.7891607Z D=5120, 2025-05-07T20:32:00.7891805Z scale_ub=None, 2025-05-07T20:32:00.7892024Z contiguous=False, 2025-05-07T20:32:00.7892257Z compiled=False, 2025-05-07T20:32:00.7892466Z ) 2025-05-07T20:32:01.3800835Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:01.3803009Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:01.3805746Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:01.3808645Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:01.3810590Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:01.3812005Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3813501Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:01.3814910Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3816358Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:01.3817633Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:01.3818946Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:01.3820239Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:01.3821294Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:01.3822335Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:01.3823688Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:01.3824998Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:01.3826137Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:01.3827203Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:01.3828398Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:01.3829784Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:01.3830866Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3831804Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3832559Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:01.3833594Z W0507 20:32:01.376000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9856690Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:01.9857800Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:01.9859515Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:01.9861021Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:01.9862437Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:01.9863853Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9865185Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:01.9866596Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9868042Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:01.9869433Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:01.9870736Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:01.9871980Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:01.9873041Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:01.9874074Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:01.9875330Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:01.9876639Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:01.9877784Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:01.9878847Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:01.9880040Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:01.9881422Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:01.9882501Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9883512Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9884266Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:01.9885300Z W0507 20:32:01.982000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.1577124Z self = 2025-05-07T20:32:03.1577672Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:03.1578429Z 2025-05-07T20:32:03.1578635Z @given( 2025-05-07T20:32:03.1578885Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.1579227Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.1579603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.1579957Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.1580356Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.1580654Z ) 2025-05-07T20:32:03.1581027Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.1581498Z def test_silu_mul_quant( 2025-05-07T20:32:03.1581749Z self, 2025-05-07T20:32:03.1581957Z T: int, 2025-05-07T20:32:03.1582169Z D: int, 2025-05-07T20:32:03.1582396Z scale_ub: Optional[float], 2025-05-07T20:32:03.1582686Z contiguous: bool, 2025-05-07T20:32:03.1582941Z compiled: bool, 2025-05-07T20:32:03.1583176Z ) -> None: 2025-05-07T20:32:03.1583408Z torch.manual_seed(2025) 2025-05-07T20:32:03.1583851Z 2025-05-07T20:32:03.1584137Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.1584492Z 2025-05-07T20:32:03.1584709Z x_sign = torch.sign(x) 2025-05-07T20:32:03.1585014Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.1585329Z x = x_sign * x_clamp 2025-05-07T20:32:03.1585581Z x0 = x[:, :D] 2025-05-07T20:32:03.1585808Z x1 = x[:, D:] 2025-05-07T20:32:03.1586018Z 2025-05-07T20:32:03.1586216Z if contiguous: 2025-05-07T20:32:03.1586460Z x0 = x0.contiguous() 2025-05-07T20:32:03.1586722Z x1 = x1.contiguous() 2025-05-07T20:32:03.1586970Z 2025-05-07T20:32:03.1587174Z if scale_ub is not None: 2025-05-07T20:32:03.1587453Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.1587807Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.1588126Z ) 2025-05-07T20:32:03.1588328Z else: 2025-05-07T20:32:03.1588549Z scale_ub_tensor = None 2025-05-07T20:32:03.1588811Z 2025-05-07T20:32:03.1589055Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.1589379Z op = silu_mul_quant 2025-05-07T20:32:03.1589638Z if compiled: 2025-05-07T20:32:03.1589893Z op = torch.compile(op) 2025-05-07T20:32:03.1590198Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.1590525Z 2025-05-07T20:32:03.1590732Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.1590902Z 2025-05-07T20:32:03.1591007Z moe/activation_test.py:117: 2025-05-07T20:32:03.1591311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.1591648Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.1591944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.1592668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.1593389Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.1593949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.1594778Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.1595465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.1596016Z kernel = self.compile( 2025-05-07T20:32:03.1596580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.1597253Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.1597666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.1597896Z 2025-05-07T20:32:03.1598119Z self = 2025-05-07T20:32:03.1599242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.1600664Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc23c5090>} 2025-05-07T20:32:03.1602048Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.1603107Z context = 2025-05-07T20:32:03.1603408Z 2025-05-07T20:32:03.1603588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.1604204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.1604696Z module_map=module_map) 2025-05-07T20:32:03.1605073Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.1605445Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.1605709Z E ^ 2025-05-07T20:32:03.1606190Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.1606653Z 2025-05-07T20:32:03.1607088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.1607615Z 2025-05-07T20:32:03.1607724Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.1608156Z self=, 2025-05-07T20:32:03.1608568Z T=4096, 2025-05-07T20:32:03.1608770Z D=7168, 2025-05-07T20:32:03.1608964Z scale_ub=None, 2025-05-07T20:32:03.1609189Z contiguous=False, 2025-05-07T20:32:03.1609435Z compiled=False, 2025-05-07T20:32:03.1609648Z ) 2025-05-07T20:32:03.1609992Z self = 2025-05-07T20:32:03.1610549Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:03.1610827Z 2025-05-07T20:32:03.1610907Z @given( 2025-05-07T20:32:03.1611149Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.1611472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.1611782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.1612125Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.1612465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.1612756Z ) 2025-05-07T20:32:03.1613108Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.1613561Z def test_silu_mul_quant( 2025-05-07T20:32:03.1613814Z self, 2025-05-07T20:32:03.1614015Z T: int, 2025-05-07T20:32:03.1614221Z D: int, 2025-05-07T20:32:03.1614447Z scale_ub: Optional[float], 2025-05-07T20:32:03.1614723Z contiguous: bool, 2025-05-07T20:32:03.1615058Z compiled: bool, 2025-05-07T20:32:03.1615289Z ) -> None: 2025-05-07T20:32:03.1615508Z torch.manual_seed(2025) 2025-05-07T20:32:03.1615759Z 2025-05-07T20:32:03.1616042Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.1616389Z 2025-05-07T20:32:03.1616592Z x_sign = torch.sign(x) 2025-05-07T20:32:03.1616893Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.1617211Z x = x_sign * x_clamp 2025-05-07T20:32:03.1617454Z x0 = x[:, :D] 2025-05-07T20:32:03.1617676Z x1 = x[:, D:] 2025-05-07T20:32:03.1618104Z 2025-05-07T20:32:03.1618296Z if contiguous: 2025-05-07T20:32:03.1618533Z x0 = x0.contiguous() 2025-05-07T20:32:03.1618800Z x1 = x1.contiguous() 2025-05-07T20:32:03.1619049Z 2025-05-07T20:32:03.1619250Z if scale_ub is not None: 2025-05-07T20:32:03.1619530Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.1619880Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.1620202Z ) 2025-05-07T20:32:03.1620437Z else: 2025-05-07T20:32:03.1620660Z scale_ub_tensor = None 2025-05-07T20:32:03.1620925Z 2025-05-07T20:32:03.1621165Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.1621481Z op = silu_mul_quant 2025-05-07T20:32:03.1621741Z if compiled: 2025-05-07T20:32:03.1621997Z op = torch.compile(op) 2025-05-07T20:32:03.1622299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.1622583Z 2025-05-07T20:32:03.1622794Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.1622963Z 2025-05-07T20:32:03.1623071Z moe/activation_test.py:117: 2025-05-07T20:32:03.1623485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.1623825Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.1624118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.1624856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.1625564Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.1626119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.1626823Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.1627503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.1628055Z kernel = self.compile( 2025-05-07T20:32:03.1628614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.1629300Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.1629708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.1629950Z 2025-05-07T20:32:03.1630166Z self = 2025-05-07T20:32:03.1631281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.1632696Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc23c6560>} 2025-05-07T20:32:03.1634080Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.1635136Z context = 2025-05-07T20:32:03.1635436Z 2025-05-07T20:32:03.1635691Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.1636229Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.1636708Z module_map=module_map) 2025-05-07T20:32:03.1637083Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.1637446Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.1637715Z E ^ 2025-05-07T20:32:03.1638191Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.1638656Z 2025-05-07T20:32:03.1639087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.1639614Z 2025-05-07T20:32:03.1639730Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.1640195Z self=, 2025-05-07T20:32:03.1640621Z T=128, 2025-05-07T20:32:03.1640814Z D=7168, 2025-05-07T20:32:03.1641011Z scale_ub=None, 2025-05-07T20:32:03.1641229Z contiguous=False, 2025-05-07T20:32:03.1641461Z compiled=True, 2025-05-07T20:32:03.1641672Z ) 2025-05-07T20:32:03.2271606Z self = 2025-05-07T20:32:03.2272138Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:03.2272449Z 2025-05-07T20:32:03.2272535Z @given( 2025-05-07T20:32:03.2272803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.2273121Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.2273440Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.2273781Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.2274294Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.2274596Z ) 2025-05-07T20:32:03.2274960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.2275426Z def test_silu_mul_quant( 2025-05-07T20:32:03.2275671Z self, 2025-05-07T20:32:03.2275873Z T: int, 2025-05-07T20:32:03.2276078Z D: int, 2025-05-07T20:32:03.2276299Z scale_ub: Optional[float], 2025-05-07T20:32:03.2276581Z contiguous: bool, 2025-05-07T20:32:03.2276833Z compiled: bool, 2025-05-07T20:32:03.2277057Z ) -> None: 2025-05-07T20:32:03.2277281Z torch.manual_seed(2025) 2025-05-07T20:32:03.2277530Z 2025-05-07T20:32:03.2277807Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.2278159Z 2025-05-07T20:32:03.2278360Z x_sign = torch.sign(x) 2025-05-07T20:32:03.2278651Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.2278976Z x = x_sign * x_clamp 2025-05-07T20:32:03.2279221Z x0 = x[:, :D] 2025-05-07T20:32:03.2279442Z x1 = x[:, D:] 2025-05-07T20:32:03.2279653Z 2025-05-07T20:32:03.2279848Z if contiguous: 2025-05-07T20:32:03.2280083Z x0 = x0.contiguous() 2025-05-07T20:32:03.2280377Z x1 = x1.contiguous() 2025-05-07T20:32:03.2280648Z 2025-05-07T20:32:03.2280846Z if scale_ub is not None: 2025-05-07T20:32:03.2281122Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.2281467Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.2281781Z ) 2025-05-07T20:32:03.2281972Z else: 2025-05-07T20:32:03.2282188Z scale_ub_tensor = None 2025-05-07T20:32:03.2282453Z 2025-05-07T20:32:03.2282688Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.2283008Z op = silu_mul_quant 2025-05-07T20:32:03.2283267Z if compiled: 2025-05-07T20:32:03.2283522Z op = torch.compile(op) 2025-05-07T20:32:03.2283829Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.2284107Z 2025-05-07T20:32:03.2284429Z y_fp8, y_scale = fn() 2025-05-07T20:32:03.2284725Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:03.2285025Z 2025-05-07T20:32:03.2285266Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.2285618Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:03.2292780Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:03.2293171Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:03.2293543Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:03.2293851Z 2025-05-07T20:32:03.2294060Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:03.2294262Z 2025-05-07T20:32:03.2294371Z moe/activation_test.py:126: 2025-05-07T20:32:03.2294675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.2295015Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:03.2295347Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:03.2296159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:03.2296919Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:03.2297475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.2298269Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.2298969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:03.2299709Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:03.2300641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:03.2301407Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:03.2302148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:03.2302801Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:03.2303411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:03.2303937Z fn() 2025-05-07T20:32:03.2304449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:03.2305044Z self.fn.run( 2025-05-07T20:32:03.2305523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.2306057Z kernel = self.compile( 2025-05-07T20:32:03.2306621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.2307292Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.2307701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.2307929Z 2025-05-07T20:32:03.2308144Z self = 2025-05-07T20:32:03.2309251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.2310711Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc23c7d00>} 2025-05-07T20:32:03.2312085Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.2313216Z context = 2025-05-07T20:32:03.2313516Z 2025-05-07T20:32:03.2313687Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.2314225Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.2314703Z module_map=module_map) 2025-05-07T20:32:03.2315071Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.2315439Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:03.2315716Z E ^ 2025-05-07T20:32:03.2316185Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.2316647Z 2025-05-07T20:32:03.2317075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.2317601Z 2025-05-07T20:32:03.2317709Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.2318143Z self=, 2025-05-07T20:32:03.2318542Z T=128, 2025-05-07T20:32:03.2318742Z D=7168, 2025-05-07T20:32:03.2318939Z scale_ub=None, 2025-05-07T20:32:03.2319157Z contiguous=False, 2025-05-07T20:32:03.2319388Z compiled=False, 2025-05-07T20:32:03.2319617Z ) 2025-05-07T20:32:03.5919603Z self = 2025-05-07T20:32:03.5920147Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:03.5920491Z 2025-05-07T20:32:03.5920582Z @given( 2025-05-07T20:32:03.5920818Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.5921163Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.5921663Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.5922010Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.5922349Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.5922648Z ) 2025-05-07T20:32:03.5923004Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.5923461Z def test_silu_mul_quant( 2025-05-07T20:32:03.5923712Z self, 2025-05-07T20:32:03.5923912Z T: int, 2025-05-07T20:32:03.5924120Z D: int, 2025-05-07T20:32:03.5924347Z scale_ub: Optional[float], 2025-05-07T20:32:03.5924632Z contiguous: bool, 2025-05-07T20:32:03.5924875Z compiled: bool, 2025-05-07T20:32:03.5925112Z ) -> None: 2025-05-07T20:32:03.5925339Z torch.manual_seed(2025) 2025-05-07T20:32:03.5925585Z 2025-05-07T20:32:03.5925871Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.5926224Z 2025-05-07T20:32:03.5926429Z x_sign = torch.sign(x) 2025-05-07T20:32:03.5926734Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.5927054Z x = x_sign * x_clamp 2025-05-07T20:32:03.5927303Z x0 = x[:, :D] 2025-05-07T20:32:03.5927530Z x1 = x[:, D:] 2025-05-07T20:32:03.5927746Z 2025-05-07T20:32:03.5927940Z if contiguous: 2025-05-07T20:32:03.5928186Z x0 = x0.contiguous() 2025-05-07T20:32:03.5928453Z x1 = x1.contiguous() 2025-05-07T20:32:03.5928698Z 2025-05-07T20:32:03.5928900Z if scale_ub is not None: 2025-05-07T20:32:03.5929181Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.5929528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.5929848Z ) 2025-05-07T20:32:03.5930049Z else: 2025-05-07T20:32:03.5930270Z scale_ub_tensor = None 2025-05-07T20:32:03.5930552Z 2025-05-07T20:32:03.5930815Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.5931144Z op = silu_mul_quant 2025-05-07T20:32:03.5931395Z if compiled: 2025-05-07T20:32:03.5931648Z op = torch.compile(op) 2025-05-07T20:32:03.5932140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.5932412Z 2025-05-07T20:32:03.5932609Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.5932775Z 2025-05-07T20:32:03.5932885Z moe/activation_test.py:117: 2025-05-07T20:32:03.5933180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.5933513Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.5933803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.5934511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.5935216Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.5935766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.5936471Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.5937142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.5937691Z kernel = self.compile( 2025-05-07T20:32:03.5938378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.5939053Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.5939451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.5939686Z 2025-05-07T20:32:03.5939901Z self = 2025-05-07T20:32:03.5941095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.5942506Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc23c72e0>} 2025-05-07T20:32:03.5943880Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.5944927Z context = 2025-05-07T20:32:03.5945255Z 2025-05-07T20:32:03.5945426Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.5945962Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.5946445Z module_map=module_map) 2025-05-07T20:32:03.5946813Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.5947185Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.5947448Z E ^ 2025-05-07T20:32:03.5947924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.5948388Z 2025-05-07T20:32:03.5948812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.5949340Z 2025-05-07T20:32:03.5949448Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.5949870Z self=, 2025-05-07T20:32:03.5950315Z T=4096, 2025-05-07T20:32:03.5950520Z D=5120, 2025-05-07T20:32:03.5950718Z scale_ub=1200.0, 2025-05-07T20:32:03.5950949Z contiguous=True, 2025-05-07T20:32:03.5951174Z compiled=False, 2025-05-07T20:32:03.5951386Z ) 2025-05-07T20:32:03.5951718Z self = 2025-05-07T20:32:03.5952219Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:03.5952499Z 2025-05-07T20:32:03.5952577Z @given( 2025-05-07T20:32:03.5952895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.5953207Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.5953518Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.5953852Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.5954185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.5954471Z ) 2025-05-07T20:32:03.5954825Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.5955275Z def test_silu_mul_quant( 2025-05-07T20:32:03.5955514Z self, 2025-05-07T20:32:03.5955982Z T: int, 2025-05-07T20:32:03.5956186Z D: int, 2025-05-07T20:32:03.5956401Z scale_ub: Optional[float], 2025-05-07T20:32:03.5956679Z contiguous: bool, 2025-05-07T20:32:03.5956931Z compiled: bool, 2025-05-07T20:32:03.5957153Z ) -> None: 2025-05-07T20:32:03.5957375Z torch.manual_seed(2025) 2025-05-07T20:32:03.5957623Z 2025-05-07T20:32:03.5957904Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.5958252Z 2025-05-07T20:32:03.5958449Z x_sign = torch.sign(x) 2025-05-07T20:32:03.5958743Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.5959057Z x = x_sign * x_clamp 2025-05-07T20:32:03.5959300Z x0 = x[:, :D] 2025-05-07T20:32:03.5959520Z x1 = x[:, D:] 2025-05-07T20:32:03.5959725Z 2025-05-07T20:32:03.5959916Z if contiguous: 2025-05-07T20:32:03.5960152Z x0 = x0.contiguous() 2025-05-07T20:32:03.5960439Z x1 = x1.contiguous() 2025-05-07T20:32:03.5960706Z 2025-05-07T20:32:03.5960901Z if scale_ub is not None: 2025-05-07T20:32:03.5961174Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.5961657Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.5961972Z ) 2025-05-07T20:32:03.5962164Z else: 2025-05-07T20:32:03.5962379Z scale_ub_tensor = None 2025-05-07T20:32:03.5962641Z 2025-05-07T20:32:03.5962875Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.5963192Z op = silu_mul_quant 2025-05-07T20:32:03.5963443Z if compiled: 2025-05-07T20:32:03.5963688Z op = torch.compile(op) 2025-05-07T20:32:03.5963991Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.5964268Z 2025-05-07T20:32:03.5964459Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.5964631Z 2025-05-07T20:32:03.5964731Z moe/activation_test.py:117: 2025-05-07T20:32:03.5965029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.5965358Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.5965641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.5966352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.5967068Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.5967615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.5968303Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.5968982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.5969523Z kernel = self.compile( 2025-05-07T20:32:03.5970075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.5970741Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.5971140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.5971372Z 2025-05-07T20:32:03.5971590Z self = 2025-05-07T20:32:03.5972691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.5974239Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9df6cdc0>} 2025-05-07T20:32:03.5975612Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.5976659Z context = 2025-05-07T20:32:03.5976950Z 2025-05-07T20:32:03.5977131Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.5977659Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.5978328Z module_map=module_map) 2025-05-07T20:32:03.5978703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.5979067Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.5979327Z E ^ 2025-05-07T20:32:03.5979797Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.5980266Z 2025-05-07T20:32:03.5980734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.5981252Z 2025-05-07T20:32:03.5981363Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.5981779Z self=, 2025-05-07T20:32:03.5982183Z T=1, 2025-05-07T20:32:03.5982456Z D=5120, 2025-05-07T20:32:03.5982648Z scale_ub=None, 2025-05-07T20:32:03.5982867Z contiguous=True, 2025-05-07T20:32:03.5983096Z compiled=True, 2025-05-07T20:32:03.5983305Z ) 2025-05-07T20:32:04.0553216Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.0554321Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:04.0555910Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.0557486Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.0558904Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.0560321Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.0561652Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.0563056Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.0564500Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.0565961Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:04.0567210Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.0568448Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:04.0569508Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:04.0570595Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:04.0571846Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.0573152Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.0574289Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:04.0575346Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:04.0576665Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.0578147Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.0579229Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.0580157Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.0580957Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:04.0582000Z W0507 20:32:04.052000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.2175274Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.2176783Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:04.2178252Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.2179778Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.2181197Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.2182812Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.2184141Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.2185543Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.2186990Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.2188262Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:04.2189502Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.2190784Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:04.2191835Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:04.2193011Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:04.2194250Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.2195558Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.2196696Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:04.2197760Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:04.2198969Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.2200344Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.2201472Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.2202400Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.2203153Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:04.2204190Z W0507 20:32:04.214000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.6611677Z self = 2025-05-07T20:32:04.6612490Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:04.6612858Z 2025-05-07T20:32:04.6612944Z @given( 2025-05-07T20:32:04.6613198Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.6613527Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.6613845Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.6614191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.6614532Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.6614825Z ) 2025-05-07T20:32:04.6615186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.6615646Z def test_silu_mul_quant( 2025-05-07T20:32:04.6615904Z self, 2025-05-07T20:32:04.6616111Z T: int, 2025-05-07T20:32:04.6616321Z D: int, 2025-05-07T20:32:04.6616554Z scale_ub: Optional[float], 2025-05-07T20:32:04.6616834Z contiguous: bool, 2025-05-07T20:32:04.6617090Z compiled: bool, 2025-05-07T20:32:04.6617330Z ) -> None: 2025-05-07T20:32:04.6617552Z torch.manual_seed(2025) 2025-05-07T20:32:04.6617805Z 2025-05-07T20:32:04.6618161Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.6618515Z 2025-05-07T20:32:04.6618723Z x_sign = torch.sign(x) 2025-05-07T20:32:04.6619028Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.6619344Z x = x_sign * x_clamp 2025-05-07T20:32:04.6619593Z x0 = x[:, :D] 2025-05-07T20:32:04.6619820Z x1 = x[:, D:] 2025-05-07T20:32:04.6620039Z 2025-05-07T20:32:04.6620231Z if contiguous: 2025-05-07T20:32:04.6620474Z x0 = x0.contiguous() 2025-05-07T20:32:04.6620923Z x1 = x1.contiguous() 2025-05-07T20:32:04.6621173Z 2025-05-07T20:32:04.6621375Z if scale_ub is not None: 2025-05-07T20:32:04.6621661Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.6622014Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.6622332Z ) 2025-05-07T20:32:04.6622535Z else: 2025-05-07T20:32:04.6622749Z scale_ub_tensor = None 2025-05-07T20:32:04.6623012Z 2025-05-07T20:32:04.6623257Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.6623578Z op = silu_mul_quant 2025-05-07T20:32:04.6623839Z if compiled: 2025-05-07T20:32:04.6624100Z op = torch.compile(op) 2025-05-07T20:32:04.6624407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6624695Z 2025-05-07T20:32:04.6624896Z y_fp8, y_scale = fn() 2025-05-07T20:32:04.6625195Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:04.6625490Z 2025-05-07T20:32:04.6625750Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.6626098Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:04.6626403Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:04.6626734Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:04.6627107Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:04.6627428Z 2025-05-07T20:32:04.6627643Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:04.6627847Z 2025-05-07T20:32:04.6627959Z moe/activation_test.py:126: 2025-05-07T20:32:04.6628265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6628609Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:04.6628953Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:04.6629779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:04.6630551Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:04.6631116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.6631901Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.6632608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:04.6633340Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:04.6634108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:04.6634869Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:04.6635617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:04.6636267Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:04.6636884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:04.6637420Z fn() 2025-05-07T20:32:04.6637936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:04.6638528Z self.fn.run( 2025-05-07T20:32:04.6639008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.6639550Z kernel = self.compile( 2025-05-07T20:32:04.6640098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.6640820Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.6641226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6641454Z 2025-05-07T20:32:04.6641749Z self = 2025-05-07T20:32:04.6642861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.6644274Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc2018430>} 2025-05-07T20:32:04.6645647Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.6646698Z context = 2025-05-07T20:32:04.6646993Z 2025-05-07T20:32:04.6647169Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.6647704Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.6648191Z module_map=module_map) 2025-05-07T20:32:04.6648561Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.6648928Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:04.6649207Z E ^ 2025-05-07T20:32:04.6649687Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.6650145Z 2025-05-07T20:32:04.6650573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.6651149Z 2025-05-07T20:32:04.6651258Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.6651684Z self=, 2025-05-07T20:32:04.6652093Z T=2048, 2025-05-07T20:32:04.6652289Z D=5120, 2025-05-07T20:32:04.6652494Z scale_ub=None, 2025-05-07T20:32:04.6652719Z contiguous=True, 2025-05-07T20:32:04.6652947Z compiled=True, 2025-05-07T20:32:04.6653243Z ) 2025-05-07T20:32:05.0816882Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.0818277Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:05.0819640Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.0821150Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.0822554Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.0823956Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.0825289Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.0826689Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.0828296Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.0829570Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:05.0830832Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.0832081Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:05.0833139Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:05.0834175Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:05.0835422Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.0836718Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.0837855Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:05.0838910Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:05.0840114Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.0841615Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.0842685Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.0843613Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.0844364Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:05.0845402Z W0507 20:32:05.078000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.2435641Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.2437106Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:05.2438467Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.2439914Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.2441556Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.2442968Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.2444288Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.2445689Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.2447133Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.2448405Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:05.2449643Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.2450900Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:05.2451973Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:05.2453017Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:05.2454256Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.2455871Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.2457002Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:05.2458126Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:05.2459335Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.2460716Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.2461791Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.2462709Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.2463456Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:05.2464610Z W0507 20:32:05.240000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.6846983Z self = 2025-05-07T20:32:05.6847685Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.6847974Z 2025-05-07T20:32:05.6848060Z @given( 2025-05-07T20:32:05.6848296Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.6848617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.6848955Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.6849296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.6849629Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.6849922Z ) 2025-05-07T20:32:05.6850281Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.6850731Z def test_silu_mul_quant( 2025-05-07T20:32:05.6851005Z self, 2025-05-07T20:32:05.6851226Z T: int, 2025-05-07T20:32:05.6851428Z D: int, 2025-05-07T20:32:05.6851644Z scale_ub: Optional[float], 2025-05-07T20:32:05.6851920Z contiguous: bool, 2025-05-07T20:32:05.6852168Z compiled: bool, 2025-05-07T20:32:05.6852392Z ) -> None: 2025-05-07T20:32:05.6852616Z torch.manual_seed(2025) 2025-05-07T20:32:05.6852864Z 2025-05-07T20:32:05.6853139Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.6853485Z 2025-05-07T20:32:05.6853692Z x_sign = torch.sign(x) 2025-05-07T20:32:05.6853985Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.6854298Z x = x_sign * x_clamp 2025-05-07T20:32:05.6854540Z x0 = x[:, :D] 2025-05-07T20:32:05.6854754Z x1 = x[:, D:] 2025-05-07T20:32:05.6854963Z 2025-05-07T20:32:05.6855153Z if contiguous: 2025-05-07T20:32:05.6855388Z x0 = x0.contiguous() 2025-05-07T20:32:05.6855833Z x1 = x1.contiguous() 2025-05-07T20:32:05.6856079Z 2025-05-07T20:32:05.6856278Z if scale_ub is not None: 2025-05-07T20:32:05.6856551Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.6857065Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.6857375Z ) 2025-05-07T20:32:05.6857565Z else: 2025-05-07T20:32:05.6857778Z scale_ub_tensor = None 2025-05-07T20:32:05.6858085Z 2025-05-07T20:32:05.6858316Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.6858641Z op = silu_mul_quant 2025-05-07T20:32:05.6858890Z if compiled: 2025-05-07T20:32:05.6859134Z op = torch.compile(op) 2025-05-07T20:32:05.6859439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.6859714Z 2025-05-07T20:32:05.6859907Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.6860200Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.6860492Z 2025-05-07T20:32:05.6860738Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.6861122Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.6861430Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.6861750Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.6862105Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.6862422Z 2025-05-07T20:32:05.6862626Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.6862823Z 2025-05-07T20:32:05.6862929Z moe/activation_test.py:126: 2025-05-07T20:32:05.6863226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.6863560Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.6863891Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.6864874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.6865645Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.6866202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.6866894Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.6867594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.6868329Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.6869093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.6869849Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.6870588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.6871250Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.6871861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.6872391Z fn() 2025-05-07T20:32:05.6872914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.6873508Z self.fn.run( 2025-05-07T20:32:05.6873980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.6874522Z kernel = self.compile( 2025-05-07T20:32:05.6875077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.6875748Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.6876148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.6876381Z 2025-05-07T20:32:05.6876599Z self = 2025-05-07T20:32:05.6877708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.6879223Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9d8c2d40>} 2025-05-07T20:32:05.6880592Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.6881690Z context = 2025-05-07T20:32:05.6881987Z 2025-05-07T20:32:05.6882161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.6882690Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.6883168Z module_map=module_map) 2025-05-07T20:32:05.6883539Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.6883900Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.6884168Z E ^ 2025-05-07T20:32:05.6884643Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.6885105Z 2025-05-07T20:32:05.6885530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.6886050Z 2025-05-07T20:32:05.6886161Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.6886578Z self=, 2025-05-07T20:32:05.6886984Z T=128, 2025-05-07T20:32:05.6887257Z D=5120, 2025-05-07T20:32:05.6887452Z scale_ub=None, 2025-05-07T20:32:05.6887673Z contiguous=True, 2025-05-07T20:32:05.6887901Z compiled=True, 2025-05-07T20:32:05.6888117Z ) 2025-05-07T20:32:06.1569375Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.1570468Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:06.1571830Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.1573292Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.1574693Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.1576108Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.1577447Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.1578941Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.1580398Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.1581832Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:06.1583072Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.1584308Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:06.1585369Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:06.1586413Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:06.1587657Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.1588964Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.1590106Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:06.1591167Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:06.1592480Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.1593861Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.1594937Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.1595865Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.1596620Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:06.1597669Z W0507 20:32:06.153000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.3209194Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.3210875Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:06.3212236Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.3213692Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.3215113Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.3216681Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.3218076Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.3219480Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.3220936Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.3222251Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:06.3223497Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.3224729Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:06.3225785Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:06.3226936Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:06.3228186Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.3229498Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.3230640Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:06.3231759Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:06.3232964Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.3234352Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.3235433Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.3236363Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.3237110Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:06.3238154Z W0507 20:32:06.317000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.0931706Z self = 2025-05-07T20:32:07.0932475Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:07.0932872Z 2025-05-07T20:32:07.0932958Z @given( 2025-05-07T20:32:07.0933202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.0933532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.0933853Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.0934190Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.0934530Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.0934822Z ) 2025-05-07T20:32:07.0935186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.0935638Z def test_silu_mul_quant( 2025-05-07T20:32:07.0935887Z self, 2025-05-07T20:32:07.0936086Z T: int, 2025-05-07T20:32:07.0936288Z D: int, 2025-05-07T20:32:07.0936518Z scale_ub: Optional[float], 2025-05-07T20:32:07.0936792Z contiguous: bool, 2025-05-07T20:32:07.0937043Z compiled: bool, 2025-05-07T20:32:07.0937273Z ) -> None: 2025-05-07T20:32:07.0937491Z torch.manual_seed(2025) 2025-05-07T20:32:07.0937739Z 2025-05-07T20:32:07.0938151Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.0938499Z 2025-05-07T20:32:07.0938696Z x_sign = torch.sign(x) 2025-05-07T20:32:07.0938998Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.0939317Z x = x_sign * x_clamp 2025-05-07T20:32:07.0939565Z x0 = x[:, :D] 2025-05-07T20:32:07.0939783Z x1 = x[:, D:] 2025-05-07T20:32:07.0939996Z 2025-05-07T20:32:07.0940193Z if contiguous: 2025-05-07T20:32:07.0940429Z x0 = x0.contiguous() 2025-05-07T20:32:07.0940720Z x1 = x1.contiguous() 2025-05-07T20:32:07.0941154Z 2025-05-07T20:32:07.0941357Z if scale_ub is not None: 2025-05-07T20:32:07.0941639Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.0941991Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.0942297Z ) 2025-05-07T20:32:07.0942499Z else: 2025-05-07T20:32:07.0942719Z scale_ub_tensor = None 2025-05-07T20:32:07.0942980Z 2025-05-07T20:32:07.0943219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.0943535Z op = silu_mul_quant 2025-05-07T20:32:07.0943787Z if compiled: 2025-05-07T20:32:07.0944043Z op = torch.compile(op) 2025-05-07T20:32:07.0944350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.0944630Z 2025-05-07T20:32:07.0944824Z y_fp8, y_scale = fn() 2025-05-07T20:32:07.0945118Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:07.0945417Z 2025-05-07T20:32:07.0945667Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.0946010Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:07.0946316Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:07.0946632Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:07.0946997Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.0947312Z 2025-05-07T20:32:07.0947516Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:07.0947720Z 2025-05-07T20:32:07.0947826Z moe/activation_test.py:126: 2025-05-07T20:32:07.0948133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.0948471Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:07.0948802Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.0949606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:07.0950375Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:07.0950935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.0951775Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.0952486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:07.0953218Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.0953977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:07.0954743Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.0955489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:07.0956330Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:07.0956939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:07.0957470Z fn() 2025-05-07T20:32:07.0957991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:07.0958579Z self.fn.run( 2025-05-07T20:32:07.0959062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.0959603Z kernel = self.compile( 2025-05-07T20:32:07.0960159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.0960828Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.0961231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.0961458Z 2025-05-07T20:32:07.0961804Z self = 2025-05-07T20:32:07.0962916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.0964317Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9d8c2e60>} 2025-05-07T20:32:07.0965690Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.0966738Z context = 2025-05-07T20:32:07.0967033Z 2025-05-07T20:32:07.0967214Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.0967741Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.0968231Z module_map=module_map) 2025-05-07T20:32:07.0968609Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.0968976Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:07.0969251Z E ^ 2025-05-07T20:32:07.0969728Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.0970185Z 2025-05-07T20:32:07.0970616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.0971160Z 2025-05-07T20:32:07.0971282Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.0971716Z self=, 2025-05-07T20:32:07.0972129Z T=4096, 2025-05-07T20:32:07.0972329Z D=5120, 2025-05-07T20:32:07.0972523Z scale_ub=None, 2025-05-07T20:32:07.0972747Z contiguous=True, 2025-05-07T20:32:07.0972982Z compiled=True, 2025-05-07T20:32:07.0973312Z ) 2025-05-07T20:32:07.5693406Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:07.5695668Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:07.5698312Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:07.5700995Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:07.5702631Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:07.5704048Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.5705389Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:07.5706797Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.5708414Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:07.5709699Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:07.5710942Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:07.5712179Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:07.5713239Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:07.5714281Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:07.5715524Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:07.5716829Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:07.5717968Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:07.5719030Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:07.5720235Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:07.5721779Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:07.5722855Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.5723781Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.5724532Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:07.5725576Z W0507 20:32:07.565000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.7338806Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:07.7339903Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:07.7341296Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:07.7342772Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:07.7344350Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:07.7345767Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.7347099Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:07.7348501Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.7349944Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:07.7351212Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:07.7352501Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:07.7353721Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:07.7354772Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 418, in visit 2025-05-07T20:32:07.7356051Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:07.7357293Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:07.7358722Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:07.7359854Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/ast.py", line 426, in generic_visit 2025-05-07T20:32:07.7360916Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:07.7362114Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:07.7363495Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:07.7364564Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.7365491Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.7366243Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:07.7367419Z W0507 20:32:07.730000 87377 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.3217570Z self = 2025-05-07T20:32:08.3218255Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:08.3218546Z 2025-05-07T20:32:08.3226220Z @given( 2025-05-07T20:32:08.3226533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.3226874Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.3227209Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.3227559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.3227893Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.3228192Z ) 2025-05-07T20:32:08.3228563Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.3229020Z def test_silu_mul_quant( 2025-05-07T20:32:08.3229277Z self, 2025-05-07T20:32:08.3229501Z T: int, 2025-05-07T20:32:08.3229698Z D: int, 2025-05-07T20:32:08.3229932Z scale_ub: Optional[float], 2025-05-07T20:32:08.3230220Z contiguous: bool, 2025-05-07T20:32:08.3230473Z compiled: bool, 2025-05-07T20:32:08.3230714Z ) -> None: 2025-05-07T20:32:08.3230950Z torch.manual_seed(2025) 2025-05-07T20:32:08.3231225Z 2025-05-07T20:32:08.3231540Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.3231905Z 2025-05-07T20:32:08.3232112Z x_sign = torch.sign(x) 2025-05-07T20:32:08.3232424Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.3232753Z x = x_sign * x_clamp 2025-05-07T20:32:08.3233006Z x0 = x[:, :D] 2025-05-07T20:32:08.3233226Z x1 = x[:, D:] 2025-05-07T20:32:08.3233453Z 2025-05-07T20:32:08.3233655Z if contiguous: 2025-05-07T20:32:08.3233893Z x0 = x0.contiguous() 2025-05-07T20:32:08.3234174Z x1 = x1.contiguous() 2025-05-07T20:32:08.3234431Z 2025-05-07T20:32:08.3234630Z if scale_ub is not None: 2025-05-07T20:32:08.3234917Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.3235654Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.3235969Z ) 2025-05-07T20:32:08.3236202Z else: 2025-05-07T20:32:08.3236429Z scale_ub_tensor = None 2025-05-07T20:32:08.3236696Z 2025-05-07T20:32:08.3236933Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.3237264Z op = silu_mul_quant 2025-05-07T20:32:08.3237526Z if compiled: 2025-05-07T20:32:08.3237776Z op = torch.compile(op) 2025-05-07T20:32:08.3238090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3238381Z 2025-05-07T20:32:08.3238582Z y_fp8, y_scale = fn() 2025-05-07T20:32:08.3238884Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:08.3239185Z 2025-05-07T20:32:08.3239444Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.3239784Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:08.3240099Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:08.3240426Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:08.3240793Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.3241165Z 2025-05-07T20:32:08.3241382Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:08.3241585Z 2025-05-07T20:32:08.3241692Z moe/activation_test.py:126: 2025-05-07T20:32:08.3242006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3242349Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:08.3242695Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.3243677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:08.3244473Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:08.3245036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.3245751Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.3246470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:08.3247217Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:08.3247985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:08.3248755Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:08.3249505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:08.3250174Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:08.3250795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:08.3251385Z fn() 2025-05-07T20:32:08.3251911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:08.3252502Z self.fn.run( 2025-05-07T20:32:08.3252992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.3253540Z kernel = self.compile( 2025-05-07T20:32:08.3254100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.3254768Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.3255178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3255410Z 2025-05-07T20:32:08.3256164Z self = 2025-05-07T20:32:08.3257279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.3258918Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9d456320>} 2025-05-07T20:32:08.3260311Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.3261424Z context = 2025-05-07T20:32:08.3261722Z 2025-05-07T20:32:08.3261907Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.3262438Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.3262929Z module_map=module_map) 2025-05-07T20:32:08.3263311Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.3263678Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:08.3263955Z E ^ 2025-05-07T20:32:08.3264443Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.3264905Z 2025-05-07T20:32:08.3265346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.3265872Z 2025-05-07T20:32:08.3265982Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.3266414Z self=, 2025-05-07T20:32:08.3266834Z T=16384, 2025-05-07T20:32:08.3267157Z D=5120, 2025-05-07T20:32:08.3267359Z scale_ub=None, 2025-05-07T20:32:08.3267585Z contiguous=True, 2025-05-07T20:32:08.3267816Z compiled=True, 2025-05-07T20:32:08.3268032Z ) 2025-05-07T20:32:08.3651717Z W0507 20:32:08.363000 87377 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:08.3652999Z W0507 20:32:08.363000 87377 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:08.3654370Z W0507 20:32:08.363000 87377 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:08.3655379Z W0507 20:32:08.363000 87377 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:08.3656755Z W0507 20:32:08.363000 87377 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:08.4679172Z self = 2025-05-07T20:32:08.4679750Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:08.4680038Z 2025-05-07T20:32:08.4680129Z @given( 2025-05-07T20:32:08.4680369Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.4680699Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.4681022Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.4681362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.4681711Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.4682010Z ) 2025-05-07T20:32:08.4682379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.4682855Z def test_silu_mul_quant( 2025-05-07T20:32:08.4683112Z self, 2025-05-07T20:32:08.4683324Z T: int, 2025-05-07T20:32:08.4683861Z D: int, 2025-05-07T20:32:08.4684097Z scale_ub: Optional[float], 2025-05-07T20:32:08.4684386Z contiguous: bool, 2025-05-07T20:32:08.4684628Z compiled: bool, 2025-05-07T20:32:08.4684864Z ) -> None: 2025-05-07T20:32:08.4685095Z torch.manual_seed(2025) 2025-05-07T20:32:08.4685342Z 2025-05-07T20:32:08.4685638Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.4686001Z 2025-05-07T20:32:08.4686199Z x_sign = torch.sign(x) 2025-05-07T20:32:08.4686507Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.4686835Z x = x_sign * x_clamp 2025-05-07T20:32:08.4687091Z x0 = x[:, :D] 2025-05-07T20:32:08.4687314Z x1 = x[:, D:] 2025-05-07T20:32:08.4687535Z 2025-05-07T20:32:08.4687742Z if contiguous: 2025-05-07T20:32:08.4687986Z x0 = x0.contiguous() 2025-05-07T20:32:08.4688265Z x1 = x1.contiguous() 2025-05-07T20:32:08.4688519Z 2025-05-07T20:32:08.4688722Z if scale_ub is not None: 2025-05-07T20:32:08.4689005Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.4689355Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.4689672Z ) 2025-05-07T20:32:08.4689877Z else: 2025-05-07T20:32:08.4690101Z scale_ub_tensor = None 2025-05-07T20:32:08.4690357Z 2025-05-07T20:32:08.4690604Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.4690935Z op = silu_mul_quant 2025-05-07T20:32:08.4691232Z if compiled: 2025-05-07T20:32:08.4691500Z op = torch.compile(op) 2025-05-07T20:32:08.4691811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.4692093Z 2025-05-07T20:32:08.4692300Z y_fp8, y_scale = fn() 2025-05-07T20:32:08.4692751Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:08.4693059Z 2025-05-07T20:32:08.4693306Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.4693664Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:08.4693967Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:08.4694290Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:08.4694665Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.4694994Z 2025-05-07T20:32:08.4695203Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:08.4695414Z 2025-05-07T20:32:08.4695523Z moe/activation_test.py:126: 2025-05-07T20:32:08.4695835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.4696184Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:08.4696521Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.4697346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:08.4698246Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:08.4698816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.4699526Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.4700244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:08.4700998Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:08.4701772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:08.4702555Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:08.4703322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:08.4703992Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:08.4704698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:08.4705243Z fn() 2025-05-07T20:32:08.4705781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:08.4706378Z self.fn.run( 2025-05-07T20:32:08.4706871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.4707425Z kernel = self.compile( 2025-05-07T20:32:08.4707990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.4708660Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.4709077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.4709313Z 2025-05-07T20:32:08.4709539Z self = 2025-05-07T20:32:08.4710669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.4712106Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9d8c25f0>} 2025-05-07T20:32:08.4713504Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.4714691Z context = 2025-05-07T20:32:08.4714989Z 2025-05-07T20:32:08.4715178Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.4715722Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.4716228Z module_map=module_map) 2025-05-07T20:32:08.4716616Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.4716994Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:08.4717271Z E ^ 2025-05-07T20:32:08.4717757Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.4718228Z 2025-05-07T20:32:08.4718658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.4719188Z 2025-05-07T20:32:08.4719303Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.4719739Z self=, 2025-05-07T20:32:08.4720158Z T=1, 2025-05-07T20:32:08.4720355Z D=5120, 2025-05-07T20:32:08.4720556Z scale_ub=1200.0, 2025-05-07T20:32:08.4720799Z contiguous=True, 2025-05-07T20:32:08.4721060Z compiled=True, 2025-05-07T20:32:08.4721296Z ) 2025-05-07T20:32:08.6165965Z self = 2025-05-07T20:32:08.6166543Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:08.6166818Z 2025-05-07T20:32:08.6166906Z @given( 2025-05-07T20:32:08.6167149Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.6167472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.6167788Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.6168123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.6168464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.6168760Z ) 2025-05-07T20:32:08.6169130Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.6169584Z def test_silu_mul_quant( 2025-05-07T20:32:08.6170068Z self, 2025-05-07T20:32:08.6170270Z T: int, 2025-05-07T20:32:08.6170470Z D: int, 2025-05-07T20:32:08.6170697Z scale_ub: Optional[float], 2025-05-07T20:32:08.6170983Z contiguous: bool, 2025-05-07T20:32:08.6171227Z compiled: bool, 2025-05-07T20:32:08.6171463Z ) -> None: 2025-05-07T20:32:08.6171721Z torch.manual_seed(2025) 2025-05-07T20:32:08.6171988Z 2025-05-07T20:32:08.6172398Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.6172846Z 2025-05-07T20:32:08.6173043Z x_sign = torch.sign(x) 2025-05-07T20:32:08.6173346Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.6173664Z x = x_sign * x_clamp 2025-05-07T20:32:08.6173903Z x0 = x[:, :D] 2025-05-07T20:32:08.6174131Z x1 = x[:, D:] 2025-05-07T20:32:08.6174346Z 2025-05-07T20:32:08.6174530Z if contiguous: 2025-05-07T20:32:08.6174768Z x0 = x0.contiguous() 2025-05-07T20:32:08.6175038Z x1 = x1.contiguous() 2025-05-07T20:32:08.6175281Z 2025-05-07T20:32:08.6175480Z if scale_ub is not None: 2025-05-07T20:32:08.6175764Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.6176114Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.6176429Z ) 2025-05-07T20:32:08.6176625Z else: 2025-05-07T20:32:08.6176839Z scale_ub_tensor = None 2025-05-07T20:32:08.6177092Z 2025-05-07T20:32:08.6177332Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.6177660Z op = silu_mul_quant 2025-05-07T20:32:08.6177911Z if compiled: 2025-05-07T20:32:08.6178274Z op = torch.compile(op) 2025-05-07T20:32:08.6178586Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.6179070Z 2025-05-07T20:32:08.6179273Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.6179444Z 2025-05-07T20:32:08.6179551Z moe/activation_test.py:117: 2025-05-07T20:32:08.6179855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.6180199Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.6180494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.6181073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.6181650Z return fn(*args, **kwargs) 2025-05-07T20:32:08.6182326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.6183036Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.6183582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.6184289Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.6184970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.6185519Z kernel = self.compile( 2025-05-07T20:32:08.6186069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.6186743Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.6187144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.6187374Z 2025-05-07T20:32:08.6187591Z self = 2025-05-07T20:32:08.6188708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.6190126Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9cff05e0>} 2025-05-07T20:32:08.6191645Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.6192699Z context = 2025-05-07T20:32:08.6192995Z 2025-05-07T20:32:08.6193170Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.6193700Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.6194184Z module_map=module_map) 2025-05-07T20:32:08.6194562Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.6194921Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.6195184Z E ^ 2025-05-07T20:32:08.6195664Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.6196132Z 2025-05-07T20:32:08.6196564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.6197089Z 2025-05-07T20:32:08.6197195Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.6197618Z self=, 2025-05-07T20:32:08.6198026Z T=1, 2025-05-07T20:32:08.6198209Z D=5120, 2025-05-07T20:32:08.6198405Z scale_ub=None, 2025-05-07T20:32:08.6198630Z contiguous=False, 2025-05-07T20:32:08.6198860Z compiled=True, 2025-05-07T20:32:08.6199065Z ) 2025-05-07T20:32:08.6877380Z self = 2025-05-07T20:32:08.6878106Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:08.6878385Z 2025-05-07T20:32:08.6878470Z @given( 2025-05-07T20:32:08.6878706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.6879038Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.6879356Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.6879689Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.6880031Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.6880321Z ) 2025-05-07T20:32:08.6880681Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.6881138Z def test_silu_mul_quant( 2025-05-07T20:32:08.6881389Z self, 2025-05-07T20:32:08.6881614Z T: int, 2025-05-07T20:32:08.6881835Z D: int, 2025-05-07T20:32:08.6882054Z scale_ub: Optional[float], 2025-05-07T20:32:08.6882332Z contiguous: bool, 2025-05-07T20:32:08.6882568Z compiled: bool, 2025-05-07T20:32:08.6882805Z ) -> None: 2025-05-07T20:32:08.6883028Z torch.manual_seed(2025) 2025-05-07T20:32:08.6883267Z 2025-05-07T20:32:08.6883546Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.6883898Z 2025-05-07T20:32:08.6884092Z x_sign = torch.sign(x) 2025-05-07T20:32:08.6884386Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.6884699Z x = x_sign * x_clamp 2025-05-07T20:32:08.6884939Z x0 = x[:, :D] 2025-05-07T20:32:08.6885158Z x1 = x[:, D:] 2025-05-07T20:32:08.6885370Z 2025-05-07T20:32:08.6885563Z if contiguous: 2025-05-07T20:32:08.6885798Z x0 = x0.contiguous() 2025-05-07T20:32:08.6886064Z x1 = x1.contiguous() 2025-05-07T20:32:08.6886313Z 2025-05-07T20:32:08.6886508Z if scale_ub is not None: 2025-05-07T20:32:08.6886784Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.6887131Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.6887448Z ) 2025-05-07T20:32:08.6887641Z else: 2025-05-07T20:32:08.6887856Z scale_ub_tensor = None 2025-05-07T20:32:08.6888233Z 2025-05-07T20:32:08.6888468Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.6888785Z op = silu_mul_quant 2025-05-07T20:32:08.6889036Z if compiled: 2025-05-07T20:32:08.6889287Z op = torch.compile(op) 2025-05-07T20:32:08.6889594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.6889865Z 2025-05-07T20:32:08.6890061Z y_fp8, y_scale = fn() 2025-05-07T20:32:08.6890354Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:08.6890642Z 2025-05-07T20:32:08.6890884Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.6891223Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:08.6891525Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:08.6891896Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:08.6892264Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.6892583Z 2025-05-07T20:32:08.6892784Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:08.6892988Z 2025-05-07T20:32:08.6893091Z moe/activation_test.py:126: 2025-05-07T20:32:08.6893393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.6893731Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:08.6894069Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.6894875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:08.6895648Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:08.6896204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.6896980Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.6897688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:08.6898582Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:08.6899369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:08.6900133Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:08.6900876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:08.6901525Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:08.6902135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:08.6902661Z fn() 2025-05-07T20:32:08.6903183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:08.6903774Z self.fn.run( 2025-05-07T20:32:08.6904252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.6904796Z kernel = self.compile( 2025-05-07T20:32:08.6905349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.6906011Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.6906410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.6906639Z 2025-05-07T20:32:08.6906858Z self = 2025-05-07T20:32:08.6907966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.6909371Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9cff1090>} 2025-05-07T20:32:08.6910837Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.6911942Z context = 2025-05-07T20:32:08.6912235Z 2025-05-07T20:32:08.6912408Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.6912936Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.6913412Z module_map=module_map) 2025-05-07T20:32:08.6913791Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.6914169Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:08.6914438Z E ^ 2025-05-07T20:32:08.6914922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.6915395Z 2025-05-07T20:32:08.6923099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.6923664Z 2025-05-07T20:32:08.6923778Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.6924208Z self=, 2025-05-07T20:32:08.6924615Z T=1, 2025-05-07T20:32:08.6924808Z D=5120, 2025-05-07T20:32:08.6924999Z scale_ub=None, 2025-05-07T20:32:08.6925217Z contiguous=True, 2025-05-07T20:32:08.6925445Z compiled=False, 2025-05-07T20:32:08.6925648Z ) 2025-05-07T20:32:09.0306529Z self = 2025-05-07T20:32:09.0307640Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:09.0308194Z 2025-05-07T20:32:09.0308368Z @given( 2025-05-07T20:32:09.0308847Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.0309500Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.0310137Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.0310819Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.0311486Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.0311884Z ) 2025-05-07T20:32:09.0312257Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.0312714Z def test_silu_mul_quant( 2025-05-07T20:32:09.0312974Z self, 2025-05-07T20:32:09.0313181Z T: int, 2025-05-07T20:32:09.0313384Z D: int, 2025-05-07T20:32:09.0313620Z scale_ub: Optional[float], 2025-05-07T20:32:09.0313915Z contiguous: bool, 2025-05-07T20:32:09.0314164Z compiled: bool, 2025-05-07T20:32:09.0314406Z ) -> None: 2025-05-07T20:32:09.0314645Z torch.manual_seed(2025) 2025-05-07T20:32:09.0314894Z 2025-05-07T20:32:09.0315183Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.0315544Z 2025-05-07T20:32:09.0315750Z x_sign = torch.sign(x) 2025-05-07T20:32:09.0316054Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.0316382Z x = x_sign * x_clamp 2025-05-07T20:32:09.0316632Z x0 = x[:, :D] 2025-05-07T20:32:09.0316853Z x1 = x[:, D:] 2025-05-07T20:32:09.0317074Z 2025-05-07T20:32:09.0317274Z if contiguous: 2025-05-07T20:32:09.0317513Z x0 = x0.contiguous() 2025-05-07T20:32:09.0317787Z x1 = x1.contiguous() 2025-05-07T20:32:09.0318041Z 2025-05-07T20:32:09.0318240Z if scale_ub is not None: 2025-05-07T20:32:09.0318536Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.0318891Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.0319211Z ) 2025-05-07T20:32:09.0319547Z else: 2025-05-07T20:32:09.0319771Z scale_ub_tensor = None 2025-05-07T20:32:09.0320039Z 2025-05-07T20:32:09.0320279Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.0320607Z op = silu_mul_quant 2025-05-07T20:32:09.0320866Z if compiled: 2025-05-07T20:32:09.0321134Z op = torch.compile(op) 2025-05-07T20:32:09.0321485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0321769Z 2025-05-07T20:32:09.0321973Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.0322143Z 2025-05-07T20:32:09.0322248Z moe/activation_test.py:117: 2025-05-07T20:32:09.0322555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0322895Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.0323189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0323914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.0324640Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.0325207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.0325910Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.0326600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.0327162Z kernel = self.compile( 2025-05-07T20:32:09.0327726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.0328407Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.0328902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0329141Z 2025-05-07T20:32:09.0329364Z self = 2025-05-07T20:32:09.0330479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.0331953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9cff3880>} 2025-05-07T20:32:09.0333335Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.0334387Z context = 2025-05-07T20:32:09.0334689Z 2025-05-07T20:32:09.0334867Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.0335396Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.0335887Z module_map=module_map) 2025-05-07T20:32:09.0336266Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.0336631Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.0336915Z E ^ 2025-05-07T20:32:09.0337556Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.0338091Z 2025-05-07T20:32:09.0338529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.0339053Z 2025-05-07T20:32:09.0339160Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.0339595Z self=, 2025-05-07T20:32:09.0340007Z T=128, 2025-05-07T20:32:09.0340201Z D=5120, 2025-05-07T20:32:09.0340404Z scale_ub=None, 2025-05-07T20:32:09.0340729Z contiguous=False, 2025-05-07T20:32:09.0340956Z compiled=True, 2025-05-07T20:32:09.0341170Z ) 2025-05-07T20:32:09.0341501Z self = 2025-05-07T20:32:09.0342054Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:09.0342333Z 2025-05-07T20:32:09.0342413Z @given( 2025-05-07T20:32:09.0342656Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.0342988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.0343299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.0343639Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.0343982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.0344271Z ) 2025-05-07T20:32:09.0344635Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.0345092Z def test_silu_mul_quant( 2025-05-07T20:32:09.0345340Z self, 2025-05-07T20:32:09.0345553Z T: int, 2025-05-07T20:32:09.0345764Z D: int, 2025-05-07T20:32:09.0345990Z scale_ub: Optional[float], 2025-05-07T20:32:09.0346267Z contiguous: bool, 2025-05-07T20:32:09.0346515Z compiled: bool, 2025-05-07T20:32:09.0346748Z ) -> None: 2025-05-07T20:32:09.0346968Z torch.manual_seed(2025) 2025-05-07T20:32:09.0347216Z 2025-05-07T20:32:09.0347501Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.0347845Z 2025-05-07T20:32:09.0348050Z x_sign = torch.sign(x) 2025-05-07T20:32:09.0348355Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.0348672Z x = x_sign * x_clamp 2025-05-07T20:32:09.0348932Z x0 = x[:, :D] 2025-05-07T20:32:09.0349163Z x1 = x[:, D:] 2025-05-07T20:32:09.0349460Z 2025-05-07T20:32:09.0349662Z if contiguous: 2025-05-07T20:32:09.0349904Z x0 = x0.contiguous() 2025-05-07T20:32:09.0350173Z x1 = x1.contiguous() 2025-05-07T20:32:09.0350424Z 2025-05-07T20:32:09.0350632Z if scale_ub is not None: 2025-05-07T20:32:09.0350912Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.0351300Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.0351631Z ) 2025-05-07T20:32:09.0351835Z else: 2025-05-07T20:32:09.0352057Z scale_ub_tensor = None 2025-05-07T20:32:09.0352319Z 2025-05-07T20:32:09.0352568Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.0352884Z op = silu_mul_quant 2025-05-07T20:32:09.0353144Z if compiled: 2025-05-07T20:32:09.0353399Z op = torch.compile(op) 2025-05-07T20:32:09.0353699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0353981Z 2025-05-07T20:32:09.0354191Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.0354359Z 2025-05-07T20:32:09.0354462Z moe/activation_test.py:117: 2025-05-07T20:32:09.0354770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0355104Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.0355396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0356530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.0357357Z return fn(*args, **kwargs) 2025-05-07T20:32:09.0358335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.0359347Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.0360147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.0360918Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.0361625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.0362344Z kernel = self.compile( 2025-05-07T20:32:09.0362904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.0363576Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.0363975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0364210Z 2025-05-07T20:32:09.0364425Z self = 2025-05-07T20:32:09.0365528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.0366942Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9cff3eb0>} 2025-05-07T20:32:09.0368324Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.0369368Z context = 2025-05-07T20:32:09.0369670Z 2025-05-07T20:32:09.0369843Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.0370379Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.0370864Z module_map=module_map) 2025-05-07T20:32:09.0371242Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.0371606Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.0372031Z E ^ 2025-05-07T20:32:09.0372520Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.0372992Z 2025-05-07T20:32:09.0373417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.0373946Z 2025-05-07T20:32:09.0374054Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.0374483Z self=, 2025-05-07T20:32:09.0374889Z T=128, 2025-05-07T20:32:09.0375089Z D=7168, 2025-05-07T20:32:09.0375296Z scale_ub=1200.0, 2025-05-07T20:32:09.0375523Z contiguous=False, 2025-05-07T20:32:09.0375757Z compiled=False, 2025-05-07T20:32:09.0375969Z ) 2025-05-07T20:32:09.1634826Z self = 2025-05-07T20:32:09.1635458Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:09.1635738Z 2025-05-07T20:32:09.1635818Z @given( 2025-05-07T20:32:09.1636060Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.1636386Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.1636714Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.1637049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.1637391Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.1637692Z ) 2025-05-07T20:32:09.1638053Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.1638499Z def test_silu_mul_quant( 2025-05-07T20:32:09.1638748Z self, 2025-05-07T20:32:09.1638953Z T: int, 2025-05-07T20:32:09.1639150Z D: int, 2025-05-07T20:32:09.1639376Z scale_ub: Optional[float], 2025-05-07T20:32:09.1639662Z contiguous: bool, 2025-05-07T20:32:09.1639905Z compiled: bool, 2025-05-07T20:32:09.1640147Z ) -> None: 2025-05-07T20:32:09.1640373Z torch.manual_seed(2025) 2025-05-07T20:32:09.1640616Z 2025-05-07T20:32:09.1640902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.1641456Z 2025-05-07T20:32:09.1641650Z x_sign = torch.sign(x) 2025-05-07T20:32:09.1641971Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.1642309Z x = x_sign * x_clamp 2025-05-07T20:32:09.1642545Z x0 = x[:, :D] 2025-05-07T20:32:09.1642764Z x1 = x[:, D:] 2025-05-07T20:32:09.1642975Z 2025-05-07T20:32:09.1643164Z if contiguous: 2025-05-07T20:32:09.1643394Z x0 = x0.contiguous() 2025-05-07T20:32:09.1643654Z x1 = x1.contiguous() 2025-05-07T20:32:09.1643895Z 2025-05-07T20:32:09.1644088Z if scale_ub is not None: 2025-05-07T20:32:09.1644367Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.1644709Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.1645018Z ) 2025-05-07T20:32:09.1645217Z else: 2025-05-07T20:32:09.1645433Z scale_ub_tensor = None 2025-05-07T20:32:09.1645691Z 2025-05-07T20:32:09.1645930Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.1646249Z op = silu_mul_quant 2025-05-07T20:32:09.1646497Z if compiled: 2025-05-07T20:32:09.1646750Z op = torch.compile(op) 2025-05-07T20:32:09.1647053Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.1647325Z 2025-05-07T20:32:09.1647521Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.1647691Z 2025-05-07T20:32:09.1647794Z moe/activation_test.py:117: 2025-05-07T20:32:09.1648090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.1648413Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.1648699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.1649518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.1650218Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.1650772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.1651518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.1652189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.1652723Z kernel = self.compile( 2025-05-07T20:32:09.1653275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.1654029Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.1654478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.1654712Z 2025-05-07T20:32:09.1654931Z self = 2025-05-07T20:32:09.1656315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.1657736Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9cff0d30>} 2025-05-07T20:32:09.1659180Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.1660223Z context = 2025-05-07T20:32:09.1660525Z 2025-05-07T20:32:09.1660694Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.1661238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.1661748Z module_map=module_map) 2025-05-07T20:32:09.1662287Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.1662647Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.1662910Z E ^ 2025-05-07T20:32:09.1663379Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.1663840Z 2025-05-07T20:32:09.1664263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.1664884Z 2025-05-07T20:32:09.1665040Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.1665618Z self=, 2025-05-07T20:32:09.1666023Z T=128, 2025-05-07T20:32:09.1666215Z D=5120, 2025-05-07T20:32:09.1666418Z scale_ub=None, 2025-05-07T20:32:09.1666632Z contiguous=False, 2025-05-07T20:32:09.1666861Z compiled=False, 2025-05-07T20:32:09.1667068Z ) 2025-05-07T20:32:09.1667395Z self = 2025-05-07T20:32:09.1667895Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:09.1668172Z 2025-05-07T20:32:09.1668250Z @given( 2025-05-07T20:32:09.1668483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.1668796Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.1669106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.1669440Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.1669772Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.1670062Z ) 2025-05-07T20:32:09.1670415Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.1670858Z def test_silu_mul_quant( 2025-05-07T20:32:09.1671254Z self, 2025-05-07T20:32:09.1671458Z T: int, 2025-05-07T20:32:09.1671653Z D: int, 2025-05-07T20:32:09.1671900Z scale_ub: Optional[float], 2025-05-07T20:32:09.1672205Z contiguous: bool, 2025-05-07T20:32:09.1672443Z compiled: bool, 2025-05-07T20:32:09.1672669Z ) -> None: 2025-05-07T20:32:09.1672893Z torch.manual_seed(2025) 2025-05-07T20:32:09.1673135Z 2025-05-07T20:32:09.1673409Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.1673751Z 2025-05-07T20:32:09.1673946Z x_sign = torch.sign(x) 2025-05-07T20:32:09.1674241Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.1674560Z x = x_sign * x_clamp 2025-05-07T20:32:09.1674803Z x0 = x[:, :D] 2025-05-07T20:32:09.1675014Z x1 = x[:, D:] 2025-05-07T20:32:09.1675226Z 2025-05-07T20:32:09.1675416Z if contiguous: 2025-05-07T20:32:09.1675652Z x0 = x0.contiguous() 2025-05-07T20:32:09.1675911Z x1 = x1.contiguous() 2025-05-07T20:32:09.1676162Z 2025-05-07T20:32:09.1676351Z if scale_ub is not None: 2025-05-07T20:32:09.1676636Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.1676977Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.1677282Z ) 2025-05-07T20:32:09.1677479Z else: 2025-05-07T20:32:09.1677691Z scale_ub_tensor = None 2025-05-07T20:32:09.1677948Z 2025-05-07T20:32:09.1678179Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.1678495Z op = silu_mul_quant 2025-05-07T20:32:09.1678748Z if compiled: 2025-05-07T20:32:09.1678994Z op = torch.compile(op) 2025-05-07T20:32:09.1679295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.1679571Z 2025-05-07T20:32:09.1679761Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.1679932Z 2025-05-07T20:32:09.1680039Z moe/activation_test.py:117: 2025-05-07T20:32:09.1680333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.1680660Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.1681037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.1681789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.1682492Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.1683033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.1683724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.1684395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.1684933Z kernel = self.compile( 2025-05-07T20:32:09.1685488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.1686157Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.1686565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.1686793Z 2025-05-07T20:32:09.1687005Z self = 2025-05-07T20:32:09.1688103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.1689501Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9c9a6cb0>} 2025-05-07T20:32:09.1690953Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.1692054Z context = 2025-05-07T20:32:09.1692352Z 2025-05-07T20:32:09.1692521Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.1693056Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.1693532Z module_map=module_map) 2025-05-07T20:32:09.1693898Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.1694257Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.1694521Z E ^ 2025-05-07T20:32:09.1694991Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.1695446Z 2025-05-07T20:32:09.1695876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.1696397Z 2025-05-07T20:32:09.1696501Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.1696922Z self=, 2025-05-07T20:32:09.1697333Z T=128, 2025-05-07T20:32:09.1697517Z D=5120, 2025-05-07T20:32:09.1697711Z scale_ub=1200.0, 2025-05-07T20:32:09.1697945Z contiguous=True, 2025-05-07T20:32:09.1698270Z compiled=False, 2025-05-07T20:32:09.1698477Z ) 2025-05-07T20:32:09.3631115Z self = 2025-05-07T20:32:09.3631666Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:09.3631948Z 2025-05-07T20:32:09.3632029Z @given( 2025-05-07T20:32:09.3632270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.3632588Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.3632894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.3633239Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.3633574Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.3634081Z ) 2025-05-07T20:32:09.3634434Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.3634883Z def test_silu_mul_quant( 2025-05-07T20:32:09.3635128Z self, 2025-05-07T20:32:09.3635327Z T: int, 2025-05-07T20:32:09.3635529Z D: int, 2025-05-07T20:32:09.3635751Z scale_ub: Optional[float], 2025-05-07T20:32:09.3636022Z contiguous: bool, 2025-05-07T20:32:09.3636265Z compiled: bool, 2025-05-07T20:32:09.3636493Z ) -> None: 2025-05-07T20:32:09.3636707Z torch.manual_seed(2025) 2025-05-07T20:32:09.3636953Z 2025-05-07T20:32:09.3637234Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.3637575Z 2025-05-07T20:32:09.3637779Z x_sign = torch.sign(x) 2025-05-07T20:32:09.3638079Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.3638396Z x = x_sign * x_clamp 2025-05-07T20:32:09.3645343Z x0 = x[:, :D] 2025-05-07T20:32:09.3645609Z x1 = x[:, D:] 2025-05-07T20:32:09.3645836Z 2025-05-07T20:32:09.3646027Z if contiguous: 2025-05-07T20:32:09.3646287Z x0 = x0.contiguous() 2025-05-07T20:32:09.3646575Z x1 = x1.contiguous() 2025-05-07T20:32:09.3646831Z 2025-05-07T20:32:09.3647028Z if scale_ub is not None: 2025-05-07T20:32:09.3647304Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.3647652Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.3647961Z ) 2025-05-07T20:32:09.3648163Z else: 2025-05-07T20:32:09.3648382Z scale_ub_tensor = None 2025-05-07T20:32:09.3648630Z 2025-05-07T20:32:09.3648870Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.3649192Z op = silu_mul_quant 2025-05-07T20:32:09.3649602Z if compiled: 2025-05-07T20:32:09.3649862Z op = torch.compile(op) 2025-05-07T20:32:09.3650164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3650438Z 2025-05-07T20:32:09.3650637Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.3650805Z 2025-05-07T20:32:09.3650913Z moe/activation_test.py:117: 2025-05-07T20:32:09.3651206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3651544Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.3651869Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3652597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.3653293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.3653835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.3654534Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.3655203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.3656101Z kernel = self.compile( 2025-05-07T20:32:09.3656657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.3657327Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.3657723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3657953Z 2025-05-07T20:32:09.3658239Z self = 2025-05-07T20:32:09.3659342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.3660746Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9c9a64d0>} 2025-05-07T20:32:09.3662280Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.3663322Z context = 2025-05-07T20:32:09.3663625Z 2025-05-07T20:32:09.3663794Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.3664324Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.3664802Z module_map=module_map) 2025-05-07T20:32:09.3665163Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.3665524Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.3665787Z E ^ 2025-05-07T20:32:09.3666253Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.3666718Z 2025-05-07T20:32:09.3667138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.3667666Z 2025-05-07T20:32:09.3667774Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.3668204Z self=, 2025-05-07T20:32:09.3668598Z T=1, 2025-05-07T20:32:09.3668783Z D=7168, 2025-05-07T20:32:09.3668977Z scale_ub=1200.0, 2025-05-07T20:32:09.3669195Z contiguous=True, 2025-05-07T20:32:09.3669423Z compiled=True, 2025-05-07T20:32:09.3669622Z ) 2025-05-07T20:32:09.3669940Z self = 2025-05-07T20:32:09.3670544Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:09.3670809Z 2025-05-07T20:32:09.3670889Z @given( 2025-05-07T20:32:09.3671116Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.3671441Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.3671790Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.3672122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.3672452Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.3672736Z ) 2025-05-07T20:32:09.3673081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.3673526Z def test_silu_mul_quant( 2025-05-07T20:32:09.3673770Z self, 2025-05-07T20:32:09.3673967Z T: int, 2025-05-07T20:32:09.3674157Z D: int, 2025-05-07T20:32:09.3674373Z scale_ub: Optional[float], 2025-05-07T20:32:09.3674644Z contiguous: bool, 2025-05-07T20:32:09.3674879Z compiled: bool, 2025-05-07T20:32:09.3675110Z ) -> None: 2025-05-07T20:32:09.3675326Z torch.manual_seed(2025) 2025-05-07T20:32:09.3675561Z 2025-05-07T20:32:09.3675834Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.3676177Z 2025-05-07T20:32:09.3676365Z x_sign = torch.sign(x) 2025-05-07T20:32:09.3676657Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.3676965Z x = x_sign * x_clamp 2025-05-07T20:32:09.3677203Z x0 = x[:, :D] 2025-05-07T20:32:09.3677421Z x1 = x[:, D:] 2025-05-07T20:32:09.3677633Z 2025-05-07T20:32:09.3677812Z if contiguous: 2025-05-07T20:32:09.3678040Z x0 = x0.contiguous() 2025-05-07T20:32:09.3678296Z x1 = x1.contiguous() 2025-05-07T20:32:09.3678530Z 2025-05-07T20:32:09.3678720Z if scale_ub is not None: 2025-05-07T20:32:09.3678993Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.3679328Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.3679635Z ) 2025-05-07T20:32:09.3679844Z else: 2025-05-07T20:32:09.3680052Z scale_ub_tensor = None 2025-05-07T20:32:09.3680386Z 2025-05-07T20:32:09.3680618Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.3680933Z op = silu_mul_quant 2025-05-07T20:32:09.3681184Z if compiled: 2025-05-07T20:32:09.3681427Z op = torch.compile(op) 2025-05-07T20:32:09.3681748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3682052Z 2025-05-07T20:32:09.3682242Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.3682414Z 2025-05-07T20:32:09.3682515Z moe/activation_test.py:117: 2025-05-07T20:32:09.3682809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3683127Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.3683408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3683984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.3684543Z return fn(*args, **kwargs) 2025-05-07T20:32:09.3685211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.3685914Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.3686456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.3687138Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.3687817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.3688359Z kernel = self.compile( 2025-05-07T20:32:09.3689050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.3689812Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.3690214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3690439Z 2025-05-07T20:32:09.3690664Z self = 2025-05-07T20:32:09.3691791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.3693389Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9c9a79a0>} 2025-05-07T20:32:09.3695082Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.3696359Z context = 2025-05-07T20:32:09.3696702Z 2025-05-07T20:32:09.3696891Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.3697503Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.3698121Z module_map=module_map) 2025-05-07T20:32:09.3698493Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.3698851Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.3699106Z E ^ 2025-05-07T20:32:09.3699579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.3700034Z 2025-05-07T20:32:09.3700461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.3700982Z 2025-05-07T20:32:09.3701088Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.3701521Z self=, 2025-05-07T20:32:09.3701978Z T=1, 2025-05-07T20:32:09.3702162Z D=7168, 2025-05-07T20:32:09.3702460Z scale_ub=1200.0, 2025-05-07T20:32:09.3702792Z contiguous=False, 2025-05-07T20:32:09.3703137Z compiled=True, 2025-05-07T20:32:09.3703424Z ) 2025-05-07T20:32:09.5085368Z self = 2025-05-07T20:32:09.5085890Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:09.5086187Z 2025-05-07T20:32:09.5086272Z @given( 2025-05-07T20:32:09.5086507Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.5086825Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.5087142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.5087479Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.5087809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.5088108Z ) 2025-05-07T20:32:09.5088462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.5088912Z def test_silu_mul_quant( 2025-05-07T20:32:09.5089162Z self, 2025-05-07T20:32:09.5089361Z T: int, 2025-05-07T20:32:09.5089555Z D: int, 2025-05-07T20:32:09.5089783Z scale_ub: Optional[float], 2025-05-07T20:32:09.5090059Z contiguous: bool, 2025-05-07T20:32:09.5090297Z compiled: bool, 2025-05-07T20:32:09.5090526Z ) -> None: 2025-05-07T20:32:09.5090747Z torch.manual_seed(2025) 2025-05-07T20:32:09.5090989Z 2025-05-07T20:32:09.5091271Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.5091617Z 2025-05-07T20:32:09.5091827Z x_sign = torch.sign(x) 2025-05-07T20:32:09.5092171Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.5092482Z x = x_sign * x_clamp 2025-05-07T20:32:09.5092898Z x0 = x[:, :D] 2025-05-07T20:32:09.5093123Z x1 = x[:, D:] 2025-05-07T20:32:09.5093334Z 2025-05-07T20:32:09.5093527Z if contiguous: 2025-05-07T20:32:09.5093766Z x0 = x0.contiguous() 2025-05-07T20:32:09.5094027Z x1 = x1.contiguous() 2025-05-07T20:32:09.5094276Z 2025-05-07T20:32:09.5094471Z if scale_ub is not None: 2025-05-07T20:32:09.5094752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.5095099Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.5095406Z ) 2025-05-07T20:32:09.5095599Z else: 2025-05-07T20:32:09.5095821Z scale_ub_tensor = None 2025-05-07T20:32:09.5096072Z 2025-05-07T20:32:09.5096308Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.5096629Z op = silu_mul_quant 2025-05-07T20:32:09.5096891Z if compiled: 2025-05-07T20:32:09.5097147Z op = torch.compile(op) 2025-05-07T20:32:09.5097456Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.5097740Z 2025-05-07T20:32:09.5097935Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.5098239Z 2025-05-07T20:32:09.5098354Z moe/activation_test.py:117: 2025-05-07T20:32:09.5098658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.5098992Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.5099277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.5099848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.5100417Z return fn(*args, **kwargs) 2025-05-07T20:32:09.5101078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.5101777Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.5102329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.5103152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.5103834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.5104553Z kernel = self.compile( 2025-05-07T20:32:09.5105109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.5105773Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.5106183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.5106416Z 2025-05-07T20:32:09.5106631Z self = 2025-05-07T20:32:09.5107737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.5109138Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebbf4430>} 2025-05-07T20:32:09.5110507Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.5111558Z context = 2025-05-07T20:32:09.5111905Z 2025-05-07T20:32:09.5112075Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.5112604Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.5113081Z module_map=module_map) 2025-05-07T20:32:09.5113453Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.5113896Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.5114158Z E ^ 2025-05-07T20:32:09.5114634Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.5115105Z 2025-05-07T20:32:09.5115530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.5116051Z 2025-05-07T20:32:09.5116163Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.5116582Z self=, 2025-05-07T20:32:09.5116989Z T=1, 2025-05-07T20:32:09.5117181Z D=7168, 2025-05-07T20:32:09.5117376Z scale_ub=None, 2025-05-07T20:32:09.5117594Z contiguous=False, 2025-05-07T20:32:09.5117825Z compiled=True, 2025-05-07T20:32:09.5118030Z ) 2025-05-07T20:32:09.6084713Z self = 2025-05-07T20:32:09.6085321Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:09.6085595Z 2025-05-07T20:32:09.6085678Z @given( 2025-05-07T20:32:09.6085916Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.6086237Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.6086549Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.6086885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.6087216Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.6087503Z ) 2025-05-07T20:32:09.6087858Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.6088304Z def test_silu_mul_quant( 2025-05-07T20:32:09.6088548Z self, 2025-05-07T20:32:09.6088746Z T: int, 2025-05-07T20:32:09.6088940Z D: int, 2025-05-07T20:32:09.6089165Z scale_ub: Optional[float], 2025-05-07T20:32:09.6089440Z contiguous: bool, 2025-05-07T20:32:09.6089685Z compiled: bool, 2025-05-07T20:32:09.6089912Z ) -> None: 2025-05-07T20:32:09.6090134Z torch.manual_seed(2025) 2025-05-07T20:32:09.6090375Z 2025-05-07T20:32:09.6090831Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.6091176Z 2025-05-07T20:32:09.6091374Z x_sign = torch.sign(x) 2025-05-07T20:32:09.6091665Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.6091985Z x = x_sign * x_clamp 2025-05-07T20:32:09.6092223Z x0 = x[:, :D] 2025-05-07T20:32:09.6092439Z x1 = x[:, D:] 2025-05-07T20:32:09.6092650Z 2025-05-07T20:32:09.6092843Z if contiguous: 2025-05-07T20:32:09.6093074Z x0 = x0.contiguous() 2025-05-07T20:32:09.6093335Z x1 = x1.contiguous() 2025-05-07T20:32:09.6093575Z 2025-05-07T20:32:09.6093768Z if scale_ub is not None: 2025-05-07T20:32:09.6094045Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.6094396Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.6094704Z ) 2025-05-07T20:32:09.6094897Z else: 2025-05-07T20:32:09.6095120Z scale_ub_tensor = None 2025-05-07T20:32:09.6095371Z 2025-05-07T20:32:09.6095608Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.6095927Z op = silu_mul_quant 2025-05-07T20:32:09.6096179Z if compiled: 2025-05-07T20:32:09.6096424Z op = torch.compile(op) 2025-05-07T20:32:09.6096727Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.6097004Z 2025-05-07T20:32:09.6097196Z y_fp8, y_scale = fn() 2025-05-07T20:32:09.6097491Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:09.6097787Z 2025-05-07T20:32:09.6098204Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.6098543Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:09.6098838Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:09.6099281Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:09.6099644Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.6099963Z 2025-05-07T20:32:09.6100171Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:09.6100371Z 2025-05-07T20:32:09.6100471Z moe/activation_test.py:126: 2025-05-07T20:32:09.6100770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.6101107Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:09.6101440Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.6102298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:09.6103067Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:09.6103628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.6104326Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.6105028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:09.6105766Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.6106532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:09.6107286Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.6108026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:09.6108677Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:09.6109298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:09.6109818Z fn() 2025-05-07T20:32:09.6110339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:09.6111016Z self.fn.run( 2025-05-07T20:32:09.6111489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.6112055Z kernel = self.compile( 2025-05-07T20:32:09.6112629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.6113297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.6113691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.6113922Z 2025-05-07T20:32:09.6114135Z self = 2025-05-07T20:32:09.6115247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.6116661Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebbf4a60>} 2025-05-07T20:32:09.6118028Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.6119072Z context = 2025-05-07T20:32:09.6119368Z 2025-05-07T20:32:09.6119538Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.6120071Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.6120673Z module_map=module_map) 2025-05-07T20:32:09.6121044Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.6121448Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:09.6121729Z E ^ 2025-05-07T20:32:09.6122200Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.6122661Z 2025-05-07T20:32:09.6123081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.6123602Z 2025-05-07T20:32:09.6123714Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.6124132Z self=, 2025-05-07T20:32:09.6124537Z T=1, 2025-05-07T20:32:09.6124721Z D=5120, 2025-05-07T20:32:09.6124911Z scale_ub=1200.0, 2025-05-07T20:32:09.6125137Z contiguous=False, 2025-05-07T20:32:09.6125365Z compiled=True, 2025-05-07T20:32:09.6125573Z ) 2025-05-07T20:32:09.9519915Z self = 2025-05-07T20:32:09.9520453Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:09.9520731Z 2025-05-07T20:32:09.9520820Z @given( 2025-05-07T20:32:09.9521070Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.9521394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.9521705Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.9527584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.9528026Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.9528314Z ) 2025-05-07T20:32:09.9528674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.9529126Z def test_silu_mul_quant( 2025-05-07T20:32:09.9529368Z self, 2025-05-07T20:32:09.9529566Z T: int, 2025-05-07T20:32:09.9529768Z D: int, 2025-05-07T20:32:09.9529994Z scale_ub: Optional[float], 2025-05-07T20:32:09.9530270Z contiguous: bool, 2025-05-07T20:32:09.9530515Z compiled: bool, 2025-05-07T20:32:09.9530867Z ) -> None: 2025-05-07T20:32:09.9531097Z torch.manual_seed(2025) 2025-05-07T20:32:09.9531348Z 2025-05-07T20:32:09.9531633Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.9532004Z 2025-05-07T20:32:09.9532228Z x_sign = torch.sign(x) 2025-05-07T20:32:09.9532523Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.9532850Z x = x_sign * x_clamp 2025-05-07T20:32:09.9533095Z x0 = x[:, :D] 2025-05-07T20:32:09.9533316Z x1 = x[:, D:] 2025-05-07T20:32:09.9533524Z 2025-05-07T20:32:09.9533715Z if contiguous: 2025-05-07T20:32:09.9533949Z x0 = x0.contiguous() 2025-05-07T20:32:09.9534209Z x1 = x1.contiguous() 2025-05-07T20:32:09.9534453Z 2025-05-07T20:32:09.9534654Z if scale_ub is not None: 2025-05-07T20:32:09.9534927Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.9535268Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.9535584Z ) 2025-05-07T20:32:09.9535774Z else: 2025-05-07T20:32:09.9535987Z scale_ub_tensor = None 2025-05-07T20:32:09.9536246Z 2025-05-07T20:32:09.9536490Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.9536812Z op = silu_mul_quant 2025-05-07T20:32:09.9537068Z if compiled: 2025-05-07T20:32:09.9537316Z op = torch.compile(op) 2025-05-07T20:32:09.9537630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.9537914Z 2025-05-07T20:32:09.9538214Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.9538382Z 2025-05-07T20:32:09.9538522Z moe/activation_test.py:117: 2025-05-07T20:32:09.9545835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.9546355Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.9546660Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.9547236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.9547812Z return fn(*args, **kwargs) 2025-05-07T20:32:09.9548489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.9549185Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.9549733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.9550429Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.9551109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.9551645Z kernel = self.compile( 2025-05-07T20:32:09.9552215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.9552888Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.9553298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.9553523Z 2025-05-07T20:32:09.9553740Z self = 2025-05-07T20:32:09.9554850Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.9556676Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebbf52d0>} 2025-05-07T20:32:09.9558380Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.9559521Z context = 2025-05-07T20:32:09.9559821Z 2025-05-07T20:32:09.9559991Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.9560525Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.9561012Z module_map=module_map) 2025-05-07T20:32:09.9561382Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.9561749Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.9562045Z E ^ 2025-05-07T20:32:09.9562532Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.9562998Z 2025-05-07T20:32:09.9563425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.9563950Z 2025-05-07T20:32:09.9564057Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.9564484Z self=, 2025-05-07T20:32:09.9564885Z T=1, 2025-05-07T20:32:09.9565077Z D=5120, 2025-05-07T20:32:09.9565276Z scale_ub=1200.0, 2025-05-07T20:32:09.9565501Z contiguous=False, 2025-05-07T20:32:09.9565731Z compiled=False, 2025-05-07T20:32:09.9565953Z ) 2025-05-07T20:32:09.9566271Z self = 2025-05-07T20:32:09.9566766Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:09.9567040Z 2025-05-07T20:32:09.9567118Z @given( 2025-05-07T20:32:09.9567351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.9567664Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.9568100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.9568437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.9568765Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.9569060Z ) 2025-05-07T20:32:09.9569417Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.9569862Z def test_silu_mul_quant( 2025-05-07T20:32:09.9570108Z self, 2025-05-07T20:32:09.9570309Z T: int, 2025-05-07T20:32:09.9570503Z D: int, 2025-05-07T20:32:09.9570733Z scale_ub: Optional[float], 2025-05-07T20:32:09.9571016Z contiguous: bool, 2025-05-07T20:32:09.9571262Z compiled: bool, 2025-05-07T20:32:09.9571483Z ) -> None: 2025-05-07T20:32:09.9571700Z torch.manual_seed(2025) 2025-05-07T20:32:09.9571940Z 2025-05-07T20:32:09.9572217Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.9572564Z 2025-05-07T20:32:09.9572756Z x_sign = torch.sign(x) 2025-05-07T20:32:09.9573058Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.9573367Z x = x_sign * x_clamp 2025-05-07T20:32:09.9573607Z x0 = x[:, :D] 2025-05-07T20:32:09.9573823Z x1 = x[:, D:] 2025-05-07T20:32:09.9574030Z 2025-05-07T20:32:09.9574212Z if contiguous: 2025-05-07T20:32:09.9574451Z x0 = x0.contiguous() 2025-05-07T20:32:09.9574713Z x1 = x1.contiguous() 2025-05-07T20:32:09.9574959Z 2025-05-07T20:32:09.9575154Z if scale_ub is not None: 2025-05-07T20:32:09.9575531Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.9575875Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.9576187Z ) 2025-05-07T20:32:09.9576382Z else: 2025-05-07T20:32:09.9576597Z scale_ub_tensor = None 2025-05-07T20:32:09.9576848Z 2025-05-07T20:32:09.9577086Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.9577417Z op = silu_mul_quant 2025-05-07T20:32:09.9577664Z if compiled: 2025-05-07T20:32:09.9577913Z op = torch.compile(op) 2025-05-07T20:32:09.9578331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.9578604Z 2025-05-07T20:32:09.9578802Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.9578968Z 2025-05-07T20:32:09.9579076Z moe/activation_test.py:117: 2025-05-07T20:32:09.9579373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.9579703Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.9579991Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.9580693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.9581390Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.9581948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.9582682Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.9583355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.9583897Z kernel = self.compile( 2025-05-07T20:32:09.9584450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.9585120Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.9585514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.9585747Z 2025-05-07T20:32:09.9585957Z self = 2025-05-07T20:32:09.9587386Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.9588873Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebbf5b40>} 2025-05-07T20:32:09.9590254Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.9591299Z context = 2025-05-07T20:32:09.9591603Z 2025-05-07T20:32:09.9591769Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.9592340Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.9592830Z module_map=module_map) 2025-05-07T20:32:09.9593194Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.9593560Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.9593821Z E ^ 2025-05-07T20:32:09.9594288Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.9594755Z 2025-05-07T20:32:09.9595177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.9595701Z 2025-05-07T20:32:09.9595809Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.9596294Z self=, 2025-05-07T20:32:09.9596699Z T=16384, 2025-05-07T20:32:09.9596900Z D=5120, 2025-05-07T20:32:09.9597098Z scale_ub=1200.0, 2025-05-07T20:32:09.9597324Z contiguous=False, 2025-05-07T20:32:09.9597554Z compiled=True, 2025-05-07T20:32:09.9597764Z ) 2025-05-07T20:32:10.0591177Z self = 2025-05-07T20:32:10.0592607Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:10.0593211Z 2025-05-07T20:32:10.0593352Z @given( 2025-05-07T20:32:10.0593772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.0594623Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.0595109Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.0595613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.0596177Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.0596674Z ) 2025-05-07T20:32:10.0597298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.0598078Z def test_silu_mul_quant( 2025-05-07T20:32:10.0598493Z self, 2025-05-07T20:32:10.0598830Z T: int, 2025-05-07T20:32:10.0599167Z D: int, 2025-05-07T20:32:10.0599524Z scale_ub: Optional[float], 2025-05-07T20:32:10.0599998Z contiguous: bool, 2025-05-07T20:32:10.0600419Z compiled: bool, 2025-05-07T20:32:10.0600796Z ) -> None: 2025-05-07T20:32:10.0601172Z torch.manual_seed(2025) 2025-05-07T20:32:10.0601652Z 2025-05-07T20:32:10.0602122Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.0602715Z 2025-05-07T20:32:10.0603059Z x_sign = torch.sign(x) 2025-05-07T20:32:10.0603559Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.0604085Z x = x_sign * x_clamp 2025-05-07T20:32:10.0604504Z x0 = x[:, :D] 2025-05-07T20:32:10.0604883Z x1 = x[:, D:] 2025-05-07T20:32:10.0605232Z 2025-05-07T20:32:10.0605554Z if contiguous: 2025-05-07T20:32:10.0605951Z x0 = x0.contiguous() 2025-05-07T20:32:10.0606389Z x1 = x1.contiguous() 2025-05-07T20:32:10.0606809Z 2025-05-07T20:32:10.0607141Z if scale_ub is not None: 2025-05-07T20:32:10.0607606Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.0608485Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.0609033Z ) 2025-05-07T20:32:10.0609352Z else: 2025-05-07T20:32:10.0609717Z scale_ub_tensor = None 2025-05-07T20:32:10.0610162Z 2025-05-07T20:32:10.0610545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.0611092Z op = silu_mul_quant 2025-05-07T20:32:10.0611522Z if compiled: 2025-05-07T20:32:10.0611992Z op = torch.compile(op) 2025-05-07T20:32:10.0612503Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0612998Z 2025-05-07T20:32:10.0613336Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.0613618Z 2025-05-07T20:32:10.0613786Z moe/activation_test.py:117: 2025-05-07T20:32:10.0614306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0614893Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.0615370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0616362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.0617333Z return fn(*args, **kwargs) 2025-05-07T20:32:10.0618614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.0619790Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.0620719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.0622055Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.0623199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.0624146Z kernel = self.compile( 2025-05-07T20:32:10.0625114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.0626294Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.0626983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0627471Z 2025-05-07T20:32:10.0627827Z self = 2025-05-07T20:32:10.0629772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.0632359Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebbf6cb0>} 2025-05-07T20:32:10.0634812Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.0636618Z context = 2025-05-07T20:32:10.0637125Z 2025-05-07T20:32:10.0637430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.0638362Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.0639199Z module_map=module_map) 2025-05-07T20:32:10.0639823Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.0640439Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.0640891Z E ^ 2025-05-07T20:32:10.0641751Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.0642585Z 2025-05-07T20:32:10.0643323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.0644259Z 2025-05-07T20:32:10.0644550Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.0645280Z self=, 2025-05-07T20:32:10.0645982Z T=2048, 2025-05-07T20:32:10.0646302Z D=7168, 2025-05-07T20:32:10.0646628Z scale_ub=1200.0, 2025-05-07T20:32:10.0647007Z contiguous=False, 2025-05-07T20:32:10.0647399Z compiled=True, 2025-05-07T20:32:10.0647757Z ) 2025-05-07T20:32:10.0648306Z self = 2025-05-07T20:32:10.0649167Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:10.0649660Z 2025-05-07T20:32:10.0649787Z @given( 2025-05-07T20:32:10.0650174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.0650704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.0651237Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.0651813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.0652385Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.0652880Z ) 2025-05-07T20:32:10.0653492Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.0654274Z def test_silu_mul_quant( 2025-05-07T20:32:10.0654679Z self, 2025-05-07T20:32:10.0655011Z T: int, 2025-05-07T20:32:10.0655344Z D: int, 2025-05-07T20:32:10.0656718Z scale_ub: Optional[float], 2025-05-07T20:32:10.0657047Z contiguous: bool, 2025-05-07T20:32:10.0657453Z compiled: bool, 2025-05-07T20:32:10.0657678Z ) -> None: 2025-05-07T20:32:10.0657897Z torch.manual_seed(2025) 2025-05-07T20:32:10.0658237Z 2025-05-07T20:32:10.0658514Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.0658866Z 2025-05-07T20:32:10.0659067Z x_sign = torch.sign(x) 2025-05-07T20:32:10.0659358Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.0659677Z x = x_sign * x_clamp 2025-05-07T20:32:10.0659920Z x0 = x[:, :D] 2025-05-07T20:32:10.0660131Z x1 = x[:, D:] 2025-05-07T20:32:10.0660335Z 2025-05-07T20:32:10.0660658Z if contiguous: 2025-05-07T20:32:10.0660888Z x0 = x0.contiguous() 2025-05-07T20:32:10.0661146Z x1 = x1.contiguous() 2025-05-07T20:32:10.0661388Z 2025-05-07T20:32:10.0661579Z if scale_ub is not None: 2025-05-07T20:32:10.0661862Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.0662239Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.0662555Z ) 2025-05-07T20:32:10.0662745Z else: 2025-05-07T20:32:10.0662957Z scale_ub_tensor = None 2025-05-07T20:32:10.0663208Z 2025-05-07T20:32:10.0663438Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.0663757Z op = silu_mul_quant 2025-05-07T20:32:10.0664006Z if compiled: 2025-05-07T20:32:10.0664255Z op = torch.compile(op) 2025-05-07T20:32:10.0664557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0664831Z 2025-05-07T20:32:10.0665021Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.0665193Z 2025-05-07T20:32:10.0665292Z moe/activation_test.py:117: 2025-05-07T20:32:10.0665594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0665923Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.0666205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0666780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.0667346Z return fn(*args, **kwargs) 2025-05-07T20:32:10.0668005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.0668704Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.0669384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.0670080Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.0670750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.0671286Z kernel = self.compile( 2025-05-07T20:32:10.0671833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.0672548Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.0672945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0673179Z 2025-05-07T20:32:10.0673390Z self = 2025-05-07T20:32:10.0674497Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.0675905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebbf7b50>} 2025-05-07T20:32:10.0677272Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.0678363Z context = 2025-05-07T20:32:10.0678658Z 2025-05-07T20:32:10.0678827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.0679353Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.0679822Z module_map=module_map) 2025-05-07T20:32:10.0680199Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.0680553Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.0680809Z E ^ 2025-05-07T20:32:10.0681323Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.0681784Z 2025-05-07T20:32:10.0682205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.0682722Z 2025-05-07T20:32:10.1969175Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.1969668Z self=, 2025-05-07T20:32:10.1970082Z T=1, 2025-05-07T20:32:10.1970273Z D=5120, 2025-05-07T20:32:10.1970470Z scale_ub=None, 2025-05-07T20:32:10.1970683Z contiguous=False, 2025-05-07T20:32:10.1970919Z compiled=False, 2025-05-07T20:32:10.1971127Z ) 2025-05-07T20:32:10.1971462Z self = 2025-05-07T20:32:10.1971966Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:10.1972243Z 2025-05-07T20:32:10.1972328Z @given( 2025-05-07T20:32:10.1972565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.1972880Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.1973193Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.1973536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.1973873Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.1974163Z ) 2025-05-07T20:32:10.1974525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.1974977Z def test_silu_mul_quant( 2025-05-07T20:32:10.1975217Z self, 2025-05-07T20:32:10.1975413Z T: int, 2025-05-07T20:32:10.1975614Z D: int, 2025-05-07T20:32:10.1975830Z scale_ub: Optional[float], 2025-05-07T20:32:10.1976292Z contiguous: bool, 2025-05-07T20:32:10.1976589Z compiled: bool, 2025-05-07T20:32:10.1976928Z ) -> None: 2025-05-07T20:32:10.1977249Z torch.manual_seed(2025) 2025-05-07T20:32:10.1977536Z 2025-05-07T20:32:10.1977812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.1978242Z 2025-05-07T20:32:10.1978441Z x_sign = torch.sign(x) 2025-05-07T20:32:10.1978733Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.1979057Z x = x_sign * x_clamp 2025-05-07T20:32:10.1979300Z x0 = x[:, :D] 2025-05-07T20:32:10.1979515Z x1 = x[:, D:] 2025-05-07T20:32:10.1979726Z 2025-05-07T20:32:10.1979916Z if contiguous: 2025-05-07T20:32:10.1980148Z x0 = x0.contiguous() 2025-05-07T20:32:10.1980412Z x1 = x1.contiguous() 2025-05-07T20:32:10.1980661Z 2025-05-07T20:32:10.1980851Z if scale_ub is not None: 2025-05-07T20:32:10.1981138Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.1981481Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.1981795Z ) 2025-05-07T20:32:10.1981990Z else: 2025-05-07T20:32:10.1982198Z scale_ub_tensor = None 2025-05-07T20:32:10.1982454Z 2025-05-07T20:32:10.1982686Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.1983007Z op = silu_mul_quant 2025-05-07T20:32:10.1983258Z if compiled: 2025-05-07T20:32:10.1983505Z op = torch.compile(op) 2025-05-07T20:32:10.1983911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.1984188Z 2025-05-07T20:32:10.1984381Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.1984554Z 2025-05-07T20:32:10.1984659Z moe/activation_test.py:117: 2025-05-07T20:32:10.1984963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.1985295Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.1985585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.1986302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.1987080Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.1987624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.1988324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.1989005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.1989550Z kernel = self.compile( 2025-05-07T20:32:10.1990099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.1990773Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.1991176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.1991405Z 2025-05-07T20:32:10.1991622Z self = 2025-05-07T20:32:10.1992737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.1994154Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebfe85e0>} 2025-05-07T20:32:10.1995544Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.1996592Z context = 2025-05-07T20:32:10.1996885Z 2025-05-07T20:32:10.1997135Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.1997670Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.1998156Z module_map=module_map) 2025-05-07T20:32:10.1998530Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.1998886Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.1999150Z E ^ 2025-05-07T20:32:10.1999620Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.2000081Z 2025-05-07T20:32:10.2000506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.2001034Z 2025-05-07T20:32:10.2001140Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.2001594Z self=, 2025-05-07T20:32:10.2001997Z T=4096, 2025-05-07T20:32:10.2002184Z D=7168, 2025-05-07T20:32:10.2002374Z scale_ub=1200.0, 2025-05-07T20:32:10.2002606Z contiguous=False, 2025-05-07T20:32:10.2002834Z compiled=False, 2025-05-07T20:32:10.2003036Z ) 2025-05-07T20:32:10.2003358Z self = 2025-05-07T20:32:10.2003860Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:10.2004136Z 2025-05-07T20:32:10.2004267Z @given( 2025-05-07T20:32:10.2004494Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.2004812Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.2005118Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.2005453Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.2005789Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.2006073Z ) 2025-05-07T20:32:10.2006437Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.2006885Z def test_silu_mul_quant( 2025-05-07T20:32:10.2007134Z self, 2025-05-07T20:32:10.2007371Z T: int, 2025-05-07T20:32:10.2007571Z D: int, 2025-05-07T20:32:10.2014822Z scale_ub: Optional[float], 2025-05-07T20:32:10.2015108Z contiguous: bool, 2025-05-07T20:32:10.2015354Z compiled: bool, 2025-05-07T20:32:10.2015584Z ) -> None: 2025-05-07T20:32:10.2015795Z torch.manual_seed(2025) 2025-05-07T20:32:10.2016044Z 2025-05-07T20:32:10.2016325Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.2016665Z 2025-05-07T20:32:10.2016856Z x_sign = torch.sign(x) 2025-05-07T20:32:10.2017149Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.2017461Z x = x_sign * x_clamp 2025-05-07T20:32:10.2017694Z x0 = x[:, :D] 2025-05-07T20:32:10.2017906Z x1 = x[:, D:] 2025-05-07T20:32:10.2018171Z 2025-05-07T20:32:10.2018354Z if contiguous: 2025-05-07T20:32:10.2018590Z x0 = x0.contiguous() 2025-05-07T20:32:10.2018849Z x1 = x1.contiguous() 2025-05-07T20:32:10.2019083Z 2025-05-07T20:32:10.2019272Z if scale_ub is not None: 2025-05-07T20:32:10.2019542Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.2019875Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.2020182Z ) 2025-05-07T20:32:10.2020374Z else: 2025-05-07T20:32:10.2020577Z scale_ub_tensor = None 2025-05-07T20:32:10.2020826Z 2025-05-07T20:32:10.2021060Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.2021367Z op = silu_mul_quant 2025-05-07T20:32:10.2021613Z if compiled: 2025-05-07T20:32:10.2021862Z op = torch.compile(op) 2025-05-07T20:32:10.2022185Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.2022477Z 2025-05-07T20:32:10.2022778Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.2022946Z 2025-05-07T20:32:10.2023050Z moe/activation_test.py:117: 2025-05-07T20:32:10.2023341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.2023669Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.2023950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.2024641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.2025346Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.2025889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.2026577Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.2027242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.2027784Z kernel = self.compile( 2025-05-07T20:32:10.2028332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.2028999Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.2029394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.2029626Z 2025-05-07T20:32:10.2029837Z self = 2025-05-07T20:32:10.2030986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.2032382Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebfe8ca0>} 2025-05-07T20:32:10.2033748Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.2034827Z context = 2025-05-07T20:32:10.2035123Z 2025-05-07T20:32:10.2035289Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.2035814Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.2036292Z module_map=module_map) 2025-05-07T20:32:10.2036656Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.2037009Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.2037264Z E ^ 2025-05-07T20:32:10.2037727Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.2038186Z 2025-05-07T20:32:10.2038611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.2039135Z 2025-05-07T20:32:10.2039237Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.2039651Z self=, 2025-05-07T20:32:10.2040047Z T=16384, 2025-05-07T20:32:10.2040242Z D=7168, 2025-05-07T20:32:10.2040433Z scale_ub=None, 2025-05-07T20:32:10.2040643Z contiguous=True, 2025-05-07T20:32:10.2040871Z compiled=True, 2025-05-07T20:32:10.2041069Z ) 2025-05-07T20:32:10.3984905Z self = 2025-05-07T20:32:10.3985595Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:10.3985886Z 2025-05-07T20:32:10.3985979Z @given( 2025-05-07T20:32:10.3986223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.3986927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.3987260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.3987603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.3987959Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.3988263Z ) 2025-05-07T20:32:10.3988627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.3989094Z def test_silu_mul_quant( 2025-05-07T20:32:10.3989351Z self, 2025-05-07T20:32:10.3989559Z T: int, 2025-05-07T20:32:10.3989768Z D: int, 2025-05-07T20:32:10.3990001Z scale_ub: Optional[float], 2025-05-07T20:32:10.3990280Z contiguous: bool, 2025-05-07T20:32:10.3990536Z compiled: bool, 2025-05-07T20:32:10.3990782Z ) -> None: 2025-05-07T20:32:10.3991010Z torch.manual_seed(2025) 2025-05-07T20:32:10.3991271Z 2025-05-07T20:32:10.3991634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.3992009Z 2025-05-07T20:32:10.3992224Z x_sign = torch.sign(x) 2025-05-07T20:32:10.3992526Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.3992857Z x = x_sign * x_clamp 2025-05-07T20:32:10.3993118Z x0 = x[:, :D] 2025-05-07T20:32:10.3993350Z x1 = x[:, D:] 2025-05-07T20:32:10.3993566Z 2025-05-07T20:32:10.3993772Z if contiguous: 2025-05-07T20:32:10.3994028Z x0 = x0.contiguous() 2025-05-07T20:32:10.3994290Z x1 = x1.contiguous() 2025-05-07T20:32:10.3994652Z 2025-05-07T20:32:10.3994864Z if scale_ub is not None: 2025-05-07T20:32:10.3995144Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.3995498Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.3995817Z ) 2025-05-07T20:32:10.3996010Z else: 2025-05-07T20:32:10.3996228Z scale_ub_tensor = None 2025-05-07T20:32:10.3996487Z 2025-05-07T20:32:10.3996729Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.3997052Z op = silu_mul_quant 2025-05-07T20:32:10.3997307Z if compiled: 2025-05-07T20:32:10.3997643Z op = torch.compile(op) 2025-05-07T20:32:10.3997954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.3998236Z 2025-05-07T20:32:10.3998436Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.3998604Z 2025-05-07T20:32:10.3998708Z moe/activation_test.py:117: 2025-05-07T20:32:10.3999012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.3999354Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.3999643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.4000227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.4000805Z return fn(*args, **kwargs) 2025-05-07T20:32:10.4001487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.4002246Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.4002801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.4003499Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.4004178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.4004729Z kernel = self.compile( 2025-05-07T20:32:10.4005297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.4005977Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.4006378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.4006612Z 2025-05-07T20:32:10.4006911Z self = 2025-05-07T20:32:10.4008028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.4009451Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebfe9b40>} 2025-05-07T20:32:10.4010828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.4011885Z context = 2025-05-07T20:32:10.4012188Z 2025-05-07T20:32:10.4012360Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.4012902Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.4013382Z module_map=module_map) 2025-05-07T20:32:10.4013764Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.4014132Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.4014396Z E ^ 2025-05-07T20:32:10.4014881Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.4015398Z 2025-05-07T20:32:10.4015822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.4016341Z 2025-05-07T20:32:10.4016459Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.4016880Z self=, 2025-05-07T20:32:10.4017295Z T=4096, 2025-05-07T20:32:10.4017492Z D=5120, 2025-05-07T20:32:10.4017692Z scale_ub=None, 2025-05-07T20:32:10.4017920Z contiguous=False, 2025-05-07T20:32:10.4018287Z compiled=True, 2025-05-07T20:32:10.4018503Z ) 2025-05-07T20:32:10.4018883Z self = 2025-05-07T20:32:10.4019388Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:10.4019663Z 2025-05-07T20:32:10.4019745Z @given( 2025-05-07T20:32:10.4019978Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.4020303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.4020627Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.4020965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.4021322Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.4021652Z ) 2025-05-07T20:32:10.4022013Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.4022457Z def test_silu_mul_quant( 2025-05-07T20:32:10.4022705Z self, 2025-05-07T20:32:10.4022907Z T: int, 2025-05-07T20:32:10.4023103Z D: int, 2025-05-07T20:32:10.4023348Z scale_ub: Optional[float], 2025-05-07T20:32:10.4023625Z contiguous: bool, 2025-05-07T20:32:10.4023880Z compiled: bool, 2025-05-07T20:32:10.4024114Z ) -> None: 2025-05-07T20:32:10.4024341Z torch.manual_seed(2025) 2025-05-07T20:32:10.4024584Z 2025-05-07T20:32:10.4024866Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.4025225Z 2025-05-07T20:32:10.4025421Z x_sign = torch.sign(x) 2025-05-07T20:32:10.4025721Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.4026037Z x = x_sign * x_clamp 2025-05-07T20:32:10.4026280Z x0 = x[:, :D] 2025-05-07T20:32:10.4026508Z x1 = x[:, D:] 2025-05-07T20:32:10.4026723Z 2025-05-07T20:32:10.4026911Z if contiguous: 2025-05-07T20:32:10.4027156Z x0 = x0.contiguous() 2025-05-07T20:32:10.4027508Z x1 = x1.contiguous() 2025-05-07T20:32:10.4027751Z 2025-05-07T20:32:10.4027953Z if scale_ub is not None: 2025-05-07T20:32:10.4028242Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.4028579Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.4028891Z ) 2025-05-07T20:32:10.4029092Z else: 2025-05-07T20:32:10.4029303Z scale_ub_tensor = None 2025-05-07T20:32:10.4029566Z 2025-05-07T20:32:10.4029815Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.4030136Z op = silu_mul_quant 2025-05-07T20:32:10.4030387Z if compiled: 2025-05-07T20:32:10.4030642Z op = torch.compile(op) 2025-05-07T20:32:10.4030950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.4031225Z 2025-05-07T20:32:10.4031435Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.4031615Z 2025-05-07T20:32:10.4031751Z moe/activation_test.py:117: 2025-05-07T20:32:10.4032072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.4032412Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.4032709Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.4033288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.4033856Z return fn(*args, **kwargs) 2025-05-07T20:32:10.4034532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.4035297Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.4035843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.4036542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.4037228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.4037781Z kernel = self.compile( 2025-05-07T20:32:10.4038331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.4039053Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.4039458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.4039687Z 2025-05-07T20:32:10.4039903Z self = 2025-05-07T20:32:10.4041014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.4042425Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebfe9240>} 2025-05-07T20:32:10.4043810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.4044864Z context = 2025-05-07T20:32:10.4045159Z 2025-05-07T20:32:10.4045333Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.4045874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.4046366Z module_map=module_map) 2025-05-07T20:32:10.4046743Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.4047103Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.4047375Z E ^ 2025-05-07T20:32:10.4047964Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.4048425Z 2025-05-07T20:32:10.4048854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.4049387Z 2025-05-07T20:32:10.7474870Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7475852Z self=, 2025-05-07T20:32:10.7476668Z T=4096, 2025-05-07T20:32:10.7477051Z D=5120, 2025-05-07T20:32:10.7477456Z scale_ub=1200.0, 2025-05-07T20:32:10.7477910Z contiguous=False, 2025-05-07T20:32:10.7478370Z compiled=False, 2025-05-07T20:32:10.7478797Z ) 2025-05-07T20:32:10.7479433Z self = 2025-05-07T20:32:10.7480446Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:10.7481014Z 2025-05-07T20:32:10.7481171Z @given( 2025-05-07T20:32:10.7481651Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7482150Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7482502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7482850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7483185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7483480Z ) 2025-05-07T20:32:10.7483843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7484293Z def test_silu_mul_quant( 2025-05-07T20:32:10.7484762Z self, 2025-05-07T20:32:10.7484968Z T: int, 2025-05-07T20:32:10.7485166Z D: int, 2025-05-07T20:32:10.7485395Z scale_ub: Optional[float], 2025-05-07T20:32:10.7485677Z contiguous: bool, 2025-05-07T20:32:10.7485928Z compiled: bool, 2025-05-07T20:32:10.7486158Z ) -> None: 2025-05-07T20:32:10.7486383Z torch.manual_seed(2025) 2025-05-07T20:32:10.7486633Z 2025-05-07T20:32:10.7486914Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7487268Z 2025-05-07T20:32:10.7487471Z x_sign = torch.sign(x) 2025-05-07T20:32:10.7487853Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.7488177Z x = x_sign * x_clamp 2025-05-07T20:32:10.7488423Z x0 = x[:, :D] 2025-05-07T20:32:10.7488640Z x1 = x[:, D:] 2025-05-07T20:32:10.7488854Z 2025-05-07T20:32:10.7489048Z if contiguous: 2025-05-07T20:32:10.7489286Z x0 = x0.contiguous() 2025-05-07T20:32:10.7489556Z x1 = x1.contiguous() 2025-05-07T20:32:10.7489801Z 2025-05-07T20:32:10.7489997Z if scale_ub is not None: 2025-05-07T20:32:10.7490275Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.7490619Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.7490939Z ) 2025-05-07T20:32:10.7491133Z else: 2025-05-07T20:32:10.7491350Z scale_ub_tensor = None 2025-05-07T20:32:10.7491617Z 2025-05-07T20:32:10.7491855Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.7492176Z op = silu_mul_quant 2025-05-07T20:32:10.7492435Z if compiled: 2025-05-07T20:32:10.7492683Z op = torch.compile(op) 2025-05-07T20:32:10.7492994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.7493271Z 2025-05-07T20:32:10.7493465Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.7493639Z 2025-05-07T20:32:10.7493745Z moe/activation_test.py:117: 2025-05-07T20:32:10.7494050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.7494378Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.7494675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.7495381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.7496230Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.7496782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.7497487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.7498249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.7498793Z kernel = self.compile( 2025-05-07T20:32:10.7499347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.7500025Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.7500429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.7500655Z 2025-05-07T20:32:10.7500871Z self = 2025-05-07T20:32:10.7502016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.7503453Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebfeacb0>} 2025-05-07T20:32:10.7504827Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.7505932Z context = 2025-05-07T20:32:10.7506228Z 2025-05-07T20:32:10.7506400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.7506935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.7507448Z module_map=module_map) 2025-05-07T20:32:10.7507825Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.7508183Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.7508494Z E ^ 2025-05-07T20:32:10.7508968Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.7509425Z 2025-05-07T20:32:10.7509854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.7510375Z 2025-05-07T20:32:10.7510484Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7510909Z self=, 2025-05-07T20:32:10.7511318Z T=4096, 2025-05-07T20:32:10.7511517Z D=5120, 2025-05-07T20:32:10.7511712Z scale_ub=1200.0, 2025-05-07T20:32:10.7511953Z contiguous=False, 2025-05-07T20:32:10.7512212Z compiled=True, 2025-05-07T20:32:10.7512445Z ) 2025-05-07T20:32:10.7512776Z self = 2025-05-07T20:32:10.7513282Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:10.7513562Z 2025-05-07T20:32:10.7513640Z @given( 2025-05-07T20:32:10.7513876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7514197Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7514506Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7514848Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7515188Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7515479Z ) 2025-05-07T20:32:10.7515847Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7516296Z def test_silu_mul_quant( 2025-05-07T20:32:10.7516549Z self, 2025-05-07T20:32:10.7516758Z T: int, 2025-05-07T20:32:10.7516959Z D: int, 2025-05-07T20:32:10.7517277Z scale_ub: Optional[float], 2025-05-07T20:32:10.7517563Z contiguous: bool, 2025-05-07T20:32:10.7517809Z compiled: bool, 2025-05-07T20:32:10.7528634Z ) -> None: 2025-05-07T20:32:10.7528882Z torch.manual_seed(2025) 2025-05-07T20:32:10.7529139Z 2025-05-07T20:32:10.7529418Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7529756Z 2025-05-07T20:32:10.7529957Z x_sign = torch.sign(x) 2025-05-07T20:32:10.7530260Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.7530574Z x = x_sign * x_clamp 2025-05-07T20:32:10.7530824Z x0 = x[:, :D] 2025-05-07T20:32:10.7531047Z x1 = x[:, D:] 2025-05-07T20:32:10.7531254Z 2025-05-07T20:32:10.7531450Z if contiguous: 2025-05-07T20:32:10.7531717Z x0 = x0.contiguous() 2025-05-07T20:32:10.7532013Z x1 = x1.contiguous() 2025-05-07T20:32:10.7532254Z 2025-05-07T20:32:10.7532461Z if scale_ub is not None: 2025-05-07T20:32:10.7532746Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.7533085Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.7533405Z ) 2025-05-07T20:32:10.7533608Z else: 2025-05-07T20:32:10.7533825Z scale_ub_tensor = None 2025-05-07T20:32:10.7534089Z 2025-05-07T20:32:10.7534334Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.7534654Z op = silu_mul_quant 2025-05-07T20:32:10.7535000Z if compiled: 2025-05-07T20:32:10.7535257Z op = torch.compile(op) 2025-05-07T20:32:10.7535557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.7535840Z 2025-05-07T20:32:10.7536046Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.7536215Z 2025-05-07T20:32:10.7536319Z moe/activation_test.py:117: 2025-05-07T20:32:10.7536628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.7536974Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.7537268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.7537841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.7538570Z return fn(*args, **kwargs) 2025-05-07T20:32:10.7539247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.7539941Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.7540493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.7541186Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.7541894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.7542453Z kernel = self.compile( 2025-05-07T20:32:10.7543016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.7543697Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.7544110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.7544341Z 2025-05-07T20:32:10.7544562Z self = 2025-05-07T20:32:10.7545672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.7547089Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebfeab90>} 2025-05-07T20:32:10.7548547Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.7549597Z context = 2025-05-07T20:32:10.7549900Z 2025-05-07T20:32:10.7550072Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.7550608Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.7551097Z module_map=module_map) 2025-05-07T20:32:10.7551468Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.7551834Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.7552103Z E ^ 2025-05-07T20:32:10.7552576Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.7553042Z 2025-05-07T20:32:10.7553470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.7553993Z 2025-05-07T20:32:10.8823837Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.8824294Z self=, 2025-05-07T20:32:10.8824722Z T=2048, 2025-05-07T20:32:10.8824920Z D=7168, 2025-05-07T20:32:10.8825124Z scale_ub=1200.0, 2025-05-07T20:32:10.8825405Z contiguous=False, 2025-05-07T20:32:10.8825633Z compiled=False, 2025-05-07T20:32:10.8825966Z ) 2025-05-07T20:32:10.8826298Z self = 2025-05-07T20:32:10.8826811Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:10.8827103Z 2025-05-07T20:32:10.8827184Z @given( 2025-05-07T20:32:10.8827425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.8827741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.8828067Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.8828411Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.8828753Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.8829123Z ) 2025-05-07T20:32:10.8829487Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.8829942Z def test_silu_mul_quant( 2025-05-07T20:32:10.8830186Z self, 2025-05-07T20:32:10.8830381Z T: int, 2025-05-07T20:32:10.8830583Z D: int, 2025-05-07T20:32:10.8830804Z scale_ub: Optional[float], 2025-05-07T20:32:10.8831082Z contiguous: bool, 2025-05-07T20:32:10.8831325Z compiled: bool, 2025-05-07T20:32:10.8831553Z ) -> None: 2025-05-07T20:32:10.8831781Z torch.manual_seed(2025) 2025-05-07T20:32:10.8832032Z 2025-05-07T20:32:10.8832318Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.8832676Z 2025-05-07T20:32:10.8832890Z x_sign = torch.sign(x) 2025-05-07T20:32:10.8833195Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.8833511Z x = x_sign * x_clamp 2025-05-07T20:32:10.8833765Z x0 = x[:, :D] 2025-05-07T20:32:10.8833988Z x1 = x[:, D:] 2025-05-07T20:32:10.8834199Z 2025-05-07T20:32:10.8834394Z if contiguous: 2025-05-07T20:32:10.8834638Z x0 = x0.contiguous() 2025-05-07T20:32:10.8834901Z x1 = x1.contiguous() 2025-05-07T20:32:10.8835154Z 2025-05-07T20:32:10.8835357Z if scale_ub is not None: 2025-05-07T20:32:10.8835632Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.8835983Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.8836303Z ) 2025-05-07T20:32:10.8836499Z else: 2025-05-07T20:32:10.8836729Z scale_ub_tensor = None 2025-05-07T20:32:10.8836993Z 2025-05-07T20:32:10.8837232Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.8837675Z op = silu_mul_quant 2025-05-07T20:32:10.8837936Z if compiled: 2025-05-07T20:32:10.8838189Z op = torch.compile(op) 2025-05-07T20:32:10.8838500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.8838781Z 2025-05-07T20:32:10.8838985Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.8839157Z 2025-05-07T20:32:10.8839260Z moe/activation_test.py:117: 2025-05-07T20:32:10.8839563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.8839901Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.8840188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.8840903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.8841614Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.8842177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.8842872Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.8843557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.8844109Z kernel = self.compile( 2025-05-07T20:32:10.8844665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.8845344Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.8845822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.8846052Z 2025-05-07T20:32:10.8846273Z self = 2025-05-07T20:32:10.8847384Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.8848801Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb9c05e0>} 2025-05-07T20:32:10.8850227Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.8851288Z context = 2025-05-07T20:32:10.8851584Z 2025-05-07T20:32:10.8851762Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.8852292Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.8852778Z module_map=module_map) 2025-05-07T20:32:10.8853159Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.8853517Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.8853790Z E ^ 2025-05-07T20:32:10.8854277Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.8854739Z 2025-05-07T20:32:10.8855174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.8856098Z 2025-05-07T20:32:10.8856215Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.8856659Z self=, 2025-05-07T20:32:10.8857077Z T=1, 2025-05-07T20:32:10.8857268Z D=7168, 2025-05-07T20:32:10.8857477Z scale_ub=None, 2025-05-07T20:32:10.8857708Z contiguous=True, 2025-05-07T20:32:10.8857936Z compiled=False, 2025-05-07T20:32:10.8858223Z ) 2025-05-07T20:32:10.8858554Z self = 2025-05-07T20:32:10.8859201Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:10.8859474Z 2025-05-07T20:32:10.8859554Z @given( 2025-05-07T20:32:10.8859801Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.8860126Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.8860441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.8860786Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.8861128Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.8861421Z ) 2025-05-07T20:32:10.8861789Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.8862246Z def test_silu_mul_quant( 2025-05-07T20:32:10.8862493Z self, 2025-05-07T20:32:10.8862703Z T: int, 2025-05-07T20:32:10.8862913Z D: int, 2025-05-07T20:32:10.8863142Z scale_ub: Optional[float], 2025-05-07T20:32:10.8863421Z contiguous: bool, 2025-05-07T20:32:10.8863676Z compiled: bool, 2025-05-07T20:32:10.8863914Z ) -> None: 2025-05-07T20:32:10.8864136Z torch.manual_seed(2025) 2025-05-07T20:32:10.8864389Z 2025-05-07T20:32:10.8864678Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.8865029Z 2025-05-07T20:32:10.8865236Z x_sign = torch.sign(x) 2025-05-07T20:32:10.8865536Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.8865849Z x = x_sign * x_clamp 2025-05-07T20:32:10.8866164Z x0 = x[:, :D] 2025-05-07T20:32:10.8866391Z x1 = x[:, D:] 2025-05-07T20:32:10.8866603Z 2025-05-07T20:32:10.8866799Z if contiguous: 2025-05-07T20:32:10.8867043Z x0 = x0.contiguous() 2025-05-07T20:32:10.8867309Z x1 = x1.contiguous() 2025-05-07T20:32:10.8867561Z 2025-05-07T20:32:10.8867764Z if scale_ub is not None: 2025-05-07T20:32:10.8868053Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.8868407Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.8868726Z ) 2025-05-07T20:32:10.8868930Z else: 2025-05-07T20:32:10.8869227Z scale_ub_tensor = None 2025-05-07T20:32:10.8869492Z 2025-05-07T20:32:10.8869740Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.8870065Z op = silu_mul_quant 2025-05-07T20:32:10.8870328Z if compiled: 2025-05-07T20:32:10.8870585Z op = torch.compile(op) 2025-05-07T20:32:10.8870899Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.8871188Z 2025-05-07T20:32:10.8871388Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.8871569Z 2025-05-07T20:32:10.8871687Z moe/activation_test.py:117: 2025-05-07T20:32:10.8872031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.8872366Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.8872662Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.8873380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.8874093Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.8874645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.8875348Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.8876036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.8876583Z kernel = self.compile( 2025-05-07T20:32:10.8877147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.8877825Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.8878234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.8878617Z 2025-05-07T20:32:10.8878838Z self = 2025-05-07T20:32:10.8879956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.8881371Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb9c0d30>} 2025-05-07T20:32:10.8882759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.8883820Z context = 2025-05-07T20:32:10.8884119Z 2025-05-07T20:32:10.8884296Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.8884842Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.8885331Z module_map=module_map) 2025-05-07T20:32:10.8885703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.8886074Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.8886343Z E ^ 2025-05-07T20:32:10.8886819Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.8887333Z 2025-05-07T20:32:10.8887764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.8888296Z 2025-05-07T20:32:10.8888405Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.8888835Z self=, 2025-05-07T20:32:10.8889245Z T=16384, 2025-05-07T20:32:10.8889455Z D=7168, 2025-05-07T20:32:10.8889659Z scale_ub=1200.0, 2025-05-07T20:32:10.8889890Z contiguous=False, 2025-05-07T20:32:10.8890176Z compiled=True, 2025-05-07T20:32:11.1538371Z ) 2025-05-07T20:32:11.1539477Z self = 2025-05-07T20:32:11.1540925Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.1541526Z 2025-05-07T20:32:11.1541610Z @given( 2025-05-07T20:32:11.1541866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.1542193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.1542508Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.1542854Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.1543197Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.1543486Z ) 2025-05-07T20:32:11.1543856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.1544312Z def test_silu_mul_quant( 2025-05-07T20:32:11.1544564Z self, 2025-05-07T20:32:11.1544761Z T: int, 2025-05-07T20:32:11.1544970Z D: int, 2025-05-07T20:32:11.1545201Z scale_ub: Optional[float], 2025-05-07T20:32:11.1545474Z contiguous: bool, 2025-05-07T20:32:11.1545720Z compiled: bool, 2025-05-07T20:32:11.1545953Z ) -> None: 2025-05-07T20:32:11.1546176Z torch.manual_seed(2025) 2025-05-07T20:32:11.1546427Z 2025-05-07T20:32:11.1546716Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.1547063Z 2025-05-07T20:32:11.1547272Z x_sign = torch.sign(x) 2025-05-07T20:32:11.1547573Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.1547884Z x = x_sign * x_clamp 2025-05-07T20:32:11.1548132Z x0 = x[:, :D] 2025-05-07T20:32:11.1548354Z x1 = x[:, D:] 2025-05-07T20:32:11.1548563Z 2025-05-07T20:32:11.1548957Z if contiguous: 2025-05-07T20:32:11.1549204Z x0 = x0.contiguous() 2025-05-07T20:32:11.1549466Z x1 = x1.contiguous() 2025-05-07T20:32:11.1549718Z 2025-05-07T20:32:11.1549918Z if scale_ub is not None: 2025-05-07T20:32:11.1550203Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.1550543Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.1550856Z ) 2025-05-07T20:32:11.1551055Z else: 2025-05-07T20:32:11.1551268Z scale_ub_tensor = None 2025-05-07T20:32:11.1551526Z 2025-05-07T20:32:11.1551774Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.1552090Z op = silu_mul_quant 2025-05-07T20:32:11.1552347Z if compiled: 2025-05-07T20:32:11.1552605Z op = torch.compile(op) 2025-05-07T20:32:11.1552911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1553196Z 2025-05-07T20:32:11.1553407Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.1553579Z 2025-05-07T20:32:11.1553687Z moe/activation_test.py:117: 2025-05-07T20:32:11.1553992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1554330Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.1554623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1555200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.1555970Z return fn(*args, **kwargs) 2025-05-07T20:32:11.1556722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.1557423Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.1557975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.1558673Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.1559358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.1559965Z kernel = self.compile( 2025-05-07T20:32:11.1560523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.1561199Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.1561599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1561838Z 2025-05-07T20:32:11.1562053Z self = 2025-05-07T20:32:11.1563163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.1564585Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb9c1bd0>} 2025-05-07T20:32:11.1565962Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.1567012Z context = 2025-05-07T20:32:11.1567311Z 2025-05-07T20:32:11.1567487Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.1568023Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.1568507Z module_map=module_map) 2025-05-07T20:32:11.1568878Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.1569241Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.1569507Z E ^ 2025-05-07T20:32:11.1570094Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.1570562Z 2025-05-07T20:32:11.1570989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.1571518Z 2025-05-07T20:32:11.1571626Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.1572051Z self=, 2025-05-07T20:32:11.1572460Z T=1, 2025-05-07T20:32:11.1572654Z D=7168, 2025-05-07T20:32:11.1572855Z scale_ub=None, 2025-05-07T20:32:11.1573075Z contiguous=False, 2025-05-07T20:32:11.1573308Z compiled=False, 2025-05-07T20:32:11.1573522Z ) 2025-05-07T20:32:11.1573846Z self = 2025-05-07T20:32:11.1574346Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.1574618Z 2025-05-07T20:32:11.1574706Z @given( 2025-05-07T20:32:11.1574947Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.1575263Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.1575585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.1575925Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.1576257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.1576550Z ) 2025-05-07T20:32:11.1576910Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.1577403Z def test_silu_mul_quant( 2025-05-07T20:32:11.1577651Z self, 2025-05-07T20:32:11.1577856Z T: int, 2025-05-07T20:32:11.1578120Z D: int, 2025-05-07T20:32:11.1578353Z scale_ub: Optional[float], 2025-05-07T20:32:11.1578633Z contiguous: bool, 2025-05-07T20:32:11.1578886Z compiled: bool, 2025-05-07T20:32:11.1579131Z ) -> None: 2025-05-07T20:32:11.1579366Z torch.manual_seed(2025) 2025-05-07T20:32:11.1579615Z 2025-05-07T20:32:11.1579900Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.1580297Z 2025-05-07T20:32:11.1580499Z x_sign = torch.sign(x) 2025-05-07T20:32:11.1580801Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.1581115Z x = x_sign * x_clamp 2025-05-07T20:32:11.1581362Z x0 = x[:, :D] 2025-05-07T20:32:11.1581586Z x1 = x[:, D:] 2025-05-07T20:32:11.1581793Z 2025-05-07T20:32:11.1581989Z if contiguous: 2025-05-07T20:32:11.1582230Z x0 = x0.contiguous() 2025-05-07T20:32:11.1582488Z x1 = x1.contiguous() 2025-05-07T20:32:11.1582740Z 2025-05-07T20:32:11.1582940Z if scale_ub is not None: 2025-05-07T20:32:11.1583214Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.1583559Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.1583874Z ) 2025-05-07T20:32:11.1584071Z else: 2025-05-07T20:32:11.1584290Z scale_ub_tensor = None 2025-05-07T20:32:11.1584548Z 2025-05-07T20:32:11.1584789Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.1585108Z op = silu_mul_quant 2025-05-07T20:32:11.1585363Z if compiled: 2025-05-07T20:32:11.1585620Z op = torch.compile(op) 2025-05-07T20:32:11.1585919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1586199Z 2025-05-07T20:32:11.1586400Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.1586571Z 2025-05-07T20:32:11.1586673Z moe/activation_test.py:117: 2025-05-07T20:32:11.1586975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1587308Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.1587594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1588383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.1589088Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.1589634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.1590326Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.1591003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.1591546Z kernel = self.compile( 2025-05-07T20:32:11.1592103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.1592766Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.1600632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1600902Z 2025-05-07T20:32:11.1601133Z self = 2025-05-07T20:32:11.1602297Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.1603701Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb9c2050>} 2025-05-07T20:32:11.1605071Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.1606189Z context = 2025-05-07T20:32:11.1606482Z 2025-05-07T20:32:11.1606656Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.1607187Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.1607662Z module_map=module_map) 2025-05-07T20:32:11.1608033Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.1608436Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.1608704Z E ^ 2025-05-07T20:32:11.1609178Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.1609634Z 2025-05-07T20:32:11.1610065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.1610586Z 2025-05-07T20:32:11.1610692Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.1611114Z self=, 2025-05-07T20:32:11.1611515Z T=2048, 2025-05-07T20:32:11.1611703Z D=7168, 2025-05-07T20:32:11.1611899Z scale_ub=None, 2025-05-07T20:32:11.1612129Z contiguous=False, 2025-05-07T20:32:11.1612397Z compiled=True, 2025-05-07T20:32:11.1612610Z ) 2025-05-07T20:32:11.2607085Z self = 2025-05-07T20:32:11.2608613Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.2609390Z 2025-05-07T20:32:11.2609608Z @given( 2025-05-07T20:32:11.2610266Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.2611114Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.2611712Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.2612052Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.2612384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.2612676Z ) 2025-05-07T20:32:11.2613034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.2613484Z def test_silu_mul_quant( 2025-05-07T20:32:11.2613729Z self, 2025-05-07T20:32:11.2614098Z T: int, 2025-05-07T20:32:11.2614306Z D: int, 2025-05-07T20:32:11.2614524Z scale_ub: Optional[float], 2025-05-07T20:32:11.2614808Z contiguous: bool, 2025-05-07T20:32:11.2615054Z compiled: bool, 2025-05-07T20:32:11.2615280Z ) -> None: 2025-05-07T20:32:11.2615507Z torch.manual_seed(2025) 2025-05-07T20:32:11.2615754Z 2025-05-07T20:32:11.2616035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.2616386Z 2025-05-07T20:32:11.2616589Z x_sign = torch.sign(x) 2025-05-07T20:32:11.2616884Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.2617201Z x = x_sign * x_clamp 2025-05-07T20:32:11.2617448Z x0 = x[:, :D] 2025-05-07T20:32:11.2617666Z x1 = x[:, D:] 2025-05-07T20:32:11.2617876Z 2025-05-07T20:32:11.2618142Z if contiguous: 2025-05-07T20:32:11.2618377Z x0 = x0.contiguous() 2025-05-07T20:32:11.2618646Z x1 = x1.contiguous() 2025-05-07T20:32:11.2618887Z 2025-05-07T20:32:11.2619083Z if scale_ub is not None: 2025-05-07T20:32:11.2619360Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.2619698Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.2620009Z ) 2025-05-07T20:32:11.2620204Z else: 2025-05-07T20:32:11.2620412Z scale_ub_tensor = None 2025-05-07T20:32:11.2620669Z 2025-05-07T20:32:11.2620911Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.2621293Z op = silu_mul_quant 2025-05-07T20:32:11.2621549Z if compiled: 2025-05-07T20:32:11.2621804Z op = torch.compile(op) 2025-05-07T20:32:11.2622101Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.2622382Z 2025-05-07T20:32:11.2622582Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.2622746Z 2025-05-07T20:32:11.2622846Z moe/activation_test.py:117: 2025-05-07T20:32:11.2623144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.2623477Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.2623771Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.2624409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.2624977Z return fn(*args, **kwargs) 2025-05-07T20:32:11.2625642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.2626338Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.2626880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.2627566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.2628239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.2628775Z kernel = self.compile( 2025-05-07T20:32:11.2629333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.2630002Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.2630392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.2630622Z 2025-05-07T20:32:11.2630835Z self = 2025-05-07T20:32:11.2631938Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.2633334Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb9c31c0>} 2025-05-07T20:32:11.2634777Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.2635818Z context = 2025-05-07T20:32:11.2636113Z 2025-05-07T20:32:11.2636283Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.2636811Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.2637287Z module_map=module_map) 2025-05-07T20:32:11.2637651Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.2638013Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.2638273Z E ^ 2025-05-07T20:32:11.2638739Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.2639204Z 2025-05-07T20:32:11.2639625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.2640150Z 2025-05-07T20:32:11.2640255Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.2640675Z self=, 2025-05-07T20:32:11.2641074Z T=4096, 2025-05-07T20:32:11.2641268Z D=7168, 2025-05-07T20:32:11.2641464Z scale_ub=None, 2025-05-07T20:32:11.2641725Z contiguous=False, 2025-05-07T20:32:11.2641953Z compiled=True, 2025-05-07T20:32:11.2642158Z ) 2025-05-07T20:32:11.2642480Z self = 2025-05-07T20:32:11.2642978Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.2643256Z 2025-05-07T20:32:11.2643332Z @given( 2025-05-07T20:32:11.2643562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.2643882Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.2644189Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.2644523Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.2644900Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.2645188Z ) 2025-05-07T20:32:11.2645543Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.2645983Z def test_silu_mul_quant( 2025-05-07T20:32:11.2646229Z self, 2025-05-07T20:32:11.2646428Z T: int, 2025-05-07T20:32:11.2646624Z D: int, 2025-05-07T20:32:11.2646848Z scale_ub: Optional[float], 2025-05-07T20:32:11.2647122Z contiguous: bool, 2025-05-07T20:32:11.2647368Z compiled: bool, 2025-05-07T20:32:11.2647588Z ) -> None: 2025-05-07T20:32:11.2647808Z torch.manual_seed(2025) 2025-05-07T20:32:11.2648052Z 2025-05-07T20:32:11.2648329Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.2648674Z 2025-05-07T20:32:11.2648870Z x_sign = torch.sign(x) 2025-05-07T20:32:11.2649160Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.2649477Z x = x_sign * x_clamp 2025-05-07T20:32:11.2649720Z x0 = x[:, :D] 2025-05-07T20:32:11.2649934Z x1 = x[:, D:] 2025-05-07T20:32:11.2650145Z 2025-05-07T20:32:11.2650334Z if contiguous: 2025-05-07T20:32:11.2650564Z x0 = x0.contiguous() 2025-05-07T20:32:11.2650825Z x1 = x1.contiguous() 2025-05-07T20:32:11.2651068Z 2025-05-07T20:32:11.2651256Z if scale_ub is not None: 2025-05-07T20:32:11.2651536Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.2651874Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.2652181Z ) 2025-05-07T20:32:11.2652371Z else: 2025-05-07T20:32:11.2652585Z scale_ub_tensor = None 2025-05-07T20:32:11.2652840Z 2025-05-07T20:32:11.2653153Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.2653470Z op = silu_mul_quant 2025-05-07T20:32:11.2653723Z if compiled: 2025-05-07T20:32:11.2653972Z op = torch.compile(op) 2025-05-07T20:32:11.2654271Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.2654547Z 2025-05-07T20:32:11.2654739Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.2654911Z 2025-05-07T20:32:11.2655011Z moe/activation_test.py:117: 2025-05-07T20:32:11.2655310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.2655814Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.2656099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.2656666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.2657230Z return fn(*args, **kwargs) 2025-05-07T20:32:11.2657895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.2658649Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.2659200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.2659887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.2660559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.2661168Z kernel = self.compile( 2025-05-07T20:32:11.2661720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.2662382Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.2662781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.2663006Z 2025-05-07T20:32:11.2663232Z self = 2025-05-07T20:32:11.2664328Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.2665788Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb6bc1f0>} 2025-05-07T20:32:11.2667159Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.2668201Z context = 2025-05-07T20:32:11.2668493Z 2025-05-07T20:32:11.2668670Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.2669202Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.2669683Z module_map=module_map) 2025-05-07T20:32:11.2670052Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.2670414Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.2670669Z E ^ 2025-05-07T20:32:11.2671138Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.2671597Z 2025-05-07T20:32:11.2672023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.2672539Z 2025-05-07T20:32:11.6236075Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6236757Z self=, 2025-05-07T20:32:11.6237324Z T=16384, 2025-05-07T20:32:11.6237647Z D=5120, 2025-05-07T20:32:11.6238094Z scale_ub=1200.0, 2025-05-07T20:32:11.6238332Z contiguous=False, 2025-05-07T20:32:11.6238564Z compiled=False, 2025-05-07T20:32:11.6238767Z ) 2025-05-07T20:32:11.6239094Z self = 2025-05-07T20:32:11.6239604Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.6239889Z 2025-05-07T20:32:11.6239965Z @given( 2025-05-07T20:32:11.6240197Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6240519Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6240829Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6241158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6241490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6241776Z ) 2025-05-07T20:32:11.6242126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6242604Z def test_silu_mul_quant( 2025-05-07T20:32:11.6242873Z self, 2025-05-07T20:32:11.6243062Z T: int, 2025-05-07T20:32:11.6243261Z D: int, 2025-05-07T20:32:11.6243485Z scale_ub: Optional[float], 2025-05-07T20:32:11.6243758Z contiguous: bool, 2025-05-07T20:32:11.6243999Z compiled: bool, 2025-05-07T20:32:11.6244223Z ) -> None: 2025-05-07T20:32:11.6244434Z torch.manual_seed(2025) 2025-05-07T20:32:11.6244681Z 2025-05-07T20:32:11.6244959Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6245370Z 2025-05-07T20:32:11.6245563Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6245857Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6246165Z x = x_sign * x_clamp 2025-05-07T20:32:11.6246402Z x0 = x[:, :D] 2025-05-07T20:32:11.6246617Z x1 = x[:, D:] 2025-05-07T20:32:11.6246826Z 2025-05-07T20:32:11.6247007Z if contiguous: 2025-05-07T20:32:11.6247249Z x0 = x0.contiguous() 2025-05-07T20:32:11.6247509Z x1 = x1.contiguous() 2025-05-07T20:32:11.6247741Z 2025-05-07T20:32:11.6247933Z if scale_ub is not None: 2025-05-07T20:32:11.6248279Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6248610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6248920Z ) 2025-05-07T20:32:11.6249113Z else: 2025-05-07T20:32:11.6249319Z scale_ub_tensor = None 2025-05-07T20:32:11.6249570Z 2025-05-07T20:32:11.6249807Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6250117Z op = silu_mul_quant 2025-05-07T20:32:11.6250368Z if compiled: 2025-05-07T20:32:11.6250615Z op = torch.compile(op) 2025-05-07T20:32:11.6250915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6251181Z 2025-05-07T20:32:11.6251376Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6251538Z 2025-05-07T20:32:11.6251648Z moe/activation_test.py:117: 2025-05-07T20:32:11.6251973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6252320Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6252603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6253293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6253993Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6254534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6255223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6256079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6256619Z kernel = self.compile( 2025-05-07T20:32:11.6257292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6257957Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6258410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6258640Z 2025-05-07T20:32:11.6258852Z self = 2025-05-07T20:32:11.6259947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6261348Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb6bc700>} 2025-05-07T20:32:11.6262769Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6263811Z context = 2025-05-07T20:32:11.6264110Z 2025-05-07T20:32:11.6264277Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6264804Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6265274Z module_map=module_map) 2025-05-07T20:32:11.6265767Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6266122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6266375Z E ^ 2025-05-07T20:32:11.6266843Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6267302Z 2025-05-07T20:32:11.6267728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6268243Z 2025-05-07T20:32:11.6268357Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6268834Z self=, 2025-05-07T20:32:11.6269241Z T=16384, 2025-05-07T20:32:11.6269443Z D=5120, 2025-05-07T20:32:11.6269636Z scale_ub=1200.0, 2025-05-07T20:32:11.6269869Z contiguous=True, 2025-05-07T20:32:11.6270091Z compiled=True, 2025-05-07T20:32:11.6270291Z ) 2025-05-07T20:32:11.6270616Z self = 2025-05-07T20:32:11.6271116Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.6271393Z 2025-05-07T20:32:11.6271475Z @given( 2025-05-07T20:32:11.6271703Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6272018Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6272350Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6272709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6273042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6273332Z ) 2025-05-07T20:32:11.6273681Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6274127Z def test_silu_mul_quant( 2025-05-07T20:32:11.6274369Z self, 2025-05-07T20:32:11.6274565Z T: int, 2025-05-07T20:32:11.6274756Z D: int, 2025-05-07T20:32:11.6274977Z scale_ub: Optional[float], 2025-05-07T20:32:11.6275252Z contiguous: bool, 2025-05-07T20:32:11.6275489Z compiled: bool, 2025-05-07T20:32:11.6275721Z ) -> None: 2025-05-07T20:32:11.6275940Z torch.manual_seed(2025) 2025-05-07T20:32:11.6276179Z 2025-05-07T20:32:11.6276456Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6276802Z 2025-05-07T20:32:11.6276997Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6277376Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6277690Z x = x_sign * x_clamp 2025-05-07T20:32:11.6277929Z x0 = x[:, :D] 2025-05-07T20:32:11.6278149Z x1 = x[:, D:] 2025-05-07T20:32:11.6278361Z 2025-05-07T20:32:11.6278544Z if contiguous: 2025-05-07T20:32:11.6278779Z x0 = x0.contiguous() 2025-05-07T20:32:11.6279037Z x1 = x1.contiguous() 2025-05-07T20:32:11.6279275Z 2025-05-07T20:32:11.6279470Z if scale_ub is not None: 2025-05-07T20:32:11.6279750Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6280092Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6280398Z ) 2025-05-07T20:32:11.6280592Z else: 2025-05-07T20:32:11.6280807Z scale_ub_tensor = None 2025-05-07T20:32:11.6281056Z 2025-05-07T20:32:11.6281288Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6281607Z op = silu_mul_quant 2025-05-07T20:32:11.6281882Z if compiled: 2025-05-07T20:32:11.6282161Z op = torch.compile(op) 2025-05-07T20:32:11.6282464Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6282737Z 2025-05-07T20:32:11.6282934Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6283098Z 2025-05-07T20:32:11.6283203Z moe/activation_test.py:117: 2025-05-07T20:32:11.6283497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6283821Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6284155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6284721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6285283Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6285950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6286653Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6287191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6287924Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6288601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6289137Z kernel = self.compile( 2025-05-07T20:32:11.6289684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6290351Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6290749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6290976Z 2025-05-07T20:32:11.6291193Z self = 2025-05-07T20:32:11.6292287Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6293687Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb6bd7e0>} 2025-05-07T20:32:11.6295054Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6296100Z context = 2025-05-07T20:32:11.6296392Z 2025-05-07T20:32:11.6296561Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6297084Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6297639Z module_map=module_map) 2025-05-07T20:32:11.6298083Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6298443Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6298707Z E ^ 2025-05-07T20:32:11.6299182Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6299633Z 2025-05-07T20:32:11.6300060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6300580Z 2025-05-07T20:32:11.8197526Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.8198134Z self=, 2025-05-07T20:32:11.8198758Z T=16384, 2025-05-07T20:32:11.8199029Z D=5120, 2025-05-07T20:32:11.8199303Z scale_ub=None, 2025-05-07T20:32:11.8199618Z contiguous=False, 2025-05-07T20:32:11.8199851Z compiled=True, 2025-05-07T20:32:11.8200067Z ) 2025-05-07T20:32:11.8200395Z self = 2025-05-07T20:32:11.8200901Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.8201189Z 2025-05-07T20:32:11.8201269Z @given( 2025-05-07T20:32:11.8201506Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.8201819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.8202132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.8202594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.8202937Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.8203233Z ) 2025-05-07T20:32:11.8203591Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.8210370Z def test_silu_mul_quant( 2025-05-07T20:32:11.8210638Z self, 2025-05-07T20:32:11.8210843Z T: int, 2025-05-07T20:32:11.8211052Z D: int, 2025-05-07T20:32:11.8211270Z scale_ub: Optional[float], 2025-05-07T20:32:11.8211549Z contiguous: bool, 2025-05-07T20:32:11.8211896Z compiled: bool, 2025-05-07T20:32:11.8212123Z ) -> None: 2025-05-07T20:32:11.8212345Z torch.manual_seed(2025) 2025-05-07T20:32:11.8212584Z 2025-05-07T20:32:11.8212862Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.8213213Z 2025-05-07T20:32:11.8213413Z x_sign = torch.sign(x) 2025-05-07T20:32:11.8213715Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.8214031Z x = x_sign * x_clamp 2025-05-07T20:32:11.8214271Z x0 = x[:, :D] 2025-05-07T20:32:11.8214491Z x1 = x[:, D:] 2025-05-07T20:32:11.8214701Z 2025-05-07T20:32:11.8214883Z if contiguous: 2025-05-07T20:32:11.8215121Z x0 = x0.contiguous() 2025-05-07T20:32:11.8215385Z x1 = x1.contiguous() 2025-05-07T20:32:11.8215631Z 2025-05-07T20:32:11.8215827Z if scale_ub is not None: 2025-05-07T20:32:11.8216106Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.8216455Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.8216767Z ) 2025-05-07T20:32:11.8216967Z else: 2025-05-07T20:32:11.8217184Z scale_ub_tensor = None 2025-05-07T20:32:11.8217435Z 2025-05-07T20:32:11.8217671Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.8218072Z op = silu_mul_quant 2025-05-07T20:32:11.8218326Z if compiled: 2025-05-07T20:32:11.8218582Z op = torch.compile(op) 2025-05-07T20:32:11.8218886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.8219157Z 2025-05-07T20:32:11.8219357Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.8219524Z 2025-05-07T20:32:11.8219629Z moe/activation_test.py:117: 2025-05-07T20:32:11.8220054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.8220381Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.8220668Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.8221246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.8221811Z return fn(*args, **kwargs) 2025-05-07T20:32:11.8222482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.8223182Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.8223721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.8224403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.8225077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.8225616Z kernel = self.compile( 2025-05-07T20:32:11.8226164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.8226829Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.8227228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.8227454Z 2025-05-07T20:32:11.8227672Z self = 2025-05-07T20:32:11.8228763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.8230209Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb6be680>} 2025-05-07T20:32:11.8231581Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.8232660Z context = 2025-05-07T20:32:11.8232951Z 2025-05-07T20:32:11.8233122Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.8233644Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.8234122Z module_map=module_map) 2025-05-07T20:32:11.8234491Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.8234848Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.8235103Z E ^ 2025-05-07T20:32:11.8235576Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.8236030Z 2025-05-07T20:32:11.8236462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.8236982Z 2025-05-07T20:32:11.8237090Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.8237506Z self=, 2025-05-07T20:32:11.8237909Z T=2048, 2025-05-07T20:32:11.8238101Z D=5120, 2025-05-07T20:32:11.8238290Z scale_ub=None, 2025-05-07T20:32:11.8238508Z contiguous=False, 2025-05-07T20:32:11.8238740Z compiled=True, 2025-05-07T20:32:11.8238940Z ) 2025-05-07T20:32:11.9275432Z self = 2025-05-07T20:32:11.9276159Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.9276556Z 2025-05-07T20:32:11.9276672Z @given( 2025-05-07T20:32:11.9276993Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.9277440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.9278059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.9278491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.9278835Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.9279127Z ) 2025-05-07T20:32:11.9279481Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.9279932Z def test_silu_mul_quant( 2025-05-07T20:32:11.9280212Z self, 2025-05-07T20:32:11.9280418Z T: int, 2025-05-07T20:32:11.9280624Z D: int, 2025-05-07T20:32:11.9280846Z scale_ub: Optional[float], 2025-05-07T20:32:11.9281125Z contiguous: bool, 2025-05-07T20:32:11.9281372Z compiled: bool, 2025-05-07T20:32:11.9281603Z ) -> None: 2025-05-07T20:32:11.9281828Z torch.manual_seed(2025) 2025-05-07T20:32:11.9282075Z 2025-05-07T20:32:11.9282351Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.9282697Z 2025-05-07T20:32:11.9282901Z x_sign = torch.sign(x) 2025-05-07T20:32:11.9283194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.9283511Z x = x_sign * x_clamp 2025-05-07T20:32:11.9283758Z x0 = x[:, :D] 2025-05-07T20:32:11.9283976Z x1 = x[:, D:] 2025-05-07T20:32:11.9284192Z 2025-05-07T20:32:11.9284387Z if contiguous: 2025-05-07T20:32:11.9284622Z x0 = x0.contiguous() 2025-05-07T20:32:11.9284888Z x1 = x1.contiguous() 2025-05-07T20:32:11.9285203Z 2025-05-07T20:32:11.9285405Z if scale_ub is not None: 2025-05-07T20:32:11.9285681Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.9286026Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.9286341Z ) 2025-05-07T20:32:11.9286534Z else: 2025-05-07T20:32:11.9286749Z scale_ub_tensor = None 2025-05-07T20:32:11.9287005Z 2025-05-07T20:32:11.9287244Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.9287566Z op = silu_mul_quant 2025-05-07T20:32:11.9287822Z if compiled: 2025-05-07T20:32:11.9288069Z op = torch.compile(op) 2025-05-07T20:32:11.9288443Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9288723Z 2025-05-07T20:32:11.9288916Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.9289089Z 2025-05-07T20:32:11.9289192Z moe/activation_test.py:117: 2025-05-07T20:32:11.9289499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9289840Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.9290126Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9290694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.9291263Z return fn(*args, **kwargs) 2025-05-07T20:32:11.9291934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.9292694Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.9293237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.9293929Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.9294598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.9295143Z kernel = self.compile( 2025-05-07T20:32:11.9295700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.9296367Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.9296771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9297004Z 2025-05-07T20:32:11.9297216Z self = 2025-05-07T20:32:11.9298501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.9299917Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb6be560>} 2025-05-07T20:32:11.9301279Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.9302329Z context = 2025-05-07T20:32:11.9302629Z 2025-05-07T20:32:11.9302797Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.9303332Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.9303804Z module_map=module_map) 2025-05-07T20:32:11.9304181Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.9304543Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.9304805Z E ^ 2025-05-07T20:32:11.9305279Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.9305739Z 2025-05-07T20:32:11.9306208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.9306727Z 2025-05-07T20:32:11.9306840Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.9307258Z self=, 2025-05-07T20:32:11.9307666Z T=2048, 2025-05-07T20:32:11.9307860Z D=5120, 2025-05-07T20:32:11.9308052Z scale_ub=1200.0, 2025-05-07T20:32:11.9308289Z contiguous=False, 2025-05-07T20:32:11.9308522Z compiled=True, 2025-05-07T20:32:11.9308723Z ) 2025-05-07T20:32:11.9309047Z self = 2025-05-07T20:32:11.9309596Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.9309873Z 2025-05-07T20:32:11.9309956Z @given( 2025-05-07T20:32:11.9310185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.9310501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.9310821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.9311153Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.9311491Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.9311793Z ) 2025-05-07T20:32:11.9312196Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.9312648Z def test_silu_mul_quant( 2025-05-07T20:32:11.9312898Z self, 2025-05-07T20:32:11.9313104Z T: int, 2025-05-07T20:32:11.9313301Z D: int, 2025-05-07T20:32:11.9313523Z scale_ub: Optional[float], 2025-05-07T20:32:11.9313806Z contiguous: bool, 2025-05-07T20:32:11.9314046Z compiled: bool, 2025-05-07T20:32:11.9314276Z ) -> None: 2025-05-07T20:32:11.9314498Z torch.manual_seed(2025) 2025-05-07T20:32:11.9314740Z 2025-05-07T20:32:11.9315022Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.9315372Z 2025-05-07T20:32:11.9315565Z x_sign = torch.sign(x) 2025-05-07T20:32:11.9315865Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.9316184Z x = x_sign * x_clamp 2025-05-07T20:32:11.9316421Z x0 = x[:, :D] 2025-05-07T20:32:11.9316640Z x1 = x[:, D:] 2025-05-07T20:32:11.9316853Z 2025-05-07T20:32:11.9317041Z if contiguous: 2025-05-07T20:32:11.9317279Z x0 = x0.contiguous() 2025-05-07T20:32:11.9317679Z x1 = x1.contiguous() 2025-05-07T20:32:11.9317924Z 2025-05-07T20:32:11.9318118Z if scale_ub is not None: 2025-05-07T20:32:11.9318398Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.9318749Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.9319055Z ) 2025-05-07T20:32:11.9319253Z else: 2025-05-07T20:32:11.9319466Z scale_ub_tensor = None 2025-05-07T20:32:11.9319715Z 2025-05-07T20:32:11.9319958Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.9320278Z op = silu_mul_quant 2025-05-07T20:32:11.9320533Z if compiled: 2025-05-07T20:32:11.9320782Z op = torch.compile(op) 2025-05-07T20:32:11.9321083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9321364Z 2025-05-07T20:32:11.9321559Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.9321729Z 2025-05-07T20:32:11.9321833Z moe/activation_test.py:117: 2025-05-07T20:32:11.9322139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9322465Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.9322754Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9323322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.9323888Z return fn(*args, **kwargs) 2025-05-07T20:32:11.9324551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.9325316Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.9325864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.9326549Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.9327226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.9327768Z kernel = self.compile( 2025-05-07T20:32:11.9328318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.9329028Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.9329433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9329660Z 2025-05-07T20:32:11.9329880Z self = 2025-05-07T20:32:11.9330981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.9332371Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb6bf370>} 2025-05-07T20:32:11.9333735Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.9334783Z context = 2025-05-07T20:32:11.9335076Z 2025-05-07T20:32:11.9335250Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.9335776Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.9336258Z module_map=module_map) 2025-05-07T20:32:11.9336629Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.9336986Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.9337249Z E ^ 2025-05-07T20:32:11.9337719Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.9338320Z 2025-05-07T20:32:11.9338748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.9339271Z 2025-05-07T20:32:12.1244444Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1244963Z self=, 2025-05-07T20:32:12.1245594Z T=4096, 2025-05-07T20:32:12.1245862Z D=5120, 2025-05-07T20:32:12.1246135Z scale_ub=1200.0, 2025-05-07T20:32:12.1246458Z contiguous=True, 2025-05-07T20:32:12.1246719Z compiled=True, 2025-05-07T20:32:12.1246930Z ) 2025-05-07T20:32:12.1247259Z self = 2025-05-07T20:32:12.1247766Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.1248040Z 2025-05-07T20:32:12.1248125Z @given( 2025-05-07T20:32:12.1248355Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.1248678Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.1248991Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.1249332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.1249661Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.1249950Z ) 2025-05-07T20:32:12.1250305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.1250752Z def test_silu_mul_quant( 2025-05-07T20:32:12.1251109Z self, 2025-05-07T20:32:12.1251310Z T: int, 2025-05-07T20:32:12.1251507Z D: int, 2025-05-07T20:32:12.1251729Z scale_ub: Optional[float], 2025-05-07T20:32:12.1252010Z contiguous: bool, 2025-05-07T20:32:12.1252250Z compiled: bool, 2025-05-07T20:32:12.1252477Z ) -> None: 2025-05-07T20:32:12.1252730Z torch.manual_seed(2025) 2025-05-07T20:32:12.1252995Z 2025-05-07T20:32:12.1253277Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.1253622Z 2025-05-07T20:32:12.1253821Z x_sign = torch.sign(x) 2025-05-07T20:32:12.1254115Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.1254499Z x = x_sign * x_clamp 2025-05-07T20:32:12.1254743Z x0 = x[:, :D] 2025-05-07T20:32:12.1254958Z x1 = x[:, D:] 2025-05-07T20:32:12.1255169Z 2025-05-07T20:32:12.1255362Z if contiguous: 2025-05-07T20:32:12.1255778Z x0 = x0.contiguous() 2025-05-07T20:32:12.1256049Z x1 = x1.contiguous() 2025-05-07T20:32:12.1256293Z 2025-05-07T20:32:12.1256488Z if scale_ub is not None: 2025-05-07T20:32:12.1256767Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.1257112Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.1257422Z ) 2025-05-07T20:32:12.1257621Z else: 2025-05-07T20:32:12.1257839Z scale_ub_tensor = None 2025-05-07T20:32:12.1258174Z 2025-05-07T20:32:12.1258418Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.1258739Z op = silu_mul_quant 2025-05-07T20:32:12.1258992Z if compiled: 2025-05-07T20:32:12.1259248Z op = torch.compile(op) 2025-05-07T20:32:12.1259551Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1259825Z 2025-05-07T20:32:12.1260018Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.1260192Z 2025-05-07T20:32:12.1260295Z moe/activation_test.py:117: 2025-05-07T20:32:12.1260600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1260925Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.1261219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1261791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.1262353Z return fn(*args, **kwargs) 2025-05-07T20:32:12.1263143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.1263849Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.1264396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.1265081Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.1265750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.1266298Z kernel = self.compile( 2025-05-07T20:32:12.1266851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.1267519Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.1267925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1268150Z 2025-05-07T20:32:12.1268374Z self = 2025-05-07T20:32:12.1269470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.1270872Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb2a0310>} 2025-05-07T20:32:12.1272327Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.1273400Z context = 2025-05-07T20:32:12.1273691Z 2025-05-07T20:32:12.1273870Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.1274402Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.1274880Z module_map=module_map) 2025-05-07T20:32:12.1275315Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.1275675Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.1275936Z E ^ 2025-05-07T20:32:12.1276408Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.1276866Z 2025-05-07T20:32:12.1277295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.1277814Z 2025-05-07T20:32:12.1277935Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1278353Z self=, 2025-05-07T20:32:12.1278762Z T=128, 2025-05-07T20:32:12.1278958Z D=5120, 2025-05-07T20:32:12.1279155Z scale_ub=1200.0, 2025-05-07T20:32:12.1279382Z contiguous=False, 2025-05-07T20:32:12.1279612Z compiled=True, 2025-05-07T20:32:12.1279814Z ) 2025-05-07T20:32:12.4257580Z self = 2025-05-07T20:32:12.4258889Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.4259745Z 2025-05-07T20:32:12.4259973Z @given( 2025-05-07T20:32:12.4260589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4261282Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4261890Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4262227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4262567Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4262853Z ) 2025-05-07T20:32:12.4263212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4263823Z def test_silu_mul_quant( 2025-05-07T20:32:12.4264070Z self, 2025-05-07T20:32:12.4264269Z T: int, 2025-05-07T20:32:12.4264475Z D: int, 2025-05-07T20:32:12.4264698Z scale_ub: Optional[float], 2025-05-07T20:32:12.4264976Z contiguous: bool, 2025-05-07T20:32:12.4265218Z compiled: bool, 2025-05-07T20:32:12.4265445Z ) -> None: 2025-05-07T20:32:12.4265664Z torch.manual_seed(2025) 2025-05-07T20:32:12.4265909Z 2025-05-07T20:32:12.4266188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4266538Z 2025-05-07T20:32:12.4266739Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4267034Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4267348Z x = x_sign * x_clamp 2025-05-07T20:32:12.4267588Z x0 = x[:, :D] 2025-05-07T20:32:12.4267811Z x1 = x[:, D:] 2025-05-07T20:32:12.4268019Z 2025-05-07T20:32:12.4268214Z if contiguous: 2025-05-07T20:32:12.4268455Z x0 = x0.contiguous() 2025-05-07T20:32:12.4268717Z x1 = x1.contiguous() 2025-05-07T20:32:12.4268964Z 2025-05-07T20:32:12.4269168Z if scale_ub is not None: 2025-05-07T20:32:12.4269446Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4269789Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4270102Z ) 2025-05-07T20:32:12.4270294Z else: 2025-05-07T20:32:12.4270508Z scale_ub_tensor = None 2025-05-07T20:32:12.4270833Z 2025-05-07T20:32:12.4271067Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4271387Z op = silu_mul_quant 2025-05-07T20:32:12.4271641Z if compiled: 2025-05-07T20:32:12.4271887Z op = torch.compile(op) 2025-05-07T20:32:12.4272194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4272481Z 2025-05-07T20:32:12.4272702Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4283942Z 2025-05-07T20:32:12.4284080Z moe/activation_test.py:117: 2025-05-07T20:32:12.4284400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4284895Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4285188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4285762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4286330Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4286995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4287696Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4288241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4288923Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4289599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4290130Z kernel = self.compile( 2025-05-07T20:32:12.4290685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4291345Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4291756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4292024Z 2025-05-07T20:32:12.4292248Z self = 2025-05-07T20:32:12.4293349Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4294839Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb2a1090>} 2025-05-07T20:32:12.4296212Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4297259Z context = 2025-05-07T20:32:12.4297551Z 2025-05-07T20:32:12.4297728Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4298317Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4298797Z module_map=module_map) 2025-05-07T20:32:12.4299169Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4299531Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4299791Z E ^ 2025-05-07T20:32:12.4300268Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4300724Z 2025-05-07T20:32:12.4301153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4301676Z 2025-05-07T20:32:12.4301790Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4302253Z self=, 2025-05-07T20:32:12.4302656Z T=16384, 2025-05-07T20:32:12.4302908Z D=7168, 2025-05-07T20:32:12.4303099Z scale_ub=1200.0, 2025-05-07T20:32:12.4303323Z contiguous=True, 2025-05-07T20:32:12.4303548Z compiled=True, 2025-05-07T20:32:12.4303749Z ) 2025-05-07T20:32:12.4304075Z self = 2025-05-07T20:32:12.4304578Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.4304855Z 2025-05-07T20:32:12.4304933Z @given( 2025-05-07T20:32:12.4305172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4305489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4305799Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4306173Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4306513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4306804Z ) 2025-05-07T20:32:12.4307153Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4307603Z def test_silu_mul_quant( 2025-05-07T20:32:12.4307848Z self, 2025-05-07T20:32:12.4308038Z T: int, 2025-05-07T20:32:12.4308233Z D: int, 2025-05-07T20:32:12.4308451Z scale_ub: Optional[float], 2025-05-07T20:32:12.4308717Z contiguous: bool, 2025-05-07T20:32:12.4308961Z compiled: bool, 2025-05-07T20:32:12.4309189Z ) -> None: 2025-05-07T20:32:12.4309399Z torch.manual_seed(2025) 2025-05-07T20:32:12.4309640Z 2025-05-07T20:32:12.4309920Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4310261Z 2025-05-07T20:32:12.4310452Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4310757Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4311060Z x = x_sign * x_clamp 2025-05-07T20:32:12.4311302Z x0 = x[:, :D] 2025-05-07T20:32:12.4311517Z x1 = x[:, D:] 2025-05-07T20:32:12.4311717Z 2025-05-07T20:32:12.4311909Z if contiguous: 2025-05-07T20:32:12.4312150Z x0 = x0.contiguous() 2025-05-07T20:32:12.4312447Z x1 = x1.contiguous() 2025-05-07T20:32:12.4312686Z 2025-05-07T20:32:12.4312884Z if scale_ub is not None: 2025-05-07T20:32:12.4313154Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4313495Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4313801Z ) 2025-05-07T20:32:12.4313992Z else: 2025-05-07T20:32:12.4314286Z scale_ub_tensor = None 2025-05-07T20:32:12.4314540Z 2025-05-07T20:32:12.4314767Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4315082Z op = silu_mul_quant 2025-05-07T20:32:12.4315333Z if compiled: 2025-05-07T20:32:12.4315582Z op = torch.compile(op) 2025-05-07T20:32:12.4315880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4316158Z 2025-05-07T20:32:12.4316358Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4316523Z 2025-05-07T20:32:12.4316621Z moe/activation_test.py:117: 2025-05-07T20:32:12.4316920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4317248Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4317529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4318093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4318669Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4319334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4320026Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4320567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4321259Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4321974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4322561Z kernel = self.compile( 2025-05-07T20:32:12.4323107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4323768Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4324165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4324396Z 2025-05-07T20:32:12.4324605Z self = 2025-05-07T20:32:12.4325749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4327142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb2a2290>} 2025-05-07T20:32:12.4328502Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4329538Z context = 2025-05-07T20:32:12.4329833Z 2025-05-07T20:32:12.4330003Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4330531Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4331004Z module_map=module_map) 2025-05-07T20:32:12.4331415Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4331813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4332097Z E ^ 2025-05-07T20:32:12.4332638Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4333198Z 2025-05-07T20:32:12.4333702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4334330Z 2025-05-07T20:32:12.5686022Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.5686477Z self=, 2025-05-07T20:32:12.5687063Z T=16384, 2025-05-07T20:32:12.5687278Z D=5120, 2025-05-07T20:32:12.5687483Z scale_ub=1200.0, 2025-05-07T20:32:12.5687710Z contiguous=True, 2025-05-07T20:32:12.5687945Z compiled=False, 2025-05-07T20:32:12.5688158Z ) 2025-05-07T20:32:12.5688493Z self = 2025-05-07T20:32:12.5689119Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.5689466Z 2025-05-07T20:32:12.5689547Z @given( 2025-05-07T20:32:12.5689789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.5690103Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.5690418Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.5690758Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.5691086Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.5691384Z ) 2025-05-07T20:32:12.5691747Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.5692192Z def test_silu_mul_quant( 2025-05-07T20:32:12.5692434Z self, 2025-05-07T20:32:12.5692638Z T: int, 2025-05-07T20:32:12.5692838Z D: int, 2025-05-07T20:32:12.5693061Z scale_ub: Optional[float], 2025-05-07T20:32:12.5693343Z contiguous: bool, 2025-05-07T20:32:12.5693587Z compiled: bool, 2025-05-07T20:32:12.5693852Z ) -> None: 2025-05-07T20:32:12.5694071Z torch.manual_seed(2025) 2025-05-07T20:32:12.5694436Z 2025-05-07T20:32:12.5694716Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.5695062Z 2025-05-07T20:32:12.5695253Z x_sign = torch.sign(x) 2025-05-07T20:32:12.5695553Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.5695865Z x = x_sign * x_clamp 2025-05-07T20:32:12.5696103Z x0 = x[:, :D] 2025-05-07T20:32:12.5696324Z x1 = x[:, D:] 2025-05-07T20:32:12.5696533Z 2025-05-07T20:32:12.5696724Z if contiguous: 2025-05-07T20:32:12.5696963Z x0 = x0.contiguous() 2025-05-07T20:32:12.5697229Z x1 = x1.contiguous() 2025-05-07T20:32:12.5697540Z 2025-05-07T20:32:12.5697739Z if scale_ub is not None: 2025-05-07T20:32:12.5698110Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.5698446Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.5698757Z ) 2025-05-07T20:32:12.5698952Z else: 2025-05-07T20:32:12.5699165Z scale_ub_tensor = None 2025-05-07T20:32:12.5699421Z 2025-05-07T20:32:12.5699660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.5699977Z op = silu_mul_quant 2025-05-07T20:32:12.5700231Z if compiled: 2025-05-07T20:32:12.5700484Z op = torch.compile(op) 2025-05-07T20:32:12.5700787Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.5701060Z 2025-05-07T20:32:12.5701263Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.5701430Z 2025-05-07T20:32:12.5701537Z moe/activation_test.py:117: 2025-05-07T20:32:12.5701834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.5702171Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.5702459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.5703156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.5703860Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.5704405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.5705106Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.5705775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.5706405Z kernel = self.compile( 2025-05-07T20:32:12.5706959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.5707628Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.5708023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.5708255Z 2025-05-07T20:32:12.5708469Z self = 2025-05-07T20:32:12.5709565Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.5710969Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb2a11b0>} 2025-05-07T20:32:12.5712337Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.5713381Z context = 2025-05-07T20:32:12.5713679Z 2025-05-07T20:32:12.5713848Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.5714385Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.5714908Z module_map=module_map) 2025-05-07T20:32:12.5715278Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.5715645Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.5715912Z E ^ 2025-05-07T20:32:12.5716382Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.5716844Z 2025-05-07T20:32:12.5717271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.5717796Z 2025-05-07T20:32:12.5717948Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.5718369Z self=, 2025-05-07T20:32:12.5718777Z T=1, 2025-05-07T20:32:12.5718968Z D=7168, 2025-05-07T20:32:12.5719169Z scale_ub=1200.0, 2025-05-07T20:32:12.5719397Z contiguous=False, 2025-05-07T20:32:12.5719636Z compiled=False, 2025-05-07T20:32:12.5719844Z ) 2025-05-07T20:32:12.5720165Z self = 2025-05-07T20:32:12.5720671Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.5720952Z 2025-05-07T20:32:12.5721031Z @given( 2025-05-07T20:32:12.5721269Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.5721582Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.5721900Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.5722238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.5722568Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.5722858Z ) 2025-05-07T20:32:12.5723211Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.5723652Z def test_silu_mul_quant( 2025-05-07T20:32:12.5723897Z self, 2025-05-07T20:32:12.5724097Z T: int, 2025-05-07T20:32:12.5724294Z D: int, 2025-05-07T20:32:12.5724517Z scale_ub: Optional[float], 2025-05-07T20:32:12.5724798Z contiguous: bool, 2025-05-07T20:32:12.5725043Z compiled: bool, 2025-05-07T20:32:12.5725264Z ) -> None: 2025-05-07T20:32:12.5725492Z torch.manual_seed(2025) 2025-05-07T20:32:12.5725742Z 2025-05-07T20:32:12.5726016Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.5726364Z 2025-05-07T20:32:12.5726649Z x_sign = torch.sign(x) 2025-05-07T20:32:12.5726947Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.5727264Z x = x_sign * x_clamp 2025-05-07T20:32:12.5727511Z x0 = x[:, :D] 2025-05-07T20:32:12.5727727Z x1 = x[:, D:] 2025-05-07T20:32:12.5727937Z 2025-05-07T20:32:12.5728134Z if contiguous: 2025-05-07T20:32:12.5728364Z x0 = x0.contiguous() 2025-05-07T20:32:12.5728627Z x1 = x1.contiguous() 2025-05-07T20:32:12.5728873Z 2025-05-07T20:32:12.5729063Z if scale_ub is not None: 2025-05-07T20:32:12.5729347Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.5729693Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.5730004Z ) 2025-05-07T20:32:12.5730204Z else: 2025-05-07T20:32:12.5730419Z scale_ub_tensor = None 2025-05-07T20:32:12.5730678Z 2025-05-07T20:32:12.5730918Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.5731244Z op = silu_mul_quant 2025-05-07T20:32:12.5731503Z if compiled: 2025-05-07T20:32:12.5731759Z op = torch.compile(op) 2025-05-07T20:32:12.5732062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.5732344Z 2025-05-07T20:32:12.5732538Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.5732711Z 2025-05-07T20:32:12.5732812Z moe/activation_test.py:117: 2025-05-07T20:32:12.5733114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.5733489Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.5733780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.5734480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.5735181Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.5735727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.5736419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.5737138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.5737675Z kernel = self.compile( 2025-05-07T20:32:12.5738283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.5738954Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.5739359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.5739584Z 2025-05-07T20:32:12.5739796Z self = 2025-05-07T20:32:12.5740893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.5742291Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb2a2680>} 2025-05-07T20:32:12.5743662Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.5744704Z context = 2025-05-07T20:32:12.5744997Z 2025-05-07T20:32:12.5745169Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.5745700Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.5746178Z module_map=module_map) 2025-05-07T20:32:12.5746627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.5746989Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.5747253Z E ^ 2025-05-07T20:32:12.5747730Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.5748185Z 2025-05-07T20:32:12.5748607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.5749132Z 2025-05-07T20:32:12.7665451Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.7666139Z self=, 2025-05-07T20:32:12.7666737Z T=4096, 2025-05-07T20:32:12.7667000Z D=7168, 2025-05-07T20:32:12.7667272Z scale_ub=1200.0, 2025-05-07T20:32:12.7667601Z contiguous=False, 2025-05-07T20:32:12.7667833Z compiled=True, 2025-05-07T20:32:12.7668040Z ) 2025-05-07T20:32:12.7668375Z self = 2025-05-07T20:32:12.7668880Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.7669160Z 2025-05-07T20:32:12.7669239Z @given( 2025-05-07T20:32:12.7669472Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.7669789Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.7670093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.7670433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.7670909Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.7671191Z ) 2025-05-07T20:32:12.7671548Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.7671995Z def test_silu_mul_quant( 2025-05-07T20:32:12.7672231Z self, 2025-05-07T20:32:12.7672430Z T: int, 2025-05-07T20:32:12.7672642Z D: int, 2025-05-07T20:32:12.7672891Z scale_ub: Optional[float], 2025-05-07T20:32:12.7673170Z contiguous: bool, 2025-05-07T20:32:12.7673434Z compiled: bool, 2025-05-07T20:32:12.7673663Z ) -> None: 2025-05-07T20:32:12.7673877Z torch.manual_seed(2025) 2025-05-07T20:32:12.7674247Z 2025-05-07T20:32:12.7674523Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.7674864Z 2025-05-07T20:32:12.7675058Z x_sign = torch.sign(x) 2025-05-07T20:32:12.7675354Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.7675662Z x = x_sign * x_clamp 2025-05-07T20:32:12.7675900Z x0 = x[:, :D] 2025-05-07T20:32:12.7676114Z x1 = x[:, D:] 2025-05-07T20:32:12.7676321Z 2025-05-07T20:32:12.7676502Z if contiguous: 2025-05-07T20:32:12.7676732Z x0 = x0.contiguous() 2025-05-07T20:32:12.7676992Z x1 = x1.contiguous() 2025-05-07T20:32:12.7677227Z 2025-05-07T20:32:12.7677420Z if scale_ub is not None: 2025-05-07T20:32:12.7677699Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.7678034Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.7678344Z ) 2025-05-07T20:32:12.7678542Z else: 2025-05-07T20:32:12.7678747Z scale_ub_tensor = None 2025-05-07T20:32:12.7678998Z 2025-05-07T20:32:12.7679236Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.7679545Z op = silu_mul_quant 2025-05-07T20:32:12.7679793Z if compiled: 2025-05-07T20:32:12.7680039Z op = torch.compile(op) 2025-05-07T20:32:12.7680342Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.7680611Z 2025-05-07T20:32:12.7680804Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.7680972Z 2025-05-07T20:32:12.7681078Z moe/activation_test.py:117: 2025-05-07T20:32:12.7681368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.7681700Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.7682107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.7682677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.7683252Z return fn(*args, **kwargs) 2025-05-07T20:32:12.7683921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.7684623Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.7685161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.7685855Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.7686524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.7687059Z kernel = self.compile( 2025-05-07T20:32:12.7687613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.7688284Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.7688679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.7688906Z 2025-05-07T20:32:12.7689119Z self = 2025-05-07T20:32:12.7690218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.7692139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb2a3b50>} 2025-05-07T20:32:12.7693737Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.7694787Z context = 2025-05-07T20:32:12.7695136Z 2025-05-07T20:32:12.7695306Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.7695834Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.7696310Z module_map=module_map) 2025-05-07T20:32:12.7696675Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.7697033Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.7697291Z E ^ 2025-05-07T20:32:12.7697763Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.7698305Z 2025-05-07T20:32:12.7698726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.7699257Z 2025-05-07T20:32:12.7699362Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.7699776Z self=, 2025-05-07T20:32:12.7700179Z T=128, 2025-05-07T20:32:12.7700360Z D=7168, 2025-05-07T20:32:12.7700553Z scale_ub=1200.0, 2025-05-07T20:32:12.7700784Z contiguous=False, 2025-05-07T20:32:12.7701002Z compiled=True, 2025-05-07T20:32:12.7701205Z ) 2025-05-07T20:32:12.8725332Z self = 2025-05-07T20:32:12.8726150Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.8726551Z 2025-05-07T20:32:12.8733551Z @given( 2025-05-07T20:32:12.8733934Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.8734370Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.8734681Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.8735207Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.8735547Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.8735829Z ) 2025-05-07T20:32:12.8736191Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.8736641Z def test_silu_mul_quant( 2025-05-07T20:32:12.8736882Z self, 2025-05-07T20:32:12.8737078Z T: int, 2025-05-07T20:32:12.8737278Z D: int, 2025-05-07T20:32:12.8737491Z scale_ub: Optional[float], 2025-05-07T20:32:12.8737776Z contiguous: bool, 2025-05-07T20:32:12.8738100Z compiled: bool, 2025-05-07T20:32:12.8738332Z ) -> None: 2025-05-07T20:32:12.8738554Z torch.manual_seed(2025) 2025-05-07T20:32:12.8738801Z 2025-05-07T20:32:12.8739081Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.8739424Z 2025-05-07T20:32:12.8739620Z x_sign = torch.sign(x) 2025-05-07T20:32:12.8739924Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.8740237Z x = x_sign * x_clamp 2025-05-07T20:32:12.8740479Z x0 = x[:, :D] 2025-05-07T20:32:12.8740700Z x1 = x[:, D:] 2025-05-07T20:32:12.8740907Z 2025-05-07T20:32:12.8741098Z if contiguous: 2025-05-07T20:32:12.8741335Z x0 = x0.contiguous() 2025-05-07T20:32:12.8741597Z x1 = x1.contiguous() 2025-05-07T20:32:12.8741841Z 2025-05-07T20:32:12.8742041Z if scale_ub is not None: 2025-05-07T20:32:12.8742316Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.8742772Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.8743103Z ) 2025-05-07T20:32:12.8743294Z else: 2025-05-07T20:32:12.8743508Z scale_ub_tensor = None 2025-05-07T20:32:12.8743766Z 2025-05-07T20:32:12.8744008Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.8744324Z op = silu_mul_quant 2025-05-07T20:32:12.8744585Z if compiled: 2025-05-07T20:32:12.8744841Z op = torch.compile(op) 2025-05-07T20:32:12.8745142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8745491Z 2025-05-07T20:32:12.8745688Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.8745858Z 2025-05-07T20:32:12.8745964Z moe/activation_test.py:117: 2025-05-07T20:32:12.8746269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8746603Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.8746889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8747465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.8748046Z return fn(*args, **kwargs) 2025-05-07T20:32:12.8748722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.8749423Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.8749980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.8750682Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.8751365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.8751907Z kernel = self.compile( 2025-05-07T20:32:12.8752465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.8753189Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.8753591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8753819Z 2025-05-07T20:32:12.8754037Z self = 2025-05-07T20:32:12.8755228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.8756974Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb318670>} 2025-05-07T20:32:12.8758359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.8759419Z context = 2025-05-07T20:32:12.8759714Z 2025-05-07T20:32:12.8759896Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.8760424Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.8760911Z module_map=module_map) 2025-05-07T20:32:12.8761288Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.8761651Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.8761909Z E ^ 2025-05-07T20:32:12.8762386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.8762847Z 2025-05-07T20:32:12.8763276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.8763871Z 2025-05-07T20:32:12.8763982Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.8764398Z self=, 2025-05-07T20:32:12.8764803Z T=2048, 2025-05-07T20:32:12.8764995Z D=7168, 2025-05-07T20:32:12.8765185Z scale_ub=None, 2025-05-07T20:32:12.8765400Z contiguous=True, 2025-05-07T20:32:12.8765633Z compiled=True, 2025-05-07T20:32:12.8765831Z ) 2025-05-07T20:32:12.8766164Z self = 2025-05-07T20:32:12.8766669Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.8767007Z 2025-05-07T20:32:12.8767083Z @given( 2025-05-07T20:32:12.8767314Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.8767632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.8767942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.8768273Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.8768608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.8768897Z ) 2025-05-07T20:32:12.8769248Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.8769696Z def test_silu_mul_quant( 2025-05-07T20:32:12.8769938Z self, 2025-05-07T20:32:12.8770153Z T: int, 2025-05-07T20:32:12.8770343Z D: int, 2025-05-07T20:32:12.8770570Z scale_ub: Optional[float], 2025-05-07T20:32:12.8770847Z contiguous: bool, 2025-05-07T20:32:12.8771082Z compiled: bool, 2025-05-07T20:32:12.8771312Z ) -> None: 2025-05-07T20:32:12.8771531Z torch.manual_seed(2025) 2025-05-07T20:32:12.8771768Z 2025-05-07T20:32:12.8772049Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.8772399Z 2025-05-07T20:32:12.8772610Z x_sign = torch.sign(x) 2025-05-07T20:32:12.8772933Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.8773249Z x = x_sign * x_clamp 2025-05-07T20:32:12.8773491Z x0 = x[:, :D] 2025-05-07T20:32:12.8773706Z x1 = x[:, D:] 2025-05-07T20:32:12.8773910Z 2025-05-07T20:32:12.8774096Z if contiguous: 2025-05-07T20:32:12.8774327Z x0 = x0.contiguous() 2025-05-07T20:32:12.8774585Z x1 = x1.contiguous() 2025-05-07T20:32:12.8774833Z 2025-05-07T20:32:12.8775024Z if scale_ub is not None: 2025-05-07T20:32:12.8775445Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.8775787Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.8776098Z ) 2025-05-07T20:32:12.8776290Z else: 2025-05-07T20:32:12.8776499Z scale_ub_tensor = None 2025-05-07T20:32:12.8776750Z 2025-05-07T20:32:12.8776985Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.8777305Z op = silu_mul_quant 2025-05-07T20:32:12.8777555Z if compiled: 2025-05-07T20:32:12.8777806Z op = torch.compile(op) 2025-05-07T20:32:12.8778179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8778447Z 2025-05-07T20:32:12.8778643Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.8778816Z 2025-05-07T20:32:12.8778915Z moe/activation_test.py:117: 2025-05-07T20:32:12.8779208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8779686Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.8780069Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8780750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.8781320Z return fn(*args, **kwargs) 2025-05-07T20:32:12.8781989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.8782693Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.8783356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.8784043Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.8784718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.8785256Z kernel = self.compile( 2025-05-07T20:32:12.8785805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.8786473Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.8786918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8787144Z 2025-05-07T20:32:12.8787358Z self = 2025-05-07T20:32:12.8788453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.8789854Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb3191b0>} 2025-05-07T20:32:12.8791227Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.8792273Z context = 2025-05-07T20:32:12.8792569Z 2025-05-07T20:32:12.8792740Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.8793262Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.8793735Z module_map=module_map) 2025-05-07T20:32:12.8794104Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.8794459Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.8794721Z E ^ 2025-05-07T20:32:12.8795191Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.8795648Z 2025-05-07T20:32:12.8796159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.8796680Z 2025-05-07T20:32:12.9599131Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.9599799Z self=, 2025-05-07T20:32:12.9600370Z T=16384, 2025-05-07T20:32:12.9600652Z D=5120, 2025-05-07T20:32:12.9600845Z scale_ub=None, 2025-05-07T20:32:12.9601068Z contiguous=False, 2025-05-07T20:32:12.9601300Z compiled=False, 2025-05-07T20:32:12.9601514Z ) 2025-05-07T20:32:12.9601836Z self = 2025-05-07T20:32:12.9602347Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.9602631Z 2025-05-07T20:32:12.9602713Z @given( 2025-05-07T20:32:12.9602945Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.9603270Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.9603588Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.9603919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.9604257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.9604553Z ) 2025-05-07T20:32:12.9604915Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.9605363Z def test_silu_mul_quant( 2025-05-07T20:32:12.9605607Z self, 2025-05-07T20:32:12.9605809Z T: int, 2025-05-07T20:32:12.9606004Z D: int, 2025-05-07T20:32:12.9606338Z scale_ub: Optional[float], 2025-05-07T20:32:12.9606616Z contiguous: bool, 2025-05-07T20:32:12.9606855Z compiled: bool, 2025-05-07T20:32:12.9607083Z ) -> None: 2025-05-07T20:32:12.9607306Z torch.manual_seed(2025) 2025-05-07T20:32:12.9607551Z 2025-05-07T20:32:12.9607828Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.9608172Z 2025-05-07T20:32:12.9608364Z x_sign = torch.sign(x) 2025-05-07T20:32:12.9608673Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.9610770Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.9612797Z 2025-05-07T20:32:12.9612917Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:12.9613140Z 2025-05-07T20:32:12.9613247Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.9613672Z self=, 2025-05-07T20:32:12.9614082Z T=4096, 2025-05-07T20:32:12.9614271Z D=7168, 2025-05-07T20:32:12.9614465Z scale_ub=1200.0, 2025-05-07T20:32:12.9614687Z contiguous=True, 2025-05-07T20:32:12.9614914Z compiled=True, 2025-05-07T20:32:12.9615120Z ) 2025-05-07T20:32:12.9615445Z self = 2025-05-07T20:32:12.9615946Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.9616251Z 2025-05-07T20:32:12.9616370Z @given( 2025-05-07T20:32:12.9616662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.9616981Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.9617297Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.9617635Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.9617967Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.9618343Z ) 2025-05-07T20:32:12.9618835Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.9619284Z def test_silu_mul_quant( 2025-05-07T20:32:12.9619526Z self, 2025-05-07T20:32:12.9619722Z T: int, 2025-05-07T20:32:12.9619921Z D: int, 2025-05-07T20:32:12.9620143Z scale_ub: Optional[float], 2025-05-07T20:32:12.9620418Z contiguous: bool, 2025-05-07T20:32:12.9620659Z compiled: bool, 2025-05-07T20:32:12.9620887Z ) -> None: 2025-05-07T20:32:12.9621109Z torch.manual_seed(2025) 2025-05-07T20:32:12.9621355Z 2025-05-07T20:32:12.9621634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.9621980Z 2025-05-07T20:32:12.9622176Z x_sign = torch.sign(x) 2025-05-07T20:32:12.9622478Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.9624550Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.9626478Z 2025-05-07T20:32:12.9626600Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:12.9626867Z 2025-05-07T20:32:12.9626979Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.9627397Z self=, 2025-05-07T20:32:12.9627807Z T=16384, 2025-05-07T20:32:12.9628003Z D=7168, 2025-05-07T20:32:12.9628195Z scale_ub=None, 2025-05-07T20:32:12.9628409Z contiguous=False, 2025-05-07T20:32:12.9628644Z compiled=False, 2025-05-07T20:32:12.9628853Z ) 2025-05-07T20:32:12.9629176Z self = 2025-05-07T20:32:12.9629680Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.9630009Z 2025-05-07T20:32:12.9630089Z @given( 2025-05-07T20:32:12.9630319Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.9630643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.9630958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.9631291Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.9631634Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.9631925Z ) 2025-05-07T20:32:12.9632283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.9632726Z def test_silu_mul_quant( 2025-05-07T20:32:12.9632975Z self, 2025-05-07T20:32:12.9633172Z T: int, 2025-05-07T20:32:12.9633373Z D: int, 2025-05-07T20:32:12.9633595Z scale_ub: Optional[float], 2025-05-07T20:32:12.9633875Z contiguous: bool, 2025-05-07T20:32:12.9634115Z compiled: bool, 2025-05-07T20:32:12.9634343Z ) -> None: 2025-05-07T20:32:12.9634570Z torch.manual_seed(2025) 2025-05-07T20:32:12.9634810Z 2025-05-07T20:32:12.9635096Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.9637236Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.9639170Z 2025-05-07T20:32:12.9639374Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:12.9639592Z 2025-05-07T20:32:12.9639703Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.9640116Z self=, 2025-05-07T20:32:12.9640524Z T=2048, 2025-05-07T20:32:12.9640715Z D=7168, 2025-05-07T20:32:12.9640912Z scale_ub=1200.0, 2025-05-07T20:32:12.9641138Z contiguous=True, 2025-05-07T20:32:12.9641364Z compiled=True, 2025-05-07T20:32:12.9641571Z ) 2025-05-07T20:32:12.9641893Z self = 2025-05-07T20:32:12.9642392Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.9642677Z 2025-05-07T20:32:12.9642781Z @given( 2025-05-07T20:32:12.9643035Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.9643353Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.9643667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.9644002Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.9644341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.9644636Z ) 2025-05-07T20:32:12.9644995Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.9645438Z def test_silu_mul_quant( 2025-05-07T20:32:12.9645687Z self, 2025-05-07T20:32:12.9645885Z T: int, 2025-05-07T20:32:12.9646084Z D: int, 2025-05-07T20:32:12.9646312Z scale_ub: Optional[float], 2025-05-07T20:32:12.9646657Z contiguous: bool, 2025-05-07T20:32:12.9646899Z compiled: bool, 2025-05-07T20:32:12.9647129Z ) -> None: 2025-05-07T20:32:12.9647353Z torch.manual_seed(2025) 2025-05-07T20:32:12.9647594Z 2025-05-07T20:32:12.9647870Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.9648218Z 2025-05-07T20:32:12.9648411Z x_sign = torch.sign(x) 2025-05-07T20:32:12.9648711Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.9650769Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:12.9652760Z 2025-05-07T20:32:12.9652904Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:12.9653149Z 2025-05-07T20:32:12.9653260Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.9653677Z self=, 2025-05-07T20:32:12.9654086Z T=2048, 2025-05-07T20:32:12.9654283Z D=7168, 2025-05-07T20:32:12.9654473Z scale_ub=None, 2025-05-07T20:32:12.9654692Z contiguous=True, 2025-05-07T20:32:12.9654925Z compiled=False, 2025-05-07T20:32:12.9655131Z ) 2025-05-07T20:32:13.0924491Z self = 2025-05-07T20:32:13.0925970Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.0926758Z 2025-05-07T20:32:13.0926938Z @given( 2025-05-07T20:32:13.0927409Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.0928046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.0928663Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.0929330Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.0929988Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.0930561Z ) 2025-05-07T20:32:13.0931264Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.0932454Z def test_silu_mul_quant( 2025-05-07T20:32:13.0932726Z self, 2025-05-07T20:32:13.0932926Z T: int, 2025-05-07T20:32:13.0933128Z D: int, 2025-05-07T20:32:13.0933347Z scale_ub: Optional[float], 2025-05-07T20:32:13.0933630Z contiguous: bool, 2025-05-07T20:32:13.0933873Z compiled: bool, 2025-05-07T20:32:13.0934097Z ) -> None: 2025-05-07T20:32:13.0934317Z torch.manual_seed(2025) 2025-05-07T20:32:13.0934564Z 2025-05-07T20:32:13.0934841Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.0935189Z 2025-05-07T20:32:13.0935391Z > x_sign = torch.sign(x) 2025-05-07T20:32:13.0937402Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.0939432Z 2025-05-07T20:32:13.0939553Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:13.0939773Z 2025-05-07T20:32:13.0939879Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0940302Z self=, 2025-05-07T20:32:13.0940783Z T=1, 2025-05-07T20:32:13.0940968Z D=7168, 2025-05-07T20:32:13.0941165Z scale_ub=1200.0, 2025-05-07T20:32:13.0941394Z contiguous=True, 2025-05-07T20:32:13.0941614Z compiled=False, 2025-05-07T20:32:13.0941822Z ) 2025-05-07T20:32:13.0942149Z self = 2025-05-07T20:32:13.0942646Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:13.0942926Z 2025-05-07T20:32:13.0943005Z @given( 2025-05-07T20:32:13.0943244Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.0943629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.0943941Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.0944274Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.0944612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.0944896Z ) 2025-05-07T20:32:13.0945259Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.0945710Z def test_silu_mul_quant( 2025-05-07T20:32:13.0945950Z self, 2025-05-07T20:32:13.0946150Z T: int, 2025-05-07T20:32:13.0946349Z D: int, 2025-05-07T20:32:13.0946566Z scale_ub: Optional[float], 2025-05-07T20:32:13.0946846Z contiguous: bool, 2025-05-07T20:32:13.0947091Z compiled: bool, 2025-05-07T20:32:13.0947319Z ) -> None: 2025-05-07T20:32:13.0947544Z torch.manual_seed(2025) 2025-05-07T20:32:13.0947787Z 2025-05-07T20:32:13.0948058Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.0948405Z 2025-05-07T20:32:13.0948603Z x_sign = torch.sign(x) 2025-05-07T20:32:13.0948901Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.0949209Z x = x_sign * x_clamp 2025-05-07T20:32:13.0949452Z x0 = x[:, :D] 2025-05-07T20:32:13.0949671Z x1 = x[:, D:] 2025-05-07T20:32:13.0949877Z 2025-05-07T20:32:13.0950066Z if contiguous: 2025-05-07T20:32:13.0950299Z x0 = x0.contiguous() 2025-05-07T20:32:13.0950558Z x1 = x1.contiguous() 2025-05-07T20:32:13.0950803Z 2025-05-07T20:32:13.0950995Z if scale_ub is not None: 2025-05-07T20:32:13.0951270Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.0951695Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.0952007Z ) 2025-05-07T20:32:13.0952197Z else: 2025-05-07T20:32:13.0952409Z scale_ub_tensor = None 2025-05-07T20:32:13.0952667Z 2025-05-07T20:32:13.0952904Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.0953224Z op = silu_mul_quant 2025-05-07T20:32:13.0953477Z if compiled: 2025-05-07T20:32:13.0953725Z op = torch.compile(op) 2025-05-07T20:32:13.0954032Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0954314Z 2025-05-07T20:32:13.0954511Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.0954678Z 2025-05-07T20:32:13.0954779Z moe/activation_test.py:117: 2025-05-07T20:32:13.0960495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0960841Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.0961132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0961848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.0962559Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.0963150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.0963838Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.0964508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.0965160Z kernel = self.compile( 2025-05-07T20:32:13.0965711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.0966371Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.0966769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0966994Z 2025-05-07T20:32:13.0967219Z self = 2025-05-07T20:32:13.0968327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.0969800Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb31ae60>} 2025-05-07T20:32:13.0971178Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.0972222Z context = 2025-05-07T20:32:13.0972519Z 2025-05-07T20:32:13.0972696Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.0973273Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.0973744Z module_map=module_map) 2025-05-07T20:32:13.0974112Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.0974466Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.0974723Z E ^ 2025-05-07T20:32:13.0975197Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.0975654Z 2025-05-07T20:32:13.0976079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.0976599Z 2025-05-07T20:32:13.0976703Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0977122Z self=, 2025-05-07T20:32:13.0977529Z T=128, 2025-05-07T20:32:13.0977720Z D=5120, 2025-05-07T20:32:13.0978073Z scale_ub=None, 2025-05-07T20:32:13.0978291Z contiguous=True, 2025-05-07T20:32:13.0978513Z compiled=False, 2025-05-07T20:32:13.0978712Z ) 2025-05-07T20:32:13.1748667Z self = 2025-05-07T20:32:13.1749466Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.1749876Z 2025-05-07T20:32:13.1749995Z @given( 2025-05-07T20:32:13.1750317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.1750774Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.1751200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.1751539Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.1751871Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.1752160Z ) 2025-05-07T20:32:13.1752521Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.1753022Z def test_silu_mul_quant( 2025-05-07T20:32:13.1753276Z self, 2025-05-07T20:32:13.1753473Z T: int, 2025-05-07T20:32:13.1753669Z D: int, 2025-05-07T20:32:13.1753894Z scale_ub: Optional[float], 2025-05-07T20:32:13.1754172Z contiguous: bool, 2025-05-07T20:32:13.1754415Z compiled: bool, 2025-05-07T20:32:13.1754648Z ) -> None: 2025-05-07T20:32:13.1754868Z torch.manual_seed(2025) 2025-05-07T20:32:13.1755109Z 2025-05-07T20:32:13.1755388Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.1756023Z 2025-05-07T20:32:13.1756217Z x_sign = torch.sign(x) 2025-05-07T20:32:13.1756518Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.1756832Z x = x_sign * x_clamp 2025-05-07T20:32:13.1757070Z x0 = x[:, :D] 2025-05-07T20:32:13.1757289Z x1 = x[:, D:] 2025-05-07T20:32:13.1757502Z 2025-05-07T20:32:13.1757689Z if contiguous: 2025-05-07T20:32:13.1757931Z x0 = x0.contiguous() 2025-05-07T20:32:13.1758195Z x1 = x1.contiguous() 2025-05-07T20:32:13.1758436Z 2025-05-07T20:32:13.1758631Z if scale_ub is not None: 2025-05-07T20:32:13.1758987Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.1759324Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.1759641Z ) 2025-05-07T20:32:13.1759839Z else: 2025-05-07T20:32:13.1760050Z scale_ub_tensor = None 2025-05-07T20:32:13.1760309Z 2025-05-07T20:32:13.1760551Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.1760867Z op = silu_mul_quant 2025-05-07T20:32:13.1761122Z if compiled: 2025-05-07T20:32:13.1761373Z op = torch.compile(op) 2025-05-07T20:32:13.1761679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1761953Z 2025-05-07T20:32:13.1762150Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.1762317Z 2025-05-07T20:32:13.1762425Z moe/activation_test.py:117: 2025-05-07T20:32:13.1762721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1763056Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.1763345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1764046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.1764752Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.1765306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.1766001Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.1766673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.1767216Z kernel = self.compile( 2025-05-07T20:32:13.1767890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.1768563Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.1768968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1769202Z 2025-05-07T20:32:13.1769416Z self = 2025-05-07T20:32:13.1770521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.1771927Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb31b7f0>} 2025-05-07T20:32:13.1773353Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.1774400Z context = 2025-05-07T20:32:13.1774703Z 2025-05-07T20:32:13.1774872Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.1775409Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.1775884Z module_map=module_map) 2025-05-07T20:32:13.1776349Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.1776713Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.1776979Z E ^ 2025-05-07T20:32:13.1777458Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.1777926Z 2025-05-07T20:32:13.1778430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.1778949Z 2025-05-07T20:32:13.1779059Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.1779528Z self=, 2025-05-07T20:32:13.1779937Z T=128, 2025-05-07T20:32:13.1780131Z D=7168, 2025-05-07T20:32:13.1780402Z scale_ub=None, 2025-05-07T20:32:13.1780721Z contiguous=True, 2025-05-07T20:32:13.1781042Z compiled=False, 2025-05-07T20:32:13.1781254Z ) 2025-05-07T20:32:13.1781588Z self = 2025-05-07T20:32:13.1782111Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.1782384Z 2025-05-07T20:32:13.1782465Z @given( 2025-05-07T20:32:13.1782700Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.1783060Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.1783377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.1783716Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.1784046Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.1784344Z ) 2025-05-07T20:32:13.1784700Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.1785146Z def test_silu_mul_quant( 2025-05-07T20:32:13.1785394Z self, 2025-05-07T20:32:13.1785591Z T: int, 2025-05-07T20:32:13.1785787Z D: int, 2025-05-07T20:32:13.1786012Z scale_ub: Optional[float], 2025-05-07T20:32:13.1786289Z contiguous: bool, 2025-05-07T20:32:13.1786534Z compiled: bool, 2025-05-07T20:32:13.1786758Z ) -> None: 2025-05-07T20:32:13.1786979Z torch.manual_seed(2025) 2025-05-07T20:32:13.1787226Z 2025-05-07T20:32:13.1787502Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.1787851Z 2025-05-07T20:32:13.1788050Z x_sign = torch.sign(x) 2025-05-07T20:32:13.1788447Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.1788765Z x = x_sign * x_clamp 2025-05-07T20:32:13.1789008Z x0 = x[:, :D] 2025-05-07T20:32:13.1789227Z x1 = x[:, D:] 2025-05-07T20:32:13.1789441Z 2025-05-07T20:32:13.1789635Z if contiguous: 2025-05-07T20:32:13.1789865Z x0 = x0.contiguous() 2025-05-07T20:32:13.1790129Z x1 = x1.contiguous() 2025-05-07T20:32:13.1790378Z 2025-05-07T20:32:13.1790572Z if scale_ub is not None: 2025-05-07T20:32:13.1790852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.1791191Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.1791500Z ) 2025-05-07T20:32:13.1791697Z else: 2025-05-07T20:32:13.1791911Z scale_ub_tensor = None 2025-05-07T20:32:13.1792172Z 2025-05-07T20:32:13.1792412Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.1792737Z op = silu_mul_quant 2025-05-07T20:32:13.1793005Z if compiled: 2025-05-07T20:32:13.1793253Z op = torch.compile(op) 2025-05-07T20:32:13.1793553Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1793833Z 2025-05-07T20:32:13.1794035Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.1794207Z 2025-05-07T20:32:13.1794308Z moe/activation_test.py:117: 2025-05-07T20:32:13.1794610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1794940Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.1795278Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.1795984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.1796686Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.1797233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.1797934Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.1798612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.1799192Z kernel = self.compile( 2025-05-07T20:32:13.1799747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.1800415Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.1800820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.1801046Z 2025-05-07T20:32:13.1801260Z self = 2025-05-07T20:32:13.1802367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.1803832Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeaf78160>} 2025-05-07T20:32:13.1805206Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.1806248Z context = 2025-05-07T20:32:13.1806545Z 2025-05-07T20:32:13.1806714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.1807244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.1807721Z module_map=module_map) 2025-05-07T20:32:13.1808088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.1808527Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.1808790Z E ^ 2025-05-07T20:32:13.1809265Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.1809727Z 2025-05-07T20:32:13.1810150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.1810675Z 2025-05-07T20:32:13.1810783Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.1811208Z self=, 2025-05-07T20:32:13.1811616Z T=2048, 2025-05-07T20:32:13.1811805Z D=7168, 2025-05-07T20:32:13.1812001Z scale_ub=1200.0, 2025-05-07T20:32:13.1812230Z contiguous=True, 2025-05-07T20:32:13.1812452Z compiled=False, 2025-05-07T20:32:13.1812661Z ) 2025-05-07T20:32:13.2769909Z self = 2025-05-07T20:32:13.2770697Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:13.2771088Z 2025-05-07T20:32:13.2771212Z @given( 2025-05-07T20:32:13.2771533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.2771976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.2772412Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.2772864Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.2773300Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.2773773Z ) 2025-05-07T20:32:13.2774125Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.2774572Z def test_silu_mul_quant( 2025-05-07T20:32:13.2774819Z self, 2025-05-07T20:32:13.2775015Z T: int, 2025-05-07T20:32:13.2775207Z D: int, 2025-05-07T20:32:13.2775428Z scale_ub: Optional[float], 2025-05-07T20:32:13.2775703Z contiguous: bool, 2025-05-07T20:32:13.2775939Z compiled: bool, 2025-05-07T20:32:13.2776171Z ) -> None: 2025-05-07T20:32:13.2776391Z torch.manual_seed(2025) 2025-05-07T20:32:13.2776632Z 2025-05-07T20:32:13.2776911Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.2779366Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.2781313Z 2025-05-07T20:32:13.2781495Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.2781824Z 2025-05-07T20:32:13.2781998Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.2782650Z self=, 2025-05-07T20:32:13.2783220Z T=1, 2025-05-07T20:32:13.2783646Z D=5120, 2025-05-07T20:32:13.2783945Z scale_ub=1200.0, 2025-05-07T20:32:13.2784516Z contiguous=True, 2025-05-07T20:32:13.2784864Z compiled=False, 2025-05-07T20:32:13.2785375Z ) 2025-05-07T20:32:13.2785979Z self = 2025-05-07T20:32:13.2786771Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:13.2787207Z 2025-05-07T20:32:13.2787327Z @given( 2025-05-07T20:32:13.2787698Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.2788236Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.2788694Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.2789161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.2789764Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.2790066Z ) 2025-05-07T20:32:13.2790420Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.2790870Z def test_silu_mul_quant( 2025-05-07T20:32:13.2791119Z self, 2025-05-07T20:32:13.2791319Z T: int, 2025-05-07T20:32:13.2791512Z D: int, 2025-05-07T20:32:13.2791733Z scale_ub: Optional[float], 2025-05-07T20:32:13.2792005Z contiguous: bool, 2025-05-07T20:32:13.2792245Z compiled: bool, 2025-05-07T20:32:13.2792474Z ) -> None: 2025-05-07T20:32:13.2792694Z torch.manual_seed(2025) 2025-05-07T20:32:13.2792937Z 2025-05-07T20:32:13.2793216Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.2793558Z 2025-05-07T20:32:13.2793752Z x_sign = torch.sign(x) 2025-05-07T20:32:13.2794046Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.2794471Z x = x_sign * x_clamp 2025-05-07T20:32:13.2794717Z x0 = x[:, :D] 2025-05-07T20:32:13.2794936Z x1 = x[:, D:] 2025-05-07T20:32:13.2795192Z 2025-05-07T20:32:13.2795378Z if contiguous: 2025-05-07T20:32:13.2795621Z x0 = x0.contiguous() 2025-05-07T20:32:13.2795963Z x1 = x1.contiguous() 2025-05-07T20:32:13.2796208Z 2025-05-07T20:32:13.2796402Z if scale_ub is not None: 2025-05-07T20:32:13.2796737Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.2797074Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.2797433Z ) 2025-05-07T20:32:13.2797630Z else: 2025-05-07T20:32:13.2797846Z scale_ub_tensor = None 2025-05-07T20:32:13.2798097Z 2025-05-07T20:32:13.2798338Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.2798657Z op = silu_mul_quant 2025-05-07T20:32:13.2798905Z if compiled: 2025-05-07T20:32:13.2799154Z op = torch.compile(op) 2025-05-07T20:32:13.2799460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.2799731Z 2025-05-07T20:32:13.2799925Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.2800144Z 2025-05-07T20:32:13.2800245Z moe/activation_test.py:117: 2025-05-07T20:32:13.2800542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.2800869Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.2801154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.2801860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.2802562Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.2803109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.2803802Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.2804483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.2805019Z kernel = self.compile( 2025-05-07T20:32:13.2805577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.2806248Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.2806668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.2806894Z 2025-05-07T20:32:13.2807106Z self = 2025-05-07T20:32:13.2808213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.2809720Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeaf78940>} 2025-05-07T20:32:13.2811093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.2812143Z context = 2025-05-07T20:32:13.2812440Z 2025-05-07T20:32:13.2812614Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.2813147Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.2813622Z module_map=module_map) 2025-05-07T20:32:13.2813985Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.2814348Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.2814614Z E ^ 2025-05-07T20:32:13.2815089Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.2815546Z 2025-05-07T20:32:13.2815970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.2816497Z 2025-05-07T20:32:13.2816604Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.2817024Z self=, 2025-05-07T20:32:13.2817425Z T=2048, 2025-05-07T20:32:13.2817662Z D=5120, 2025-05-07T20:32:13.2817856Z scale_ub=None, 2025-05-07T20:32:13.2818195Z contiguous=True, 2025-05-07T20:32:13.2818428Z compiled=False, 2025-05-07T20:32:13.2818637Z ) 2025-05-07T20:32:13.2818965Z self = 2025-05-07T20:32:13.2819459Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.2819735Z 2025-05-07T20:32:13.2819811Z @given( 2025-05-07T20:32:13.2820049Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.2820366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.2820679Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.2821063Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.2821396Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.2821682Z ) 2025-05-07T20:32:13.2822036Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.2822483Z def test_silu_mul_quant( 2025-05-07T20:32:13.2822724Z self, 2025-05-07T20:32:13.2822924Z T: int, 2025-05-07T20:32:13.2823122Z D: int, 2025-05-07T20:32:13.2823338Z scale_ub: Optional[float], 2025-05-07T20:32:13.2823615Z contiguous: bool, 2025-05-07T20:32:13.2823860Z compiled: bool, 2025-05-07T20:32:13.2824080Z ) -> None: 2025-05-07T20:32:13.2824296Z torch.manual_seed(2025) 2025-05-07T20:32:13.2824543Z 2025-05-07T20:32:13.2824822Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.2825168Z 2025-05-07T20:32:13.2825370Z > x_sign = torch.sign(x) 2025-05-07T20:32:13.2827384Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.2829284Z 2025-05-07T20:32:13.2829411Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:13.2829625Z 2025-05-07T20:32:13.2829733Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.2830239Z self=, 2025-05-07T20:32:13.2830647Z T=16384, 2025-05-07T20:32:13.2830839Z D=5120, 2025-05-07T20:32:13.2831035Z scale_ub=None, 2025-05-07T20:32:13.2831248Z contiguous=True, 2025-05-07T20:32:13.2831468Z compiled=False, 2025-05-07T20:32:13.2831673Z ) 2025-05-07T20:32:13.3799543Z self = 2025-05-07T20:32:13.3800289Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.3800671Z 2025-05-07T20:32:13.3800786Z @given( 2025-05-07T20:32:13.3801115Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.3801559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.3801946Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.3802319Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.3802682Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.3802974Z ) 2025-05-07T20:32:13.3803326Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.3803775Z def test_silu_mul_quant( 2025-05-07T20:32:13.3804018Z self, 2025-05-07T20:32:13.3804207Z T: int, 2025-05-07T20:32:13.3804437Z D: int, 2025-05-07T20:32:13.3804657Z scale_ub: Optional[float], 2025-05-07T20:32:13.3804929Z contiguous: bool, 2025-05-07T20:32:13.3805164Z compiled: bool, 2025-05-07T20:32:13.3805561Z ) -> None: 2025-05-07T20:32:13.3805781Z torch.manual_seed(2025) 2025-05-07T20:32:13.3806020Z 2025-05-07T20:32:13.3806295Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.3808412Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.3816272Z 2025-05-07T20:32:13.3816413Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.3816637Z 2025-05-07T20:32:13.3816751Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.3817171Z self=, 2025-05-07T20:32:13.3817574Z T=4096, 2025-05-07T20:32:13.3817757Z D=5120, 2025-05-07T20:32:13.3817945Z scale_ub=None, 2025-05-07T20:32:13.3818227Z contiguous=True, 2025-05-07T20:32:13.3818469Z compiled=False, 2025-05-07T20:32:13.3818687Z ) 2025-05-07T20:32:13.3819044Z self = 2025-05-07T20:32:13.3819623Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.3819937Z 2025-05-07T20:32:13.3820016Z @given( 2025-05-07T20:32:13.3820259Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.3820618Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.3820962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.3821328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.3821693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.3822011Z ) 2025-05-07T20:32:13.3822407Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.3822925Z def test_silu_mul_quant( 2025-05-07T20:32:13.3823229Z self, 2025-05-07T20:32:13.3823431Z T: int, 2025-05-07T20:32:13.3823633Z D: int, 2025-05-07T20:32:13.3823863Z scale_ub: Optional[float], 2025-05-07T20:32:13.3824159Z contiguous: bool, 2025-05-07T20:32:13.3824564Z compiled: bool, 2025-05-07T20:32:13.3824793Z ) -> None: 2025-05-07T20:32:13.3825015Z torch.manual_seed(2025) 2025-05-07T20:32:13.3825254Z 2025-05-07T20:32:13.3825530Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.3827622Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.3829532Z 2025-05-07T20:32:13.3829649Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.3829862Z 2025-05-07T20:32:13.3829980Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.3830393Z self=, 2025-05-07T20:32:13.3830805Z T=2048, 2025-05-07T20:32:13.3830999Z D=5120, 2025-05-07T20:32:13.3831191Z scale_ub=None, 2025-05-07T20:32:13.3831411Z contiguous=False, 2025-05-07T20:32:13.3831640Z compiled=False, 2025-05-07T20:32:13.3831842Z ) 2025-05-07T20:32:13.3832159Z self = 2025-05-07T20:32:13.3832703Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:13.3832973Z 2025-05-07T20:32:13.3833060Z @given( 2025-05-07T20:32:13.3833285Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.3833602Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.3833910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.3834237Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.3834571Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.3834858Z ) 2025-05-07T20:32:13.3835208Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.3835696Z def test_silu_mul_quant( 2025-05-07T20:32:13.3835935Z self, 2025-05-07T20:32:13.3836127Z T: int, 2025-05-07T20:32:13.3836322Z D: int, 2025-05-07T20:32:13.3836543Z scale_ub: Optional[float], 2025-05-07T20:32:13.3836809Z contiguous: bool, 2025-05-07T20:32:13.3837054Z compiled: bool, 2025-05-07T20:32:13.3837272Z ) -> None: 2025-05-07T20:32:13.3837487Z torch.manual_seed(2025) 2025-05-07T20:32:13.3837728Z 2025-05-07T20:32:13.3838003Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.3840102Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.3842003Z 2025-05-07T20:32:13.3842123Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.3842341Z 2025-05-07T20:32:13.3842441Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.3842868Z self=, 2025-05-07T20:32:13.3843314Z T=4096, 2025-05-07T20:32:13.3843498Z D=7168, 2025-05-07T20:32:13.3843687Z scale_ub=None, 2025-05-07T20:32:13.3843902Z contiguous=True, 2025-05-07T20:32:13.3844122Z compiled=True, 2025-05-07T20:32:13.3844325Z ) 2025-05-07T20:32:13.3844730Z self = 2025-05-07T20:32:13.3845224Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:13.3845497Z 2025-05-07T20:32:13.3845573Z @given( 2025-05-07T20:32:13.3845803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.3846111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.3846420Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.3846751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.3847080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.3847365Z ) 2025-05-07T20:32:13.3847717Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.3848161Z def test_silu_mul_quant( 2025-05-07T20:32:13.3848396Z self, 2025-05-07T20:32:13.3848592Z T: int, 2025-05-07T20:32:13.3848790Z D: int, 2025-05-07T20:32:13.3849006Z scale_ub: Optional[float], 2025-05-07T20:32:13.3849287Z contiguous: bool, 2025-05-07T20:32:13.3849529Z compiled: bool, 2025-05-07T20:32:13.3849749Z ) -> None: 2025-05-07T20:32:13.3849970Z torch.manual_seed(2025) 2025-05-07T20:32:13.3850211Z 2025-05-07T20:32:13.3850484Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.3852580Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.3854594Z 2025-05-07T20:32:13.3854719Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.3854939Z 2025-05-07T20:32:13.3855043Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.3855461Z self=, 2025-05-07T20:32:13.3856101Z T=2048, 2025-05-07T20:32:13.3856288Z D=5120, 2025-05-07T20:32:13.3856479Z scale_ub=1200.0, 2025-05-07T20:32:13.3856704Z contiguous=False, 2025-05-07T20:32:13.3856931Z compiled=False, 2025-05-07T20:32:13.3857134Z ) 2025-05-07T20:32:13.3857453Z self = 2025-05-07T20:32:13.3857959Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.3858312Z 2025-05-07T20:32:13.3858397Z @given( 2025-05-07T20:32:13.3858634Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.3858941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.3859246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.3859585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.3859914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.3860204Z ) 2025-05-07T20:32:13.3860556Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.3860997Z def test_silu_mul_quant( 2025-05-07T20:32:13.3861234Z self, 2025-05-07T20:32:13.3861427Z T: int, 2025-05-07T20:32:13.3861616Z D: int, 2025-05-07T20:32:13.3861833Z scale_ub: Optional[float], 2025-05-07T20:32:13.3862107Z contiguous: bool, 2025-05-07T20:32:13.3862343Z compiled: bool, 2025-05-07T20:32:13.3862563Z ) -> None: 2025-05-07T20:32:13.3862785Z torch.manual_seed(2025) 2025-05-07T20:32:13.3863023Z 2025-05-07T20:32:13.3863297Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.3865544Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.3867452Z 2025-05-07T20:32:13.3867576Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.3867793Z 2025-05-07T20:32:13.3867901Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.3868313Z self=, 2025-05-07T20:32:13.3868716Z T=4096, 2025-05-07T20:32:13.3868906Z D=7168, 2025-05-07T20:32:13.3869097Z scale_ub=1200.0, 2025-05-07T20:32:13.3869315Z contiguous=True, 2025-05-07T20:32:13.3869534Z compiled=False, 2025-05-07T20:32:13.3869744Z ) 2025-05-07T20:32:13.5144372Z self = 2025-05-07T20:32:13.5145191Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:13.5145585Z 2025-05-07T20:32:13.5145697Z @given( 2025-05-07T20:32:13.5145957Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5146277Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5146585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5147038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5147379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5147665Z ) 2025-05-07T20:32:13.5148023Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5148478Z def test_silu_mul_quant( 2025-05-07T20:32:13.5148727Z self, 2025-05-07T20:32:13.5148923Z T: int, 2025-05-07T20:32:13.5149127Z D: int, 2025-05-07T20:32:13.5149349Z scale_ub: Optional[float], 2025-05-07T20:32:13.5149625Z contiguous: bool, 2025-05-07T20:32:13.5149872Z compiled: bool, 2025-05-07T20:32:13.5150180Z ) -> None: 2025-05-07T20:32:13.5150400Z torch.manual_seed(2025) 2025-05-07T20:32:13.5150648Z 2025-05-07T20:32:13.5150926Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5153069Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.5155007Z 2025-05-07T20:32:13.5155130Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.5155349Z 2025-05-07T20:32:13.5155457Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5156066Z self=, 2025-05-07T20:32:13.5156480Z T=16384, 2025-05-07T20:32:13.5156672Z D=7168, 2025-05-07T20:32:13.5156868Z scale_ub=None, 2025-05-07T20:32:13.5157092Z contiguous=False, 2025-05-07T20:32:13.5157322Z compiled=True, 2025-05-07T20:32:13.5157529Z ) 2025-05-07T20:32:13.5157853Z self = 2025-05-07T20:32:13.5158360Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:13.5158648Z 2025-05-07T20:32:13.5158726Z @given( 2025-05-07T20:32:13.5158959Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5159278Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5159708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5160044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5160382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5160665Z ) 2025-05-07T20:32:13.5161022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5161473Z def test_silu_mul_quant( 2025-05-07T20:32:13.5161714Z self, 2025-05-07T20:32:13.5161909Z T: int, 2025-05-07T20:32:13.5162113Z D: int, 2025-05-07T20:32:13.5162327Z scale_ub: Optional[float], 2025-05-07T20:32:13.5162624Z contiguous: bool, 2025-05-07T20:32:13.5162898Z compiled: bool, 2025-05-07T20:32:13.5163118Z ) -> None: 2025-05-07T20:32:13.5163334Z torch.manual_seed(2025) 2025-05-07T20:32:13.5163579Z 2025-05-07T20:32:13.5163853Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5165976Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.5167968Z 2025-05-07T20:32:13.5168098Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.5168313Z 2025-05-07T20:32:13.5168419Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5168844Z self=, 2025-05-07T20:32:13.5169256Z T=4096, 2025-05-07T20:32:13.5169440Z D=7168, 2025-05-07T20:32:13.5169637Z scale_ub=None, 2025-05-07T20:32:13.5169862Z contiguous=True, 2025-05-07T20:32:13.5170086Z compiled=False, 2025-05-07T20:32:13.5170294Z ) 2025-05-07T20:32:13.5170622Z self = 2025-05-07T20:32:13.5171187Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.5171466Z 2025-05-07T20:32:13.5171543Z @given( 2025-05-07T20:32:13.5171776Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5172093Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5172406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5172741Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5173106Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5173414Z ) 2025-05-07T20:32:13.5173767Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5174216Z def test_silu_mul_quant( 2025-05-07T20:32:13.5174455Z self, 2025-05-07T20:32:13.5174654Z T: int, 2025-05-07T20:32:13.5174854Z D: int, 2025-05-07T20:32:13.5175070Z scale_ub: Optional[float], 2025-05-07T20:32:13.5175347Z contiguous: bool, 2025-05-07T20:32:13.5175589Z compiled: bool, 2025-05-07T20:32:13.5175813Z ) -> None: 2025-05-07T20:32:13.5176034Z torch.manual_seed(2025) 2025-05-07T20:32:13.5176277Z 2025-05-07T20:32:13.5176555Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5178798Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.5180727Z 2025-05-07T20:32:13.5180847Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.5181067Z 2025-05-07T20:32:13.5181170Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5181587Z self=, 2025-05-07T20:32:13.5181998Z T=16384, 2025-05-07T20:32:13.5182191Z D=7168, 2025-05-07T20:32:13.5182386Z scale_ub=None, 2025-05-07T20:32:13.5182609Z contiguous=True, 2025-05-07T20:32:13.5182831Z compiled=False, 2025-05-07T20:32:13.5183035Z ) 2025-05-07T20:32:13.5183356Z self = 2025-05-07T20:32:13.5183858Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.5184140Z 2025-05-07T20:32:13.5184217Z @given( 2025-05-07T20:32:13.5184447Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5184762Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5185071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5185408Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5185744Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5186027Z ) 2025-05-07T20:32:13.5186381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5186835Z def test_silu_mul_quant( 2025-05-07T20:32:13.5187125Z self, 2025-05-07T20:32:13.5187327Z T: int, 2025-05-07T20:32:13.5187525Z D: int, 2025-05-07T20:32:13.5187741Z scale_ub: Optional[float], 2025-05-07T20:32:13.5188019Z contiguous: bool, 2025-05-07T20:32:13.5188263Z compiled: bool, 2025-05-07T20:32:13.5188483Z ) -> None: 2025-05-07T20:32:13.5188704Z torch.manual_seed(2025) 2025-05-07T20:32:13.5188949Z 2025-05-07T20:32:13.5189229Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5191338Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.5193306Z 2025-05-07T20:32:13.5193426Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.5193646Z 2025-05-07T20:32:13.5193752Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5194173Z self=, 2025-05-07T20:32:13.5194579Z T=16384, 2025-05-07T20:32:13.5194776Z D=7168, 2025-05-07T20:32:13.5194968Z scale_ub=1200.0, 2025-05-07T20:32:13.5195188Z contiguous=True, 2025-05-07T20:32:13.5195415Z compiled=False, 2025-05-07T20:32:13.5195620Z ) 2025-05-07T20:32:13.5195941Z self = 2025-05-07T20:32:13.5196443Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:13.5196725Z 2025-05-07T20:32:13.5196802Z @given( 2025-05-07T20:32:13.5197036Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5197351Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5197661Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5197996Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5198324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5198614Z ) 2025-05-07T20:32:13.5198971Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5199494Z def test_silu_mul_quant( 2025-05-07T20:32:13.5199742Z self, 2025-05-07T20:32:13.5199941Z T: int, 2025-05-07T20:32:13.5200135Z D: int, 2025-05-07T20:32:13.5200359Z scale_ub: Optional[float], 2025-05-07T20:32:13.5200632Z contiguous: bool, 2025-05-07T20:32:13.5200877Z compiled: bool, 2025-05-07T20:32:13.5201098Z ) -> None: 2025-05-07T20:32:13.5201319Z torch.manual_seed(2025) 2025-05-07T20:32:13.5201566Z 2025-05-07T20:32:13.5201837Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5204005Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.5205926Z 2025-05-07T20:32:13.5206047Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.5206260Z 2025-05-07T20:32:13.5206368Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5206790Z self=, 2025-05-07T20:32:13.5207192Z T=128, 2025-05-07T20:32:13.5207426Z D=5120, 2025-05-07T20:32:13.5207629Z scale_ub=1200.0, 2025-05-07T20:32:13.5207852Z contiguous=False, 2025-05-07T20:32:13.5208080Z compiled=False, 2025-05-07T20:32:13.5208284Z ) 2025-05-07T20:32:13.8522050Z self = 2025-05-07T20:32:13.8523103Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.8523404Z 2025-05-07T20:32:13.8523491Z @given( 2025-05-07T20:32:13.8523748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8524078Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8524543Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8524892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8525237Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8525531Z ) 2025-05-07T20:32:13.8525900Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8526366Z def test_silu_mul_quant( 2025-05-07T20:32:13.8526617Z self, 2025-05-07T20:32:13.8526825Z T: int, 2025-05-07T20:32:13.8527036Z D: int, 2025-05-07T20:32:13.8527265Z scale_ub: Optional[float], 2025-05-07T20:32:13.8527552Z contiguous: bool, 2025-05-07T20:32:13.8527805Z compiled: bool, 2025-05-07T20:32:13.8528040Z ) -> None: 2025-05-07T20:32:13.8528272Z torch.manual_seed(2025) 2025-05-07T20:32:13.8528532Z 2025-05-07T20:32:13.8528814Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.8529167Z 2025-05-07T20:32:13.8529375Z x_sign = torch.sign(x) 2025-05-07T20:32:13.8529679Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.8529995Z x = x_sign * x_clamp 2025-05-07T20:32:13.8530246Z x0 = x[:, :D] 2025-05-07T20:32:13.8530472Z x1 = x[:, D:] 2025-05-07T20:32:13.8530687Z 2025-05-07T20:32:13.8530885Z if contiguous: 2025-05-07T20:32:13.8531133Z x0 = x0.contiguous() 2025-05-07T20:32:13.8531402Z x1 = x1.contiguous() 2025-05-07T20:32:13.8531651Z 2025-05-07T20:32:13.8531853Z if scale_ub is not None: 2025-05-07T20:32:13.8532133Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.8532488Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.8532811Z ) 2025-05-07T20:32:13.8533008Z else: 2025-05-07T20:32:13.8533350Z scale_ub_tensor = None 2025-05-07T20:32:13.8533619Z 2025-05-07T20:32:13.8533862Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.8534194Z op = silu_mul_quant 2025-05-07T20:32:13.8534456Z if compiled: 2025-05-07T20:32:13.8534715Z op = torch.compile(op) 2025-05-07T20:32:13.8535019Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8535308Z 2025-05-07T20:32:13.8535514Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.8535687Z 2025-05-07T20:32:13.8535791Z moe/activation_test.py:117: 2025-05-07T20:32:13.8536098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8536438Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.8536729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8537448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.8538266Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.8538829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.8539536Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.8540229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.8540785Z kernel = self.compile( 2025-05-07T20:32:13.8541447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.8542126Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.8542587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8542820Z 2025-05-07T20:32:13.8543042Z self = 2025-05-07T20:32:13.8544157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.8545628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeae04940>} 2025-05-07T20:32:13.8547017Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.8548081Z context = 2025-05-07T20:32:13.8548380Z 2025-05-07T20:32:13.8548557Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.8549097Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.8549586Z module_map=module_map) 2025-05-07T20:32:13.8549962Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.8550330Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.8550600Z E ^ 2025-05-07T20:32:13.8551084Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.8551547Z 2025-05-07T20:32:13.8551984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.8552514Z 2025-05-07T20:32:13.8552621Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.8553103Z self=, 2025-05-07T20:32:13.8559007Z T=2048, 2025-05-07T20:32:13.8559232Z D=7168, 2025-05-07T20:32:13.8559429Z scale_ub=None, 2025-05-07T20:32:13.8559648Z contiguous=False, 2025-05-07T20:32:13.8560033Z compiled=False, 2025-05-07T20:32:13.8560241Z ) 2025-05-07T20:32:13.8560566Z self = 2025-05-07T20:32:13.8561074Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:13.8561350Z 2025-05-07T20:32:13.8561427Z @given( 2025-05-07T20:32:13.8561659Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8561977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8562283Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8562634Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8563001Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8563284Z ) 2025-05-07T20:32:13.8563643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8564091Z def test_silu_mul_quant( 2025-05-07T20:32:13.8564334Z self, 2025-05-07T20:32:13.8564528Z T: int, 2025-05-07T20:32:13.8564723Z D: int, 2025-05-07T20:32:13.8564942Z scale_ub: Optional[float], 2025-05-07T20:32:13.8565211Z contiguous: bool, 2025-05-07T20:32:13.8565454Z compiled: bool, 2025-05-07T20:32:13.8565684Z ) -> None: 2025-05-07T20:32:13.8565903Z torch.manual_seed(2025) 2025-05-07T20:32:13.8566147Z 2025-05-07T20:32:13.8566429Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.8568554Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.8570552Z 2025-05-07T20:32:13.8570674Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:13.8570960Z 2025-05-07T20:32:13.8571069Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.8571496Z self=, 2025-05-07T20:32:13.8571902Z T=128, 2025-05-07T20:32:13.8572085Z D=7168, 2025-05-07T20:32:13.8572285Z scale_ub=1200.0, 2025-05-07T20:32:13.8572515Z contiguous=True, 2025-05-07T20:32:13.8572741Z compiled=True, 2025-05-07T20:32:13.8572951Z ) 2025-05-07T20:32:13.8987311Z self = 2025-05-07T20:32:13.8988022Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:13.8988299Z 2025-05-07T20:32:13.8988389Z @given( 2025-05-07T20:32:13.8988626Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8988955Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8989269Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8989601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8989940Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8990236Z ) 2025-05-07T20:32:13.8990596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8991042Z def test_silu_mul_quant( 2025-05-07T20:32:13.8991288Z self, 2025-05-07T20:32:13.8991490Z T: int, 2025-05-07T20:32:13.8991685Z D: int, 2025-05-07T20:32:13.8991909Z scale_ub: Optional[float], 2025-05-07T20:32:13.8992185Z contiguous: bool, 2025-05-07T20:32:13.8992427Z compiled: bool, 2025-05-07T20:32:13.8992657Z ) -> None: 2025-05-07T20:32:13.8992881Z torch.manual_seed(2025) 2025-05-07T20:32:13.8993123Z 2025-05-07T20:32:13.8993402Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.8993903Z 2025-05-07T20:32:13.8994106Z x_sign = torch.sign(x) 2025-05-07T20:32:13.8994406Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.8994719Z x = x_sign * x_clamp 2025-05-07T20:32:13.8994961Z x0 = x[:, :D] 2025-05-07T20:32:13.8995178Z x1 = x[:, D:] 2025-05-07T20:32:13.8995395Z 2025-05-07T20:32:13.8995579Z if contiguous: 2025-05-07T20:32:13.8995814Z x0 = x0.contiguous() 2025-05-07T20:32:13.8996075Z x1 = x1.contiguous() 2025-05-07T20:32:13.8996316Z 2025-05-07T20:32:13.8996515Z if scale_ub is not None: 2025-05-07T20:32:13.8996794Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.8997132Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.8997445Z ) 2025-05-07T20:32:13.8997642Z else: 2025-05-07T20:32:13.8997854Z scale_ub_tensor = None 2025-05-07T20:32:13.8998112Z 2025-05-07T20:32:13.8998359Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.8998675Z op = silu_mul_quant 2025-05-07T20:32:13.8998933Z if compiled: 2025-05-07T20:32:13.8999188Z op = torch.compile(op) 2025-05-07T20:32:13.8999484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8999764Z 2025-05-07T20:32:13.8999963Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.9000129Z 2025-05-07T20:32:13.9000237Z moe/activation_test.py:117: 2025-05-07T20:32:13.9000529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9000932Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.9001219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.9001788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.9002361Z return fn(*args, **kwargs) 2025-05-07T20:32:13.9003069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.9003779Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.9004317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.9005069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.9005742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.9006277Z kernel = self.compile( 2025-05-07T20:32:13.9006825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.9007491Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.9007888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.9008114Z 2025-05-07T20:32:13.9008332Z self = 2025-05-07T20:32:13.9009431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.9010833Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeae04dc0>} 2025-05-07T20:32:13.9012203Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.9013274Z context = 2025-05-07T20:32:13.9013592Z 2025-05-07T20:32:13.9013761Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.9014370Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.9014847Z module_map=module_map) 2025-05-07T20:32:13.9015213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.9015578Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.9015841Z E ^ 2025-05-07T20:32:13.9016310Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.9016776Z 2025-05-07T20:32:13.9017197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.9017715Z 2025-05-07T20:32:13.9017824Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.9018314Z self=, 2025-05-07T20:32:13.9018714Z T=128, 2025-05-07T20:32:13.9018912Z D=7168, 2025-05-07T20:32:13.9019118Z scale_ub=1200.0, 2025-05-07T20:32:13.9019343Z contiguous=True, 2025-05-07T20:32:13.9019573Z compiled=False, 2025-05-07T20:32:13.9019785Z ) 2025-05-07T20:32:13.9020109Z self = 2025-05-07T20:32:13.9020609Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:13.9020884Z 2025-05-07T20:32:13.9020968Z @given( 2025-05-07T20:32:13.9021205Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.9021591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.9021909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.9022242Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.9022570Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.9022861Z ) 2025-05-07T20:32:13.9023216Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.9023658Z def test_silu_mul_quant( 2025-05-07T20:32:13.9023903Z self, 2025-05-07T20:32:13.9024102Z T: int, 2025-05-07T20:32:13.9024302Z D: int, 2025-05-07T20:32:13.9024518Z scale_ub: Optional[float], 2025-05-07T20:32:13.9024846Z contiguous: bool, 2025-05-07T20:32:13.9025092Z compiled: bool, 2025-05-07T20:32:13.9025316Z ) -> None: 2025-05-07T20:32:13.9025533Z torch.manual_seed(2025) 2025-05-07T20:32:13.9025779Z 2025-05-07T20:32:13.9026052Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.9026399Z 2025-05-07T20:32:13.9026594Z x_sign = torch.sign(x) 2025-05-07T20:32:13.9026886Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.9028933Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.9030829Z 2025-05-07T20:32:13.9030952Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:13.9031170Z 2025-05-07T20:32:13.9031278Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.9031699Z self=, 2025-05-07T20:32:13.9032103Z T=128, 2025-05-07T20:32:13.9032290Z D=5120, 2025-05-07T20:32:13.9032487Z scale_ub=1200.0, 2025-05-07T20:32:13.9032707Z contiguous=True, 2025-05-07T20:32:13.9032933Z compiled=True, 2025-05-07T20:32:13.9033142Z ) 2025-05-07T20:32:13.9033460Z self = 2025-05-07T20:32:13.9034036Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:13.9034314Z 2025-05-07T20:32:13.9034393Z @given( 2025-05-07T20:32:13.9034621Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.9034934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.9035243Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.9035576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.9035909Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.9036201Z ) 2025-05-07T20:32:13.9036559Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.9037003Z def test_silu_mul_quant( 2025-05-07T20:32:13.9037249Z self, 2025-05-07T20:32:13.9037447Z T: int, 2025-05-07T20:32:13.9037643Z D: int, 2025-05-07T20:32:13.9037864Z scale_ub: Optional[float], 2025-05-07T20:32:13.9038139Z contiguous: bool, 2025-05-07T20:32:13.9038379Z compiled: bool, 2025-05-07T20:32:13.9038611Z ) -> None: 2025-05-07T20:32:13.9038831Z torch.manual_seed(2025) 2025-05-07T20:32:13.9039077Z 2025-05-07T20:32:13.9039350Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.9039698Z 2025-05-07T20:32:13.9039896Z x_sign = torch.sign(x) 2025-05-07T20:32:13.9040189Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.9042217Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:13.9044160Z 2025-05-07T20:32:13.9044282Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:13.9044495Z 2025-05-07T20:32:13.9044603Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.9045066Z self=, 2025-05-07T20:32:13.9045466Z T=128, 2025-05-07T20:32:13.9045652Z D=7168, 2025-05-07T20:32:13.9045846Z scale_ub=None, 2025-05-07T20:32:13.9046058Z contiguous=True, 2025-05-07T20:32:13.9046289Z compiled=True, 2025-05-07T20:32:13.9046498Z ) 2025-05-07T20:32:14.1356176Z self = 2025-05-07T20:32:14.1356679Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.1357047Z 2025-05-07T20:32:14.1357160Z @given( 2025-05-07T20:32:14.1357459Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1357778Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1358092Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1358429Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1358764Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1359053Z ) 2025-05-07T20:32:14.1359414Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1359861Z def test_silu_mul_quant( 2025-05-07T20:32:14.1360104Z self, 2025-05-07T20:32:14.1360302Z T: int, 2025-05-07T20:32:14.1360508Z D: int, 2025-05-07T20:32:14.1360728Z scale_ub: Optional[float], 2025-05-07T20:32:14.1361004Z contiguous: bool, 2025-05-07T20:32:14.1361247Z compiled: bool, 2025-05-07T20:32:14.1361471Z ) -> None: 2025-05-07T20:32:14.1361696Z torch.manual_seed(2025) 2025-05-07T20:32:14.1361944Z 2025-05-07T20:32:14.1362218Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1364487Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.1366404Z 2025-05-07T20:32:14.1366527Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.1366747Z 2025-05-07T20:32:14.1395968Z FAILED 2025-05-07T20:32:14.1396369Z 2025-05-07T20:32:14.1396895Z =================================== FAILURES =================================== 2025-05-07T20:32:14.1397585Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:14.1398256Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:14.1399154Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:32:14.1399945Z | yield 2025-05-07T20:32:14.1400555Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run 2025-05-07T20:32:14.1401292Z | self._callTestMethod(testMethod) 2025-05-07T20:32:14.1402091Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod 2025-05-07T20:32:14.1403099Z | method() 2025-05-07T20:32:14.1404013Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:14.1405031Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1405940Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:14.1406815Z | raise the_error_hypothesis_found 2025-05-07T20:32:14.1407511Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:14.1408319Z +-+---------------- 1 ---------------- 2025-05-07T20:32:14.1408738Z | Traceback (most recent call last): 2025-05-07T20:32:14.1409753Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:14.1410846Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1413832Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.1416748Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:14.1417382Z | self=, 2025-05-07T20:32:14.1417950Z | T=2048, 2025-05-07T20:32:14.1418442Z | D=5120, # or any other generated value 2025-05-07T20:32:14.1418935Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:14.1419453Z | contiguous=True, # or any other generated value 2025-05-07T20:32:14.1419982Z | compiled=False, # or any other generated value 2025-05-07T20:32:14.1420410Z | ) 2025-05-07T20:32:14.1420676Z | 2025-05-07T20:32:14.1421554Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:14.1422424Z +---------------- 2 ---------------- 2025-05-07T20:32:14.1422850Z | Traceback (most recent call last): 2025-05-07T20:32:14.1423863Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:14.1424959Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1427891Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.1430691Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:14.1431318Z | self=, 2025-05-07T20:32:14.1431907Z | T=128, 2025-05-07T20:32:14.1432197Z | D=7168, 2025-05-07T20:32:14.1432523Z | scale_ub=None, 2025-05-07T20:32:14.1432897Z | contiguous=True, 2025-05-07T20:32:14.1433258Z | compiled=True, 2025-05-07T20:32:14.1433571Z | ) 2025-05-07T20:32:14.1434716Z | 2025-05-07T20:32:14.1435472Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:14.1436319Z +---------------- 3 ---------------- 2025-05-07T20:32:14.1436743Z | Traceback (most recent call last): 2025-05-07T20:32:14.1437744Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:14.1438714Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1440828Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.1442964Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:14.1443415Z | self=, 2025-05-07T20:32:14.1443834Z | T=128, 2025-05-07T20:32:14.1444038Z | D=5120, 2025-05-07T20:32:14.1444264Z | scale_ub=1200.0, 2025-05-07T20:32:14.1444513Z | contiguous=True, 2025-05-07T20:32:14.1444756Z | compiled=True, 2025-05-07T20:32:14.1444993Z | ) 2025-05-07T20:32:14.1445179Z | 2025-05-07T20:32:14.1445707Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:14.1446327Z +---------------- 4 ---------------- 2025-05-07T20:32:14.1446629Z | Traceback (most recent call last): 2025-05-07T20:32:14.1447364Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:14.1448086Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.1448988Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:14.1450091Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1451258Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:14.1452411Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.1453304Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:14.1454380Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1455489Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:14.1456893Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1458155Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:14.1459329Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1460483Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:14.1461467Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.1462475Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:14.1463449Z | fn() 2025-05-07T20:32:14.1464281Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:14.1465205Z | self.fn.run( 2025-05-07T20:32:14.1465981Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:14.1466836Z | kernel = self.compile( 2025-05-07T20:32:14.1467720Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:14.1468841Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1469871Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:14.1471020Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1471743Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1472253Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.1472673Z | ^ 2025-05-07T20:32:14.1473355Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1474181Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:14.1474769Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:14.1475546Z | self=, 2025-05-07T20:32:14.1476181Z | T=1, # or any other generated value 2025-05-07T20:32:14.1476638Z | D=5120, # or any other generated value 2025-05-07T20:32:14.1477122Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:14.1477640Z | contiguous=True, # or any other generated value 2025-05-07T20:32:14.1478178Z | compiled=True, # or any other generated value 2025-05-07T20:32:14.1478617Z | ) 2025-05-07T20:32:14.1478881Z | 2025-05-07T20:32:14.1479657Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:14.1480526Z +------------------------------------ 2025-05-07T20:32:14.1481202Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:14.1500843Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1501429Z self=, 2025-05-07T20:32:14.1501988Z T=1, 2025-05-07T20:32:14.1502245Z D=5120, 2025-05-07T20:32:14.1502509Z scale_ub=None, 2025-05-07T20:32:14.1502857Z contiguous=True, 2025-05-07T20:32:14.1503166Z compiled=True, 2025-05-07T20:32:14.1503448Z ) 2025-05-07T20:32:14.1503882Z self = 2025-05-07T20:32:14.1504537Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.1504898Z 2025-05-07T20:32:14.1505015Z @given( 2025-05-07T20:32:14.1505344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1505808Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1506247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1506710Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1507170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1507599Z ) 2025-05-07T20:32:14.1508104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1508726Z def test_silu_mul_quant( 2025-05-07T20:32:14.1509055Z self, 2025-05-07T20:32:14.1509326Z T: int, 2025-05-07T20:32:14.1509609Z D: int, 2025-05-07T20:32:14.1509918Z scale_ub: Optional[float], 2025-05-07T20:32:14.1510479Z contiguous: bool, 2025-05-07T20:32:14.1510829Z compiled: bool, 2025-05-07T20:32:14.1511144Z ) -> None: 2025-05-07T20:32:14.1511444Z torch.manual_seed(2025) 2025-05-07T20:32:14.1511775Z 2025-05-07T20:32:14.1512150Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1512621Z 2025-05-07T20:32:14.1512880Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1513289Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1513713Z x = x_sign * x_clamp 2025-05-07T20:32:14.1514041Z x0 = x[:, :D] 2025-05-07T20:32:14.1514344Z x1 = x[:, D:] 2025-05-07T20:32:14.1514705Z 2025-05-07T20:32:14.1514969Z if contiguous: 2025-05-07T20:32:14.1515301Z x0 = x0.contiguous() 2025-05-07T20:32:14.1515669Z x1 = x1.contiguous() 2025-05-07T20:32:14.1516018Z 2025-05-07T20:32:14.1516306Z if scale_ub is not None: 2025-05-07T20:32:14.1516700Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1517184Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1517630Z ) 2025-05-07T20:32:14.1517907Z else: 2025-05-07T20:32:14.1518195Z scale_ub_tensor = None 2025-05-07T20:32:14.1518553Z 2025-05-07T20:32:14.1518884Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1519332Z op = silu_mul_quant 2025-05-07T20:32:14.1519688Z if compiled: 2025-05-07T20:32:14.1520046Z op = torch.compile(op) 2025-05-07T20:32:14.1520467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1520854Z 2025-05-07T20:32:14.1521133Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.1521539Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.1521948Z 2025-05-07T20:32:14.1522297Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1522819Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.1523247Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.1523705Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.1524188Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1524647Z 2025-05-07T20:32:14.1524945Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.1525220Z 2025-05-07T20:32:14.1525369Z moe/activation_test.py:126: 2025-05-07T20:32:14.1525885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1526368Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.1526836Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1527945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.1528999Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.1529763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1530705Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1531672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.1532791Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1533901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.1534976Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1536017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.1536940Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.1537807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.1538704Z fn() 2025-05-07T20:32:14.1539426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.1540257Z self.fn.run( 2025-05-07T20:32:14.1540920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1541679Z kernel = self.compile( 2025-05-07T20:32:14.1542488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1543498Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1544039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1544373Z 2025-05-07T20:32:14.1544659Z self = 2025-05-07T20:32:14.1546178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1548142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc8283400>} 2025-05-07T20:32:14.1550049Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1551500Z context = 2025-05-07T20:32:14.1551904Z 2025-05-07T20:32:14.1552141Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1552936Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1553598Z module_map=module_map) 2025-05-07T20:32:14.1554106Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1554622Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.1555003Z E ^ 2025-05-07T20:32:14.1556139Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1556809Z 2025-05-07T20:32:14.1557652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1558393Z 2025-05-07T20:32:14.1558548Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1559130Z self=, 2025-05-07T20:32:14.1559677Z T=2048, 2025-05-07T20:32:14.1559941Z D=5120, 2025-05-07T20:32:14.1560205Z scale_ub=1200.0, 2025-05-07T20:32:14.1560530Z contiguous=True, 2025-05-07T20:32:14.1560867Z compiled=False, 2025-05-07T20:32:14.1561153Z ) 2025-05-07T20:32:14.1561600Z self = 2025-05-07T20:32:14.1562311Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.1562720Z 2025-05-07T20:32:14.1562832Z @given( 2025-05-07T20:32:14.1563172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1563625Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1564086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1564557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1565028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1565418Z ) 2025-05-07T20:32:14.1565920Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1566550Z def test_silu_mul_quant( 2025-05-07T20:32:14.1566899Z self, 2025-05-07T20:32:14.1567188Z T: int, 2025-05-07T20:32:14.1567565Z D: int, 2025-05-07T20:32:14.1567878Z scale_ub: Optional[float], 2025-05-07T20:32:14.1568275Z contiguous: bool, 2025-05-07T20:32:14.1568630Z compiled: bool, 2025-05-07T20:32:14.1568951Z ) -> None: 2025-05-07T20:32:14.1569252Z torch.manual_seed(2025) 2025-05-07T20:32:14.1569591Z 2025-05-07T20:32:14.1569977Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1570465Z 2025-05-07T20:32:14.1570738Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1571154Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1571700Z x = x_sign * x_clamp 2025-05-07T20:32:14.1572041Z x0 = x[:, :D] 2025-05-07T20:32:14.1572326Z x1 = x[:, D:] 2025-05-07T20:32:14.1572622Z 2025-05-07T20:32:14.1572893Z if contiguous: 2025-05-07T20:32:14.1573206Z x0 = x0.contiguous() 2025-05-07T20:32:14.1573562Z x1 = x1.contiguous() 2025-05-07T20:32:14.1573920Z 2025-05-07T20:32:14.1574192Z if scale_ub is not None: 2025-05-07T20:32:14.1574581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1575057Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1575504Z ) 2025-05-07T20:32:14.1575783Z else: 2025-05-07T20:32:14.1576091Z scale_ub_tensor = None 2025-05-07T20:32:14.1576468Z 2025-05-07T20:32:14.1576804Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1577266Z op = silu_mul_quant 2025-05-07T20:32:14.1577641Z if compiled: 2025-05-07T20:32:14.1578098Z op = torch.compile(op) 2025-05-07T20:32:14.1578550Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1578958Z 2025-05-07T20:32:14.1579225Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1579464Z 2025-05-07T20:32:14.1579607Z moe/activation_test.py:117: 2025-05-07T20:32:14.1580022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1580493Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1580888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1581857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1582814Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1583671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1584622Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1585558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1586307Z kernel = self.compile( 2025-05-07T20:32:14.1587050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1587968Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1588531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1588844Z 2025-05-07T20:32:14.1589132Z self = 2025-05-07T20:32:14.1590623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1592602Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc3962e60>} 2025-05-07T20:32:14.1594525Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1595738Z context = 2025-05-07T20:32:14.1596036Z 2025-05-07T20:32:14.1596210Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1596827Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1597455Z module_map=module_map) 2025-05-07T20:32:14.1597983Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1598458Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.1598819Z E ^ 2025-05-07T20:32:14.1599387Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1599929Z 2025-05-07T20:32:14.1600352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1600879Z 2025-05-07T20:32:14.1600983Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1601407Z self=, 2025-05-07T20:32:14.1601811Z T=2048, 2025-05-07T20:32:14.1601995Z D=5120, 2025-05-07T20:32:14.1602188Z scale_ub=1200.0, 2025-05-07T20:32:14.1602415Z contiguous=True, 2025-05-07T20:32:14.1602632Z compiled=True, 2025-05-07T20:32:14.1602840Z ) 2025-05-07T20:32:14.1603174Z self = 2025-05-07T20:32:14.1603688Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.1603960Z 2025-05-07T20:32:14.1604041Z @given( 2025-05-07T20:32:14.1604275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1604594Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1604903Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1605239Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1605575Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1605867Z ) 2025-05-07T20:32:14.1606216Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1606662Z def test_silu_mul_quant( 2025-05-07T20:32:14.1606911Z self, 2025-05-07T20:32:14.1607103Z T: int, 2025-05-07T20:32:14.1607306Z D: int, 2025-05-07T20:32:14.1607530Z scale_ub: Optional[float], 2025-05-07T20:32:14.1607914Z contiguous: bool, 2025-05-07T20:32:14.1608162Z compiled: bool, 2025-05-07T20:32:14.1608392Z ) -> None: 2025-05-07T20:32:14.1608606Z torch.manual_seed(2025) 2025-05-07T20:32:14.1608856Z 2025-05-07T20:32:14.1609135Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1609471Z 2025-05-07T20:32:14.1609664Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1609960Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1610266Z x = x_sign * x_clamp 2025-05-07T20:32:14.1610510Z x0 = x[:, :D] 2025-05-07T20:32:14.1610729Z x1 = x[:, D:] 2025-05-07T20:32:14.1610940Z 2025-05-07T20:32:14.1611122Z if contiguous: 2025-05-07T20:32:14.1611356Z x0 = x0.contiguous() 2025-05-07T20:32:14.1611619Z x1 = x1.contiguous() 2025-05-07T20:32:14.1611855Z 2025-05-07T20:32:14.1612050Z if scale_ub is not None: 2025-05-07T20:32:14.1612334Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1612722Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1613034Z ) 2025-05-07T20:32:14.1613230Z else: 2025-05-07T20:32:14.1613441Z scale_ub_tensor = None 2025-05-07T20:32:14.1613696Z 2025-05-07T20:32:14.1613931Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1614242Z op = silu_mul_quant 2025-05-07T20:32:14.1614495Z if compiled: 2025-05-07T20:32:14.1614748Z op = torch.compile(op) 2025-05-07T20:32:14.1615095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1615374Z 2025-05-07T20:32:14.1615572Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.1615863Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.1616148Z 2025-05-07T20:32:14.1616392Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1616733Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.1617029Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.1617349Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.1617717Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1618189Z 2025-05-07T20:32:14.1618399Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.1618595Z 2025-05-07T20:32:14.1618702Z moe/activation_test.py:126: 2025-05-07T20:32:14.1618993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1619334Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.1619671Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1620468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.1621231Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.1621792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1622488Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1623191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.1623920Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1624682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.1625441Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1626181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.1626823Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.1627514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.1628042Z fn() 2025-05-07T20:32:14.1628553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.1629144Z self.fn.run( 2025-05-07T20:32:14.1629625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1630166Z kernel = self.compile( 2025-05-07T20:32:14.1630710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1631374Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1631775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1632003Z 2025-05-07T20:32:14.1632215Z self = 2025-05-07T20:32:14.1633318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1634723Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc243d6c0>} 2025-05-07T20:32:14.1636089Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1637774Z context = 2025-05-07T20:32:14.1638066Z 2025-05-07T20:32:14.1638235Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1638766Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1639246Z module_map=module_map) 2025-05-07T20:32:14.1639613Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1640023Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.1640294Z E ^ 2025-05-07T20:32:14.1640769Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1641225Z 2025-05-07T20:32:14.1641649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1642174Z 2025-05-07T20:32:14.1642283Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1642752Z self=, 2025-05-07T20:32:14.1643156Z T=16384, 2025-05-07T20:32:14.1643348Z D=7168, 2025-05-07T20:32:14.1643543Z scale_ub=1200.0, 2025-05-07T20:32:14.1643770Z contiguous=False, 2025-05-07T20:32:14.1643993Z compiled=False, 2025-05-07T20:32:14.1644206Z ) 2025-05-07T20:32:14.1644533Z self = 2025-05-07T20:32:14.1645040Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.1645335Z 2025-05-07T20:32:14.1645412Z @given( 2025-05-07T20:32:14.1645653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1645965Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1646279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1646616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1646953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1647236Z ) 2025-05-07T20:32:14.1647593Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1648042Z def test_silu_mul_quant( 2025-05-07T20:32:14.1648283Z self, 2025-05-07T20:32:14.1648481Z T: int, 2025-05-07T20:32:14.1648761Z D: int, 2025-05-07T20:32:14.1648980Z scale_ub: Optional[float], 2025-05-07T20:32:14.1649257Z contiguous: bool, 2025-05-07T20:32:14.1649500Z compiled: bool, 2025-05-07T20:32:14.1649719Z ) -> None: 2025-05-07T20:32:14.1649939Z torch.manual_seed(2025) 2025-05-07T20:32:14.1650182Z 2025-05-07T20:32:14.1650456Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1650800Z 2025-05-07T20:32:14.1651001Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1651292Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1651602Z x = x_sign * x_clamp 2025-05-07T20:32:14.1651842Z x0 = x[:, :D] 2025-05-07T20:32:14.1652063Z x1 = x[:, D:] 2025-05-07T20:32:14.1652264Z 2025-05-07T20:32:14.1652456Z if contiguous: 2025-05-07T20:32:14.1652722Z x0 = x0.contiguous() 2025-05-07T20:32:14.1652995Z x1 = x1.contiguous() 2025-05-07T20:32:14.1653240Z 2025-05-07T20:32:14.1653441Z if scale_ub is not None: 2025-05-07T20:32:14.1653712Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1654053Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1654363Z ) 2025-05-07T20:32:14.1654553Z else: 2025-05-07T20:32:14.1654767Z scale_ub_tensor = None 2025-05-07T20:32:14.1655019Z 2025-05-07T20:32:14.1655250Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1655833Z op = silu_mul_quant 2025-05-07T20:32:14.1656275Z if compiled: 2025-05-07T20:32:14.1656522Z op = torch.compile(op) 2025-05-07T20:32:14.1656822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1657099Z 2025-05-07T20:32:14.1657303Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1657473Z 2025-05-07T20:32:14.1657575Z moe/activation_test.py:117: 2025-05-07T20:32:14.1657886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1658290Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1658575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1659366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1660074Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1660615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1661316Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1661994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1662581Z kernel = self.compile( 2025-05-07T20:32:14.1663132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1663811Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1664217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1664447Z 2025-05-07T20:32:14.1664666Z self = 2025-05-07T20:32:14.1665759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1667156Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc243d510>} 2025-05-07T20:32:14.1668530Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1669694Z context = 2025-05-07T20:32:14.1669990Z 2025-05-07T20:32:14.1670168Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1670696Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1671175Z module_map=module_map) 2025-05-07T20:32:14.1671550Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1671908Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.1672173Z E ^ 2025-05-07T20:32:14.1672699Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1673154Z 2025-05-07T20:32:14.1673580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1674097Z 2025-05-07T20:32:14.1674211Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1674635Z self=, 2025-05-07T20:32:14.1675041Z T=1, 2025-05-07T20:32:14.1675227Z D=7168, 2025-05-07T20:32:14.1675427Z scale_ub=None, 2025-05-07T20:32:14.1675645Z contiguous=True, 2025-05-07T20:32:14.1675867Z compiled=True, 2025-05-07T20:32:14.1676076Z ) 2025-05-07T20:32:14.1676402Z self = 2025-05-07T20:32:14.1676885Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.1677199Z 2025-05-07T20:32:14.1677276Z @given( 2025-05-07T20:32:14.1677518Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1677836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1678144Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1678480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1678821Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1679103Z ) 2025-05-07T20:32:14.1679460Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1679978Z def test_silu_mul_quant( 2025-05-07T20:32:14.1687764Z self, 2025-05-07T20:32:14.1688000Z T: int, 2025-05-07T20:32:14.1688208Z D: int, 2025-05-07T20:32:14.1688426Z scale_ub: Optional[float], 2025-05-07T20:32:14.1688713Z contiguous: bool, 2025-05-07T20:32:14.1688970Z compiled: bool, 2025-05-07T20:32:14.1689203Z ) -> None: 2025-05-07T20:32:14.1689428Z torch.manual_seed(2025) 2025-05-07T20:32:14.1689764Z 2025-05-07T20:32:14.1690133Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1690489Z 2025-05-07T20:32:14.1690692Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1690995Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1691306Z x = x_sign * x_clamp 2025-05-07T20:32:14.1691569Z x0 = x[:, :D] 2025-05-07T20:32:14.1691795Z x1 = x[:, D:] 2025-05-07T20:32:14.1692000Z 2025-05-07T20:32:14.1692196Z if contiguous: 2025-05-07T20:32:14.1692439Z x0 = x0.contiguous() 2025-05-07T20:32:14.1692699Z x1 = x1.contiguous() 2025-05-07T20:32:14.1692949Z 2025-05-07T20:32:14.1693150Z if scale_ub is not None: 2025-05-07T20:32:14.1693424Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1693771Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1694092Z ) 2025-05-07T20:32:14.1694289Z else: 2025-05-07T20:32:14.1694515Z scale_ub_tensor = None 2025-05-07T20:32:14.1694775Z 2025-05-07T20:32:14.1695010Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1695333Z op = silu_mul_quant 2025-05-07T20:32:14.1695594Z if compiled: 2025-05-07T20:32:14.1695853Z op = torch.compile(op) 2025-05-07T20:32:14.1696277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1696565Z 2025-05-07T20:32:14.1696767Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.1697059Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.1697434Z 2025-05-07T20:32:14.1697722Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1698139Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.1698444Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.1698772Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.1699131Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1699457Z 2025-05-07T20:32:14.1699750Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.1699956Z 2025-05-07T20:32:14.1700068Z moe/activation_test.py:126: 2025-05-07T20:32:14.1700365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1700830Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.1701289Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1702270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.1703057Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.1703620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1704425Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1705129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.1705872Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1706651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.1707419Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1708157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.1708865Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.1709477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.1710010Z fn() 2025-05-07T20:32:14.1710532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.1711125Z self.fn.run( 2025-05-07T20:32:14.1711599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1712140Z kernel = self.compile( 2025-05-07T20:32:14.1712698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1713362Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1713768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1713999Z 2025-05-07T20:32:14.1714223Z self = 2025-05-07T20:32:14.1715337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1716751Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc243d7e0>} 2025-05-07T20:32:14.1718209Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1719273Z context = 2025-05-07T20:32:14.1719568Z 2025-05-07T20:32:14.1719742Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1720269Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1720751Z module_map=module_map) 2025-05-07T20:32:14.1721133Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1721502Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.1721771Z E ^ 2025-05-07T20:32:14.1722245Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1722704Z 2025-05-07T20:32:14.1723141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1723665Z 2025-05-07T20:32:14.1723780Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1724197Z self=, 2025-05-07T20:32:14.1724617Z T=4096, 2025-05-07T20:32:14.1724815Z D=5120, 2025-05-07T20:32:14.1725010Z scale_ub=None, 2025-05-07T20:32:14.1725234Z contiguous=False, 2025-05-07T20:32:14.1725471Z compiled=False, 2025-05-07T20:32:14.1725684Z ) 2025-05-07T20:32:14.1726019Z self = 2025-05-07T20:32:14.1726573Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.1726848Z 2025-05-07T20:32:14.1726930Z @given( 2025-05-07T20:32:14.1727168Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1727491Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1727809Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1728147Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1728484Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1728775Z ) 2025-05-07T20:32:14.1729173Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1729625Z def test_silu_mul_quant( 2025-05-07T20:32:14.1729875Z self, 2025-05-07T20:32:14.1730071Z T: int, 2025-05-07T20:32:14.1730275Z D: int, 2025-05-07T20:32:14.1730499Z scale_ub: Optional[float], 2025-05-07T20:32:14.1730776Z contiguous: bool, 2025-05-07T20:32:14.1731024Z compiled: bool, 2025-05-07T20:32:14.1731254Z ) -> None: 2025-05-07T20:32:14.1731470Z torch.manual_seed(2025) 2025-05-07T20:32:14.1731717Z 2025-05-07T20:32:14.1732001Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1732350Z 2025-05-07T20:32:14.1732568Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1732891Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1733203Z x = x_sign * x_clamp 2025-05-07T20:32:14.1733440Z x0 = x[:, :D] 2025-05-07T20:32:14.1733666Z x1 = x[:, D:] 2025-05-07T20:32:14.1733876Z 2025-05-07T20:32:14.1734063Z if contiguous: 2025-05-07T20:32:14.1734297Z x0 = x0.contiguous() 2025-05-07T20:32:14.1734559Z x1 = x1.contiguous() 2025-05-07T20:32:14.1734798Z 2025-05-07T20:32:14.1734997Z if scale_ub is not None: 2025-05-07T20:32:14.1735279Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1735615Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1735930Z ) 2025-05-07T20:32:14.1736130Z else: 2025-05-07T20:32:14.1736343Z scale_ub_tensor = None 2025-05-07T20:32:14.1736599Z 2025-05-07T20:32:14.1736835Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1737156Z op = silu_mul_quant 2025-05-07T20:32:14.1737490Z if compiled: 2025-05-07T20:32:14.1737745Z op = torch.compile(op) 2025-05-07T20:32:14.1738146Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1738425Z 2025-05-07T20:32:14.1738624Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1738790Z 2025-05-07T20:32:14.1738898Z moe/activation_test.py:117: 2025-05-07T20:32:14.1739190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1739529Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1739822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1740521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1741231Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1741776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1742484Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1743204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1743751Z kernel = self.compile( 2025-05-07T20:32:14.1744308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1744978Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1745372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1745653Z 2025-05-07T20:32:14.1745865Z self = 2025-05-07T20:32:14.1746971Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1748387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc23c5090>} 2025-05-07T20:32:14.1749800Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1750853Z context = 2025-05-07T20:32:14.1751156Z 2025-05-07T20:32:14.1751327Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1751861Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1752338Z module_map=module_map) 2025-05-07T20:32:14.1752762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1753124Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.1753389Z E ^ 2025-05-07T20:32:14.1753865Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1754338Z 2025-05-07T20:32:14.1754762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1755283Z 2025-05-07T20:32:14.1755394Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1756118Z self=, 2025-05-07T20:32:14.1756532Z T=4096, 2025-05-07T20:32:14.1756728Z D=7168, 2025-05-07T20:32:14.1756922Z scale_ub=None, 2025-05-07T20:32:14.1757148Z contiguous=False, 2025-05-07T20:32:14.1757382Z compiled=False, 2025-05-07T20:32:14.1757593Z ) 2025-05-07T20:32:14.1757916Z self = 2025-05-07T20:32:14.1758609Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.1758891Z 2025-05-07T20:32:14.1758978Z @given( 2025-05-07T20:32:14.1759212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1759535Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1759850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1760181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1760518Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1760812Z ) 2025-05-07T20:32:14.1761168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1761619Z def test_silu_mul_quant( 2025-05-07T20:32:14.1761868Z self, 2025-05-07T20:32:14.1762071Z T: int, 2025-05-07T20:32:14.1762271Z D: int, 2025-05-07T20:32:14.1762519Z scale_ub: Optional[float], 2025-05-07T20:32:14.1762821Z contiguous: bool, 2025-05-07T20:32:14.1763062Z compiled: bool, 2025-05-07T20:32:14.1763297Z ) -> None: 2025-05-07T20:32:14.1763523Z torch.manual_seed(2025) 2025-05-07T20:32:14.1763778Z 2025-05-07T20:32:14.1764056Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1764412Z 2025-05-07T20:32:14.1764614Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1764907Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1765224Z x = x_sign * x_clamp 2025-05-07T20:32:14.1765470Z x0 = x[:, :D] 2025-05-07T20:32:14.1765685Z x1 = x[:, D:] 2025-05-07T20:32:14.1765997Z 2025-05-07T20:32:14.1766189Z if contiguous: 2025-05-07T20:32:14.1766420Z x0 = x0.contiguous() 2025-05-07T20:32:14.1766689Z x1 = x1.contiguous() 2025-05-07T20:32:14.1766939Z 2025-05-07T20:32:14.1767133Z if scale_ub is not None: 2025-05-07T20:32:14.1767415Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1767763Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1768080Z ) 2025-05-07T20:32:14.1768280Z else: 2025-05-07T20:32:14.1768501Z scale_ub_tensor = None 2025-05-07T20:32:14.1768834Z 2025-05-07T20:32:14.1769067Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1769387Z op = silu_mul_quant 2025-05-07T20:32:14.1769644Z if compiled: 2025-05-07T20:32:14.1769892Z op = torch.compile(op) 2025-05-07T20:32:14.1770196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1770484Z 2025-05-07T20:32:14.1770678Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1770852Z 2025-05-07T20:32:14.1770954Z moe/activation_test.py:117: 2025-05-07T20:32:14.1771256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1771585Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1771874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1772603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1773349Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1773900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1774600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1775281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1775825Z kernel = self.compile( 2025-05-07T20:32:14.1776380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1777051Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1777453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1777683Z 2025-05-07T20:32:14.1777977Z self = 2025-05-07T20:32:14.1779198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1780611Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc23c6560>} 2025-05-07T20:32:14.1781994Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1783052Z context = 2025-05-07T20:32:14.1783345Z 2025-05-07T20:32:14.1783517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1784057Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1784537Z module_map=module_map) 2025-05-07T20:32:14.1784906Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1785272Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.1785543Z E ^ 2025-05-07T20:32:14.1786021Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1786074Z 2025-05-07T20:32:14.1786498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1786503Z 2025-05-07T20:32:14.1786610Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1786846Z self=, 2025-05-07T20:32:14.1786926Z T=128, 2025-05-07T20:32:14.1787005Z D=7168, 2025-05-07T20:32:14.1787105Z scale_ub=None, 2025-05-07T20:32:14.1787195Z contiguous=False, 2025-05-07T20:32:14.1787288Z compiled=True, 2025-05-07T20:32:14.1787361Z ) 2025-05-07T20:32:14.1787631Z self = 2025-05-07T20:32:14.1787812Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.1787817Z 2025-05-07T20:32:14.1787896Z @given( 2025-05-07T20:32:14.1788017Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1788129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1788248Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1788367Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1788492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1788569Z ) 2025-05-07T20:32:14.1788826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1788925Z def test_silu_mul_quant( 2025-05-07T20:32:14.1789005Z self, 2025-05-07T20:32:14.1789091Z T: int, 2025-05-07T20:32:14.1789168Z D: int, 2025-05-07T20:32:14.1789273Z scale_ub: Optional[float], 2025-05-07T20:32:14.1789371Z contiguous: bool, 2025-05-07T20:32:14.1789460Z compiled: bool, 2025-05-07T20:32:14.1789542Z ) -> None: 2025-05-07T20:32:14.1789644Z torch.manual_seed(2025) 2025-05-07T20:32:14.1789718Z 2025-05-07T20:32:14.1789891Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1789975Z 2025-05-07T20:32:14.1790071Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1790208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1790298Z x = x_sign * x_clamp 2025-05-07T20:32:14.1790380Z x0 = x[:, :D] 2025-05-07T20:32:14.1790466Z x1 = x[:, D:] 2025-05-07T20:32:14.1790540Z 2025-05-07T20:32:14.1790626Z if contiguous: 2025-05-07T20:32:14.1790806Z x0 = x0.contiguous() 2025-05-07T20:32:14.1790897Z x1 = x1.contiguous() 2025-05-07T20:32:14.1790971Z 2025-05-07T20:32:14.1791069Z if scale_ub is not None: 2025-05-07T20:32:14.1791179Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1791316Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1791399Z ) 2025-05-07T20:32:14.1791477Z else: 2025-05-07T20:32:14.1791578Z scale_ub_tensor = None 2025-05-07T20:32:14.1791653Z 2025-05-07T20:32:14.1791788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1791886Z op = silu_mul_quant 2025-05-07T20:32:14.1791971Z if compiled: 2025-05-07T20:32:14.1792074Z op = torch.compile(op) 2025-05-07T20:32:14.1792188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1792265Z 2025-05-07T20:32:14.1792357Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.1792519Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.1792602Z 2025-05-07T20:32:14.1792756Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1792868Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.1792972Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.1793102Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.1793247Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1793325Z 2025-05-07T20:32:14.1793432Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.1793480Z 2025-05-07T20:32:14.1793582Z moe/activation_test.py:126: 2025-05-07T20:32:14.1793710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1793824Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.1793962Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1794552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.1794656Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.1795065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1795297Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1795673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.1795938Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1796352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.1796609Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1797005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.1797178Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.1797529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.1797615Z fn() 2025-05-07T20:32:14.1798026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.1798117Z self.fn.run( 2025-05-07T20:32:14.1798466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1798562Z kernel = self.compile( 2025-05-07T20:32:14.1798956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1799136Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1799340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1799345Z 2025-05-07T20:32:14.1799565Z self = 2025-05-07T20:32:14.1800453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1801026Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc23c7d00>} 2025-05-07T20:32:14.1801800Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1802003Z context = 2025-05-07T20:32:14.1802010Z 2025-05-07T20:32:14.1802227Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1802564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1802701Z module_map=module_map) 2025-05-07T20:32:14.1802881Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1802989Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.1803073Z E ^ 2025-05-07T20:32:14.1803583Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1803590Z 2025-05-07T20:32:14.1804027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1804032Z 2025-05-07T20:32:14.1804148Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1804451Z self=, 2025-05-07T20:32:14.1804541Z T=128, 2025-05-07T20:32:14.1804619Z D=7168, 2025-05-07T20:32:14.1804704Z scale_ub=None, 2025-05-07T20:32:14.1804862Z contiguous=False, 2025-05-07T20:32:14.1804948Z compiled=False, 2025-05-07T20:32:14.1805030Z ) 2025-05-07T20:32:14.1805258Z self = 2025-05-07T20:32:14.1805436Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.1805441Z 2025-05-07T20:32:14.1805531Z @given( 2025-05-07T20:32:14.1805653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1805755Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1805883Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1806006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1806124Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1806212Z ) 2025-05-07T20:32:14.1806475Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1806577Z def test_silu_mul_quant( 2025-05-07T20:32:14.1806656Z self, 2025-05-07T20:32:14.1806740Z T: int, 2025-05-07T20:32:14.1806826Z D: int, 2025-05-07T20:32:14.1806928Z scale_ub: Optional[float], 2025-05-07T20:32:14.1807018Z contiguous: bool, 2025-05-07T20:32:14.1807110Z compiled: bool, 2025-05-07T20:32:14.1807189Z ) -> None: 2025-05-07T20:32:14.1807289Z torch.manual_seed(2025) 2025-05-07T20:32:14.1807370Z 2025-05-07T20:32:14.1807543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1807620Z 2025-05-07T20:32:14.1807719Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1807849Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1807944Z x = x_sign * x_clamp 2025-05-07T20:32:14.1808026Z x0 = x[:, :D] 2025-05-07T20:32:14.1808107Z x1 = x[:, D:] 2025-05-07T20:32:14.1808290Z 2025-05-07T20:32:14.1808379Z if contiguous: 2025-05-07T20:32:14.1808472Z x0 = x0.contiguous() 2025-05-07T20:32:14.1808568Z x1 = x1.contiguous() 2025-05-07T20:32:14.1808645Z 2025-05-07T20:32:14.1808740Z if scale_ub is not None: 2025-05-07T20:32:14.1808858Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1808996Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1809073Z ) 2025-05-07T20:32:14.1809161Z else: 2025-05-07T20:32:14.1809257Z scale_ub_tensor = None 2025-05-07T20:32:14.1809341Z 2025-05-07T20:32:14.1809474Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1809566Z op = silu_mul_quant 2025-05-07T20:32:14.1809662Z if compiled: 2025-05-07T20:32:14.1809764Z op = torch.compile(op) 2025-05-07T20:32:14.1809876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1809957Z 2025-05-07T20:32:14.1810055Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1810060Z 2025-05-07T20:32:14.1810161Z moe/activation_test.py:117: 2025-05-07T20:32:14.1810305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1810408Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1810519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1811038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1811185Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1811560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1811788Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1812137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1812247Z kernel = self.compile( 2025-05-07T20:32:14.1812638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1812864Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1812992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1812997Z 2025-05-07T20:32:14.1813209Z self = 2025-05-07T20:32:14.1814018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1814538Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc23c72e0>} 2025-05-07T20:32:14.1815313Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1815515Z context = 2025-05-07T20:32:14.1815520Z 2025-05-07T20:32:14.1815693Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1815971Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1816085Z module_map=module_map) 2025-05-07T20:32:14.1816259Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1816362Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.1816442Z E ^ 2025-05-07T20:32:14.1816813Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1816818Z 2025-05-07T20:32:14.1817364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1817372Z 2025-05-07T20:32:14.1817490Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1817722Z self=, 2025-05-07T20:32:14.1817803Z T=4096, 2025-05-07T20:32:14.1817891Z D=5120, 2025-05-07T20:32:14.1817979Z scale_ub=1200.0, 2025-05-07T20:32:14.1818178Z contiguous=True, 2025-05-07T20:32:14.1818277Z compiled=False, 2025-05-07T20:32:14.1818353Z ) 2025-05-07T20:32:14.1818580Z self = 2025-05-07T20:32:14.1818769Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.1818774Z 2025-05-07T20:32:14.1818855Z @given( 2025-05-07T20:32:14.1818985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1819095Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1819215Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1819345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1819466Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1819544Z ) 2025-05-07T20:32:14.1819803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1819901Z def test_silu_mul_quant( 2025-05-07T20:32:14.1819979Z self, 2025-05-07T20:32:14.1820114Z T: int, 2025-05-07T20:32:14.1820192Z D: int, 2025-05-07T20:32:14.1820299Z scale_ub: Optional[float], 2025-05-07T20:32:14.1820392Z contiguous: bool, 2025-05-07T20:32:14.1820479Z compiled: bool, 2025-05-07T20:32:14.1820565Z ) -> None: 2025-05-07T20:32:14.1820664Z torch.manual_seed(2025) 2025-05-07T20:32:14.1820739Z 2025-05-07T20:32:14.1820919Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1820999Z 2025-05-07T20:32:14.1821094Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1821230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1821365Z x = x_sign * x_clamp 2025-05-07T20:32:14.1821447Z x0 = x[:, :D] 2025-05-07T20:32:14.1821534Z x1 = x[:, D:] 2025-05-07T20:32:14.1821608Z 2025-05-07T20:32:14.1821693Z if contiguous: 2025-05-07T20:32:14.1821791Z x0 = x0.contiguous() 2025-05-07T20:32:14.1821881Z x1 = x1.contiguous() 2025-05-07T20:32:14.1821962Z 2025-05-07T20:32:14.1822055Z if scale_ub is not None: 2025-05-07T20:32:14.1822163Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1822312Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1822393Z ) 2025-05-07T20:32:14.1822489Z else: 2025-05-07T20:32:14.1822598Z scale_ub_tensor = None 2025-05-07T20:32:14.1822688Z 2025-05-07T20:32:14.1822825Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1822924Z op = silu_mul_quant 2025-05-07T20:32:14.1823010Z if compiled: 2025-05-07T20:32:14.1823115Z op = torch.compile(op) 2025-05-07T20:32:14.1823229Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1823302Z 2025-05-07T20:32:14.1823404Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1823409Z 2025-05-07T20:32:14.1823508Z moe/activation_test.py:117: 2025-05-07T20:32:14.1823638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1823752Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1823855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1824368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1824475Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1824927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1825164Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1825517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1825614Z kernel = self.compile( 2025-05-07T20:32:14.1826012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1826196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1826322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1826333Z 2025-05-07T20:32:14.1826545Z self = 2025-05-07T20:32:14.1827348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1827871Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9df6cdc0>} 2025-05-07T20:32:14.1828639Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1828884Z context = 2025-05-07T20:32:14.1828889Z 2025-05-07T20:32:14.1829061Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1829333Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1829452Z module_map=module_map) 2025-05-07T20:32:14.1829624Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1829736Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.1829815Z E ^ 2025-05-07T20:32:14.1830221Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1830226Z 2025-05-07T20:32:14.1830655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1830662Z 2025-05-07T20:32:14.1830772Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1831003Z self=, 2025-05-07T20:32:14.1831092Z T=1, 2025-05-07T20:32:14.1831172Z D=5120, 2025-05-07T20:32:14.1831263Z scale_ub=None, 2025-05-07T20:32:14.1831352Z contiguous=True, 2025-05-07T20:32:14.1831438Z compiled=True, 2025-05-07T20:32:14.1831519Z ) 2025-05-07T20:32:14.1831751Z self = 2025-05-07T20:32:14.1831917Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.1831925Z 2025-05-07T20:32:14.1832011Z @given( 2025-05-07T20:32:14.1832139Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1832242Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1832367Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1838530Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1838680Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1838762Z ) 2025-05-07T20:32:14.1839023Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1839121Z def test_silu_mul_quant( 2025-05-07T20:32:14.1839204Z self, 2025-05-07T20:32:14.1839280Z T: int, 2025-05-07T20:32:14.1839358Z D: int, 2025-05-07T20:32:14.1839464Z scale_ub: Optional[float], 2025-05-07T20:32:14.1839671Z contiguous: bool, 2025-05-07T20:32:14.1839765Z compiled: bool, 2025-05-07T20:32:14.1839846Z ) -> None: 2025-05-07T20:32:14.1839944Z torch.manual_seed(2025) 2025-05-07T20:32:14.1840024Z 2025-05-07T20:32:14.1840205Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1840279Z 2025-05-07T20:32:14.1840379Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1840508Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1840601Z x = x_sign * x_clamp 2025-05-07T20:32:14.1840687Z x0 = x[:, :D] 2025-05-07T20:32:14.1840767Z x1 = x[:, D:] 2025-05-07T20:32:14.1840840Z 2025-05-07T20:32:14.1840933Z if contiguous: 2025-05-07T20:32:14.1841029Z x0 = x0.contiguous() 2025-05-07T20:32:14.1841128Z x1 = x1.contiguous() 2025-05-07T20:32:14.1841206Z 2025-05-07T20:32:14.1841302Z if scale_ub is not None: 2025-05-07T20:32:14.1841425Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1841564Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1841642Z ) 2025-05-07T20:32:14.1841733Z else: 2025-05-07T20:32:14.1841830Z scale_ub_tensor = None 2025-05-07T20:32:14.1841902Z 2025-05-07T20:32:14.1842047Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1842140Z op = silu_mul_quant 2025-05-07T20:32:14.1842227Z if compiled: 2025-05-07T20:32:14.1842386Z op = torch.compile(op) 2025-05-07T20:32:14.1842495Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1842569Z 2025-05-07T20:32:14.1842670Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.1842797Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.1842879Z 2025-05-07T20:32:14.1843019Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1843124Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.1843240Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.1843365Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.1843553Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1843639Z 2025-05-07T20:32:14.1843741Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.1843746Z 2025-05-07T20:32:14.1843856Z moe/activation_test.py:126: 2025-05-07T20:32:14.1843989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1844101Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.1844248Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1844832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.1844938Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.1845319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1845550Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1845938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.1846201Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1846610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.1846879Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1847267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.1847448Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.1847907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.1847987Z fn() 2025-05-07T20:32:14.1848408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.1848496Z self.fn.run( 2025-05-07T20:32:14.1848844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1848946Z kernel = self.compile( 2025-05-07T20:32:14.1849337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1849528Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1849660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1849665Z 2025-05-07T20:32:14.1849879Z self = 2025-05-07T20:32:14.1850692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1851214Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7facc2018430>} 2025-05-07T20:32:14.1851990Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1852234Z context = 2025-05-07T20:32:14.1852238Z 2025-05-07T20:32:14.1852413Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1852742Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1852855Z module_map=module_map) 2025-05-07T20:32:14.1853032Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1853182Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.1853261Z E ^ 2025-05-07T20:32:14.1853636Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1853641Z 2025-05-07T20:32:14.1854069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1854079Z 2025-05-07T20:32:14.1854197Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1854428Z self=, 2025-05-07T20:32:14.1854510Z T=2048, 2025-05-07T20:32:14.1854600Z D=5120, 2025-05-07T20:32:14.1854688Z scale_ub=None, 2025-05-07T20:32:14.1854778Z contiguous=True, 2025-05-07T20:32:14.1854876Z compiled=True, 2025-05-07T20:32:14.1854954Z ) 2025-05-07T20:32:14.1855183Z self = 2025-05-07T20:32:14.1855370Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.1855375Z 2025-05-07T20:32:14.1855455Z @given( 2025-05-07T20:32:14.1855860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1856019Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1856163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1856301Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1856425Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1856503Z ) 2025-05-07T20:32:14.1856765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1856864Z def test_silu_mul_quant( 2025-05-07T20:32:14.1856943Z self, 2025-05-07T20:32:14.1857032Z T: int, 2025-05-07T20:32:14.1857312Z D: int, 2025-05-07T20:32:14.1857420Z scale_ub: Optional[float], 2025-05-07T20:32:14.1857520Z contiguous: bool, 2025-05-07T20:32:14.1857610Z compiled: bool, 2025-05-07T20:32:14.1857697Z ) -> None: 2025-05-07T20:32:14.1857803Z torch.manual_seed(2025) 2025-05-07T20:32:14.1857879Z 2025-05-07T20:32:14.1858121Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1858205Z 2025-05-07T20:32:14.1858303Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1858436Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1858533Z x = x_sign * x_clamp 2025-05-07T20:32:14.1858617Z x0 = x[:, :D] 2025-05-07T20:32:14.1858699Z x1 = x[:, D:] 2025-05-07T20:32:14.1858778Z 2025-05-07T20:32:14.1858865Z if contiguous: 2025-05-07T20:32:14.1858962Z x0 = x0.contiguous() 2025-05-07T20:32:14.1859058Z x1 = x1.contiguous() 2025-05-07T20:32:14.1859138Z 2025-05-07T20:32:14.1859232Z if scale_ub is not None: 2025-05-07T20:32:14.1859346Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1859488Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1859572Z ) 2025-05-07T20:32:14.1859650Z else: 2025-05-07T20:32:14.1859748Z scale_ub_tensor = None 2025-05-07T20:32:14.1859828Z 2025-05-07T20:32:14.1859963Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1860127Z op = silu_mul_quant 2025-05-07T20:32:14.1860231Z if compiled: 2025-05-07T20:32:14.1860334Z op = torch.compile(op) 2025-05-07T20:32:14.1860443Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1860524Z 2025-05-07T20:32:14.1860619Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.1860745Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.1860825Z 2025-05-07T20:32:14.1860974Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1861086Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.1861189Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.1861382Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.1861538Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1861614Z 2025-05-07T20:32:14.1861720Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.1861725Z 2025-05-07T20:32:14.1861835Z moe/activation_test.py:126: 2025-05-07T20:32:14.1861975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1862092Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.1862233Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1862843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.1862981Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.1863351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1863584Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1863970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.1864233Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1864651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.1864909Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1865294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.1865554Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.1865908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.1865997Z fn() 2025-05-07T20:32:14.1866409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.1866495Z self.fn.run( 2025-05-07T20:32:14.1866849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1866956Z kernel = self.compile( 2025-05-07T20:32:14.1867347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1867534Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1867667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1867671Z 2025-05-07T20:32:14.1867901Z self = 2025-05-07T20:32:14.1868703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1869224Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9d8c2d40>} 2025-05-07T20:32:14.1870043Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1870245Z context = 2025-05-07T20:32:14.1870250Z 2025-05-07T20:32:14.1870428Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1870710Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1870822Z module_map=module_map) 2025-05-07T20:32:14.1871037Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1871145Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.1871230Z E ^ 2025-05-07T20:32:14.1871596Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1871603Z 2025-05-07T20:32:14.1872028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1872033Z 2025-05-07T20:32:14.1872147Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1872381Z self=, 2025-05-07T20:32:14.1872467Z T=128, 2025-05-07T20:32:14.1872545Z D=5120, 2025-05-07T20:32:14.1872637Z scale_ub=None, 2025-05-07T20:32:14.1872731Z contiguous=True, 2025-05-07T20:32:14.1872818Z compiled=True, 2025-05-07T20:32:14.1872893Z ) 2025-05-07T20:32:14.1873127Z self = 2025-05-07T20:32:14.1873301Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.1873306Z 2025-05-07T20:32:14.1873386Z @given( 2025-05-07T20:32:14.1873515Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1873621Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1873747Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1873870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1873988Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1874070Z ) 2025-05-07T20:32:14.1874322Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1874501Z def test_silu_mul_quant( 2025-05-07T20:32:14.1874588Z self, 2025-05-07T20:32:14.1874667Z T: int, 2025-05-07T20:32:14.1874746Z D: int, 2025-05-07T20:32:14.1874857Z scale_ub: Optional[float], 2025-05-07T20:32:14.1874951Z contiguous: bool, 2025-05-07T20:32:14.1875040Z compiled: bool, 2025-05-07T20:32:14.1875134Z ) -> None: 2025-05-07T20:32:14.1875233Z torch.manual_seed(2025) 2025-05-07T20:32:14.1875315Z 2025-05-07T20:32:14.1875491Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1875569Z 2025-05-07T20:32:14.1875669Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1875800Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1875892Z x = x_sign * x_clamp 2025-05-07T20:32:14.1875980Z x0 = x[:, :D] 2025-05-07T20:32:14.1876062Z x1 = x[:, D:] 2025-05-07T20:32:14.1876137Z 2025-05-07T20:32:14.1876229Z if contiguous: 2025-05-07T20:32:14.1876330Z x0 = x0.contiguous() 2025-05-07T20:32:14.1876426Z x1 = x1.contiguous() 2025-05-07T20:32:14.1876507Z 2025-05-07T20:32:14.1876601Z if scale_ub is not None: 2025-05-07T20:32:14.1876713Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1876862Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1876944Z ) 2025-05-07T20:32:14.1877028Z else: 2025-05-07T20:32:14.1877127Z scale_ub_tensor = None 2025-05-07T20:32:14.1877202Z 2025-05-07T20:32:14.1877389Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1877484Z op = silu_mul_quant 2025-05-07T20:32:14.1877572Z if compiled: 2025-05-07T20:32:14.1877685Z op = torch.compile(op) 2025-05-07T20:32:14.1877798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1877873Z 2025-05-07T20:32:14.1877974Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.1878107Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.1878182Z 2025-05-07T20:32:14.1878331Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1878507Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.1878619Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.1878749Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.1878896Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1878978Z 2025-05-07T20:32:14.1879087Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.1879092Z 2025-05-07T20:32:14.1879196Z moe/activation_test.py:126: 2025-05-07T20:32:14.1879332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1879442Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.1879590Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1880169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.1880276Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.1880658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1880886Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1881263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.1881536Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1881946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.1882213Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1882725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.1882903Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.1883266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.1883347Z fn() 2025-05-07T20:32:14.1883765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.1883851Z self.fn.run( 2025-05-07T20:32:14.1884203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1884310Z kernel = self.compile( 2025-05-07T20:32:14.1884701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1884881Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1885025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1885029Z 2025-05-07T20:32:14.1885242Z self = 2025-05-07T20:32:14.1886054Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1886576Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9d8c2e60>} 2025-05-07T20:32:14.1887391Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1887589Z context = 2025-05-07T20:32:14.1887600Z 2025-05-07T20:32:14.1887771Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1888053Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1888208Z module_map=module_map) 2025-05-07T20:32:14.1888377Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1888490Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.1888569Z E ^ 2025-05-07T20:32:14.1888944Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1888949Z 2025-05-07T20:32:14.1889376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1889380Z 2025-05-07T20:32:14.1889489Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1889729Z self=, 2025-05-07T20:32:14.1889809Z T=4096, 2025-05-07T20:32:14.1889894Z D=5120, 2025-05-07T20:32:14.1889979Z scale_ub=None, 2025-05-07T20:32:14.1890071Z contiguous=True, 2025-05-07T20:32:14.1890163Z compiled=True, 2025-05-07T20:32:14.1890239Z ) 2025-05-07T20:32:14.1890465Z self = 2025-05-07T20:32:14.1890648Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.1890653Z 2025-05-07T20:32:14.1890737Z @given( 2025-05-07T20:32:14.1890863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1890974Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1891095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1891224Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1891343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1891420Z ) 2025-05-07T20:32:14.1891778Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1891879Z def test_silu_mul_quant( 2025-05-07T20:32:14.1891960Z self, 2025-05-07T20:32:14.1892048Z T: int, 2025-05-07T20:32:14.1892126Z D: int, 2025-05-07T20:32:14.1892230Z scale_ub: Optional[float], 2025-05-07T20:32:14.1892329Z contiguous: bool, 2025-05-07T20:32:14.1892418Z compiled: bool, 2025-05-07T20:32:14.1892500Z ) -> None: 2025-05-07T20:32:14.1892606Z torch.manual_seed(2025) 2025-05-07T20:32:14.1892686Z 2025-05-07T20:32:14.1892900Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1892992Z 2025-05-07T20:32:14.1893089Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1893225Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1893318Z x = x_sign * x_clamp 2025-05-07T20:32:14.1893399Z x0 = x[:, :D] 2025-05-07T20:32:14.1893486Z x1 = x[:, D:] 2025-05-07T20:32:14.1893566Z 2025-05-07T20:32:14.1893654Z if contiguous: 2025-05-07T20:32:14.1893754Z x0 = x0.contiguous() 2025-05-07T20:32:14.1893848Z x1 = x1.contiguous() 2025-05-07T20:32:14.1893923Z 2025-05-07T20:32:14.1894025Z if scale_ub is not None: 2025-05-07T20:32:14.1894135Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1894275Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1894359Z ) 2025-05-07T20:32:14.1894509Z else: 2025-05-07T20:32:14.1894617Z scale_ub_tensor = None 2025-05-07T20:32:14.1894693Z 2025-05-07T20:32:14.1894839Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1894975Z op = silu_mul_quant 2025-05-07T20:32:14.1895101Z if compiled: 2025-05-07T20:32:14.1895241Z op = torch.compile(op) 2025-05-07T20:32:14.1895375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1895449Z 2025-05-07T20:32:14.1895550Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.1895683Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.1895812Z 2025-05-07T20:32:14.1895954Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1896065Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.1896166Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.1896297Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.1896442Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1896520Z 2025-05-07T20:32:14.1896628Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.1896633Z 2025-05-07T20:32:14.1896734Z moe/activation_test.py:126: 2025-05-07T20:32:14.1896863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1896979Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.1897122Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1897704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.1897811Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.1898260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1898495Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1898876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.1899138Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1899554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.1899895Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1900295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.1900470Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.1900822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.1900906Z fn() 2025-05-07T20:32:14.1901317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.1901408Z self.fn.run( 2025-05-07T20:32:14.1901756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1901851Z kernel = self.compile( 2025-05-07T20:32:14.1902245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1902430Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1902573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1902581Z 2025-05-07T20:32:14.1902794Z self = 2025-05-07T20:32:14.1903710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1904293Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9d456320>} 2025-05-07T20:32:14.1905144Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1905360Z context = 2025-05-07T20:32:14.1905365Z 2025-05-07T20:32:14.1905535Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1905954Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1906107Z module_map=module_map) 2025-05-07T20:32:14.1906319Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1906480Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.1906585Z E ^ 2025-05-07T20:32:14.1907104Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1907121Z 2025-05-07T20:32:14.1907555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1907560Z 2025-05-07T20:32:14.1907669Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1907905Z self=, 2025-05-07T20:32:14.1907983Z T=16384, 2025-05-07T20:32:14.1908066Z D=5120, 2025-05-07T20:32:14.1908154Z scale_ub=None, 2025-05-07T20:32:14.1908240Z contiguous=True, 2025-05-07T20:32:14.1908324Z compiled=True, 2025-05-07T20:32:14.1908405Z ) 2025-05-07T20:32:14.1908629Z self = 2025-05-07T20:32:14.1908817Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.1908821Z 2025-05-07T20:32:14.1908903Z @given( 2025-05-07T20:32:14.1909026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1909135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1909252Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1909370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1909595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1909670Z ) 2025-05-07T20:32:14.1909923Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1910026Z def test_silu_mul_quant( 2025-05-07T20:32:14.1910102Z self, 2025-05-07T20:32:14.1910184Z T: int, 2025-05-07T20:32:14.1910260Z D: int, 2025-05-07T20:32:14.1910360Z scale_ub: Optional[float], 2025-05-07T20:32:14.1910455Z contiguous: bool, 2025-05-07T20:32:14.1910544Z compiled: bool, 2025-05-07T20:32:14.1910623Z ) -> None: 2025-05-07T20:32:14.1910723Z torch.manual_seed(2025) 2025-05-07T20:32:14.1910798Z 2025-05-07T20:32:14.1910969Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1911048Z 2025-05-07T20:32:14.1911141Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1911271Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1911370Z x = x_sign * x_clamp 2025-05-07T20:32:14.1911450Z x0 = x[:, :D] 2025-05-07T20:32:14.1911537Z x1 = x[:, D:] 2025-05-07T20:32:14.1911609Z 2025-05-07T20:32:14.1911696Z if contiguous: 2025-05-07T20:32:14.1911792Z x0 = x0.contiguous() 2025-05-07T20:32:14.1911884Z x1 = x1.contiguous() 2025-05-07T20:32:14.1911956Z 2025-05-07T20:32:14.1912053Z if scale_ub is not None: 2025-05-07T20:32:14.1912164Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1912315Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1912491Z ) 2025-05-07T20:32:14.1912583Z else: 2025-05-07T20:32:14.1912678Z scale_ub_tensor = None 2025-05-07T20:32:14.1912756Z 2025-05-07T20:32:14.1912888Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1912981Z op = silu_mul_quant 2025-05-07T20:32:14.1913074Z if compiled: 2025-05-07T20:32:14.1913181Z op = torch.compile(op) 2025-05-07T20:32:14.1913294Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1913367Z 2025-05-07T20:32:14.1913460Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.1913634Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.1913708Z 2025-05-07T20:32:14.1913848Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1913957Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.1914057Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.1914183Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.1914332Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1914407Z 2025-05-07T20:32:14.1914519Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.1914524Z 2025-05-07T20:32:14.1914624Z moe/activation_test.py:126: 2025-05-07T20:32:14.1914752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1914871Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.1915012Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1915593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.1915706Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.1916073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1916310Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1916686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.1916947Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1917442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.1917701Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1918097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.1918270Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.1918622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.1918708Z fn() 2025-05-07T20:32:14.1919119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.1919203Z self.fn.run( 2025-05-07T20:32:14.1919557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1919653Z kernel = self.compile( 2025-05-07T20:32:14.1920063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1920242Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1920374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1920379Z 2025-05-07T20:32:14.1920597Z self = 2025-05-07T20:32:14.1921398Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1921965Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9d8c25f0>} 2025-05-07T20:32:14.1922791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1922987Z context = 2025-05-07T20:32:14.1923032Z 2025-05-07T20:32:14.1923207Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1923479Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1923595Z module_map=module_map) 2025-05-07T20:32:14.1923761Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1923865Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.1923949Z E ^ 2025-05-07T20:32:14.1924314Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1924318Z 2025-05-07T20:32:14.1924754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1924759Z 2025-05-07T20:32:14.1924865Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1925097Z self=, 2025-05-07T20:32:14.1925183Z T=1, 2025-05-07T20:32:14.1925261Z D=5120, 2025-05-07T20:32:14.1925345Z scale_ub=1200.0, 2025-05-07T20:32:14.1925437Z contiguous=True, 2025-05-07T20:32:14.1925521Z compiled=True, 2025-05-07T20:32:14.1925594Z ) 2025-05-07T20:32:14.1925826Z self = 2025-05-07T20:32:14.1925997Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.1926002Z 2025-05-07T20:32:14.1926086Z @given( 2025-05-07T20:32:14.1926207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1926310Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1926439Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1926636Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1926754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1926837Z ) 2025-05-07T20:32:14.1927090Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1927185Z def test_silu_mul_quant( 2025-05-07T20:32:14.1927267Z self, 2025-05-07T20:32:14.1927343Z T: int, 2025-05-07T20:32:14.1927423Z D: int, 2025-05-07T20:32:14.1927524Z scale_ub: Optional[float], 2025-05-07T20:32:14.1927617Z contiguous: bool, 2025-05-07T20:32:14.1927709Z compiled: bool, 2025-05-07T20:32:14.1927788Z ) -> None: 2025-05-07T20:32:14.1927884Z torch.manual_seed(2025) 2025-05-07T20:32:14.1927962Z 2025-05-07T20:32:14.1928133Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1928207Z 2025-05-07T20:32:14.1928310Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1928442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1928530Z x = x_sign * x_clamp 2025-05-07T20:32:14.1928616Z x0 = x[:, :D] 2025-05-07T20:32:14.1928698Z x1 = x[:, D:] 2025-05-07T20:32:14.1928770Z 2025-05-07T20:32:14.1928864Z if contiguous: 2025-05-07T20:32:14.1928958Z x0 = x0.contiguous() 2025-05-07T20:32:14.1929052Z x1 = x1.contiguous() 2025-05-07T20:32:14.1929124Z 2025-05-07T20:32:14.1929216Z if scale_ub is not None: 2025-05-07T20:32:14.1929375Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1929512Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1929588Z ) 2025-05-07T20:32:14.1929669Z else: 2025-05-07T20:32:14.1929764Z scale_ub_tensor = None 2025-05-07T20:32:14.1929837Z 2025-05-07T20:32:14.1929977Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1930069Z op = silu_mul_quant 2025-05-07T20:32:14.1930160Z if compiled: 2025-05-07T20:32:14.1930268Z op = torch.compile(op) 2025-05-07T20:32:14.1930375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1930499Z 2025-05-07T20:32:14.1930590Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1930595Z 2025-05-07T20:32:14.1930692Z moe/activation_test.py:117: 2025-05-07T20:32:14.1930829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1930932Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1931037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1931427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.1931521Z return fn(*args, **kwargs) 2025-05-07T20:32:14.1932037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1932142Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1932537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1932801Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1933151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1933246Z kernel = self.compile( 2025-05-07T20:32:14.1933645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1933826Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1933958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1933963Z 2025-05-07T20:32:14.1934173Z self = 2025-05-07T20:32:14.1935048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1935580Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9cff05e0>} 2025-05-07T20:32:14.1936343Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1936547Z context = 2025-05-07T20:32:14.1936552Z 2025-05-07T20:32:14.1936721Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1936997Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1937112Z module_map=module_map) 2025-05-07T20:32:14.1937277Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1937388Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.1937468Z E ^ 2025-05-07T20:32:14.1937830Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1937835Z 2025-05-07T20:32:14.1938383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1938440Z 2025-05-07T20:32:14.1938550Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1938787Z self=, 2025-05-07T20:32:14.1938866Z T=1, 2025-05-07T20:32:14.1938945Z D=5120, 2025-05-07T20:32:14.1939035Z scale_ub=None, 2025-05-07T20:32:14.1939125Z contiguous=False, 2025-05-07T20:32:14.1939210Z compiled=True, 2025-05-07T20:32:14.1939297Z ) 2025-05-07T20:32:14.1939521Z self = 2025-05-07T20:32:14.1939689Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.1939742Z 2025-05-07T20:32:14.1939819Z @given( 2025-05-07T20:32:14.1939941Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1940048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1940167Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1940295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1940417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1940492Z ) 2025-05-07T20:32:14.1940743Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1940843Z def test_silu_mul_quant( 2025-05-07T20:32:14.1940920Z self, 2025-05-07T20:32:14.1940997Z T: int, 2025-05-07T20:32:14.1941084Z D: int, 2025-05-07T20:32:14.1941184Z scale_ub: Optional[float], 2025-05-07T20:32:14.1941282Z contiguous: bool, 2025-05-07T20:32:14.1941370Z compiled: bool, 2025-05-07T20:32:14.1941452Z ) -> None: 2025-05-07T20:32:14.1941560Z torch.manual_seed(2025) 2025-05-07T20:32:14.1941633Z 2025-05-07T20:32:14.1941806Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1941887Z 2025-05-07T20:32:14.1941979Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1942109Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1942207Z x = x_sign * x_clamp 2025-05-07T20:32:14.1942288Z x0 = x[:, :D] 2025-05-07T20:32:14.1942374Z x1 = x[:, D:] 2025-05-07T20:32:14.1942471Z 2025-05-07T20:32:14.1942559Z if contiguous: 2025-05-07T20:32:14.1942669Z x0 = x0.contiguous() 2025-05-07T20:32:14.1942766Z x1 = x1.contiguous() 2025-05-07T20:32:14.1942837Z 2025-05-07T20:32:14.1943039Z if scale_ub is not None: 2025-05-07T20:32:14.1943147Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1943284Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1943369Z ) 2025-05-07T20:32:14.1943445Z else: 2025-05-07T20:32:14.1943539Z scale_ub_tensor = None 2025-05-07T20:32:14.1943618Z 2025-05-07T20:32:14.1943749Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1943839Z op = silu_mul_quant 2025-05-07T20:32:14.1943933Z if compiled: 2025-05-07T20:32:14.1944034Z op = torch.compile(op) 2025-05-07T20:32:14.1944141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1944220Z 2025-05-07T20:32:14.1944311Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.1944440Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.1944514Z 2025-05-07T20:32:14.1944657Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1944766Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.1944868Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.1944995Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.1945144Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1945217Z 2025-05-07T20:32:14.1945319Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.1945324Z 2025-05-07T20:32:14.1945429Z moe/activation_test.py:126: 2025-05-07T20:32:14.1945604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1945717Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.1945854Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.1946428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.1946542Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.1946908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1947175Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1947557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.1947818Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1948235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.1948492Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.1948873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.1949054Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.1949403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.1949494Z fn() 2025-05-07T20:32:14.1949903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.1949985Z self.fn.run( 2025-05-07T20:32:14.1950337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1950434Z kernel = self.compile( 2025-05-07T20:32:14.1950822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1951007Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1951136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1951140Z 2025-05-07T20:32:14.1951436Z self = 2025-05-07T20:32:14.1952237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1952800Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9cff1090>} 2025-05-07T20:32:14.1953586Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1953784Z context = 2025-05-07T20:32:14.1953789Z 2025-05-07T20:32:14.1953963Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1954239Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1954355Z module_map=module_map) 2025-05-07T20:32:14.1954522Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1954627Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.1954710Z E ^ 2025-05-07T20:32:14.1955073Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1955120Z 2025-05-07T20:32:14.1955961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1955981Z 2025-05-07T20:32:14.1956097Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1956329Z self=, 2025-05-07T20:32:14.1956415Z T=1, 2025-05-07T20:32:14.1956495Z D=5120, 2025-05-07T20:32:14.1956585Z scale_ub=None, 2025-05-07T20:32:14.1956677Z contiguous=True, 2025-05-07T20:32:14.1956763Z compiled=False, 2025-05-07T20:32:14.1956836Z ) 2025-05-07T20:32:14.1957230Z self = 2025-05-07T20:32:14.1957398Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.1957402Z 2025-05-07T20:32:14.1957480Z @given( 2025-05-07T20:32:14.1957610Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1957714Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1957839Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1957962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1958079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1958161Z ) 2025-05-07T20:32:14.1958414Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1958507Z def test_silu_mul_quant( 2025-05-07T20:32:14.1958597Z self, 2025-05-07T20:32:14.1958676Z T: int, 2025-05-07T20:32:14.1958752Z D: int, 2025-05-07T20:32:14.1958858Z scale_ub: Optional[float], 2025-05-07T20:32:14.1958952Z contiguous: bool, 2025-05-07T20:32:14.1959047Z compiled: bool, 2025-05-07T20:32:14.1959125Z ) -> None: 2025-05-07T20:32:14.1959222Z torch.manual_seed(2025) 2025-05-07T20:32:14.1959299Z 2025-05-07T20:32:14.1959473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1959547Z 2025-05-07T20:32:14.1959646Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1959774Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1959865Z x = x_sign * x_clamp 2025-05-07T20:32:14.1959950Z x0 = x[:, :D] 2025-05-07T20:32:14.1960029Z x1 = x[:, D:] 2025-05-07T20:32:14.1960100Z 2025-05-07T20:32:14.1960191Z if contiguous: 2025-05-07T20:32:14.1960283Z x0 = x0.contiguous() 2025-05-07T20:32:14.1960583Z x1 = x1.contiguous() 2025-05-07T20:32:14.1960665Z 2025-05-07T20:32:14.1960758Z if scale_ub is not None: 2025-05-07T20:32:14.1960875Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1961012Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1961087Z ) 2025-05-07T20:32:14.1961169Z else: 2025-05-07T20:32:14.1961264Z scale_ub_tensor = None 2025-05-07T20:32:14.1961337Z 2025-05-07T20:32:14.1961480Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1961571Z op = silu_mul_quant 2025-05-07T20:32:14.1961656Z if compiled: 2025-05-07T20:32:14.1961764Z op = torch.compile(op) 2025-05-07T20:32:14.1961872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1961944Z 2025-05-07T20:32:14.1962041Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1962045Z 2025-05-07T20:32:14.1962150Z moe/activation_test.py:117: 2025-05-07T20:32:14.1962287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1962389Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1962493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1963016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1963116Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1963484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1963781Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1964131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1964232Z kernel = self.compile( 2025-05-07T20:32:14.1964629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1964808Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1964982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1964987Z 2025-05-07T20:32:14.1965198Z self = 2025-05-07T20:32:14.1966002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1966520Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9cff3880>} 2025-05-07T20:32:14.1967292Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1967495Z context = 2025-05-07T20:32:14.1967502Z 2025-05-07T20:32:14.1967670Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1967952Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1968061Z module_map=module_map) 2025-05-07T20:32:14.1968227Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1968334Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.1968408Z E ^ 2025-05-07T20:32:14.1968788Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1968792Z 2025-05-07T20:32:14.1969294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1969299Z 2025-05-07T20:32:14.1969408Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1969643Z self=, 2025-05-07T20:32:14.1975410Z T=128, 2025-05-07T20:32:14.1975520Z D=5120, 2025-05-07T20:32:14.1975613Z scale_ub=None, 2025-05-07T20:32:14.1975702Z contiguous=False, 2025-05-07T20:32:14.1975795Z compiled=True, 2025-05-07T20:32:14.1975870Z ) 2025-05-07T20:32:14.1976104Z self = 2025-05-07T20:32:14.1976293Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.1976299Z 2025-05-07T20:32:14.1976377Z @given( 2025-05-07T20:32:14.1976511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1976613Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1976732Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1976864Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1976981Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1977055Z ) 2025-05-07T20:32:14.1977318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1977416Z def test_silu_mul_quant( 2025-05-07T20:32:14.1977494Z self, 2025-05-07T20:32:14.1977579Z T: int, 2025-05-07T20:32:14.1977654Z D: int, 2025-05-07T20:32:14.1977755Z scale_ub: Optional[float], 2025-05-07T20:32:14.1977933Z contiguous: bool, 2025-05-07T20:32:14.1978104Z compiled: bool, 2025-05-07T20:32:14.1978224Z ) -> None: 2025-05-07T20:32:14.1978325Z torch.manual_seed(2025) 2025-05-07T20:32:14.1978400Z 2025-05-07T20:32:14.1978583Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1978659Z 2025-05-07T20:32:14.1978754Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1978897Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1978988Z x = x_sign * x_clamp 2025-05-07T20:32:14.1979069Z x0 = x[:, :D] 2025-05-07T20:32:14.1979214Z x1 = x[:, D:] 2025-05-07T20:32:14.1979288Z 2025-05-07T20:32:14.1979376Z if contiguous: 2025-05-07T20:32:14.1979479Z x0 = x0.contiguous() 2025-05-07T20:32:14.1979572Z x1 = x1.contiguous() 2025-05-07T20:32:14.1979654Z 2025-05-07T20:32:14.1979747Z if scale_ub is not None: 2025-05-07T20:32:14.1979858Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1980011Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1980087Z ) 2025-05-07T20:32:14.1980165Z else: 2025-05-07T20:32:14.1980271Z scale_ub_tensor = None 2025-05-07T20:32:14.1980344Z 2025-05-07T20:32:14.1980480Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1980582Z op = silu_mul_quant 2025-05-07T20:32:14.1980674Z if compiled: 2025-05-07T20:32:14.1980777Z op = torch.compile(op) 2025-05-07T20:32:14.1980894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1980973Z 2025-05-07T20:32:14.1981070Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1981082Z 2025-05-07T20:32:14.1981184Z moe/activation_test.py:117: 2025-05-07T20:32:14.1981320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1981436Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1981541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1981925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.1982029Z return fn(*args, **kwargs) 2025-05-07T20:32:14.1982559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1982682Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1983180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1983416Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1983774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1983873Z kernel = self.compile( 2025-05-07T20:32:14.1984266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1984458Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1984588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1984593Z 2025-05-07T20:32:14.1984813Z self = 2025-05-07T20:32:14.1985621Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1986146Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9cff3eb0>} 2025-05-07T20:32:14.1986922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1987162Z context = 2025-05-07T20:32:14.1987167Z 2025-05-07T20:32:14.1987345Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1987618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1987735Z module_map=module_map) 2025-05-07T20:32:14.1987915Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1988020Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.1988147Z E ^ 2025-05-07T20:32:14.1988513Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1988518Z 2025-05-07T20:32:14.1988945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1988952Z 2025-05-07T20:32:14.1989070Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1989301Z self=, 2025-05-07T20:32:14.1989389Z T=128, 2025-05-07T20:32:14.1989467Z D=7168, 2025-05-07T20:32:14.1989552Z scale_ub=1200.0, 2025-05-07T20:32:14.1989649Z contiguous=False, 2025-05-07T20:32:14.1989734Z compiled=False, 2025-05-07T20:32:14.1989809Z ) 2025-05-07T20:32:14.1990045Z self = 2025-05-07T20:32:14.1990224Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.1990232Z 2025-05-07T20:32:14.1990311Z @given( 2025-05-07T20:32:14.1990441Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1990543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1990668Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1990793Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1990909Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1990989Z ) 2025-05-07T20:32:14.1991241Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1991337Z def test_silu_mul_quant( 2025-05-07T20:32:14.1991422Z self, 2025-05-07T20:32:14.1991500Z T: int, 2025-05-07T20:32:14.1991577Z D: int, 2025-05-07T20:32:14.1991767Z scale_ub: Optional[float], 2025-05-07T20:32:14.1991860Z contiguous: bool, 2025-05-07T20:32:14.1991947Z compiled: bool, 2025-05-07T20:32:14.1992036Z ) -> None: 2025-05-07T20:32:14.1992133Z torch.manual_seed(2025) 2025-05-07T20:32:14.1992211Z 2025-05-07T20:32:14.1992385Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1992461Z 2025-05-07T20:32:14.1992552Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1992689Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1992781Z x = x_sign * x_clamp 2025-05-07T20:32:14.1992863Z x0 = x[:, :D] 2025-05-07T20:32:14.1992951Z x1 = x[:, D:] 2025-05-07T20:32:14.1993024Z 2025-05-07T20:32:14.1993107Z if contiguous: 2025-05-07T20:32:14.1993205Z x0 = x0.contiguous() 2025-05-07T20:32:14.1993296Z x1 = x1.contiguous() 2025-05-07T20:32:14.1993369Z 2025-05-07T20:32:14.1993473Z if scale_ub is not None: 2025-05-07T20:32:14.1993579Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1993722Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1993802Z ) 2025-05-07T20:32:14.1993878Z else: 2025-05-07T20:32:14.1993978Z scale_ub_tensor = None 2025-05-07T20:32:14.1994050Z 2025-05-07T20:32:14.1994182Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1994281Z op = silu_mul_quant 2025-05-07T20:32:14.1994413Z if compiled: 2025-05-07T20:32:14.1994516Z op = torch.compile(op) 2025-05-07T20:32:14.1994631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1994703Z 2025-05-07T20:32:14.1994795Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1994800Z 2025-05-07T20:32:14.1994908Z moe/activation_test.py:117: 2025-05-07T20:32:14.1995040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1995157Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1995260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1995774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1995931Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1996300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1996528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1996887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1996986Z kernel = self.compile( 2025-05-07T20:32:14.1997385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1997569Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1997697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1997702Z 2025-05-07T20:32:14.1997924Z self = 2025-05-07T20:32:14.1998726Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1999254Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9cff0d30>} 2025-05-07T20:32:14.2000024Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2000316Z context = 2025-05-07T20:32:14.2000322Z 2025-05-07T20:32:14.2000495Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2000773Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2000892Z module_map=module_map) 2025-05-07T20:32:14.2001059Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2001163Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2001254Z E ^ 2025-05-07T20:32:14.2001621Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2001626Z 2025-05-07T20:32:14.2002061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2002065Z 2025-05-07T20:32:14.2002174Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2002434Z self=, 2025-05-07T20:32:14.2002527Z T=128, 2025-05-07T20:32:14.2002626Z D=5120, 2025-05-07T20:32:14.2002717Z scale_ub=None, 2025-05-07T20:32:14.2002816Z contiguous=False, 2025-05-07T20:32:14.2002907Z compiled=False, 2025-05-07T20:32:14.2002983Z ) 2025-05-07T20:32:14.2003214Z self = 2025-05-07T20:32:14.2003391Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.2003440Z 2025-05-07T20:32:14.2003530Z @given( 2025-05-07T20:32:14.2003655Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2003762Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2003889Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2004012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2004131Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2004224Z ) 2025-05-07T20:32:14.2004478Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2004630Z def test_silu_mul_quant( 2025-05-07T20:32:14.2004709Z self, 2025-05-07T20:32:14.2004789Z T: int, 2025-05-07T20:32:14.2004874Z D: int, 2025-05-07T20:32:14.2004978Z scale_ub: Optional[float], 2025-05-07T20:32:14.2005071Z contiguous: bool, 2025-05-07T20:32:14.2005164Z compiled: bool, 2025-05-07T20:32:14.2005247Z ) -> None: 2025-05-07T20:32:14.2005383Z torch.manual_seed(2025) 2025-05-07T20:32:14.2005497Z 2025-05-07T20:32:14.2005741Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2005838Z 2025-05-07T20:32:14.2005940Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2006069Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2006168Z x = x_sign * x_clamp 2025-05-07T20:32:14.2006257Z x0 = x[:, :D] 2025-05-07T20:32:14.2006341Z x1 = x[:, D:] 2025-05-07T20:32:14.2006423Z 2025-05-07T20:32:14.2006511Z if contiguous: 2025-05-07T20:32:14.2006615Z x0 = x0.contiguous() 2025-05-07T20:32:14.2006716Z x1 = x1.contiguous() 2025-05-07T20:32:14.2006791Z 2025-05-07T20:32:14.2006886Z if scale_ub is not None: 2025-05-07T20:32:14.2007001Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2007141Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2007224Z ) 2025-05-07T20:32:14.2007311Z else: 2025-05-07T20:32:14.2007426Z scale_ub_tensor = None 2025-05-07T20:32:14.2007535Z 2025-05-07T20:32:14.2007709Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2007805Z op = silu_mul_quant 2025-05-07T20:32:14.2007900Z if compiled: 2025-05-07T20:32:14.2008002Z op = torch.compile(op) 2025-05-07T20:32:14.2008212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2008294Z 2025-05-07T20:32:14.2008389Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2008393Z 2025-05-07T20:32:14.2008497Z moe/activation_test.py:117: 2025-05-07T20:32:14.2008691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2008832Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2008937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2009538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2009650Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2010029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2010258Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2010616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2010722Z kernel = self.compile( 2025-05-07T20:32:14.2011115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2011305Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2011433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2011438Z 2025-05-07T20:32:14.2011652Z self = 2025-05-07T20:32:14.2012519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2013097Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9c9a6cb0>} 2025-05-07T20:32:14.2013871Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2014144Z context = 2025-05-07T20:32:14.2014149Z 2025-05-07T20:32:14.2014324Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2014609Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2014720Z module_map=module_map) 2025-05-07T20:32:14.2014894Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2014995Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2015076Z E ^ 2025-05-07T20:32:14.2015452Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2015457Z 2025-05-07T20:32:14.2015884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2015891Z 2025-05-07T20:32:14.2016005Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2016237Z self=, 2025-05-07T20:32:14.2016316Z T=128, 2025-05-07T20:32:14.2016401Z D=5120, 2025-05-07T20:32:14.2016490Z scale_ub=1200.0, 2025-05-07T20:32:14.2016577Z contiguous=True, 2025-05-07T20:32:14.2016671Z compiled=False, 2025-05-07T20:32:14.2016745Z ) 2025-05-07T20:32:14.2016971Z self = 2025-05-07T20:32:14.2017153Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.2017157Z 2025-05-07T20:32:14.2017238Z @given( 2025-05-07T20:32:14.2017448Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2017553Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2017673Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2017804Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2017922Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2018121Z ) 2025-05-07T20:32:14.2018386Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2018484Z def test_silu_mul_quant( 2025-05-07T20:32:14.2018564Z self, 2025-05-07T20:32:14.2018650Z T: int, 2025-05-07T20:32:14.2018728Z D: int, 2025-05-07T20:32:14.2018834Z scale_ub: Optional[float], 2025-05-07T20:32:14.2018933Z contiguous: bool, 2025-05-07T20:32:14.2019022Z compiled: bool, 2025-05-07T20:32:14.2019107Z ) -> None: 2025-05-07T20:32:14.2019206Z torch.manual_seed(2025) 2025-05-07T20:32:14.2019279Z 2025-05-07T20:32:14.2019467Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2019542Z 2025-05-07T20:32:14.2019639Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2019777Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2019869Z x = x_sign * x_clamp 2025-05-07T20:32:14.2019951Z x0 = x[:, :D] 2025-05-07T20:32:14.2020039Z x1 = x[:, D:] 2025-05-07T20:32:14.2020111Z 2025-05-07T20:32:14.2020197Z if contiguous: 2025-05-07T20:32:14.2020296Z x0 = x0.contiguous() 2025-05-07T20:32:14.2020438Z x1 = x1.contiguous() 2025-05-07T20:32:14.2020511Z 2025-05-07T20:32:14.2020611Z if scale_ub is not None: 2025-05-07T20:32:14.2020719Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2020866Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2020942Z ) 2025-05-07T20:32:14.2021019Z else: 2025-05-07T20:32:14.2021122Z scale_ub_tensor = None 2025-05-07T20:32:14.2021198Z 2025-05-07T20:32:14.2021334Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2021432Z op = silu_mul_quant 2025-05-07T20:32:14.2021564Z if compiled: 2025-05-07T20:32:14.2021668Z op = torch.compile(op) 2025-05-07T20:32:14.2021782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2021855Z 2025-05-07T20:32:14.2021948Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2021959Z 2025-05-07T20:32:14.2022063Z moe/activation_test.py:117: 2025-05-07T20:32:14.2022195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2022306Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2022409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2022925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2023036Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2023405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2023641Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2023993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2024088Z kernel = self.compile( 2025-05-07T20:32:14.2024489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2024674Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2024802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2024807Z 2025-05-07T20:32:14.2025027Z self = 2025-05-07T20:32:14.2025904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2026430Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9c9a64d0>} 2025-05-07T20:32:14.2027198Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2027406Z context = 2025-05-07T20:32:14.2027410Z 2025-05-07T20:32:14.2027583Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2027856Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2027977Z module_map=module_map) 2025-05-07T20:32:14.2028145Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2028251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2028334Z E ^ 2025-05-07T20:32:14.2028700Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2028704Z 2025-05-07T20:32:14.2029141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2029186Z 2025-05-07T20:32:14.2029295Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2029527Z self=, 2025-05-07T20:32:14.2029613Z T=1, 2025-05-07T20:32:14.2029691Z D=7168, 2025-05-07T20:32:14.2029777Z scale_ub=1200.0, 2025-05-07T20:32:14.2029874Z contiguous=True, 2025-05-07T20:32:14.2029959Z compiled=True, 2025-05-07T20:32:14.2030045Z ) 2025-05-07T20:32:14.2030275Z self = 2025-05-07T20:32:14.2030448Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.2030495Z 2025-05-07T20:32:14.2030582Z @given( 2025-05-07T20:32:14.2030705Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2030807Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2030933Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2031061Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2031188Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2031266Z ) 2025-05-07T20:32:14.2031519Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2031622Z def test_silu_mul_quant( 2025-05-07T20:32:14.2031700Z self, 2025-05-07T20:32:14.2031778Z T: int, 2025-05-07T20:32:14.2031863Z D: int, 2025-05-07T20:32:14.2031968Z scale_ub: Optional[float], 2025-05-07T20:32:14.2032064Z contiguous: bool, 2025-05-07T20:32:14.2032159Z compiled: bool, 2025-05-07T20:32:14.2032246Z ) -> None: 2025-05-07T20:32:14.2032344Z torch.manual_seed(2025) 2025-05-07T20:32:14.2032441Z 2025-05-07T20:32:14.2032643Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2032722Z 2025-05-07T20:32:14.2032824Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2032957Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2033054Z x = x_sign * x_clamp 2025-05-07T20:32:14.2033136Z x0 = x[:, :D] 2025-05-07T20:32:14.2033218Z x1 = x[:, D:] 2025-05-07T20:32:14.2033297Z 2025-05-07T20:32:14.2033383Z if contiguous: 2025-05-07T20:32:14.2033479Z x0 = x0.contiguous() 2025-05-07T20:32:14.2033575Z x1 = x1.contiguous() 2025-05-07T20:32:14.2033648Z 2025-05-07T20:32:14.2033825Z if scale_ub is not None: 2025-05-07T20:32:14.2033940Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2034080Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2034159Z ) 2025-05-07T20:32:14.2034242Z else: 2025-05-07T20:32:14.2034339Z scale_ub_tensor = None 2025-05-07T20:32:14.2034420Z 2025-05-07T20:32:14.2034553Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2034646Z op = silu_mul_quant 2025-05-07T20:32:14.2034755Z if compiled: 2025-05-07T20:32:14.2034859Z op = torch.compile(op) 2025-05-07T20:32:14.2034973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2035049Z 2025-05-07T20:32:14.2035140Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2035145Z 2025-05-07T20:32:14.2035251Z moe/activation_test.py:117: 2025-05-07T20:32:14.2035382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2035492Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2035601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2035982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2036081Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2036594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2036695Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2037142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2037373Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2037724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2037826Z kernel = self.compile( 2025-05-07T20:32:14.2038224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2038411Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2038579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2038584Z 2025-05-07T20:32:14.2038796Z self = 2025-05-07T20:32:14.2039603Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2040126Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fac9c9a79a0>} 2025-05-07T20:32:14.2040902Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2041105Z context = 2025-05-07T20:32:14.2041110Z 2025-05-07T20:32:14.2041281Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2041562Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2041675Z module_map=module_map) 2025-05-07T20:32:14.2041847Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2041953Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2042032Z E ^ 2025-05-07T20:32:14.2042402Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2042407Z 2025-05-07T20:32:14.2042914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2042920Z 2025-05-07T20:32:14.2043036Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2043269Z self=, 2025-05-07T20:32:14.2043350Z T=1, 2025-05-07T20:32:14.2043440Z D=7168, 2025-05-07T20:32:14.2043525Z scale_ub=1200.0, 2025-05-07T20:32:14.2043614Z contiguous=False, 2025-05-07T20:32:14.2043703Z compiled=True, 2025-05-07T20:32:14.2043780Z ) 2025-05-07T20:32:14.2044003Z self = 2025-05-07T20:32:14.2044181Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.2044186Z 2025-05-07T20:32:14.2044265Z @given( 2025-05-07T20:32:14.2044394Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2044498Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2044626Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2044752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2044870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2044949Z ) 2025-05-07T20:32:14.2045208Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2045303Z def test_silu_mul_quant( 2025-05-07T20:32:14.2045380Z self, 2025-05-07T20:32:14.2045465Z T: int, 2025-05-07T20:32:14.2045546Z D: int, 2025-05-07T20:32:14.2045726Z scale_ub: Optional[float], 2025-05-07T20:32:14.2045824Z contiguous: bool, 2025-05-07T20:32:14.2045912Z compiled: bool, 2025-05-07T20:32:14.2045998Z ) -> None: 2025-05-07T20:32:14.2046095Z torch.manual_seed(2025) 2025-05-07T20:32:14.2046168Z 2025-05-07T20:32:14.2046350Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2046426Z 2025-05-07T20:32:14.2046529Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2046665Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2046756Z x = x_sign * x_clamp 2025-05-07T20:32:14.2046884Z x0 = x[:, :D] 2025-05-07T20:32:14.2046973Z x1 = x[:, D:] 2025-05-07T20:32:14.2047047Z 2025-05-07T20:32:14.2047132Z if contiguous: 2025-05-07T20:32:14.2047235Z x0 = x0.contiguous() 2025-05-07T20:32:14.2047325Z x1 = x1.contiguous() 2025-05-07T20:32:14.2047408Z 2025-05-07T20:32:14.2047506Z if scale_ub is not None: 2025-05-07T20:32:14.2047614Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2047762Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2047838Z ) 2025-05-07T20:32:14.2047916Z else: 2025-05-07T20:32:14.2048024Z scale_ub_tensor = None 2025-05-07T20:32:14.2048099Z 2025-05-07T20:32:14.2048236Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2048344Z op = silu_mul_quant 2025-05-07T20:32:14.2048430Z if compiled: 2025-05-07T20:32:14.2048534Z op = torch.compile(op) 2025-05-07T20:32:14.2048651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2048727Z 2025-05-07T20:32:14.2048826Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2048831Z 2025-05-07T20:32:14.2048932Z moe/activation_test.py:117: 2025-05-07T20:32:14.2049062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2049173Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2049276Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2049655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2049756Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2050349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2050457Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2050823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2051055Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2051412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2051507Z kernel = self.compile( 2025-05-07T20:32:14.2051903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2052088Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2052216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2052220Z 2025-05-07T20:32:14.2052441Z self = 2025-05-07T20:32:14.2053244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2053768Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebbf4430>} 2025-05-07T20:32:14.2054540Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2054788Z context = 2025-05-07T20:32:14.2054793Z 2025-05-07T20:32:14.2054970Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2055246Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2055365Z module_map=module_map) 2025-05-07T20:32:14.2055536Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2056209Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2056315Z E ^ 2025-05-07T20:32:14.2056684Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2056689Z 2025-05-07T20:32:14.2057118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2057127Z 2025-05-07T20:32:14.2057247Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2057479Z self=, 2025-05-07T20:32:14.2057566Z T=1, 2025-05-07T20:32:14.2057645Z D=7168, 2025-05-07T20:32:14.2057734Z scale_ub=None, 2025-05-07T20:32:14.2057832Z contiguous=False, 2025-05-07T20:32:14.2057925Z compiled=True, 2025-05-07T20:32:14.2058078Z ) 2025-05-07T20:32:14.2058326Z self = 2025-05-07T20:32:14.2058506Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.2058511Z 2025-05-07T20:32:14.2058593Z @given( 2025-05-07T20:32:14.2058723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2058830Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2058958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2059079Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2059196Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2059278Z ) 2025-05-07T20:32:14.2059531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2059628Z def test_silu_mul_quant( 2025-05-07T20:32:14.2059715Z self, 2025-05-07T20:32:14.2059941Z T: int, 2025-05-07T20:32:14.2060023Z D: int, 2025-05-07T20:32:14.2060136Z scale_ub: Optional[float], 2025-05-07T20:32:14.2060228Z contiguous: bool, 2025-05-07T20:32:14.2060321Z compiled: bool, 2025-05-07T20:32:14.2060408Z ) -> None: 2025-05-07T20:32:14.2060507Z torch.manual_seed(2025) 2025-05-07T20:32:14.2060587Z 2025-05-07T20:32:14.2060763Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2060839Z 2025-05-07T20:32:14.2060947Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2061077Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2061168Z x = x_sign * x_clamp 2025-05-07T20:32:14.2061257Z x0 = x[:, :D] 2025-05-07T20:32:14.2061339Z x1 = x[:, D:] 2025-05-07T20:32:14.2061413Z 2025-05-07T20:32:14.2061505Z if contiguous: 2025-05-07T20:32:14.2061599Z x0 = x0.contiguous() 2025-05-07T20:32:14.2061691Z x1 = x1.contiguous() 2025-05-07T20:32:14.2061779Z 2025-05-07T20:32:14.2061873Z if scale_ub is not None: 2025-05-07T20:32:14.2061992Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2062136Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2062217Z ) 2025-05-07T20:32:14.2062301Z else: 2025-05-07T20:32:14.2062401Z scale_ub_tensor = None 2025-05-07T20:32:14.2062477Z 2025-05-07T20:32:14.2062623Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2062784Z op = silu_mul_quant 2025-05-07T20:32:14.2062872Z if compiled: 2025-05-07T20:32:14.2062981Z op = torch.compile(op) 2025-05-07T20:32:14.2063091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2063165Z 2025-05-07T20:32:14.2063265Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.2063390Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.2063471Z 2025-05-07T20:32:14.2063620Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2063726Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.2063835Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.2064012Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.2064160Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.2064243Z 2025-05-07T20:32:14.2064349Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.2064354Z 2025-05-07T20:32:14.2064461Z moe/activation_test.py:126: 2025-05-07T20:32:14.2064594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2064705Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.2064851Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.2065428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.2065538Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.2065918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2066151Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2066528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.2066797Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.2067207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:14.2067475Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.2067861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.2068117Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.2068483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.2068570Z fn() 2025-05-07T20:32:14.2068991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.2069080Z self.fn.run( 2025-05-07T20:32:14.2069429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2069539Z kernel = self.compile( 2025-05-07T20:32:14.2069930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2070113Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2070249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2070254Z 2025-05-07T20:32:14.2070476Z self = 2025-05-07T20:32:14.2071287Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2071812Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebbf4a60>} 2025-05-07T20:32:14.2072629Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2072830Z context = 2025-05-07T20:32:14.2072834Z 2025-05-07T20:32:14.2073012Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2073294Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2073449Z module_map=module_map) 2025-05-07T20:32:14.2073625Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2073735Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.2073816Z E ^ 2025-05-07T20:32:14.2074190Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2074197Z 2025-05-07T20:32:14.2074626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2074631Z 2025-05-07T20:32:14.2074742Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2074983Z self=, 2025-05-07T20:32:14.2075064Z T=1, 2025-05-07T20:32:14.2075151Z D=5120, 2025-05-07T20:32:14.2075243Z scale_ub=1200.0, 2025-05-07T20:32:14.2075335Z contiguous=False, 2025-05-07T20:32:14.2075429Z compiled=True, 2025-05-07T20:32:14.2075509Z ) 2025-05-07T20:32:14.2075737Z self = 2025-05-07T20:32:14.2075917Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.2075922Z 2025-05-07T20:32:14.2076003Z @given( 2025-05-07T20:32:14.2076128Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2076239Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2076361Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2076492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2076614Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2076692Z ) 2025-05-07T20:32:14.2076952Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2077166Z def test_silu_mul_quant( 2025-05-07T20:32:14.2077249Z self, 2025-05-07T20:32:14.2077338Z T: int, 2025-05-07T20:32:14.2077418Z D: int, 2025-05-07T20:32:14.2077525Z scale_ub: Optional[float], 2025-05-07T20:32:14.2077626Z contiguous: bool, 2025-05-07T20:32:14.2077716Z compiled: bool, 2025-05-07T20:32:14.2077799Z ) -> None: 2025-05-07T20:32:14.2077904Z torch.manual_seed(2025) 2025-05-07T20:32:14.2077981Z 2025-05-07T20:32:14.2078164Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2078245Z 2025-05-07T20:32:14.2078342Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2078481Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2078575Z x = x_sign * x_clamp 2025-05-07T20:32:14.2078658Z x0 = x[:, :D] 2025-05-07T20:32:14.2078748Z x1 = x[:, D:] 2025-05-07T20:32:14.2078824Z 2025-05-07T20:32:14.2078913Z if contiguous: 2025-05-07T20:32:14.2079019Z x0 = x0.contiguous() 2025-05-07T20:32:14.2079112Z x1 = x1.contiguous() 2025-05-07T20:32:14.2079191Z 2025-05-07T20:32:14.2079293Z if scale_ub is not None: 2025-05-07T20:32:14.2079403Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2079550Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2079628Z ) 2025-05-07T20:32:14.2079708Z else: 2025-05-07T20:32:14.2079815Z scale_ub_tensor = None 2025-05-07T20:32:14.2079938Z 2025-05-07T20:32:14.2080074Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2080173Z op = silu_mul_quant 2025-05-07T20:32:14.2080261Z if compiled: 2025-05-07T20:32:14.2080364Z op = torch.compile(op) 2025-05-07T20:32:14.2080481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2080558Z 2025-05-07T20:32:14.2080653Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2080657Z 2025-05-07T20:32:14.2080772Z moe/activation_test.py:117: 2025-05-07T20:32:14.2080904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2081064Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2081168Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2081549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2081656Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2082162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2082267Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2082643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2082873Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2083238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2083338Z kernel = self.compile( 2025-05-07T20:32:14.2083735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2083923Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2084055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2084063Z 2025-05-07T20:32:14.2084283Z self = 2025-05-07T20:32:14.2085085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2085684Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebbf52d0>} 2025-05-07T20:32:14.2086460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2086662Z context = 2025-05-07T20:32:14.2086667Z 2025-05-07T20:32:14.2086847Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2087125Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2087238Z module_map=module_map) 2025-05-07T20:32:14.2087419Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2087524Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2087611Z E ^ 2025-05-07T20:32:14.2087980Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2087985Z 2025-05-07T20:32:14.2088411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2088418Z 2025-05-07T20:32:14.2088536Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2088767Z self=, 2025-05-07T20:32:14.2088847Z T=1, 2025-05-07T20:32:14.2088981Z D=5120, 2025-05-07T20:32:14.2089071Z scale_ub=1200.0, 2025-05-07T20:32:14.2089169Z contiguous=False, 2025-05-07T20:32:14.2089256Z compiled=False, 2025-05-07T20:32:14.2089333Z ) 2025-05-07T20:32:14.2089565Z self = 2025-05-07T20:32:14.2089741Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.2089745Z 2025-05-07T20:32:14.2089827Z @given( 2025-05-07T20:32:14.2089964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2090069Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2090235Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2090363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2090482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2090568Z ) 2025-05-07T20:32:14.2090819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2090919Z def test_silu_mul_quant( 2025-05-07T20:32:14.2091006Z self, 2025-05-07T20:32:14.2091084Z T: int, 2025-05-07T20:32:14.2091163Z D: int, 2025-05-07T20:32:14.2091269Z scale_ub: Optional[float], 2025-05-07T20:32:14.2091360Z contiguous: bool, 2025-05-07T20:32:14.2091447Z compiled: bool, 2025-05-07T20:32:14.2091531Z ) -> None: 2025-05-07T20:32:14.2091629Z torch.manual_seed(2025) 2025-05-07T20:32:14.2091703Z 2025-05-07T20:32:14.2091885Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2091959Z 2025-05-07T20:32:14.2092066Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2092196Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2092287Z x = x_sign * x_clamp 2025-05-07T20:32:14.2092374Z x0 = x[:, :D] 2025-05-07T20:32:14.2092454Z x1 = x[:, D:] 2025-05-07T20:32:14.2092529Z 2025-05-07T20:32:14.2092619Z if contiguous: 2025-05-07T20:32:14.2092715Z x0 = x0.contiguous() 2025-05-07T20:32:14.2092805Z x1 = x1.contiguous() 2025-05-07T20:32:14.2092886Z 2025-05-07T20:32:14.2092982Z if scale_ub is not None: 2025-05-07T20:32:14.2093089Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2093236Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2093313Z ) 2025-05-07T20:32:14.2093401Z else: 2025-05-07T20:32:14.2093587Z scale_ub_tensor = None 2025-05-07T20:32:14.2093663Z 2025-05-07T20:32:14.2093803Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2093898Z op = silu_mul_quant 2025-05-07T20:32:14.2093985Z if compiled: 2025-05-07T20:32:14.2094093Z op = torch.compile(op) 2025-05-07T20:32:14.2094201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2094275Z 2025-05-07T20:32:14.2094375Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2094382Z 2025-05-07T20:32:14.2094483Z moe/activation_test.py:117: 2025-05-07T20:32:14.2094614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2094724Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2094826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2095345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2095451Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2095819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2096055Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2096406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2096507Z kernel = self.compile( 2025-05-07T20:32:14.2096899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2097127Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2097260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2097265Z 2025-05-07T20:32:14.2097478Z self = 2025-05-07T20:32:14.2098474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2099148Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebbf5b40>} 2025-05-07T20:32:14.2099918Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2100176Z context = 2025-05-07T20:32:14.2100183Z 2025-05-07T20:32:14.2100415Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2100747Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2100859Z module_map=module_map) 2025-05-07T20:32:14.2101025Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2101157Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2101237Z E ^ 2025-05-07T20:32:14.2101604Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2101608Z 2025-05-07T20:32:14.2102048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2102055Z 2025-05-07T20:32:14.2107925Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2108250Z self=, 2025-05-07T20:32:14.2108369Z T=16384, 2025-05-07T20:32:14.2108479Z D=5120, 2025-05-07T20:32:14.2108603Z scale_ub=1200.0, 2025-05-07T20:32:14.2108692Z contiguous=False, 2025-05-07T20:32:14.2108923Z compiled=True, 2025-05-07T20:32:14.2109003Z ) 2025-05-07T20:32:14.2109230Z self = 2025-05-07T20:32:14.2109423Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.2109429Z 2025-05-07T20:32:14.2109508Z @given( 2025-05-07T20:32:14.2109631Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2109740Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2109909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2110068Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2110196Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2110273Z ) 2025-05-07T20:32:14.2110537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2110635Z def test_silu_mul_quant( 2025-05-07T20:32:14.2110714Z self, 2025-05-07T20:32:14.2110804Z T: int, 2025-05-07T20:32:14.2110888Z D: int, 2025-05-07T20:32:14.2111013Z scale_ub: Optional[float], 2025-05-07T20:32:14.2111155Z contiguous: bool, 2025-05-07T20:32:14.2111284Z compiled: bool, 2025-05-07T20:32:14.2111396Z ) -> None: 2025-05-07T20:32:14.2111540Z torch.manual_seed(2025) 2025-05-07T20:32:14.2111647Z 2025-05-07T20:32:14.2111879Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2111998Z 2025-05-07T20:32:14.2112126Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2112414Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2112539Z x = x_sign * x_clamp 2025-05-07T20:32:14.2112624Z x0 = x[:, :D] 2025-05-07T20:32:14.2112717Z x1 = x[:, D:] 2025-05-07T20:32:14.2112792Z 2025-05-07T20:32:14.2112881Z if contiguous: 2025-05-07T20:32:14.2112989Z x0 = x0.contiguous() 2025-05-07T20:32:14.2113083Z x1 = x1.contiguous() 2025-05-07T20:32:14.2113163Z 2025-05-07T20:32:14.2113264Z if scale_ub is not None: 2025-05-07T20:32:14.2113374Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2113574Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2113660Z ) 2025-05-07T20:32:14.2113738Z else: 2025-05-07T20:32:14.2113841Z scale_ub_tensor = None 2025-05-07T20:32:14.2113915Z 2025-05-07T20:32:14.2114050Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2114152Z op = silu_mul_quant 2025-05-07T20:32:14.2114240Z if compiled: 2025-05-07T20:32:14.2114342Z op = torch.compile(op) 2025-05-07T20:32:14.2114458Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2114534Z 2025-05-07T20:32:14.2114627Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2114631Z 2025-05-07T20:32:14.2114739Z moe/activation_test.py:117: 2025-05-07T20:32:14.2114874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2114986Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2115089Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2115478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2115582Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2116091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2116193Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2116567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2116797Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2117159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2117377Z kernel = self.compile( 2025-05-07T20:32:14.2117772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2117969Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2118100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2118104Z 2025-05-07T20:32:14.2118316Z self = 2025-05-07T20:32:14.2119130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2119652Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebbf6cb0>} 2025-05-07T20:32:14.2120433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2120634Z context = 2025-05-07T20:32:14.2120639Z 2025-05-07T20:32:14.2120814Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2121090Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2121247Z module_map=module_map) 2025-05-07T20:32:14.2121422Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2121527Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2121607Z E ^ 2025-05-07T20:32:14.2121981Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2121986Z 2025-05-07T20:32:14.2122420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2122424Z 2025-05-07T20:32:14.2122637Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2122867Z self=, 2025-05-07T20:32:14.2122945Z T=2048, 2025-05-07T20:32:14.2123032Z D=7168, 2025-05-07T20:32:14.2123118Z scale_ub=1200.0, 2025-05-07T20:32:14.2123207Z contiguous=False, 2025-05-07T20:32:14.2123304Z compiled=True, 2025-05-07T20:32:14.2123380Z ) 2025-05-07T20:32:14.2123612Z self = 2025-05-07T20:32:14.2123791Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.2123795Z 2025-05-07T20:32:14.2123874Z @given( 2025-05-07T20:32:14.2124004Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2124111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2124229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2124358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2124477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2124558Z ) 2025-05-07T20:32:14.2124810Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2124905Z def test_silu_mul_quant( 2025-05-07T20:32:14.2124993Z self, 2025-05-07T20:32:14.2125073Z T: int, 2025-05-07T20:32:14.2125152Z D: int, 2025-05-07T20:32:14.2125260Z scale_ub: Optional[float], 2025-05-07T20:32:14.2125350Z contiguous: bool, 2025-05-07T20:32:14.2125436Z compiled: bool, 2025-05-07T20:32:14.2125524Z ) -> None: 2025-05-07T20:32:14.2125621Z torch.manual_seed(2025) 2025-05-07T20:32:14.2125694Z 2025-05-07T20:32:14.2125867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2126035Z 2025-05-07T20:32:14.2126129Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2126257Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2126355Z x = x_sign * x_clamp 2025-05-07T20:32:14.2126439Z x0 = x[:, :D] 2025-05-07T20:32:14.2126524Z x1 = x[:, D:] 2025-05-07T20:32:14.2126596Z 2025-05-07T20:32:14.2126680Z if contiguous: 2025-05-07T20:32:14.2126780Z x0 = x0.contiguous() 2025-05-07T20:32:14.2126868Z x1 = x1.contiguous() 2025-05-07T20:32:14.2126946Z 2025-05-07T20:32:14.2127043Z if scale_ub is not None: 2025-05-07T20:32:14.2127149Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2127286Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2127366Z ) 2025-05-07T20:32:14.2127442Z else: 2025-05-07T20:32:14.2127536Z scale_ub_tensor = None 2025-05-07T20:32:14.2127614Z 2025-05-07T20:32:14.2127752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2127848Z op = silu_mul_quant 2025-05-07T20:32:14.2127933Z if compiled: 2025-05-07T20:32:14.2128037Z op = torch.compile(op) 2025-05-07T20:32:14.2128148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2128223Z 2025-05-07T20:32:14.2128319Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2128324Z 2025-05-07T20:32:14.2128429Z moe/activation_test.py:117: 2025-05-07T20:32:14.2128559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2128709Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2128816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2129194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2129294Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2129805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2129905Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2130277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2130549Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2130897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2131001Z kernel = self.compile( 2025-05-07T20:32:14.2131392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2131578Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2131704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2131709Z 2025-05-07T20:32:14.2131925Z self = 2025-05-07T20:32:14.2132778Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2133306Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebbf7b50>} 2025-05-07T20:32:14.2134082Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2134279Z context = 2025-05-07T20:32:14.2134283Z 2025-05-07T20:32:14.2134459Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2134812Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2134922Z module_map=module_map) 2025-05-07T20:32:14.2135096Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2135197Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2135273Z E ^ 2025-05-07T20:32:14.2135644Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2135652Z 2025-05-07T20:32:14.2136076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2136081Z 2025-05-07T20:32:14.2136193Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2136423Z self=, 2025-05-07T20:32:14.2136500Z T=1, 2025-05-07T20:32:14.2136585Z D=5120, 2025-05-07T20:32:14.2136668Z scale_ub=None, 2025-05-07T20:32:14.2136762Z contiguous=False, 2025-05-07T20:32:14.2136858Z compiled=False, 2025-05-07T20:32:14.2136934Z ) 2025-05-07T20:32:14.2137157Z self = 2025-05-07T20:32:14.2137340Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.2137345Z 2025-05-07T20:32:14.2137424Z @given( 2025-05-07T20:32:14.2137553Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2137655Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2137820Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2137945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2138236Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2138339Z ) 2025-05-07T20:32:14.2138599Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2138698Z def test_silu_mul_quant( 2025-05-07T20:32:14.2138788Z self, 2025-05-07T20:32:14.2138867Z T: int, 2025-05-07T20:32:14.2138945Z D: int, 2025-05-07T20:32:14.2139050Z scale_ub: Optional[float], 2025-05-07T20:32:14.2139192Z contiguous: bool, 2025-05-07T20:32:14.2139280Z compiled: bool, 2025-05-07T20:32:14.2139364Z ) -> None: 2025-05-07T20:32:14.2139461Z torch.manual_seed(2025) 2025-05-07T20:32:14.2139535Z 2025-05-07T20:32:14.2139713Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2139790Z 2025-05-07T20:32:14.2139884Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2140017Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2140108Z x = x_sign * x_clamp 2025-05-07T20:32:14.2140189Z x0 = x[:, :D] 2025-05-07T20:32:14.2140276Z x1 = x[:, D:] 2025-05-07T20:32:14.2140349Z 2025-05-07T20:32:14.2140440Z if contiguous: 2025-05-07T20:32:14.2140533Z x0 = x0.contiguous() 2025-05-07T20:32:14.2140628Z x1 = x1.contiguous() 2025-05-07T20:32:14.2140709Z 2025-05-07T20:32:14.2140802Z if scale_ub is not None: 2025-05-07T20:32:14.2140913Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2141056Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2141134Z ) 2025-05-07T20:32:14.2141212Z else: 2025-05-07T20:32:14.2141314Z scale_ub_tensor = None 2025-05-07T20:32:14.2141391Z 2025-05-07T20:32:14.2141524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2141626Z op = silu_mul_quant 2025-05-07T20:32:14.2141712Z if compiled: 2025-05-07T20:32:14.2141819Z op = torch.compile(op) 2025-05-07T20:32:14.2141931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2142004Z 2025-05-07T20:32:14.2142103Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2142107Z 2025-05-07T20:32:14.2142207Z moe/activation_test.py:117: 2025-05-07T20:32:14.2142433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2142546Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2142652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2143166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2143277Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2143646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2143885Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2144236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2144333Z kernel = self.compile( 2025-05-07T20:32:14.2144742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2144921Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2145054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2145060Z 2025-05-07T20:32:14.2145272Z self = 2025-05-07T20:32:14.2146073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2146643Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebfe85e0>} 2025-05-07T20:32:14.2147414Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2147619Z context = 2025-05-07T20:32:14.2147699Z 2025-05-07T20:32:14.2147872Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2148148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2148265Z module_map=module_map) 2025-05-07T20:32:14.2148433Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2148544Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2148624Z E ^ 2025-05-07T20:32:14.2148987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2148991Z 2025-05-07T20:32:14.2149423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2149432Z 2025-05-07T20:32:14.2149540Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2149776Z self=, 2025-05-07T20:32:14.2149859Z T=4096, 2025-05-07T20:32:14.2149936Z D=7168, 2025-05-07T20:32:14.2150027Z scale_ub=1200.0, 2025-05-07T20:32:14.2150120Z contiguous=False, 2025-05-07T20:32:14.2150205Z compiled=False, 2025-05-07T20:32:14.2150287Z ) 2025-05-07T20:32:14.2150511Z self = 2025-05-07T20:32:14.2150695Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.2150700Z 2025-05-07T20:32:14.2150784Z @given( 2025-05-07T20:32:14.2150910Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2151018Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2151138Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2151348Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2151473Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2151549Z ) 2025-05-07T20:32:14.2151804Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2151905Z def test_silu_mul_quant( 2025-05-07T20:32:14.2151984Z self, 2025-05-07T20:32:14.2152062Z T: int, 2025-05-07T20:32:14.2152147Z D: int, 2025-05-07T20:32:14.2152254Z scale_ub: Optional[float], 2025-05-07T20:32:14.2152367Z contiguous: bool, 2025-05-07T20:32:14.2152466Z compiled: bool, 2025-05-07T20:32:14.2152564Z ) -> None: 2025-05-07T20:32:14.2152674Z torch.manual_seed(2025) 2025-05-07T20:32:14.2152747Z 2025-05-07T20:32:14.2152921Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2153003Z 2025-05-07T20:32:14.2153097Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2153230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2153326Z x = x_sign * x_clamp 2025-05-07T20:32:14.2153407Z x0 = x[:, :D] 2025-05-07T20:32:14.2153490Z x1 = x[:, D:] 2025-05-07T20:32:14.2153572Z 2025-05-07T20:32:14.2153656Z if contiguous: 2025-05-07T20:32:14.2153749Z x0 = x0.contiguous() 2025-05-07T20:32:14.2153845Z x1 = x1.contiguous() 2025-05-07T20:32:14.2153918Z 2025-05-07T20:32:14.2154010Z if scale_ub is not None: 2025-05-07T20:32:14.2154123Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2154309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2154394Z ) 2025-05-07T20:32:14.2154471Z else: 2025-05-07T20:32:14.2154567Z scale_ub_tensor = None 2025-05-07T20:32:14.2154646Z 2025-05-07T20:32:14.2154778Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2154868Z op = silu_mul_quant 2025-05-07T20:32:14.2154969Z if compiled: 2025-05-07T20:32:14.2155071Z op = torch.compile(op) 2025-05-07T20:32:14.2155179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2155302Z 2025-05-07T20:32:14.2155395Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2155400Z 2025-05-07T20:32:14.2155506Z moe/activation_test.py:117: 2025-05-07T20:32:14.2156017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2156128Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2156242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2156759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2156859Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2157234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2157467Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2157824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2157923Z kernel = self.compile( 2025-05-07T20:32:14.2158314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2158498Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2158623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2158631Z 2025-05-07T20:32:14.2158841Z self = 2025-05-07T20:32:14.2159646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2160401Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebfe8ca0>} 2025-05-07T20:32:14.2161184Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2161382Z context = 2025-05-07T20:32:14.2161388Z 2025-05-07T20:32:14.2161562Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2161834Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2161942Z module_map=module_map) 2025-05-07T20:32:14.2162117Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2162222Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2162306Z E ^ 2025-05-07T20:32:14.2162676Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2162683Z 2025-05-07T20:32:14.2163110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2163115Z 2025-05-07T20:32:14.2163230Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2163460Z self=, 2025-05-07T20:32:14.2163603Z T=16384, 2025-05-07T20:32:14.2163690Z D=7168, 2025-05-07T20:32:14.2163775Z scale_ub=None, 2025-05-07T20:32:14.2163863Z contiguous=True, 2025-05-07T20:32:14.2163955Z compiled=True, 2025-05-07T20:32:14.2164031Z ) 2025-05-07T20:32:14.2164260Z self = 2025-05-07T20:32:14.2164439Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.2164450Z 2025-05-07T20:32:14.2164530Z @given( 2025-05-07T20:32:14.2164657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2164827Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2164946Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2165072Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2165191Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2165270Z ) 2025-05-07T20:32:14.2165528Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2165627Z def test_silu_mul_quant( 2025-05-07T20:32:14.2165712Z self, 2025-05-07T20:32:14.2165791Z T: int, 2025-05-07T20:32:14.2165869Z D: int, 2025-05-07T20:32:14.2165977Z scale_ub: Optional[float], 2025-05-07T20:32:14.2166068Z contiguous: bool, 2025-05-07T20:32:14.2166155Z compiled: bool, 2025-05-07T20:32:14.2166242Z ) -> None: 2025-05-07T20:32:14.2166342Z torch.manual_seed(2025) 2025-05-07T20:32:14.2166416Z 2025-05-07T20:32:14.2166595Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2166672Z 2025-05-07T20:32:14.2166768Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2166903Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2166992Z x = x_sign * x_clamp 2025-05-07T20:32:14.2167079Z x0 = x[:, :D] 2025-05-07T20:32:14.2167159Z x1 = x[:, D:] 2025-05-07T20:32:14.2167236Z 2025-05-07T20:32:14.2167327Z if contiguous: 2025-05-07T20:32:14.2167419Z x0 = x0.contiguous() 2025-05-07T20:32:14.2167510Z x1 = x1.contiguous() 2025-05-07T20:32:14.2167589Z 2025-05-07T20:32:14.2167682Z if scale_ub is not None: 2025-05-07T20:32:14.2167789Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2167933Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2168089Z ) 2025-05-07T20:32:14.2168169Z else: 2025-05-07T20:32:14.2168286Z scale_ub_tensor = None 2025-05-07T20:32:14.2168360Z 2025-05-07T20:32:14.2168504Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2168596Z op = silu_mul_quant 2025-05-07T20:32:14.2168682Z if compiled: 2025-05-07T20:32:14.2168790Z op = torch.compile(op) 2025-05-07T20:32:14.2168898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2168980Z 2025-05-07T20:32:14.2169072Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2169077Z 2025-05-07T20:32:14.2169180Z moe/activation_test.py:117: 2025-05-07T20:32:14.2169316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2169421Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2169523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2169911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2170008Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2170523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2170628Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2170995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2171231Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2171625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2171720Z kernel = self.compile( 2025-05-07T20:32:14.2172117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2172297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2172440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2172445Z 2025-05-07T20:32:14.2172699Z self = 2025-05-07T20:32:14.2173499Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2174028Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebfe9b40>} 2025-05-07T20:32:14.2174796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2175004Z context = 2025-05-07T20:32:14.2175008Z 2025-05-07T20:32:14.2175182Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2175456Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2175578Z module_map=module_map) 2025-05-07T20:32:14.2175745Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2175851Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2175934Z E ^ 2025-05-07T20:32:14.2176298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2176303Z 2025-05-07T20:32:14.2176732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2176737Z 2025-05-07T20:32:14.2176844Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2177162Z self=, 2025-05-07T20:32:14.2177243Z T=4096, 2025-05-07T20:32:14.2177321Z D=5120, 2025-05-07T20:32:14.2177412Z scale_ub=None, 2025-05-07T20:32:14.2177500Z contiguous=False, 2025-05-07T20:32:14.2177584Z compiled=True, 2025-05-07T20:32:14.2177663Z ) 2025-05-07T20:32:14.2177887Z self = 2025-05-07T20:32:14.2178154Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.2178162Z 2025-05-07T20:32:14.2178249Z @given( 2025-05-07T20:32:14.2178372Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2178483Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2178600Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2178720Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2178842Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2178924Z ) 2025-05-07T20:32:14.2179175Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2179277Z def test_silu_mul_quant( 2025-05-07T20:32:14.2179360Z self, 2025-05-07T20:32:14.2179436Z T: int, 2025-05-07T20:32:14.2179520Z D: int, 2025-05-07T20:32:14.2179620Z scale_ub: Optional[float], 2025-05-07T20:32:14.2179711Z contiguous: bool, 2025-05-07T20:32:14.2179805Z compiled: bool, 2025-05-07T20:32:14.2179884Z ) -> None: 2025-05-07T20:32:14.2180111Z torch.manual_seed(2025) 2025-05-07T20:32:14.2180184Z 2025-05-07T20:32:14.2180357Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2180439Z 2025-05-07T20:32:14.2180533Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2180661Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2180761Z x = x_sign * x_clamp 2025-05-07T20:32:14.2180843Z x0 = x[:, :D] 2025-05-07T20:32:14.2180929Z x1 = x[:, D:] 2025-05-07T20:32:14.2181008Z 2025-05-07T20:32:14.2181093Z if contiguous: 2025-05-07T20:32:14.2181185Z x0 = x0.contiguous() 2025-05-07T20:32:14.2181330Z x1 = x1.contiguous() 2025-05-07T20:32:14.2181403Z 2025-05-07T20:32:14.2181503Z if scale_ub is not None: 2025-05-07T20:32:14.2181613Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2181751Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2181838Z ) 2025-05-07T20:32:14.2181915Z else: 2025-05-07T20:32:14.2182010Z scale_ub_tensor = None 2025-05-07T20:32:14.2182089Z 2025-05-07T20:32:14.2182224Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2182317Z op = silu_mul_quant 2025-05-07T20:32:14.2182410Z if compiled: 2025-05-07T20:32:14.2182513Z op = torch.compile(op) 2025-05-07T20:32:14.2182625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2182707Z 2025-05-07T20:32:14.2182799Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2182804Z 2025-05-07T20:32:14.2182913Z moe/activation_test.py:117: 2025-05-07T20:32:14.2183043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2183147Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2183256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2183636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2183736Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2184250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2184352Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2184728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2185037Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2185390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2185499Z kernel = self.compile( 2025-05-07T20:32:14.2185893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2186075Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2186213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2186218Z 2025-05-07T20:32:14.2186429Z self = 2025-05-07T20:32:14.2187241Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2187761Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebfe9240>} 2025-05-07T20:32:14.2188537Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2188736Z context = 2025-05-07T20:32:14.2188783Z 2025-05-07T20:32:14.2188956Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2189235Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2189348Z module_map=module_map) 2025-05-07T20:32:14.2189523Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2189631Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2189713Z E ^ 2025-05-07T20:32:14.2190085Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2190131Z 2025-05-07T20:32:14.2190556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2190560Z 2025-05-07T20:32:14.2190671Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2190910Z self=, 2025-05-07T20:32:14.2190994Z T=4096, 2025-05-07T20:32:14.2191082Z D=5120, 2025-05-07T20:32:14.2191169Z scale_ub=1200.0, 2025-05-07T20:32:14.2191260Z contiguous=False, 2025-05-07T20:32:14.2191358Z compiled=False, 2025-05-07T20:32:14.2191434Z ) 2025-05-07T20:32:14.2191657Z self = 2025-05-07T20:32:14.2191850Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.2191855Z 2025-05-07T20:32:14.2191935Z @given( 2025-05-07T20:32:14.2192062Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2192175Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2192297Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2192433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2192571Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2192666Z ) 2025-05-07T20:32:14.2192932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2193029Z def test_silu_mul_quant( 2025-05-07T20:32:14.2193109Z self, 2025-05-07T20:32:14.2193196Z T: int, 2025-05-07T20:32:14.2193275Z D: int, 2025-05-07T20:32:14.2193380Z scale_ub: Optional[float], 2025-05-07T20:32:14.2193479Z contiguous: bool, 2025-05-07T20:32:14.2193647Z compiled: bool, 2025-05-07T20:32:14.2193730Z ) -> None: 2025-05-07T20:32:14.2193835Z torch.manual_seed(2025) 2025-05-07T20:32:14.2193911Z 2025-05-07T20:32:14.2194095Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2194171Z 2025-05-07T20:32:14.2194267Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2194404Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2194497Z x = x_sign * x_clamp 2025-05-07T20:32:14.2194579Z x0 = x[:, :D] 2025-05-07T20:32:14.2194670Z x1 = x[:, D:] 2025-05-07T20:32:14.2194744Z 2025-05-07T20:32:14.2194831Z if contiguous: 2025-05-07T20:32:14.2194933Z x0 = x0.contiguous() 2025-05-07T20:32:14.2195026Z x1 = x1.contiguous() 2025-05-07T20:32:14.2195100Z 2025-05-07T20:32:14.2195198Z if scale_ub is not None: 2025-05-07T20:32:14.2195307Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2195458Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2195538Z ) 2025-05-07T20:32:14.2195621Z else: 2025-05-07T20:32:14.2195724Z scale_ub_tensor = None 2025-05-07T20:32:14.2195801Z 2025-05-07T20:32:14.2195935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2196037Z op = silu_mul_quant 2025-05-07T20:32:14.2196126Z if compiled: 2025-05-07T20:32:14.2196234Z op = torch.compile(op) 2025-05-07T20:32:14.2196349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2196470Z 2025-05-07T20:32:14.2196564Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2196568Z 2025-05-07T20:32:14.2196678Z moe/activation_test.py:117: 2025-05-07T20:32:14.2196808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2196919Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2197024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2197542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2197690Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2198058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2198287Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2198647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2198749Z kernel = self.compile( 2025-05-07T20:32:14.2199151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2199332Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2199461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2199471Z 2025-05-07T20:32:14.2199692Z self = 2025-05-07T20:32:14.2200492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2201019Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebfeacb0>} 2025-05-07T20:32:14.2201790Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2201995Z context = 2025-05-07T20:32:14.2202000Z 2025-05-07T20:32:14.2202250Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2202528Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2202667Z module_map=module_map) 2025-05-07T20:32:14.2202862Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2202975Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2203063Z E ^ 2025-05-07T20:32:14.2203426Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2203434Z 2025-05-07T20:32:14.2203866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2203870Z 2025-05-07T20:32:14.2203980Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2204212Z self=, 2025-05-07T20:32:14.2204300Z T=4096, 2025-05-07T20:32:14.2204385Z D=5120, 2025-05-07T20:32:14.2204473Z scale_ub=1200.0, 2025-05-07T20:32:14.2204569Z contiguous=False, 2025-05-07T20:32:14.2204658Z compiled=True, 2025-05-07T20:32:14.2204737Z ) 2025-05-07T20:32:14.2204967Z self = 2025-05-07T20:32:14.2205147Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.2205151Z 2025-05-07T20:32:14.2205242Z @given( 2025-05-07T20:32:14.2205366Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2205513Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2205644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2205767Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2205884Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2205972Z ) 2025-05-07T20:32:14.2206228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2206334Z def test_silu_mul_quant( 2025-05-07T20:32:14.2206412Z self, 2025-05-07T20:32:14.2206493Z T: int, 2025-05-07T20:32:14.2206622Z D: int, 2025-05-07T20:32:14.2206725Z scale_ub: Optional[float], 2025-05-07T20:32:14.2206819Z contiguous: bool, 2025-05-07T20:32:14.2206915Z compiled: bool, 2025-05-07T20:32:14.2206995Z ) -> None: 2025-05-07T20:32:14.2207093Z torch.manual_seed(2025) 2025-05-07T20:32:14.2207175Z 2025-05-07T20:32:14.2207353Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2207430Z 2025-05-07T20:32:14.2207534Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2207664Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2207762Z x = x_sign * x_clamp 2025-05-07T20:32:14.2207847Z x0 = x[:, :D] 2025-05-07T20:32:14.2207930Z x1 = x[:, D:] 2025-05-07T20:32:14.2208012Z 2025-05-07T20:32:14.2208101Z if contiguous: 2025-05-07T20:32:14.2208195Z x0 = x0.contiguous() 2025-05-07T20:32:14.2208292Z x1 = x1.contiguous() 2025-05-07T20:32:14.2208366Z 2025-05-07T20:32:14.2208467Z if scale_ub is not None: 2025-05-07T20:32:14.2208584Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2208723Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2208805Z ) 2025-05-07T20:32:14.2208890Z else: 2025-05-07T20:32:14.2208987Z scale_ub_tensor = None 2025-05-07T20:32:14.2209067Z 2025-05-07T20:32:14.2209208Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2209301Z op = silu_mul_quant 2025-05-07T20:32:14.2209395Z if compiled: 2025-05-07T20:32:14.2209498Z op = torch.compile(op) 2025-05-07T20:32:14.2209608Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2209689Z 2025-05-07T20:32:14.2209785Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2209901Z 2025-05-07T20:32:14.2210006Z moe/activation_test.py:117: 2025-05-07T20:32:14.2210145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2210252Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2210357Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2210844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2210982Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2211536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2211638Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2212006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2212244Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2212602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2212708Z kernel = self.compile( 2025-05-07T20:32:14.2213188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2213374Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2213513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2213590Z 2025-05-07T20:32:14.2213806Z self = 2025-05-07T20:32:14.2214701Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2215318Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaebfeab90>} 2025-05-07T20:32:14.2216092Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2216358Z context = 2025-05-07T20:32:14.2216364Z 2025-05-07T20:32:14.2216538Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2216823Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2216934Z module_map=module_map) 2025-05-07T20:32:14.2217102Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2217210Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2217288Z E ^ 2025-05-07T20:32:14.2217658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2217670Z 2025-05-07T20:32:14.2218242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2218249Z 2025-05-07T20:32:14.2218358Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2218595Z self=, 2025-05-07T20:32:14.2218675Z T=2048, 2025-05-07T20:32:14.2218754Z D=7168, 2025-05-07T20:32:14.2218848Z scale_ub=1200.0, 2025-05-07T20:32:14.2218935Z contiguous=False, 2025-05-07T20:32:14.2219023Z compiled=False, 2025-05-07T20:32:14.2219105Z ) 2025-05-07T20:32:14.2219331Z self = 2025-05-07T20:32:14.2219520Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.2219524Z 2025-05-07T20:32:14.2219697Z @given( 2025-05-07T20:32:14.2219821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2219930Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2220052Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2220176Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2220302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2220378Z ) 2025-05-07T20:32:14.2220628Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2220732Z def test_silu_mul_quant( 2025-05-07T20:32:14.2220811Z self, 2025-05-07T20:32:14.2220895Z T: int, 2025-05-07T20:32:14.2220972Z D: int, 2025-05-07T20:32:14.2221074Z scale_ub: Optional[float], 2025-05-07T20:32:14.2221172Z contiguous: bool, 2025-05-07T20:32:14.2221264Z compiled: bool, 2025-05-07T20:32:14.2221344Z ) -> None: 2025-05-07T20:32:14.2221449Z torch.manual_seed(2025) 2025-05-07T20:32:14.2221529Z 2025-05-07T20:32:14.2221703Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2221790Z 2025-05-07T20:32:14.2221885Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2222014Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2222115Z x = x_sign * x_clamp 2025-05-07T20:32:14.2222200Z x0 = x[:, :D] 2025-05-07T20:32:14.2222289Z x1 = x[:, D:] 2025-05-07T20:32:14.2222363Z 2025-05-07T20:32:14.2222495Z if contiguous: 2025-05-07T20:32:14.2222594Z x0 = x0.contiguous() 2025-05-07T20:32:14.2222685Z x1 = x1.contiguous() 2025-05-07T20:32:14.2222760Z 2025-05-07T20:32:14.2222862Z if scale_ub is not None: 2025-05-07T20:32:14.2222970Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2223109Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2223193Z ) 2025-05-07T20:32:14.2223279Z else: 2025-05-07T20:32:14.2223376Z scale_ub_tensor = None 2025-05-07T20:32:14.2223455Z 2025-05-07T20:32:14.2223588Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2223724Z op = silu_mul_quant 2025-05-07T20:32:14.2223816Z if compiled: 2025-05-07T20:32:14.2223918Z op = torch.compile(op) 2025-05-07T20:32:14.2224035Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2224108Z 2025-05-07T20:32:14.2224200Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2224207Z 2025-05-07T20:32:14.2224312Z moe/activation_test.py:117: 2025-05-07T20:32:14.2224442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2224546Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2224654Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2225169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2225279Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2225645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2225874Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2226233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2226329Z kernel = self.compile( 2025-05-07T20:32:14.2226722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2226908Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2227034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2227039Z 2025-05-07T20:32:14.2227257Z self = 2025-05-07T20:32:14.2228137Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2228660Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb9c05e0>} 2025-05-07T20:32:14.2229431Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2229634Z context = 2025-05-07T20:32:14.2229639Z 2025-05-07T20:32:14.2229813Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2230088Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2230203Z module_map=module_map) 2025-05-07T20:32:14.2230369Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2230473Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2230556Z E ^ 2025-05-07T20:32:14.2230918Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2230923Z 2025-05-07T20:32:14.2231345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2231395Z 2025-05-07T20:32:14.2231509Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2231739Z self=, 2025-05-07T20:32:14.2231824Z T=1, 2025-05-07T20:32:14.2231902Z D=7168, 2025-05-07T20:32:14.2231985Z scale_ub=None, 2025-05-07T20:32:14.2232077Z contiguous=True, 2025-05-07T20:32:14.2232168Z compiled=False, 2025-05-07T20:32:14.2232242Z ) 2025-05-07T20:32:14.2232473Z self = 2025-05-07T20:32:14.2232712Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.2232718Z 2025-05-07T20:32:14.2232820Z @given( 2025-05-07T20:32:14.2232963Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2233065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2233194Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2233314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2239103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2239205Z ) 2025-05-07T20:32:14.2239472Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2239571Z def test_silu_mul_quant( 2025-05-07T20:32:14.2239657Z self, 2025-05-07T20:32:14.2239744Z T: int, 2025-05-07T20:32:14.2239828Z D: int, 2025-05-07T20:32:14.2239932Z scale_ub: Optional[float], 2025-05-07T20:32:14.2240025Z contiguous: bool, 2025-05-07T20:32:14.2240125Z compiled: bool, 2025-05-07T20:32:14.2240205Z ) -> None: 2025-05-07T20:32:14.2240304Z torch.manual_seed(2025) 2025-05-07T20:32:14.2240390Z 2025-05-07T20:32:14.2240568Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2240642Z 2025-05-07T20:32:14.2240750Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2240880Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2240974Z x = x_sign * x_clamp 2025-05-07T20:32:14.2241065Z x0 = x[:, :D] 2025-05-07T20:32:14.2241146Z x1 = x[:, D:] 2025-05-07T20:32:14.2241227Z 2025-05-07T20:32:14.2241316Z if contiguous: 2025-05-07T20:32:14.2241412Z x0 = x0.contiguous() 2025-05-07T20:32:14.2241514Z x1 = x1.contiguous() 2025-05-07T20:32:14.2241708Z 2025-05-07T20:32:14.2241804Z if scale_ub is not None: 2025-05-07T20:32:14.2241924Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2242066Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2242144Z ) 2025-05-07T20:32:14.2242235Z else: 2025-05-07T20:32:14.2242333Z scale_ub_tensor = None 2025-05-07T20:32:14.2242409Z 2025-05-07T20:32:14.2242552Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2242649Z op = silu_mul_quant 2025-05-07T20:32:14.2242736Z if compiled: 2025-05-07T20:32:14.2242853Z op = torch.compile(op) 2025-05-07T20:32:14.2242963Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2243048Z 2025-05-07T20:32:14.2243144Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2243149Z 2025-05-07T20:32:14.2243255Z moe/activation_test.py:117: 2025-05-07T20:32:14.2243405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2243511Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2243617Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2244148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2244251Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2244633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2245004Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2245358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2245466Z kernel = self.compile( 2025-05-07T20:32:14.2245861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2246050Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2246187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2246233Z 2025-05-07T20:32:14.2246450Z self = 2025-05-07T20:32:14.2247262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2247787Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb9c0d30>} 2025-05-07T20:32:14.2248572Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2248773Z context = 2025-05-07T20:32:14.2248778Z 2025-05-07T20:32:14.2248955Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2249240Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2249357Z module_map=module_map) 2025-05-07T20:32:14.2249537Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2249643Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2249724Z E ^ 2025-05-07T20:32:14.2250109Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2250114Z 2025-05-07T20:32:14.2250539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2250544Z 2025-05-07T20:32:14.2250765Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2251009Z self=, 2025-05-07T20:32:14.2251095Z T=16384, 2025-05-07T20:32:14.2251182Z D=7168, 2025-05-07T20:32:14.2251271Z scale_ub=1200.0, 2025-05-07T20:32:14.2251363Z contiguous=False, 2025-05-07T20:32:14.2251460Z compiled=True, 2025-05-07T20:32:14.2251538Z ) 2025-05-07T20:32:14.2251768Z self = 2025-05-07T20:32:14.2251963Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.2251967Z 2025-05-07T20:32:14.2252049Z @given( 2025-05-07T20:32:14.2252177Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2252298Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2252429Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2252576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2252728Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2252808Z ) 2025-05-07T20:32:14.2253073Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2253174Z def test_silu_mul_quant( 2025-05-07T20:32:14.2253257Z self, 2025-05-07T20:32:14.2253345Z T: int, 2025-05-07T20:32:14.2253425Z D: int, 2025-05-07T20:32:14.2253529Z scale_ub: Optional[float], 2025-05-07T20:32:14.2253629Z contiguous: bool, 2025-05-07T20:32:14.2253765Z compiled: bool, 2025-05-07T20:32:14.2253847Z ) -> None: 2025-05-07T20:32:14.2253955Z torch.manual_seed(2025) 2025-05-07T20:32:14.2254030Z 2025-05-07T20:32:14.2254215Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2254291Z 2025-05-07T20:32:14.2254387Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2254524Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2254623Z x = x_sign * x_clamp 2025-05-07T20:32:14.2254707Z x0 = x[:, :D] 2025-05-07T20:32:14.2254798Z x1 = x[:, D:] 2025-05-07T20:32:14.2254916Z 2025-05-07T20:32:14.2255004Z if contiguous: 2025-05-07T20:32:14.2255108Z x0 = x0.contiguous() 2025-05-07T20:32:14.2255202Z x1 = x1.contiguous() 2025-05-07T20:32:14.2255276Z 2025-05-07T20:32:14.2255379Z if scale_ub is not None: 2025-05-07T20:32:14.2255489Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2256025Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2256132Z ) 2025-05-07T20:32:14.2256215Z else: 2025-05-07T20:32:14.2256314Z scale_ub_tensor = None 2025-05-07T20:32:14.2256397Z 2025-05-07T20:32:14.2256532Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2256627Z op = silu_mul_quant 2025-05-07T20:32:14.2256718Z if compiled: 2025-05-07T20:32:14.2256828Z op = torch.compile(op) 2025-05-07T20:32:14.2256939Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2257022Z 2025-05-07T20:32:14.2257119Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2257124Z 2025-05-07T20:32:14.2257226Z moe/activation_test.py:117: 2025-05-07T20:32:14.2257366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2257471Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2257580Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2257963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2258161Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2258677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2258779Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2259379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2259620Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2259977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2260082Z kernel = self.compile( 2025-05-07T20:32:14.2260478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2260662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2260802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2260806Z 2025-05-07T20:32:14.2261022Z self = 2025-05-07T20:32:14.2261837Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2262360Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb9c1bd0>} 2025-05-07T20:32:14.2263190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2263453Z context = 2025-05-07T20:32:14.2263458Z 2025-05-07T20:32:14.2263629Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2263910Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2264021Z module_map=module_map) 2025-05-07T20:32:14.2264197Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2264310Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2264391Z E ^ 2025-05-07T20:32:14.2264834Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2264839Z 2025-05-07T20:32:14.2265266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2265274Z 2025-05-07T20:32:14.2265386Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2265624Z self=, 2025-05-07T20:32:14.2265706Z T=1, 2025-05-07T20:32:14.2265785Z D=7168, 2025-05-07T20:32:14.2265878Z scale_ub=None, 2025-05-07T20:32:14.2265969Z contiguous=False, 2025-05-07T20:32:14.2266067Z compiled=False, 2025-05-07T20:32:14.2266142Z ) 2025-05-07T20:32:14.2266376Z self = 2025-05-07T20:32:14.2266562Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.2266570Z 2025-05-07T20:32:14.2266649Z @given( 2025-05-07T20:32:14.2266773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2266884Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2267004Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2267125Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2267257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2267333Z ) 2025-05-07T20:32:14.2267595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2267692Z def test_silu_mul_quant( 2025-05-07T20:32:14.2267771Z self, 2025-05-07T20:32:14.2267860Z T: int, 2025-05-07T20:32:14.2267939Z D: int, 2025-05-07T20:32:14.2268042Z scale_ub: Optional[float], 2025-05-07T20:32:14.2268226Z contiguous: bool, 2025-05-07T20:32:14.2268318Z compiled: bool, 2025-05-07T20:32:14.2268401Z ) -> None: 2025-05-07T20:32:14.2268509Z torch.manual_seed(2025) 2025-05-07T20:32:14.2268583Z 2025-05-07T20:32:14.2268759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2268842Z 2025-05-07T20:32:14.2268938Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2269074Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2269168Z x = x_sign * x_clamp 2025-05-07T20:32:14.2269250Z x0 = x[:, :D] 2025-05-07T20:32:14.2269338Z x1 = x[:, D:] 2025-05-07T20:32:14.2269413Z 2025-05-07T20:32:14.2269498Z if contiguous: 2025-05-07T20:32:14.2269602Z x0 = x0.contiguous() 2025-05-07T20:32:14.2269694Z x1 = x1.contiguous() 2025-05-07T20:32:14.2269769Z 2025-05-07T20:32:14.2269869Z if scale_ub is not None: 2025-05-07T20:32:14.2269986Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2270126Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2270210Z ) 2025-05-07T20:32:14.2270290Z else: 2025-05-07T20:32:14.2270393Z scale_ub_tensor = None 2025-05-07T20:32:14.2270469Z 2025-05-07T20:32:14.2270604Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2270702Z op = silu_mul_quant 2025-05-07T20:32:14.2270790Z if compiled: 2025-05-07T20:32:14.2270941Z op = torch.compile(op) 2025-05-07T20:32:14.2271056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2271132Z 2025-05-07T20:32:14.2271227Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2271231Z 2025-05-07T20:32:14.2271336Z moe/activation_test.py:117: 2025-05-07T20:32:14.2271468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2271579Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2271690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2272204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2272356Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2272729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2272958Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2273321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2273418Z kernel = self.compile( 2025-05-07T20:32:14.2273818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2274000Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2274133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2274138Z 2025-05-07T20:32:14.2274357Z self = 2025-05-07T20:32:14.2275162Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2275689Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb9c2050>} 2025-05-07T20:32:14.2276459Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2276658Z context = 2025-05-07T20:32:14.2276740Z 2025-05-07T20:32:14.2276920Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2277195Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2277315Z module_map=module_map) 2025-05-07T20:32:14.2277485Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2277593Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2277677Z E ^ 2025-05-07T20:32:14.2278046Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2278050Z 2025-05-07T20:32:14.2278485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2278490Z 2025-05-07T20:32:14.2278599Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2278837Z self=, 2025-05-07T20:32:14.2278923Z T=2048, 2025-05-07T20:32:14.2279002Z D=7168, 2025-05-07T20:32:14.2279088Z scale_ub=None, 2025-05-07T20:32:14.2279187Z contiguous=False, 2025-05-07T20:32:14.2279274Z compiled=True, 2025-05-07T20:32:14.2279350Z ) 2025-05-07T20:32:14.2279582Z self = 2025-05-07T20:32:14.2279762Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.2279766Z 2025-05-07T20:32:14.2279901Z @given( 2025-05-07T20:32:14.2280023Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2280127Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2280254Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2280376Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2280494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2280581Z ) 2025-05-07T20:32:14.2280840Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2280938Z def test_silu_mul_quant( 2025-05-07T20:32:14.2281106Z self, 2025-05-07T20:32:14.2281186Z T: int, 2025-05-07T20:32:14.2281274Z D: int, 2025-05-07T20:32:14.2281377Z scale_ub: Optional[float], 2025-05-07T20:32:14.2281471Z contiguous: bool, 2025-05-07T20:32:14.2281565Z compiled: bool, 2025-05-07T20:32:14.2281647Z ) -> None: 2025-05-07T20:32:14.2281746Z torch.manual_seed(2025) 2025-05-07T20:32:14.2281832Z 2025-05-07T20:32:14.2282009Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2282086Z 2025-05-07T20:32:14.2282191Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2282325Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2282430Z x = x_sign * x_clamp 2025-05-07T20:32:14.2282531Z x0 = x[:, :D] 2025-05-07T20:32:14.2282630Z x1 = x[:, D:] 2025-05-07T20:32:14.2282717Z 2025-05-07T20:32:14.2282812Z if contiguous: 2025-05-07T20:32:14.2282907Z x0 = x0.contiguous() 2025-05-07T20:32:14.2283007Z x1 = x1.contiguous() 2025-05-07T20:32:14.2283082Z 2025-05-07T20:32:14.2283179Z if scale_ub is not None: 2025-05-07T20:32:14.2283294Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2283434Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2283516Z ) 2025-05-07T20:32:14.2283604Z else: 2025-05-07T20:32:14.2283701Z scale_ub_tensor = None 2025-05-07T20:32:14.2283776Z 2025-05-07T20:32:14.2283917Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2284009Z op = silu_mul_quant 2025-05-07T20:32:14.2284097Z if compiled: 2025-05-07T20:32:14.2284205Z op = torch.compile(op) 2025-05-07T20:32:14.2284314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2284488Z 2025-05-07T20:32:14.2284584Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2284589Z 2025-05-07T20:32:14.2284690Z moe/activation_test.py:117: 2025-05-07T20:32:14.2284831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2284934Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2285037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2285424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2285523Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2286040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2286141Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2286508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2286754Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2287104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2287204Z kernel = self.compile( 2025-05-07T20:32:14.2287603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2287784Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2287919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2287967Z 2025-05-07T20:32:14.2288181Z self = 2025-05-07T20:32:14.2288986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2289523Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb9c31c0>} 2025-05-07T20:32:14.2290333Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2290538Z context = 2025-05-07T20:32:14.2290545Z 2025-05-07T20:32:14.2290717Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2290995Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2291105Z module_map=module_map) 2025-05-07T20:32:14.2291270Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2291379Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2291460Z E ^ 2025-05-07T20:32:14.2291825Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2291833Z 2025-05-07T20:32:14.2292264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2292268Z 2025-05-07T20:32:14.2292374Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2292618Z self=, 2025-05-07T20:32:14.2292719Z T=4096, 2025-05-07T20:32:14.2292803Z D=7168, 2025-05-07T20:32:14.2292913Z scale_ub=None, 2025-05-07T20:32:14.2293001Z contiguous=False, 2025-05-07T20:32:14.2293088Z compiled=True, 2025-05-07T20:32:14.2293167Z ) 2025-05-07T20:32:14.2293393Z self = 2025-05-07T20:32:14.2293651Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.2293664Z 2025-05-07T20:32:14.2293744Z @given( 2025-05-07T20:32:14.2293866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2293978Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2294096Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2294216Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2294341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2294419Z ) 2025-05-07T20:32:14.2294674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2294780Z def test_silu_mul_quant( 2025-05-07T20:32:14.2294858Z self, 2025-05-07T20:32:14.2294937Z T: int, 2025-05-07T20:32:14.2295020Z D: int, 2025-05-07T20:32:14.2295121Z scale_ub: Optional[float], 2025-05-07T20:32:14.2295219Z contiguous: bool, 2025-05-07T20:32:14.2295307Z compiled: bool, 2025-05-07T20:32:14.2295390Z ) -> None: 2025-05-07T20:32:14.2295494Z torch.manual_seed(2025) 2025-05-07T20:32:14.2295568Z 2025-05-07T20:32:14.2295742Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2295826Z 2025-05-07T20:32:14.2295922Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2296050Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2296146Z x = x_sign * x_clamp 2025-05-07T20:32:14.2296227Z x0 = x[:, :D] 2025-05-07T20:32:14.2296309Z x1 = x[:, D:] 2025-05-07T20:32:14.2296435Z 2025-05-07T20:32:14.2296520Z if contiguous: 2025-05-07T20:32:14.2296613Z x0 = x0.contiguous() 2025-05-07T20:32:14.2296708Z x1 = x1.contiguous() 2025-05-07T20:32:14.2296781Z 2025-05-07T20:32:14.2296879Z if scale_ub is not None: 2025-05-07T20:32:14.2296986Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2297125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2297213Z ) 2025-05-07T20:32:14.2297290Z else: 2025-05-07T20:32:14.2297386Z scale_ub_tensor = None 2025-05-07T20:32:14.2297510Z 2025-05-07T20:32:14.2297642Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2297735Z op = silu_mul_quant 2025-05-07T20:32:14.2297826Z if compiled: 2025-05-07T20:32:14.2297928Z op = torch.compile(op) 2025-05-07T20:32:14.2298177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2298287Z 2025-05-07T20:32:14.2298389Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2298394Z 2025-05-07T20:32:14.2298501Z moe/activation_test.py:117: 2025-05-07T20:32:14.2298630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2298735Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2298843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2299231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2299326Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2299861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2299965Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2300342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2300575Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2300939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2301036Z kernel = self.compile( 2025-05-07T20:32:14.2301432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2301709Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2301841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2301848Z 2025-05-07T20:32:14.2302072Z self = 2025-05-07T20:32:14.2302882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2303409Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb6bc1f0>} 2025-05-07T20:32:14.2304193Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2304403Z context = 2025-05-07T20:32:14.2304408Z 2025-05-07T20:32:14.2304590Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2304868Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2304981Z module_map=module_map) 2025-05-07T20:32:14.2305156Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2305259Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2305388Z E ^ 2025-05-07T20:32:14.2305835Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2305842Z 2025-05-07T20:32:14.2306372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2306378Z 2025-05-07T20:32:14.2306495Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2306740Z self=, 2025-05-07T20:32:14.2306820Z T=16384, 2025-05-07T20:32:14.2306908Z D=5120, 2025-05-07T20:32:14.2307053Z scale_ub=1200.0, 2025-05-07T20:32:14.2307150Z contiguous=False, 2025-05-07T20:32:14.2307239Z compiled=False, 2025-05-07T20:32:14.2307317Z ) 2025-05-07T20:32:14.2307551Z self = 2025-05-07T20:32:14.2307747Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.2307755Z 2025-05-07T20:32:14.2307835Z @given( 2025-05-07T20:32:14.2307971Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2308075Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2308196Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2308325Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2308450Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2308532Z ) 2025-05-07T20:32:14.2308784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2308884Z def test_silu_mul_quant( 2025-05-07T20:32:14.2308968Z self, 2025-05-07T20:32:14.2309047Z T: int, 2025-05-07T20:32:14.2309125Z D: int, 2025-05-07T20:32:14.2309235Z scale_ub: Optional[float], 2025-05-07T20:32:14.2309328Z contiguous: bool, 2025-05-07T20:32:14.2309417Z compiled: bool, 2025-05-07T20:32:14.2309505Z ) -> None: 2025-05-07T20:32:14.2309605Z torch.manual_seed(2025) 2025-05-07T20:32:14.2309680Z 2025-05-07T20:32:14.2309860Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2309936Z 2025-05-07T20:32:14.2310039Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2310170Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2310261Z x = x_sign * x_clamp 2025-05-07T20:32:14.2310444Z x0 = x[:, :D] 2025-05-07T20:32:14.2310527Z x1 = x[:, D:] 2025-05-07T20:32:14.2310601Z 2025-05-07T20:32:14.2310694Z if contiguous: 2025-05-07T20:32:14.2310793Z x0 = x0.contiguous() 2025-05-07T20:32:14.2310885Z x1 = x1.contiguous() 2025-05-07T20:32:14.2310966Z 2025-05-07T20:32:14.2311060Z if scale_ub is not None: 2025-05-07T20:32:14.2311177Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2311325Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2311405Z ) 2025-05-07T20:32:14.2311484Z else: 2025-05-07T20:32:14.2311588Z scale_ub_tensor = None 2025-05-07T20:32:14.2311663Z 2025-05-07T20:32:14.2311806Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2311898Z op = silu_mul_quant 2025-05-07T20:32:14.2311984Z if compiled: 2025-05-07T20:32:14.2312093Z op = torch.compile(op) 2025-05-07T20:32:14.2312210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2312286Z 2025-05-07T20:32:14.2312387Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2312396Z 2025-05-07T20:32:14.2312521Z moe/activation_test.py:117: 2025-05-07T20:32:14.2312670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2312787Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2312893Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2313425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2313664Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2314075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2314314Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2314676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2314775Z kernel = self.compile( 2025-05-07T20:32:14.2315178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2315455Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2315630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2315636Z 2025-05-07T20:32:14.2315858Z self = 2025-05-07T20:32:14.2316713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2317446Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb6bc700>} 2025-05-07T20:32:14.2318421Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2318638Z context = 2025-05-07T20:32:14.2318644Z 2025-05-07T20:32:14.2318817Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2319102Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2319213Z module_map=module_map) 2025-05-07T20:32:14.2319381Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2319495Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2319575Z E ^ 2025-05-07T20:32:14.2320064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2320070Z 2025-05-07T20:32:14.2320510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2320518Z 2025-05-07T20:32:14.2320626Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2320864Z self=, 2025-05-07T20:32:14.2320946Z T=16384, 2025-05-07T20:32:14.2321027Z D=5120, 2025-05-07T20:32:14.2321125Z scale_ub=1200.0, 2025-05-07T20:32:14.2321213Z contiguous=True, 2025-05-07T20:32:14.2321299Z compiled=True, 2025-05-07T20:32:14.2321384Z ) 2025-05-07T20:32:14.2321611Z self = 2025-05-07T20:32:14.2321794Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.2321806Z 2025-05-07T20:32:14.2321886Z @given( 2025-05-07T20:32:14.2322015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2322125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2322245Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2322370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2322499Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2322581Z ) 2025-05-07T20:32:14.2322883Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2323033Z def test_silu_mul_quant( 2025-05-07T20:32:14.2323111Z self, 2025-05-07T20:32:14.2323190Z T: int, 2025-05-07T20:32:14.2323275Z D: int, 2025-05-07T20:32:14.2323378Z scale_ub: Optional[float], 2025-05-07T20:32:14.2323478Z contiguous: bool, 2025-05-07T20:32:14.2323568Z compiled: bool, 2025-05-07T20:32:14.2323650Z ) -> None: 2025-05-07T20:32:14.2323757Z torch.manual_seed(2025) 2025-05-07T20:32:14.2323831Z 2025-05-07T20:32:14.2324012Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2324100Z 2025-05-07T20:32:14.2324194Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2324372Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2324470Z x = x_sign * x_clamp 2025-05-07T20:32:14.2324552Z x0 = x[:, :D] 2025-05-07T20:32:14.2324633Z x1 = x[:, D:] 2025-05-07T20:32:14.2324712Z 2025-05-07T20:32:14.2324798Z if contiguous: 2025-05-07T20:32:14.2324902Z x0 = x0.contiguous() 2025-05-07T20:32:14.2324994Z x1 = x1.contiguous() 2025-05-07T20:32:14.2325068Z 2025-05-07T20:32:14.2325168Z if scale_ub is not None: 2025-05-07T20:32:14.2325278Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2325418Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2325502Z ) 2025-05-07T20:32:14.2325580Z else: 2025-05-07T20:32:14.2325683Z scale_ub_tensor = None 2025-05-07T20:32:14.2325764Z 2025-05-07T20:32:14.2325899Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2325995Z op = silu_mul_quant 2025-05-07T20:32:14.2326089Z if compiled: 2025-05-07T20:32:14.2326192Z op = torch.compile(op) 2025-05-07T20:32:14.2326307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2326381Z 2025-05-07T20:32:14.2326474Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2326478Z 2025-05-07T20:32:14.2326587Z moe/activation_test.py:117: 2025-05-07T20:32:14.2326721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2326825Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2326935Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2327317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2327413Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2328010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2328121Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2328495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2328726Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2329077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2329183Z kernel = self.compile( 2025-05-07T20:32:14.2329576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2329763Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2329892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2329902Z 2025-05-07T20:32:14.2330117Z self = 2025-05-07T20:32:14.2330926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2331448Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb6bd7e0>} 2025-05-07T20:32:14.2332264Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2332468Z context = 2025-05-07T20:32:14.2332472Z 2025-05-07T20:32:14.2332653Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2332933Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2333085Z module_map=module_map) 2025-05-07T20:32:14.2333258Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2333360Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2333439Z E ^ 2025-05-07T20:32:14.2333809Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2333817Z 2025-05-07T20:32:14.2334244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2334249Z 2025-05-07T20:32:14.2334366Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2334598Z self=, 2025-05-07T20:32:14.2334684Z T=16384, 2025-05-07T20:32:14.2334769Z D=5120, 2025-05-07T20:32:14.2334854Z scale_ub=None, 2025-05-07T20:32:14.2334945Z contiguous=False, 2025-05-07T20:32:14.2335040Z compiled=True, 2025-05-07T20:32:14.2335118Z ) 2025-05-07T20:32:14.2335343Z self = 2025-05-07T20:32:14.2335534Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.2335539Z 2025-05-07T20:32:14.2335621Z @given( 2025-05-07T20:32:14.2335753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2335858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2335980Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2336108Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2336228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2336305Z ) 2025-05-07T20:32:14.2336645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2336745Z def test_silu_mul_quant( 2025-05-07T20:32:14.2336825Z self, 2025-05-07T20:32:14.2336914Z T: int, 2025-05-07T20:32:14.2336996Z D: int, 2025-05-07T20:32:14.2337103Z scale_ub: Optional[float], 2025-05-07T20:32:14.2337203Z contiguous: bool, 2025-05-07T20:32:14.2337293Z compiled: bool, 2025-05-07T20:32:14.2337382Z ) -> None: 2025-05-07T20:32:14.2337481Z torch.manual_seed(2025) 2025-05-07T20:32:14.2337556Z 2025-05-07T20:32:14.2337740Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2337817Z 2025-05-07T20:32:14.2337914Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2338180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2338291Z x = x_sign * x_clamp 2025-05-07T20:32:14.2338398Z x0 = x[:, :D] 2025-05-07T20:32:14.2338488Z x1 = x[:, D:] 2025-05-07T20:32:14.2338564Z 2025-05-07T20:32:14.2338660Z if contiguous: 2025-05-07T20:32:14.2338762Z x0 = x0.contiguous() 2025-05-07T20:32:14.2338854Z x1 = x1.contiguous() 2025-05-07T20:32:14.2338939Z 2025-05-07T20:32:14.2339032Z if scale_ub is not None: 2025-05-07T20:32:14.2339141Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2339286Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2339365Z ) 2025-05-07T20:32:14.2339444Z else: 2025-05-07T20:32:14.2339596Z scale_ub_tensor = None 2025-05-07T20:32:14.2339671Z 2025-05-07T20:32:14.2339805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2339904Z op = silu_mul_quant 2025-05-07T20:32:14.2339992Z if compiled: 2025-05-07T20:32:14.2340094Z op = torch.compile(op) 2025-05-07T20:32:14.2340209Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2340283Z 2025-05-07T20:32:14.2340387Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2340391Z 2025-05-07T20:32:14.2340493Z moe/activation_test.py:117: 2025-05-07T20:32:14.2340625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2340782Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2340887Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2341266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2341368Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2341880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2341989Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2342363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2342597Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2342955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2343056Z kernel = self.compile( 2025-05-07T20:32:14.2343450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2343637Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2343766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2343773Z 2025-05-07T20:32:14.2343994Z self = 2025-05-07T20:32:14.2344797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2345430Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb6be680>} 2025-05-07T20:32:14.2346210Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2346409Z context = 2025-05-07T20:32:14.2346413Z 2025-05-07T20:32:14.2346593Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2346864Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2346980Z module_map=module_map) 2025-05-07T20:32:14.2347145Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2347247Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2347335Z E ^ 2025-05-07T20:32:14.2347704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2347709Z 2025-05-07T20:32:14.2348135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2348140Z 2025-05-07T20:32:14.2348254Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2348486Z self=, 2025-05-07T20:32:14.2348615Z T=2048, 2025-05-07T20:32:14.2348693Z D=5120, 2025-05-07T20:32:14.2348780Z scale_ub=None, 2025-05-07T20:32:14.2348876Z contiguous=False, 2025-05-07T20:32:14.2348962Z compiled=True, 2025-05-07T20:32:14.2349040Z ) 2025-05-07T20:32:14.2349272Z self = 2025-05-07T20:32:14.2349451Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.2349456Z 2025-05-07T20:32:14.2349541Z @given( 2025-05-07T20:32:14.2349670Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2349774Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2349942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2350065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2350183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2350270Z ) 2025-05-07T20:32:14.2350524Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2350624Z def test_silu_mul_quant( 2025-05-07T20:32:14.2350711Z self, 2025-05-07T20:32:14.2350791Z T: int, 2025-05-07T20:32:14.2350869Z D: int, 2025-05-07T20:32:14.2350978Z scale_ub: Optional[float], 2025-05-07T20:32:14.2351070Z contiguous: bool, 2025-05-07T20:32:14.2351159Z compiled: bool, 2025-05-07T20:32:14.2351250Z ) -> None: 2025-05-07T20:32:14.2351352Z torch.manual_seed(2025) 2025-05-07T20:32:14.2351435Z 2025-05-07T20:32:14.2351611Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2351689Z 2025-05-07T20:32:14.2351792Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2351923Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2352014Z x = x_sign * x_clamp 2025-05-07T20:32:14.2352103Z x0 = x[:, :D] 2025-05-07T20:32:14.2352188Z x1 = x[:, D:] 2025-05-07T20:32:14.2352263Z 2025-05-07T20:32:14.2352362Z if contiguous: 2025-05-07T20:32:14.2352456Z x0 = x0.contiguous() 2025-05-07T20:32:14.2352547Z x1 = x1.contiguous() 2025-05-07T20:32:14.2352629Z 2025-05-07T20:32:14.2352724Z if scale_ub is not None: 2025-05-07T20:32:14.2352841Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2352980Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2353057Z ) 2025-05-07T20:32:14.2353232Z else: 2025-05-07T20:32:14.2353333Z scale_ub_tensor = None 2025-05-07T20:32:14.2353409Z 2025-05-07T20:32:14.2353549Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2353644Z op = silu_mul_quant 2025-05-07T20:32:14.2353730Z if compiled: 2025-05-07T20:32:14.2353838Z op = torch.compile(op) 2025-05-07T20:32:14.2353949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2354023Z 2025-05-07T20:32:14.2354122Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2354130Z 2025-05-07T20:32:14.2354230Z moe/activation_test.py:117: 2025-05-07T20:32:14.2354371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2354475Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2354579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2354967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2355070Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2355862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2356030Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2356416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2356653Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2357172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2357272Z kernel = self.compile( 2025-05-07T20:32:14.2357670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2357852Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2357987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2358000Z 2025-05-07T20:32:14.2358213Z self = 2025-05-07T20:32:14.2359094Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2359621Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb6be560>} 2025-05-07T20:32:14.2360388Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2360594Z context = 2025-05-07T20:32:14.2360606Z 2025-05-07T20:32:14.2360778Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2361051Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2361173Z module_map=module_map) 2025-05-07T20:32:14.2361343Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2361450Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2361529Z E ^ 2025-05-07T20:32:14.2361896Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2361900Z 2025-05-07T20:32:14.2362329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2362334Z 2025-05-07T20:32:14.2362443Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2362794Z self=, 2025-05-07T20:32:14.2362884Z T=2048, 2025-05-07T20:32:14.2362963Z D=5120, 2025-05-07T20:32:14.2363056Z scale_ub=1200.0, 2025-05-07T20:32:14.2363148Z contiguous=False, 2025-05-07T20:32:14.2363235Z compiled=True, 2025-05-07T20:32:14.2363315Z ) 2025-05-07T20:32:14.2363540Z self = 2025-05-07T20:32:14.2363719Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.2363726Z 2025-05-07T20:32:14.2363812Z @given( 2025-05-07T20:32:14.2363935Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2364039Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2364162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2364284Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2364411Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2364488Z ) 2025-05-07T20:32:14.2364746Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2364849Z def test_silu_mul_quant( 2025-05-07T20:32:14.2364931Z self, 2025-05-07T20:32:14.2365010Z T: int, 2025-05-07T20:32:14.2365093Z D: int, 2025-05-07T20:32:14.2365195Z scale_ub: Optional[float], 2025-05-07T20:32:14.2365288Z contiguous: bool, 2025-05-07T20:32:14.2365405Z compiled: bool, 2025-05-07T20:32:14.2365485Z ) -> None: 2025-05-07T20:32:14.2365634Z torch.manual_seed(2025) 2025-05-07T20:32:14.2365717Z 2025-05-07T20:32:14.2371550Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2371647Z 2025-05-07T20:32:14.2371751Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2371886Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2371989Z x = x_sign * x_clamp 2025-05-07T20:32:14.2372071Z x0 = x[:, :D] 2025-05-07T20:32:14.2372162Z x1 = x[:, D:] 2025-05-07T20:32:14.2372247Z 2025-05-07T20:32:14.2372335Z if contiguous: 2025-05-07T20:32:14.2372441Z x0 = x0.contiguous() 2025-05-07T20:32:14.2372615Z x1 = x1.contiguous() 2025-05-07T20:32:14.2372691Z 2025-05-07T20:32:14.2372791Z if scale_ub is not None: 2025-05-07T20:32:14.2372901Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2373041Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2373125Z ) 2025-05-07T20:32:14.2373207Z else: 2025-05-07T20:32:14.2373305Z scale_ub_tensor = None 2025-05-07T20:32:14.2373386Z 2025-05-07T20:32:14.2373524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2373618Z op = silu_mul_quant 2025-05-07T20:32:14.2373712Z if compiled: 2025-05-07T20:32:14.2373819Z op = torch.compile(op) 2025-05-07T20:32:14.2373936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2374014Z 2025-05-07T20:32:14.2374111Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2374117Z 2025-05-07T20:32:14.2374226Z moe/activation_test.py:117: 2025-05-07T20:32:14.2374363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2374468Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2374582Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2374968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2375069Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2375589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2375693Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2376071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2376390Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2376743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2376853Z kernel = self.compile( 2025-05-07T20:32:14.2377247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2377438Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2377571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2377576Z 2025-05-07T20:32:14.2377792Z self = 2025-05-07T20:32:14.2378695Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2379220Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb6bf370>} 2025-05-07T20:32:14.2380001Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2380200Z context = 2025-05-07T20:32:14.2380249Z 2025-05-07T20:32:14.2380421Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2380703Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2380814Z module_map=module_map) 2025-05-07T20:32:14.2380988Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2381097Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2381177Z E ^ 2025-05-07T20:32:14.2381555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2381603Z 2025-05-07T20:32:14.2382032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2382037Z 2025-05-07T20:32:14.2382154Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2382386Z self=, 2025-05-07T20:32:14.2382471Z T=4096, 2025-05-07T20:32:14.2382560Z D=5120, 2025-05-07T20:32:14.2382646Z scale_ub=1200.0, 2025-05-07T20:32:14.2382734Z contiguous=True, 2025-05-07T20:32:14.2382828Z compiled=True, 2025-05-07T20:32:14.2382903Z ) 2025-05-07T20:32:14.2383129Z self = 2025-05-07T20:32:14.2383321Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.2383326Z 2025-05-07T20:32:14.2383406Z @given( 2025-05-07T20:32:14.2383538Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2383645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2383765Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2383895Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2384015Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2384092Z ) 2025-05-07T20:32:14.2384357Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2384455Z def test_silu_mul_quant( 2025-05-07T20:32:14.2384533Z self, 2025-05-07T20:32:14.2384617Z T: int, 2025-05-07T20:32:14.2384696Z D: int, 2025-05-07T20:32:14.2384799Z scale_ub: Optional[float], 2025-05-07T20:32:14.2384900Z contiguous: bool, 2025-05-07T20:32:14.2384987Z compiled: bool, 2025-05-07T20:32:14.2385283Z ) -> None: 2025-05-07T20:32:14.2385386Z torch.manual_seed(2025) 2025-05-07T20:32:14.2385462Z 2025-05-07T20:32:14.2385643Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2385723Z 2025-05-07T20:32:14.2385817Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2385954Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2386045Z x = x_sign * x_clamp 2025-05-07T20:32:14.2386133Z x0 = x[:, :D] 2025-05-07T20:32:14.2386225Z x1 = x[:, D:] 2025-05-07T20:32:14.2386298Z 2025-05-07T20:32:14.2386385Z if contiguous: 2025-05-07T20:32:14.2386486Z x0 = x0.contiguous() 2025-05-07T20:32:14.2386577Z x1 = x1.contiguous() 2025-05-07T20:32:14.2386653Z 2025-05-07T20:32:14.2386754Z if scale_ub is not None: 2025-05-07T20:32:14.2386862Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2387015Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2387092Z ) 2025-05-07T20:32:14.2387170Z else: 2025-05-07T20:32:14.2387274Z scale_ub_tensor = None 2025-05-07T20:32:14.2387350Z 2025-05-07T20:32:14.2387485Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2387586Z op = silu_mul_quant 2025-05-07T20:32:14.2387677Z if compiled: 2025-05-07T20:32:14.2387779Z op = torch.compile(op) 2025-05-07T20:32:14.2387897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2388017Z 2025-05-07T20:32:14.2388112Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2388123Z 2025-05-07T20:32:14.2388223Z moe/activation_test.py:117: 2025-05-07T20:32:14.2388354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2388456Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2388572Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2388956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2389053Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2389566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2389708Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2390081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2390312Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2390664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2390767Z kernel = self.compile( 2025-05-07T20:32:14.2391159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2391342Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2391474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2391481Z 2025-05-07T20:32:14.2391695Z self = 2025-05-07T20:32:14.2392550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2393074Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb2a0310>} 2025-05-07T20:32:14.2393846Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2394122Z context = 2025-05-07T20:32:14.2394127Z 2025-05-07T20:32:14.2394298Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2394580Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2394691Z module_map=module_map) 2025-05-07T20:32:14.2394861Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2394964Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2395043Z E ^ 2025-05-07T20:32:14.2395415Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2395419Z 2025-05-07T20:32:14.2395845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2395850Z 2025-05-07T20:32:14.2395957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2396200Z self=, 2025-05-07T20:32:14.2396279Z T=128, 2025-05-07T20:32:14.2396365Z D=5120, 2025-05-07T20:32:14.2396450Z scale_ub=1200.0, 2025-05-07T20:32:14.2396537Z contiguous=False, 2025-05-07T20:32:14.2396628Z compiled=True, 2025-05-07T20:32:14.2396703Z ) 2025-05-07T20:32:14.2396930Z self = 2025-05-07T20:32:14.2397115Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.2397163Z 2025-05-07T20:32:14.2397244Z @given( 2025-05-07T20:32:14.2397371Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2397479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2397597Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2397726Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2397849Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2397925Z ) 2025-05-07T20:32:14.2398186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2398325Z def test_silu_mul_quant( 2025-05-07T20:32:14.2398402Z self, 2025-05-07T20:32:14.2398493Z T: int, 2025-05-07T20:32:14.2398570Z D: int, 2025-05-07T20:32:14.2398672Z scale_ub: Optional[float], 2025-05-07T20:32:14.2398770Z contiguous: bool, 2025-05-07T20:32:14.2398858Z compiled: bool, 2025-05-07T20:32:14.2398940Z ) -> None: 2025-05-07T20:32:14.2399049Z torch.manual_seed(2025) 2025-05-07T20:32:14.2399123Z 2025-05-07T20:32:14.2399305Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2399382Z 2025-05-07T20:32:14.2399479Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2399614Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2399705Z x = x_sign * x_clamp 2025-05-07T20:32:14.2399790Z x0 = x[:, :D] 2025-05-07T20:32:14.2399878Z x1 = x[:, D:] 2025-05-07T20:32:14.2399953Z 2025-05-07T20:32:14.2400039Z if contiguous: 2025-05-07T20:32:14.2400144Z x0 = x0.contiguous() 2025-05-07T20:32:14.2400235Z x1 = x1.contiguous() 2025-05-07T20:32:14.2400309Z 2025-05-07T20:32:14.2400410Z if scale_ub is not None: 2025-05-07T20:32:14.2400520Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2400664Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2400744Z ) 2025-05-07T20:32:14.2400820Z else: 2025-05-07T20:32:14.2400921Z scale_ub_tensor = None 2025-05-07T20:32:14.2400996Z 2025-05-07T20:32:14.2401128Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2401227Z op = silu_mul_quant 2025-05-07T20:32:14.2401313Z if compiled: 2025-05-07T20:32:14.2401416Z op = torch.compile(op) 2025-05-07T20:32:14.2401611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2401687Z 2025-05-07T20:32:14.2401780Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2401788Z 2025-05-07T20:32:14.2401892Z moe/activation_test.py:117: 2025-05-07T20:32:14.2402023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2402133Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2402238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2402660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2402775Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2403282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2403382Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2403763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2403991Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2404351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2404446Z kernel = self.compile( 2025-05-07T20:32:14.2404837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2405024Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2405196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2405201Z 2025-05-07T20:32:14.2405418Z self = 2025-05-07T20:32:14.2406223Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2406744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb2a1090>} 2025-05-07T20:32:14.2407563Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2407764Z context = 2025-05-07T20:32:14.2407769Z 2025-05-07T20:32:14.2407944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2408218Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2408328Z module_map=module_map) 2025-05-07T20:32:14.2408503Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2408606Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2408689Z E ^ 2025-05-07T20:32:14.2409054Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2409061Z 2025-05-07T20:32:14.2409487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2409491Z 2025-05-07T20:32:14.2409605Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2409838Z self=, 2025-05-07T20:32:14.2409916Z T=16384, 2025-05-07T20:32:14.2410000Z D=7168, 2025-05-07T20:32:14.2410086Z scale_ub=1200.0, 2025-05-07T20:32:14.2410179Z contiguous=True, 2025-05-07T20:32:14.2410264Z compiled=True, 2025-05-07T20:32:14.2410339Z ) 2025-05-07T20:32:14.2410573Z self = 2025-05-07T20:32:14.2410833Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.2410838Z 2025-05-07T20:32:14.2410916Z @given( 2025-05-07T20:32:14.2411050Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2411151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2411272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2411399Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2411516Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2411600Z ) 2025-05-07T20:32:14.2411852Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2411948Z def test_silu_mul_quant( 2025-05-07T20:32:14.2412032Z self, 2025-05-07T20:32:14.2412110Z T: int, 2025-05-07T20:32:14.2412188Z D: int, 2025-05-07T20:32:14.2412302Z scale_ub: Optional[float], 2025-05-07T20:32:14.2412395Z contiguous: bool, 2025-05-07T20:32:14.2412492Z compiled: bool, 2025-05-07T20:32:14.2412578Z ) -> None: 2025-05-07T20:32:14.2412676Z torch.manual_seed(2025) 2025-05-07T20:32:14.2412751Z 2025-05-07T20:32:14.2412936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2413011Z 2025-05-07T20:32:14.2413110Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2413238Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2413329Z x = x_sign * x_clamp 2025-05-07T20:32:14.2413464Z x0 = x[:, :D] 2025-05-07T20:32:14.2413546Z x1 = x[:, D:] 2025-05-07T20:32:14.2413620Z 2025-05-07T20:32:14.2413710Z if contiguous: 2025-05-07T20:32:14.2413803Z x0 = x0.contiguous() 2025-05-07T20:32:14.2413895Z x1 = x1.contiguous() 2025-05-07T20:32:14.2413973Z 2025-05-07T20:32:14.2414068Z if scale_ub is not None: 2025-05-07T20:32:14.2414177Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2414328Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2414405Z ) 2025-05-07T20:32:14.2414490Z else: 2025-05-07T20:32:14.2414657Z scale_ub_tensor = None 2025-05-07T20:32:14.2414731Z 2025-05-07T20:32:14.2414872Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2414967Z op = silu_mul_quant 2025-05-07T20:32:14.2415056Z if compiled: 2025-05-07T20:32:14.2415165Z op = torch.compile(op) 2025-05-07T20:32:14.2415275Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2415349Z 2025-05-07T20:32:14.2415449Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2415454Z 2025-05-07T20:32:14.2415555Z moe/activation_test.py:117: 2025-05-07T20:32:14.2415687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2415798Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2415904Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2416397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2416539Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2417095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2417203Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2417574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2417810Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2418263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2418360Z kernel = self.compile( 2025-05-07T20:32:14.2419433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2419631Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2419798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2419809Z 2025-05-07T20:32:14.2420088Z self = 2025-05-07T20:32:14.2420894Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2421427Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb2a2290>} 2025-05-07T20:32:14.2422203Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2422410Z context = 2025-05-07T20:32:14.2422416Z 2025-05-07T20:32:14.2422587Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2422862Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2422979Z module_map=module_map) 2025-05-07T20:32:14.2423146Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2423305Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2423390Z E ^ 2025-05-07T20:32:14.2423758Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2423763Z 2025-05-07T20:32:14.2424205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2424217Z 2025-05-07T20:32:14.2424325Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2424555Z self=, 2025-05-07T20:32:14.2424684Z T=16384, 2025-05-07T20:32:14.2424761Z D=5120, 2025-05-07T20:32:14.2424847Z scale_ub=1200.0, 2025-05-07T20:32:14.2424940Z contiguous=True, 2025-05-07T20:32:14.2425024Z compiled=False, 2025-05-07T20:32:14.2425104Z ) 2025-05-07T20:32:14.2425328Z self = 2025-05-07T20:32:14.2425515Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.2425520Z 2025-05-07T20:32:14.2425605Z @given( 2025-05-07T20:32:14.2425727Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2425830Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2425962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2426089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2426207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2426288Z ) 2025-05-07T20:32:14.2426540Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2426647Z def test_silu_mul_quant( 2025-05-07T20:32:14.2426724Z self, 2025-05-07T20:32:14.2426802Z T: int, 2025-05-07T20:32:14.2426884Z D: int, 2025-05-07T20:32:14.2426986Z scale_ub: Optional[float], 2025-05-07T20:32:14.2427077Z contiguous: bool, 2025-05-07T20:32:14.2427174Z compiled: bool, 2025-05-07T20:32:14.2427254Z ) -> None: 2025-05-07T20:32:14.2427354Z torch.manual_seed(2025) 2025-05-07T20:32:14.2427434Z 2025-05-07T20:32:14.2427609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2427684Z 2025-05-07T20:32:14.2427787Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2427916Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2428102Z x = x_sign * x_clamp 2025-05-07T20:32:14.2428187Z x0 = x[:, :D] 2025-05-07T20:32:14.2428269Z x1 = x[:, D:] 2025-05-07T20:32:14.2428355Z 2025-05-07T20:32:14.2428441Z if contiguous: 2025-05-07T20:32:14.2428534Z x0 = x0.contiguous() 2025-05-07T20:32:14.2428632Z x1 = x1.contiguous() 2025-05-07T20:32:14.2428709Z 2025-05-07T20:32:14.2428803Z if scale_ub is not None: 2025-05-07T20:32:14.2428919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2429062Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2429139Z ) 2025-05-07T20:32:14.2429224Z else: 2025-05-07T20:32:14.2429328Z scale_ub_tensor = None 2025-05-07T20:32:14.2429402Z 2025-05-07T20:32:14.2429550Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2429643Z op = silu_mul_quant 2025-05-07T20:32:14.2429738Z if compiled: 2025-05-07T20:32:14.2429845Z op = torch.compile(op) 2025-05-07T20:32:14.2429956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2430035Z 2025-05-07T20:32:14.2430130Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2430135Z 2025-05-07T20:32:14.2430235Z moe/activation_test.py:117: 2025-05-07T20:32:14.2430370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2430512Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2430659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2431309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2431412Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2431790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2432029Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2432384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2432534Z kernel = self.compile( 2025-05-07T20:32:14.2432927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2433114Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2433241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2433249Z 2025-05-07T20:32:14.2433465Z self = 2025-05-07T20:32:14.2434268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2434807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb2a11b0>} 2025-05-07T20:32:14.2435582Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2435782Z context = 2025-05-07T20:32:14.2435786Z 2025-05-07T20:32:14.2435967Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2436239Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2436351Z module_map=module_map) 2025-05-07T20:32:14.2436525Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2436630Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2436717Z E ^ 2025-05-07T20:32:14.2437162Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2437169Z 2025-05-07T20:32:14.2437598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2437603Z 2025-05-07T20:32:14.2437718Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2437948Z self=, 2025-05-07T20:32:14.2438040Z T=1, 2025-05-07T20:32:14.2438124Z D=7168, 2025-05-07T20:32:14.2438216Z scale_ub=1200.0, 2025-05-07T20:32:14.2438314Z contiguous=False, 2025-05-07T20:32:14.2438402Z compiled=False, 2025-05-07T20:32:14.2438477Z ) 2025-05-07T20:32:14.2438707Z self = 2025-05-07T20:32:14.2438884Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.2438889Z 2025-05-07T20:32:14.2438975Z @given( 2025-05-07T20:32:14.2439105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2439210Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2439335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2439465Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2439584Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2439666Z ) 2025-05-07T20:32:14.2439917Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2440060Z def test_silu_mul_quant( 2025-05-07T20:32:14.2440145Z self, 2025-05-07T20:32:14.2440224Z T: int, 2025-05-07T20:32:14.2440302Z D: int, 2025-05-07T20:32:14.2440409Z scale_ub: Optional[float], 2025-05-07T20:32:14.2440503Z contiguous: bool, 2025-05-07T20:32:14.2440592Z compiled: bool, 2025-05-07T20:32:14.2440678Z ) -> None: 2025-05-07T20:32:14.2440782Z torch.manual_seed(2025) 2025-05-07T20:32:14.2440856Z 2025-05-07T20:32:14.2441035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2441157Z 2025-05-07T20:32:14.2441258Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2441386Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2441476Z x = x_sign * x_clamp 2025-05-07T20:32:14.2441563Z x0 = x[:, :D] 2025-05-07T20:32:14.2441644Z x1 = x[:, D:] 2025-05-07T20:32:14.2441719Z 2025-05-07T20:32:14.2441809Z if contiguous: 2025-05-07T20:32:14.2441901Z x0 = x0.contiguous() 2025-05-07T20:32:14.2441992Z x1 = x1.contiguous() 2025-05-07T20:32:14.2442071Z 2025-05-07T20:32:14.2442164Z if scale_ub is not None: 2025-05-07T20:32:14.2442272Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2442420Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2442495Z ) 2025-05-07T20:32:14.2442583Z else: 2025-05-07T20:32:14.2442680Z scale_ub_tensor = None 2025-05-07T20:32:14.2442754Z 2025-05-07T20:32:14.2442894Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2442989Z op = silu_mul_quant 2025-05-07T20:32:14.2443075Z if compiled: 2025-05-07T20:32:14.2443185Z op = torch.compile(op) 2025-05-07T20:32:14.2443294Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2443372Z 2025-05-07T20:32:14.2443479Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2443483Z 2025-05-07T20:32:14.2443583Z moe/activation_test.py:117: 2025-05-07T20:32:14.2443723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2443826Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2443929Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2444534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2444637Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2445005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2445241Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2445593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2445696Z kernel = self.compile( 2025-05-07T20:32:14.2446090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2446270Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2446404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2446408Z 2025-05-07T20:32:14.2446628Z self = 2025-05-07T20:32:14.2447433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2447954Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb2a2680>} 2025-05-07T20:32:14.2448721Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2449006Z context = 2025-05-07T20:32:14.2449011Z 2025-05-07T20:32:14.2449182Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2449468Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2449581Z module_map=module_map) 2025-05-07T20:32:14.2449790Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2449900Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2449980Z E ^ 2025-05-07T20:32:14.2450345Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2450357Z 2025-05-07T20:32:14.2450784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2450789Z 2025-05-07T20:32:14.2450896Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2451134Z self=, 2025-05-07T20:32:14.2451215Z T=4096, 2025-05-07T20:32:14.2451294Z D=7168, 2025-05-07T20:32:14.2451387Z scale_ub=1200.0, 2025-05-07T20:32:14.2451482Z contiguous=False, 2025-05-07T20:32:14.2451569Z compiled=True, 2025-05-07T20:32:14.2451651Z ) 2025-05-07T20:32:14.2451875Z self = 2025-05-07T20:32:14.2452066Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.2452071Z 2025-05-07T20:32:14.2452152Z @given( 2025-05-07T20:32:14.2452275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2452388Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2452511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2452633Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2452758Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2452834Z ) 2025-05-07T20:32:14.2453086Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2453189Z def test_silu_mul_quant( 2025-05-07T20:32:14.2453349Z self, 2025-05-07T20:32:14.2453435Z T: int, 2025-05-07T20:32:14.2453512Z D: int, 2025-05-07T20:32:14.2453614Z scale_ub: Optional[float], 2025-05-07T20:32:14.2453715Z contiguous: bool, 2025-05-07T20:32:14.2453804Z compiled: bool, 2025-05-07T20:32:14.2453883Z ) -> None: 2025-05-07T20:32:14.2453988Z torch.manual_seed(2025) 2025-05-07T20:32:14.2454061Z 2025-05-07T20:32:14.2454234Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2454318Z 2025-05-07T20:32:14.2454414Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2454544Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2454643Z x = x_sign * x_clamp 2025-05-07T20:32:14.2454723Z x0 = x[:, :D] 2025-05-07T20:32:14.2454811Z x1 = x[:, D:] 2025-05-07T20:32:14.2454885Z 2025-05-07T20:32:14.2454969Z if contiguous: 2025-05-07T20:32:14.2455068Z x0 = x0.contiguous() 2025-05-07T20:32:14.2455162Z x1 = x1.contiguous() 2025-05-07T20:32:14.2455237Z 2025-05-07T20:32:14.2455337Z if scale_ub is not None: 2025-05-07T20:32:14.2455445Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2455911Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2456019Z ) 2025-05-07T20:32:14.2456097Z else: 2025-05-07T20:32:14.2456193Z scale_ub_tensor = None 2025-05-07T20:32:14.2456272Z 2025-05-07T20:32:14.2456407Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2456671Z op = silu_mul_quant 2025-05-07T20:32:14.2456767Z if compiled: 2025-05-07T20:32:14.2456868Z op = torch.compile(op) 2025-05-07T20:32:14.2456984Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2457056Z 2025-05-07T20:32:14.2457148Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2457153Z 2025-05-07T20:32:14.2457257Z moe/activation_test.py:117: 2025-05-07T20:32:14.2457394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2457500Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2457697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2458174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2458277Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2458784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2458890Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2459263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2459492Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2459847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2459950Z kernel = self.compile( 2025-05-07T20:32:14.2460341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2460529Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2460656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2460660Z 2025-05-07T20:32:14.2460875Z self = 2025-05-07T20:32:14.2461680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2462390Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb2a3b50>} 2025-05-07T20:32:14.2463172Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2463373Z context = 2025-05-07T20:32:14.2463377Z 2025-05-07T20:32:14.2463552Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2463823Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2463937Z module_map=module_map) 2025-05-07T20:32:14.2464107Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2464208Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2464287Z E ^ 2025-05-07T20:32:14.2464662Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2464667Z 2025-05-07T20:32:14.2465092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2465099Z 2025-05-07T20:32:14.2465212Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2465441Z self=, 2025-05-07T20:32:14.2465519Z T=128, 2025-05-07T20:32:14.2465604Z D=7168, 2025-05-07T20:32:14.2465689Z scale_ub=1200.0, 2025-05-07T20:32:14.2465825Z contiguous=False, 2025-05-07T20:32:14.2465917Z compiled=True, 2025-05-07T20:32:14.2465990Z ) 2025-05-07T20:32:14.2466214Z self = 2025-05-07T20:32:14.2466397Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.2466402Z 2025-05-07T20:32:14.2466482Z @given( 2025-05-07T20:32:14.2466609Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2466717Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2466835Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2467005Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2467124Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2467198Z ) 2025-05-07T20:32:14.2467456Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2467556Z def test_silu_mul_quant( 2025-05-07T20:32:14.2467636Z self, 2025-05-07T20:32:14.2467720Z T: int, 2025-05-07T20:32:14.2467798Z D: int, 2025-05-07T20:32:14.2467907Z scale_ub: Optional[float], 2025-05-07T20:32:14.2467999Z contiguous: bool, 2025-05-07T20:32:14.2468087Z compiled: bool, 2025-05-07T20:32:14.2468173Z ) -> None: 2025-05-07T20:32:14.2468271Z torch.manual_seed(2025) 2025-05-07T20:32:14.2468345Z 2025-05-07T20:32:14.2468531Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2468606Z 2025-05-07T20:32:14.2468702Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2468836Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2468929Z x = x_sign * x_clamp 2025-05-07T20:32:14.2469011Z x0 = x[:, :D] 2025-05-07T20:32:14.2469101Z x1 = x[:, D:] 2025-05-07T20:32:14.2469175Z 2025-05-07T20:32:14.2469267Z if contiguous: 2025-05-07T20:32:14.2469360Z x0 = x0.contiguous() 2025-05-07T20:32:14.2469452Z x1 = x1.contiguous() 2025-05-07T20:32:14.2469532Z 2025-05-07T20:32:14.2469624Z if scale_ub is not None: 2025-05-07T20:32:14.2469733Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2469880Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2469956Z ) 2025-05-07T20:32:14.2470032Z else: 2025-05-07T20:32:14.2470134Z scale_ub_tensor = None 2025-05-07T20:32:14.2470289Z 2025-05-07T20:32:14.2470423Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2470523Z op = silu_mul_quant 2025-05-07T20:32:14.2470611Z if compiled: 2025-05-07T20:32:14.2470713Z op = torch.compile(op) 2025-05-07T20:32:14.2470828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2470900Z 2025-05-07T20:32:14.2470997Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2471002Z 2025-05-07T20:32:14.2471103Z moe/activation_test.py:117: 2025-05-07T20:32:14.2471238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2471347Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2471452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2471831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2471931Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2472445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2472551Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2472971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2473198Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2473555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2473696Z kernel = self.compile( 2025-05-07T20:32:14.2474088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2474275Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2474402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2474406Z 2025-05-07T20:32:14.2474628Z self = 2025-05-07T20:32:14.2475425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2476004Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb318670>} 2025-05-07T20:32:14.2476774Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2476974Z context = 2025-05-07T20:32:14.2476979Z 2025-05-07T20:32:14.2477159Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2477431Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2477551Z module_map=module_map) 2025-05-07T20:32:14.2477717Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2477820Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2477910Z E ^ 2025-05-07T20:32:14.2478273Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2478280Z 2025-05-07T20:32:14.2478705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2478716Z 2025-05-07T20:32:14.2478825Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2479053Z self=, 2025-05-07T20:32:14.2479138Z T=2048, 2025-05-07T20:32:14.2479327Z D=7168, 2025-05-07T20:32:14.2479414Z scale_ub=None, 2025-05-07T20:32:14.2479508Z contiguous=True, 2025-05-07T20:32:14.2479595Z compiled=True, 2025-05-07T20:32:14.2479672Z ) 2025-05-07T20:32:14.2479901Z self = 2025-05-07T20:32:14.2480078Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.2480083Z 2025-05-07T20:32:14.2480168Z @given( 2025-05-07T20:32:14.2480292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2480400Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2480526Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2480649Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2480767Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2480852Z ) 2025-05-07T20:32:14.2481105Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2481208Z def test_silu_mul_quant( 2025-05-07T20:32:14.2481295Z self, 2025-05-07T20:32:14.2481374Z T: int, 2025-05-07T20:32:14.2481455Z D: int, 2025-05-07T20:32:14.2481566Z scale_ub: Optional[float], 2025-05-07T20:32:14.2481662Z contiguous: bool, 2025-05-07T20:32:14.2481757Z compiled: bool, 2025-05-07T20:32:14.2481838Z ) -> None: 2025-05-07T20:32:14.2481938Z torch.manual_seed(2025) 2025-05-07T20:32:14.2482022Z 2025-05-07T20:32:14.2482198Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2482319Z 2025-05-07T20:32:14.2482423Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2482552Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2482643Z x = x_sign * x_clamp 2025-05-07T20:32:14.2482733Z x0 = x[:, :D] 2025-05-07T20:32:14.2482815Z x1 = x[:, D:] 2025-05-07T20:32:14.2482890Z 2025-05-07T20:32:14.2482985Z if contiguous: 2025-05-07T20:32:14.2483088Z x0 = x0.contiguous() 2025-05-07T20:32:14.2483180Z x1 = x1.contiguous() 2025-05-07T20:32:14.2483261Z 2025-05-07T20:32:14.2483399Z if scale_ub is not None: 2025-05-07T20:32:14.2483516Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2483657Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2483735Z ) 2025-05-07T20:32:14.2483818Z else: 2025-05-07T20:32:14.2483916Z scale_ub_tensor = None 2025-05-07T20:32:14.2483992Z 2025-05-07T20:32:14.2484135Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2484227Z op = silu_mul_quant 2025-05-07T20:32:14.2484312Z if compiled: 2025-05-07T20:32:14.2484421Z op = torch.compile(op) 2025-05-07T20:32:14.2484530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2484605Z 2025-05-07T20:32:14.2484704Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2484709Z 2025-05-07T20:32:14.2484813Z moe/activation_test.py:117: 2025-05-07T20:32:14.2484950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2485057Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2485160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2485548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2485647Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2486153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2486264Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2486631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2486863Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2487294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2487391Z kernel = self.compile( 2025-05-07T20:32:14.2487789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2487969Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2488104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2488111Z 2025-05-07T20:32:14.2488324Z self = 2025-05-07T20:32:14.2489120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2489650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb3191b0>} 2025-05-07T20:32:14.2490413Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2490622Z context = 2025-05-07T20:32:14.2490627Z 2025-05-07T20:32:14.2490797Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2491108Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2491224Z module_map=module_map) 2025-05-07T20:32:14.2491390Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2491499Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2491579Z E ^ 2025-05-07T20:32:14.2491948Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2491952Z 2025-05-07T20:32:14.2492388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2492433Z 2025-05-07T20:32:14.2492560Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2492829Z self=, 2025-05-07T20:32:14.2492910Z T=16384, 2025-05-07T20:32:14.2492991Z D=5120, 2025-05-07T20:32:14.2493083Z scale_ub=None, 2025-05-07T20:32:14.2493173Z contiguous=False, 2025-05-07T20:32:14.2493262Z compiled=False, 2025-05-07T20:32:14.2493342Z ) 2025-05-07T20:32:14.2493567Z self = 2025-05-07T20:32:14.2493750Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.2493755Z 2025-05-07T20:32:14.2493840Z @given( 2025-05-07T20:32:14.2493968Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2494083Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2494206Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2494328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2494450Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2494531Z ) 2025-05-07T20:32:14.2494783Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2494889Z def test_silu_mul_quant( 2025-05-07T20:32:14.2494967Z self, 2025-05-07T20:32:14.2495046Z T: int, 2025-05-07T20:32:14.2495131Z D: int, 2025-05-07T20:32:14.2495234Z scale_ub: Optional[float], 2025-05-07T20:32:14.2495326Z contiguous: bool, 2025-05-07T20:32:14.2495421Z compiled: bool, 2025-05-07T20:32:14.2495502Z ) -> None: 2025-05-07T20:32:14.2495609Z torch.manual_seed(2025) 2025-05-07T20:32:14.2495684Z 2025-05-07T20:32:14.2495941Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2496027Z 2025-05-07T20:32:14.2496126Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2496256Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2498275Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2498285Z 2025-05-07T20:32:14.2498409Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:14.2498414Z 2025-05-07T20:32:14.2498533Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2498781Z self=, 2025-05-07T20:32:14.2498863Z T=4096, 2025-05-07T20:32:14.2498950Z D=7168, 2025-05-07T20:32:14.2499036Z scale_ub=1200.0, 2025-05-07T20:32:14.2504864Z contiguous=True, 2025-05-07T20:32:14.2504979Z compiled=True, 2025-05-07T20:32:14.2505059Z ) 2025-05-07T20:32:14.2505299Z self = 2025-05-07T20:32:14.2505566Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.2505571Z 2025-05-07T20:32:14.2505651Z @given( 2025-05-07T20:32:14.2505787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2505890Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2506016Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2506144Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2506260Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2506342Z ) 2025-05-07T20:32:14.2506603Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2506750Z def test_silu_mul_quant( 2025-05-07T20:32:14.2506835Z self, 2025-05-07T20:32:14.2506913Z T: int, 2025-05-07T20:32:14.2506992Z D: int, 2025-05-07T20:32:14.2507099Z scale_ub: Optional[float], 2025-05-07T20:32:14.2507191Z contiguous: bool, 2025-05-07T20:32:14.2507287Z compiled: bool, 2025-05-07T20:32:14.2507375Z ) -> None: 2025-05-07T20:32:14.2507474Z torch.manual_seed(2025) 2025-05-07T20:32:14.2507557Z 2025-05-07T20:32:14.2507731Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2507808Z 2025-05-07T20:32:14.2507912Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2508042Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2509919Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2509939Z 2025-05-07T20:32:14.2510066Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:14.2510071Z 2025-05-07T20:32:14.2510179Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2510418Z self=, 2025-05-07T20:32:14.2510499Z T=16384, 2025-05-07T20:32:14.2510585Z D=7168, 2025-05-07T20:32:14.2510757Z scale_ub=None, 2025-05-07T20:32:14.2510851Z contiguous=False, 2025-05-07T20:32:14.2510945Z compiled=False, 2025-05-07T20:32:14.2511020Z ) 2025-05-07T20:32:14.2511251Z self = 2025-05-07T20:32:14.2511472Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.2511479Z 2025-05-07T20:32:14.2511586Z @given( 2025-05-07T20:32:14.2511751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2511907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2512059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2512190Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2512310Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2512387Z ) 2025-05-07T20:32:14.2512648Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2512745Z def test_silu_mul_quant( 2025-05-07T20:32:14.2512830Z self, 2025-05-07T20:32:14.2512916Z T: int, 2025-05-07T20:32:14.2512994Z D: int, 2025-05-07T20:32:14.2513099Z scale_ub: Optional[float], 2025-05-07T20:32:14.2513202Z contiguous: bool, 2025-05-07T20:32:14.2516295Z compiled: bool, 2025-05-07T20:32:14.2516403Z ) -> None: 2025-05-07T20:32:14.2516512Z torch.manual_seed(2025) 2025-05-07T20:32:14.2516589Z 2025-05-07T20:32:14.2516770Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2518715Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2518722Z 2025-05-07T20:32:14.2518925Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2518930Z 2025-05-07T20:32:14.2519069Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2519373Z self=, 2025-05-07T20:32:14.2519493Z T=2048, 2025-05-07T20:32:14.2519613Z D=7168, 2025-05-07T20:32:14.2519714Z scale_ub=1200.0, 2025-05-07T20:32:14.2519805Z contiguous=True, 2025-05-07T20:32:14.2519901Z compiled=True, 2025-05-07T20:32:14.2519978Z ) 2025-05-07T20:32:14.2520217Z self = 2025-05-07T20:32:14.2520398Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.2520403Z 2025-05-07T20:32:14.2520485Z @given( 2025-05-07T20:32:14.2520620Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2520724Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2520843Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2520977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2521146Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2521258Z ) 2025-05-07T20:32:14.2521525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2521626Z def test_silu_mul_quant( 2025-05-07T20:32:14.2521718Z self, 2025-05-07T20:32:14.2521798Z T: int, 2025-05-07T20:32:14.2521878Z D: int, 2025-05-07T20:32:14.2521990Z scale_ub: Optional[float], 2025-05-07T20:32:14.2522084Z contiguous: bool, 2025-05-07T20:32:14.2522174Z compiled: bool, 2025-05-07T20:32:14.2522267Z ) -> None: 2025-05-07T20:32:14.2522369Z torch.manual_seed(2025) 2025-05-07T20:32:14.2522472Z 2025-05-07T20:32:14.2522803Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2522916Z 2025-05-07T20:32:14.2523046Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2523240Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2525243Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2525255Z 2025-05-07T20:32:14.2525393Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:14.2525397Z 2025-05-07T20:32:14.2525513Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2525754Z self=, 2025-05-07T20:32:14.2525840Z T=2048, 2025-05-07T20:32:14.2525922Z D=7168, 2025-05-07T20:32:14.2526016Z scale_ub=None, 2025-05-07T20:32:14.2526219Z contiguous=True, 2025-05-07T20:32:14.2526309Z compiled=False, 2025-05-07T20:32:14.2526396Z ) 2025-05-07T20:32:14.2526623Z self = 2025-05-07T20:32:14.2526848Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.2526859Z 2025-05-07T20:32:14.2526940Z @given( 2025-05-07T20:32:14.2527063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2527173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2527290Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2527417Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2527537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2527620Z ) 2025-05-07T20:32:14.2527875Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2528089Z def test_silu_mul_quant( 2025-05-07T20:32:14.2528176Z self, 2025-05-07T20:32:14.2528266Z T: int, 2025-05-07T20:32:14.2528347Z D: int, 2025-05-07T20:32:14.2528455Z scale_ub: Optional[float], 2025-05-07T20:32:14.2528548Z contiguous: bool, 2025-05-07T20:32:14.2528645Z compiled: bool, 2025-05-07T20:32:14.2528727Z ) -> None: 2025-05-07T20:32:14.2528826Z torch.manual_seed(2025) 2025-05-07T20:32:14.2528910Z 2025-05-07T20:32:14.2529086Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2529162Z 2025-05-07T20:32:14.2529264Z > x_sign = torch.sign(x) 2025-05-07T20:32:14.2531124Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2531133Z 2025-05-07T20:32:14.2531271Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:14.2531276Z 2025-05-07T20:32:14.2531384Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2531616Z self=, 2025-05-07T20:32:14.2531702Z T=1, 2025-05-07T20:32:14.2531783Z D=7168, 2025-05-07T20:32:14.2531874Z scale_ub=1200.0, 2025-05-07T20:32:14.2531964Z contiguous=True, 2025-05-07T20:32:14.2532055Z compiled=False, 2025-05-07T20:32:14.2532181Z ) 2025-05-07T20:32:14.2532408Z self = 2025-05-07T20:32:14.2532587Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.2532594Z 2025-05-07T20:32:14.2532686Z @given( 2025-05-07T20:32:14.2532812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2532916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2533041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2533165Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2533290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2533368Z ) 2025-05-07T20:32:14.2533621Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2533726Z def test_silu_mul_quant( 2025-05-07T20:32:14.2533806Z self, 2025-05-07T20:32:14.2533887Z T: int, 2025-05-07T20:32:14.2533975Z D: int, 2025-05-07T20:32:14.2534078Z scale_ub: Optional[float], 2025-05-07T20:32:14.2534172Z contiguous: bool, 2025-05-07T20:32:14.2534274Z compiled: bool, 2025-05-07T20:32:14.2534356Z ) -> None: 2025-05-07T20:32:14.2534456Z torch.manual_seed(2025) 2025-05-07T20:32:14.2534609Z 2025-05-07T20:32:14.2534789Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2534872Z 2025-05-07T20:32:14.2534969Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2535142Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2535242Z x = x_sign * x_clamp 2025-05-07T20:32:14.2535326Z x0 = x[:, :D] 2025-05-07T20:32:14.2535410Z x1 = x[:, D:] 2025-05-07T20:32:14.2535493Z 2025-05-07T20:32:14.2535583Z if contiguous: 2025-05-07T20:32:14.2535682Z x0 = x0.contiguous() 2025-05-07T20:32:14.2535784Z x1 = x1.contiguous() 2025-05-07T20:32:14.2535858Z 2025-05-07T20:32:14.2535958Z if scale_ub is not None: 2025-05-07T20:32:14.2536076Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2536226Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2536347Z ) 2025-05-07T20:32:14.2536435Z else: 2025-05-07T20:32:14.2536538Z scale_ub_tensor = None 2025-05-07T20:32:14.2536620Z 2025-05-07T20:32:14.2536757Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2536853Z op = silu_mul_quant 2025-05-07T20:32:14.2536949Z if compiled: 2025-05-07T20:32:14.2537054Z op = torch.compile(op) 2025-05-07T20:32:14.2537172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2537253Z 2025-05-07T20:32:14.2537349Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2537354Z 2025-05-07T20:32:14.2537456Z moe/activation_test.py:117: 2025-05-07T20:32:14.2537594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2537703Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2537814Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2538424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2538532Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2538911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2539144Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2539498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2539601Z kernel = self.compile( 2025-05-07T20:32:14.2539998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2540240Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2540377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2540385Z 2025-05-07T20:32:14.2540603Z self = 2025-05-07T20:32:14.2541417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2541948Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb31ae60>} 2025-05-07T20:32:14.2542726Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2542930Z context = 2025-05-07T20:32:14.2542935Z 2025-05-07T20:32:14.2543114Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2543441Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2543555Z module_map=module_map) 2025-05-07T20:32:14.2543737Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2543842Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2543962Z E ^ 2025-05-07T20:32:14.2544340Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2544347Z 2025-05-07T20:32:14.2544778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2544783Z 2025-05-07T20:32:14.2544899Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2545134Z self=, 2025-05-07T20:32:14.2545216Z T=128, 2025-05-07T20:32:14.2545342Z D=5120, 2025-05-07T20:32:14.2545428Z scale_ub=None, 2025-05-07T20:32:14.2545519Z contiguous=True, 2025-05-07T20:32:14.2545616Z compiled=False, 2025-05-07T20:32:14.2545692Z ) 2025-05-07T20:32:14.2545921Z self = 2025-05-07T20:32:14.2546104Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.2546112Z 2025-05-07T20:32:14.2546193Z @given( 2025-05-07T20:32:14.2546322Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2546426Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2546546Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2546678Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2546799Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2546876Z ) 2025-05-07T20:32:14.2547141Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2547245Z def test_silu_mul_quant( 2025-05-07T20:32:14.2547324Z self, 2025-05-07T20:32:14.2547411Z T: int, 2025-05-07T20:32:14.2547492Z D: int, 2025-05-07T20:32:14.2547606Z scale_ub: Optional[float], 2025-05-07T20:32:14.2547700Z contiguous: bool, 2025-05-07T20:32:14.2547790Z compiled: bool, 2025-05-07T20:32:14.2547880Z ) -> None: 2025-05-07T20:32:14.2547987Z torch.manual_seed(2025) 2025-05-07T20:32:14.2548065Z 2025-05-07T20:32:14.2548247Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2548326Z 2025-05-07T20:32:14.2548423Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2548561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2548655Z x = x_sign * x_clamp 2025-05-07T20:32:14.2548793Z x0 = x[:, :D] 2025-05-07T20:32:14.2548886Z x1 = x[:, D:] 2025-05-07T20:32:14.2548963Z 2025-05-07T20:32:14.2549060Z if contiguous: 2025-05-07T20:32:14.2549160Z x0 = x0.contiguous() 2025-05-07T20:32:14.2549253Z x1 = x1.contiguous() 2025-05-07T20:32:14.2549337Z 2025-05-07T20:32:14.2549434Z if scale_ub is not None: 2025-05-07T20:32:14.2549545Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2549693Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2549777Z ) 2025-05-07T20:32:14.2549857Z else: 2025-05-07T20:32:14.2549963Z scale_ub_tensor = None 2025-05-07T20:32:14.2550039Z 2025-05-07T20:32:14.2550176Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2550277Z op = silu_mul_quant 2025-05-07T20:32:14.2550367Z if compiled: 2025-05-07T20:32:14.2550478Z op = torch.compile(op) 2025-05-07T20:32:14.2550592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2550673Z 2025-05-07T20:32:14.2550776Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2550785Z 2025-05-07T20:32:14.2550894Z moe/activation_test.py:117: 2025-05-07T20:32:14.2551079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2551192Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2551297Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2551820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2551997Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2552372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2552634Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2553017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2553117Z kernel = self.compile( 2025-05-07T20:32:14.2553518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2553744Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2553879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2553884Z 2025-05-07T20:32:14.2554104Z self = 2025-05-07T20:32:14.2554914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2555448Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeb31b7f0>} 2025-05-07T20:32:14.2556468Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2556684Z context = 2025-05-07T20:32:14.2556689Z 2025-05-07T20:32:14.2556863Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2557143Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2557259Z module_map=module_map) 2025-05-07T20:32:14.2557429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2557537Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2557620Z E ^ 2025-05-07T20:32:14.2558108Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2558114Z 2025-05-07T20:32:14.2558551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2558558Z 2025-05-07T20:32:14.2558674Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2558916Z self=, 2025-05-07T20:32:14.2558998Z T=128, 2025-05-07T20:32:14.2559078Z D=7168, 2025-05-07T20:32:14.2559173Z scale_ub=None, 2025-05-07T20:32:14.2559263Z contiguous=True, 2025-05-07T20:32:14.2559353Z compiled=False, 2025-05-07T20:32:14.2559436Z ) 2025-05-07T20:32:14.2559663Z self = 2025-05-07T20:32:14.2559842Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.2559847Z 2025-05-07T20:32:14.2559936Z @given( 2025-05-07T20:32:14.2560063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2560168Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2560296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2560421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2560615Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2560702Z ) 2025-05-07T20:32:14.2560957Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2561123Z def test_silu_mul_quant( 2025-05-07T20:32:14.2561202Z self, 2025-05-07T20:32:14.2561282Z T: int, 2025-05-07T20:32:14.2561367Z D: int, 2025-05-07T20:32:14.2561470Z scale_ub: Optional[float], 2025-05-07T20:32:14.2561564Z contiguous: bool, 2025-05-07T20:32:14.2561663Z compiled: bool, 2025-05-07T20:32:14.2561747Z ) -> None: 2025-05-07T20:32:14.2561847Z torch.manual_seed(2025) 2025-05-07T20:32:14.2561927Z 2025-05-07T20:32:14.2562106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2562190Z 2025-05-07T20:32:14.2562287Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2562488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2562593Z x = x_sign * x_clamp 2025-05-07T20:32:14.2562678Z x0 = x[:, :D] 2025-05-07T20:32:14.2562776Z x1 = x[:, D:] 2025-05-07T20:32:14.2562869Z 2025-05-07T20:32:14.2562971Z if contiguous: 2025-05-07T20:32:14.2563079Z x0 = x0.contiguous() 2025-05-07T20:32:14.2563178Z x1 = x1.contiguous() 2025-05-07T20:32:14.2563253Z 2025-05-07T20:32:14.2563349Z if scale_ub is not None: 2025-05-07T20:32:14.2563469Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2563610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2563694Z ) 2025-05-07T20:32:14.2563776Z else: 2025-05-07T20:32:14.2563881Z scale_ub_tensor = None 2025-05-07T20:32:14.2563964Z 2025-05-07T20:32:14.2564103Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2564196Z op = silu_mul_quant 2025-05-07T20:32:14.2564297Z if compiled: 2025-05-07T20:32:14.2564405Z op = torch.compile(op) 2025-05-07T20:32:14.2564517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2564599Z 2025-05-07T20:32:14.2564694Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2564698Z 2025-05-07T20:32:14.2564805Z moe/activation_test.py:117: 2025-05-07T20:32:14.2564945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2565052Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2565169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2565690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2565839Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2566218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2566451Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2566814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2566912Z kernel = self.compile( 2025-05-07T20:32:14.2567308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2567500Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2567629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2567634Z 2025-05-07T20:32:14.2567850Z self = 2025-05-07T20:32:14.2568661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2569244Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeaf78160>} 2025-05-07T20:32:14.2570025Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2570278Z context = 2025-05-07T20:32:14.2570283Z 2025-05-07T20:32:14.2570462Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2570738Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2570853Z module_map=module_map) 2025-05-07T20:32:14.2571026Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2571172Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2571253Z E ^ 2025-05-07T20:32:14.2571630Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2571635Z 2025-05-07T20:32:14.2572063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2572089Z 2025-05-07T20:32:14.2572199Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2572432Z self=, 2025-05-07T20:32:14.2572520Z T=2048, 2025-05-07T20:32:14.2572599Z D=7168, 2025-05-07T20:32:14.2572688Z scale_ub=1200.0, 2025-05-07T20:32:14.2572785Z contiguous=True, 2025-05-07T20:32:14.2572874Z compiled=False, 2025-05-07T20:32:14.2572960Z ) 2025-05-07T20:32:14.2573199Z self = 2025-05-07T20:32:14.2573380Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.2573388Z 2025-05-07T20:32:14.2573478Z @given( 2025-05-07T20:32:14.2573603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2573708Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2573836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2573961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2574085Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2574171Z ) 2025-05-07T20:32:14.2574426Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2574526Z def test_silu_mul_quant( 2025-05-07T20:32:14.2574612Z self, 2025-05-07T20:32:14.2574694Z T: int, 2025-05-07T20:32:14.2574819Z D: int, 2025-05-07T20:32:14.2574934Z scale_ub: Optional[float], 2025-05-07T20:32:14.2575029Z contiguous: bool, 2025-05-07T20:32:14.2575127Z compiled: bool, 2025-05-07T20:32:14.2575212Z ) -> None: 2025-05-07T20:32:14.2575313Z torch.manual_seed(2025) 2025-05-07T20:32:14.2575403Z 2025-05-07T20:32:14.2575580Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2577447Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2577462Z 2025-05-07T20:32:14.2577587Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2577592Z 2025-05-07T20:32:14.2577703Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2578144Z self=, 2025-05-07T20:32:14.2578238Z T=1, 2025-05-07T20:32:14.2578319Z D=5120, 2025-05-07T20:32:14.2578415Z scale_ub=1200.0, 2025-05-07T20:32:14.2578504Z contiguous=True, 2025-05-07T20:32:14.2578596Z compiled=False, 2025-05-07T20:32:14.2578715Z ) 2025-05-07T20:32:14.2578941Z self = 2025-05-07T20:32:14.2579122Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.2579127Z 2025-05-07T20:32:14.2579207Z @given( 2025-05-07T20:32:14.2579332Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2579443Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2579565Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2579686Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2579851Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2579927Z ) 2025-05-07T20:32:14.2580190Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2580290Z def test_silu_mul_quant( 2025-05-07T20:32:14.2580368Z self, 2025-05-07T20:32:14.2580453Z T: int, 2025-05-07T20:32:14.2580535Z D: int, 2025-05-07T20:32:14.2580637Z scale_ub: Optional[float], 2025-05-07T20:32:14.2580737Z contiguous: bool, 2025-05-07T20:32:14.2580826Z compiled: bool, 2025-05-07T20:32:14.2580907Z ) -> None: 2025-05-07T20:32:14.2581013Z torch.manual_seed(2025) 2025-05-07T20:32:14.2581088Z 2025-05-07T20:32:14.2581262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2581346Z 2025-05-07T20:32:14.2581444Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2581581Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2581675Z x = x_sign * x_clamp 2025-05-07T20:32:14.2581762Z x0 = x[:, :D] 2025-05-07T20:32:14.2581850Z x1 = x[:, D:] 2025-05-07T20:32:14.2581929Z 2025-05-07T20:32:14.2582016Z if contiguous: 2025-05-07T20:32:14.2582117Z x0 = x0.contiguous() 2025-05-07T20:32:14.2582215Z x1 = x1.contiguous() 2025-05-07T20:32:14.2582292Z 2025-05-07T20:32:14.2582400Z if scale_ub is not None: 2025-05-07T20:32:14.2582529Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2582694Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2582777Z ) 2025-05-07T20:32:14.2582857Z else: 2025-05-07T20:32:14.2582960Z scale_ub_tensor = None 2025-05-07T20:32:14.2583036Z 2025-05-07T20:32:14.2583172Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2583316Z op = silu_mul_quant 2025-05-07T20:32:14.2583406Z if compiled: 2025-05-07T20:32:14.2583514Z op = torch.compile(op) 2025-05-07T20:32:14.2583632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2583708Z 2025-05-07T20:32:14.2583806Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2583810Z 2025-05-07T20:32:14.2583919Z moe/activation_test.py:117: 2025-05-07T20:32:14.2584052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2584166Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2584277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2584796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2584904Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2585280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2585511Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2585875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2586048Z kernel = self.compile( 2025-05-07T20:32:14.2586457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2586642Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2586815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2586819Z 2025-05-07T20:32:14.2587042Z self = 2025-05-07T20:32:14.2587846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2588377Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeaf78940>} 2025-05-07T20:32:14.2589191Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2589399Z context = 2025-05-07T20:32:14.2589404Z 2025-05-07T20:32:14.2589583Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2589859Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2589980Z module_map=module_map) 2025-05-07T20:32:14.2590153Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2590257Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2590343Z E ^ 2025-05-07T20:32:14.2590711Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2590718Z 2025-05-07T20:32:14.2591155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2591160Z 2025-05-07T20:32:14.2591269Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2591504Z self=, 2025-05-07T20:32:14.2591591Z T=2048, 2025-05-07T20:32:14.2591671Z D=5120, 2025-05-07T20:32:14.2591758Z scale_ub=None, 2025-05-07T20:32:14.2591855Z contiguous=True, 2025-05-07T20:32:14.2591945Z compiled=False, 2025-05-07T20:32:14.2592021Z ) 2025-05-07T20:32:14.2592254Z self = 2025-05-07T20:32:14.2592484Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.2592489Z 2025-05-07T20:32:14.2592581Z @given( 2025-05-07T20:32:14.2592729Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2592856Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2592987Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2593109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2593228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2593314Z ) 2025-05-07T20:32:14.2593570Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2593667Z def test_silu_mul_quant( 2025-05-07T20:32:14.2593753Z self, 2025-05-07T20:32:14.2593833Z T: int, 2025-05-07T20:32:14.2593918Z D: int, 2025-05-07T20:32:14.2594022Z scale_ub: Optional[float], 2025-05-07T20:32:14.2594114Z contiguous: bool, 2025-05-07T20:32:14.2594212Z compiled: bool, 2025-05-07T20:32:14.2594294Z ) -> None: 2025-05-07T20:32:14.2594394Z torch.manual_seed(2025) 2025-05-07T20:32:14.2594478Z 2025-05-07T20:32:14.2594652Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2594776Z 2025-05-07T20:32:14.2594879Z > x_sign = torch.sign(x) 2025-05-07T20:32:14.2596734Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2596781Z 2025-05-07T20:32:14.2596914Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:14.2596919Z 2025-05-07T20:32:14.2597026Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2597306Z self=, 2025-05-07T20:32:14.2597390Z T=16384, 2025-05-07T20:32:14.2597471Z D=5120, 2025-05-07T20:32:14.2597563Z scale_ub=None, 2025-05-07T20:32:14.2597651Z contiguous=True, 2025-05-07T20:32:14.2597739Z compiled=False, 2025-05-07T20:32:14.2597822Z ) 2025-05-07T20:32:14.2598047Z self = 2025-05-07T20:32:14.2598230Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.2598235Z 2025-05-07T20:32:14.2598322Z @given( 2025-05-07T20:32:14.2598444Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2598549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2598676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2598798Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2598922Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2599003Z ) 2025-05-07T20:32:14.2599259Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2599363Z def test_silu_mul_quant( 2025-05-07T20:32:14.2599443Z self, 2025-05-07T20:32:14.2599523Z T: int, 2025-05-07T20:32:14.2599608Z D: int, 2025-05-07T20:32:14.2599713Z scale_ub: Optional[float], 2025-05-07T20:32:14.2599807Z contiguous: bool, 2025-05-07T20:32:14.2599901Z compiled: bool, 2025-05-07T20:32:14.2599982Z ) -> None: 2025-05-07T20:32:14.2600082Z torch.manual_seed(2025) 2025-05-07T20:32:14.2600163Z 2025-05-07T20:32:14.2600336Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2602243Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2602254Z 2025-05-07T20:32:14.2602402Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2602408Z 2025-05-07T20:32:14.2602532Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2602777Z self=, 2025-05-07T20:32:14.2602859Z T=4096, 2025-05-07T20:32:14.2602947Z D=5120, 2025-05-07T20:32:14.2603033Z scale_ub=None, 2025-05-07T20:32:14.2603123Z contiguous=True, 2025-05-07T20:32:14.2603224Z compiled=False, 2025-05-07T20:32:14.2603300Z ) 2025-05-07T20:32:14.2603525Z self = 2025-05-07T20:32:14.2603710Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.2603715Z 2025-05-07T20:32:14.2603842Z @given( 2025-05-07T20:32:14.2603974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2604077Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2604236Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2604364Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2604485Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2604563Z ) 2025-05-07T20:32:14.2604821Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2604919Z def test_silu_mul_quant( 2025-05-07T20:32:14.2605006Z self, 2025-05-07T20:32:14.2605089Z T: int, 2025-05-07T20:32:14.2605169Z D: int, 2025-05-07T20:32:14.2605278Z scale_ub: Optional[float], 2025-05-07T20:32:14.2605372Z contiguous: bool, 2025-05-07T20:32:14.2605507Z compiled: bool, 2025-05-07T20:32:14.2605596Z ) -> None: 2025-05-07T20:32:14.2605700Z torch.manual_seed(2025) 2025-05-07T20:32:14.2605777Z 2025-05-07T20:32:14.2605959Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2607798Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2607807Z 2025-05-07T20:32:14.2607937Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2607944Z 2025-05-07T20:32:14.2608051Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2608289Z self=, 2025-05-07T20:32:14.2608371Z T=2048, 2025-05-07T20:32:14.2608452Z D=5120, 2025-05-07T20:32:14.2608544Z scale_ub=None, 2025-05-07T20:32:14.2608636Z contiguous=False, 2025-05-07T20:32:14.2608726Z compiled=False, 2025-05-07T20:32:14.2608807Z ) 2025-05-07T20:32:14.2609033Z self = 2025-05-07T20:32:14.2609214Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.2609219Z 2025-05-07T20:32:14.2609303Z @given( 2025-05-07T20:32:14.2609427Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2609574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2609701Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2609824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2609952Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2610032Z ) 2025-05-07T20:32:14.2610290Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2610394Z def test_silu_mul_quant( 2025-05-07T20:32:14.2610474Z self, 2025-05-07T20:32:14.2610555Z T: int, 2025-05-07T20:32:14.2610640Z D: int, 2025-05-07T20:32:14.2610743Z scale_ub: Optional[float], 2025-05-07T20:32:14.2610837Z contiguous: bool, 2025-05-07T20:32:14.2610932Z compiled: bool, 2025-05-07T20:32:14.2611012Z ) -> None: 2025-05-07T20:32:14.2611112Z torch.manual_seed(2025) 2025-05-07T20:32:14.2611193Z 2025-05-07T20:32:14.2611367Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2613314Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2613360Z 2025-05-07T20:32:14.2613485Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2613489Z 2025-05-07T20:32:14.2613605Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2613835Z self=, 2025-05-07T20:32:14.2613917Z T=4096, 2025-05-07T20:32:14.2614003Z D=7168, 2025-05-07T20:32:14.2614093Z scale_ub=None, 2025-05-07T20:32:14.2614184Z contiguous=True, 2025-05-07T20:32:14.2614279Z compiled=True, 2025-05-07T20:32:14.2614356Z ) 2025-05-07T20:32:14.2614624Z self = 2025-05-07T20:32:14.2614811Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.2614815Z 2025-05-07T20:32:14.2614896Z @given( 2025-05-07T20:32:14.2615024Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2615131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2615250Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2615377Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2615500Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2615577Z ) 2025-05-07T20:32:14.2615836Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2615935Z def test_silu_mul_quant( 2025-05-07T20:32:14.2616024Z self, 2025-05-07T20:32:14.2616104Z T: int, 2025-05-07T20:32:14.2616187Z D: int, 2025-05-07T20:32:14.2616293Z scale_ub: Optional[float], 2025-05-07T20:32:14.2616390Z contiguous: bool, 2025-05-07T20:32:14.2616480Z compiled: bool, 2025-05-07T20:32:14.2616568Z ) -> None: 2025-05-07T20:32:14.2616667Z torch.manual_seed(2025) 2025-05-07T20:32:14.2616744Z 2025-05-07T20:32:14.2616923Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2618982Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2618989Z 2025-05-07T20:32:14.2619126Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2619131Z 2025-05-07T20:32:14.2619241Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2619479Z self=, 2025-05-07T20:32:14.2619560Z T=2048, 2025-05-07T20:32:14.2619640Z D=5120, 2025-05-07T20:32:14.2619737Z scale_ub=1200.0, 2025-05-07T20:32:14.2619827Z contiguous=False, 2025-05-07T20:32:14.2619913Z compiled=False, 2025-05-07T20:32:14.2619998Z ) 2025-05-07T20:32:14.2620225Z self = 2025-05-07T20:32:14.2620408Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.2620413Z 2025-05-07T20:32:14.2620499Z @given( 2025-05-07T20:32:14.2620625Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2620730Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2620857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2620991Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2621164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2621245Z ) 2025-05-07T20:32:14.2621500Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2621605Z def test_silu_mul_quant( 2025-05-07T20:32:14.2621745Z self, 2025-05-07T20:32:14.2621853Z T: int, 2025-05-07T20:32:14.2621973Z D: int, 2025-05-07T20:32:14.2622110Z scale_ub: Optional[float], 2025-05-07T20:32:14.2622230Z contiguous: bool, 2025-05-07T20:32:14.2622331Z compiled: bool, 2025-05-07T20:32:14.2622418Z ) -> None: 2025-05-07T20:32:14.2622518Z torch.manual_seed(2025) 2025-05-07T20:32:14.2622603Z 2025-05-07T20:32:14.2622799Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2624766Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2624838Z 2025-05-07T20:32:14.2624965Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2624969Z 2025-05-07T20:32:14.2625086Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2625320Z self=, 2025-05-07T20:32:14.2625403Z T=4096, 2025-05-07T20:32:14.2625505Z D=7168, 2025-05-07T20:32:14.2625627Z scale_ub=1200.0, 2025-05-07T20:32:14.2625753Z contiguous=True, 2025-05-07T20:32:14.2625859Z compiled=False, 2025-05-07T20:32:14.2625935Z ) 2025-05-07T20:32:14.2626164Z self = 2025-05-07T20:32:14.2626373Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.2626380Z 2025-05-07T20:32:14.2626492Z @given( 2025-05-07T20:32:14.2626646Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2626752Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2626873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2627001Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2627122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2627201Z ) 2025-05-07T20:32:14.2627524Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2627624Z def test_silu_mul_quant( 2025-05-07T20:32:14.2627709Z self, 2025-05-07T20:32:14.2627790Z T: int, 2025-05-07T20:32:14.2627872Z D: int, 2025-05-07T20:32:14.2627980Z scale_ub: Optional[float], 2025-05-07T20:32:14.2628074Z contiguous: bool, 2025-05-07T20:32:14.2628164Z compiled: bool, 2025-05-07T20:32:14.2628251Z ) -> None: 2025-05-07T20:32:14.2628351Z torch.manual_seed(2025) 2025-05-07T20:32:14.2628426Z 2025-05-07T20:32:14.2628610Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2630601Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2630610Z 2025-05-07T20:32:14.2630792Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2630798Z 2025-05-07T20:32:14.2630909Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2631148Z self=, 2025-05-07T20:32:14.2631270Z T=16384, 2025-05-07T20:32:14.2631352Z D=7168, 2025-05-07T20:32:14.2631444Z scale_ub=None, 2025-05-07T20:32:14.2631536Z contiguous=False, 2025-05-07T20:32:14.2631625Z compiled=True, 2025-05-07T20:32:14.2631709Z ) 2025-05-07T20:32:14.2631936Z self = 2025-05-07T20:32:14.2632121Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.2632125Z 2025-05-07T20:32:14.2632217Z @given( 2025-05-07T20:32:14.2632343Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2632456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2632620Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2632749Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2632882Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2632961Z ) 2025-05-07T20:32:14.2633216Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2633327Z def test_silu_mul_quant( 2025-05-07T20:32:14.2633407Z self, 2025-05-07T20:32:14.2633488Z T: int, 2025-05-07T20:32:14.2633577Z D: int, 2025-05-07T20:32:14.2633681Z scale_ub: Optional[float], 2025-05-07T20:32:14.2633777Z contiguous: bool, 2025-05-07T20:32:14.2633873Z compiled: bool, 2025-05-07T20:32:14.2633957Z ) -> None: 2025-05-07T20:32:14.2634067Z torch.manual_seed(2025) 2025-05-07T20:32:14.2634145Z 2025-05-07T20:32:14.2634320Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2636182Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2636193Z 2025-05-07T20:32:14.2636320Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2636324Z 2025-05-07T20:32:14.2636441Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2636717Z self=, 2025-05-07T20:32:14.2636803Z T=4096, 2025-05-07T20:32:14.2636893Z D=7168, 2025-05-07T20:32:14.2636980Z scale_ub=None, 2025-05-07T20:32:14.2637073Z contiguous=True, 2025-05-07T20:32:14.2637169Z compiled=False, 2025-05-07T20:32:14.2637247Z ) 2025-05-07T20:32:14.2637481Z self = 2025-05-07T20:32:14.2637660Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.2637667Z 2025-05-07T20:32:14.2637768Z @given( 2025-05-07T20:32:14.2637897Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2638004Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2638127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2644088Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2644234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2644321Z ) 2025-05-07T20:32:14.2644586Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2644695Z def test_silu_mul_quant( 2025-05-07T20:32:14.2644777Z self, 2025-05-07T20:32:14.2644856Z T: int, 2025-05-07T20:32:14.2644941Z D: int, 2025-05-07T20:32:14.2645125Z scale_ub: Optional[float], 2025-05-07T20:32:14.2645222Z contiguous: bool, 2025-05-07T20:32:14.2645319Z compiled: bool, 2025-05-07T20:32:14.2645402Z ) -> None: 2025-05-07T20:32:14.2645544Z torch.manual_seed(2025) 2025-05-07T20:32:14.2645630Z 2025-05-07T20:32:14.2645807Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2647691Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2647738Z 2025-05-07T20:32:14.2647864Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2647869Z 2025-05-07T20:32:14.2647977Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2648224Z self=, 2025-05-07T20:32:14.2648309Z T=16384, 2025-05-07T20:32:14.2648398Z D=7168, 2025-05-07T20:32:14.2648486Z scale_ub=None, 2025-05-07T20:32:14.2648585Z contiguous=True, 2025-05-07T20:32:14.2648684Z compiled=False, 2025-05-07T20:32:14.2648764Z ) 2025-05-07T20:32:14.2648991Z self = 2025-05-07T20:32:14.2649186Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:14.2649191Z 2025-05-07T20:32:14.2649273Z @given( 2025-05-07T20:32:14.2649398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2649514Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2649636Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2649771Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2649889Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2649968Z ) 2025-05-07T20:32:14.2650234Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2650334Z def test_silu_mul_quant( 2025-05-07T20:32:14.2650412Z self, 2025-05-07T20:32:14.2650500Z T: int, 2025-05-07T20:32:14.2650578Z D: int, 2025-05-07T20:32:14.2650681Z scale_ub: Optional[float], 2025-05-07T20:32:14.2650783Z contiguous: bool, 2025-05-07T20:32:14.2650871Z compiled: bool, 2025-05-07T20:32:14.2650997Z ) -> None: 2025-05-07T20:32:14.2651107Z torch.manual_seed(2025) 2025-05-07T20:32:14.2651183Z 2025-05-07T20:32:14.2651364Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2653233Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2653243Z 2025-05-07T20:32:14.2653379Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2653384Z 2025-05-07T20:32:14.2653494Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2653726Z self=, 2025-05-07T20:32:14.2653813Z T=16384, 2025-05-07T20:32:14.2653893Z D=7168, 2025-05-07T20:32:14.2653979Z scale_ub=1200.0, 2025-05-07T20:32:14.2654070Z contiguous=True, 2025-05-07T20:32:14.2654204Z compiled=False, 2025-05-07T20:32:14.2654281Z ) 2025-05-07T20:32:14.2654510Z self = 2025-05-07T20:32:14.2654692Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.2654736Z 2025-05-07T20:32:14.2654821Z @given( 2025-05-07T20:32:14.2654942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2655042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2655166Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2655288Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2655408Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2655490Z ) 2025-05-07T20:32:14.2656152Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2656457Z def test_silu_mul_quant( 2025-05-07T20:32:14.2656535Z self, 2025-05-07T20:32:14.2656616Z T: int, 2025-05-07T20:32:14.2656702Z D: int, 2025-05-07T20:32:14.2656806Z scale_ub: Optional[float], 2025-05-07T20:32:14.2656899Z contiguous: bool, 2025-05-07T20:32:14.2656994Z compiled: bool, 2025-05-07T20:32:14.2657077Z ) -> None: 2025-05-07T20:32:14.2657176Z torch.manual_seed(2025) 2025-05-07T20:32:14.2657262Z 2025-05-07T20:32:14.2657439Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2659379Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2659389Z 2025-05-07T20:32:14.2659512Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2659517Z 2025-05-07T20:32:14.2659626Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2659866Z self=, 2025-05-07T20:32:14.2659947Z T=128, 2025-05-07T20:32:14.2660038Z D=5120, 2025-05-07T20:32:14.2660123Z scale_ub=1200.0, 2025-05-07T20:32:14.2660212Z contiguous=False, 2025-05-07T20:32:14.2660305Z compiled=False, 2025-05-07T20:32:14.2660380Z ) 2025-05-07T20:32:14.2660678Z self = 2025-05-07T20:32:14.2660866Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.2660871Z 2025-05-07T20:32:14.2660954Z @given( 2025-05-07T20:32:14.2661075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2661188Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2661306Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2661437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2661561Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2661637Z ) 2025-05-07T20:32:14.2661897Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2661995Z def test_silu_mul_quant( 2025-05-07T20:32:14.2662074Z self, 2025-05-07T20:32:14.2662161Z T: int, 2025-05-07T20:32:14.2662240Z D: int, 2025-05-07T20:32:14.2662345Z scale_ub: Optional[float], 2025-05-07T20:32:14.2662465Z contiguous: bool, 2025-05-07T20:32:14.2662562Z compiled: bool, 2025-05-07T20:32:14.2662659Z ) -> None: 2025-05-07T20:32:14.2662771Z torch.manual_seed(2025) 2025-05-07T20:32:14.2662849Z 2025-05-07T20:32:14.2663114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2663188Z 2025-05-07T20:32:14.2663291Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2663422Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2663513Z x = x_sign * x_clamp 2025-05-07T20:32:14.2663666Z x0 = x[:, :D] 2025-05-07T20:32:14.2663747Z x1 = x[:, D:] 2025-05-07T20:32:14.2663821Z 2025-05-07T20:32:14.2663915Z if contiguous: 2025-05-07T20:32:14.2664012Z x0 = x0.contiguous() 2025-05-07T20:32:14.2664106Z x1 = x1.contiguous() 2025-05-07T20:32:14.2664184Z 2025-05-07T20:32:14.2664277Z if scale_ub is not None: 2025-05-07T20:32:14.2664394Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2664533Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2664608Z ) 2025-05-07T20:32:14.2664732Z else: 2025-05-07T20:32:14.2664828Z scale_ub_tensor = None 2025-05-07T20:32:14.2664903Z 2025-05-07T20:32:14.2665046Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2665139Z op = silu_mul_quant 2025-05-07T20:32:14.2665227Z if compiled: 2025-05-07T20:32:14.2665336Z op = torch.compile(op) 2025-05-07T20:32:14.2665446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2665520Z 2025-05-07T20:32:14.2665620Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2665625Z 2025-05-07T20:32:14.2665726Z moe/activation_test.py:117: 2025-05-07T20:32:14.2665863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2665968Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2666074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2666598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2666701Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2667075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2667314Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2667669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2667774Z kernel = self.compile( 2025-05-07T20:32:14.2668168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2668351Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2668533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2668538Z 2025-05-07T20:32:14.2668754Z self = 2025-05-07T20:32:14.2669569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2670101Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeae04940>} 2025-05-07T20:32:14.2670876Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2671084Z context = 2025-05-07T20:32:14.2671089Z 2025-05-07T20:32:14.2671267Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2671552Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2671668Z module_map=module_map) 2025-05-07T20:32:14.2671902Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2672016Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2672098Z E ^ 2025-05-07T20:32:14.2672494Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2672631Z 2025-05-07T20:32:14.2673061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2673066Z 2025-05-07T20:32:14.2673174Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2673411Z self=, 2025-05-07T20:32:14.2673495Z T=2048, 2025-05-07T20:32:14.2673573Z D=7168, 2025-05-07T20:32:14.2673665Z scale_ub=None, 2025-05-07T20:32:14.2673756Z contiguous=False, 2025-05-07T20:32:14.2673883Z compiled=False, 2025-05-07T20:32:14.2673964Z ) 2025-05-07T20:32:14.2674193Z self = 2025-05-07T20:32:14.2674379Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.2674384Z 2025-05-07T20:32:14.2674465Z @given( 2025-05-07T20:32:14.2674589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2674700Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2674819Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2674939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2675066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2675142Z ) 2025-05-07T20:32:14.2675400Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2675498Z def test_silu_mul_quant( 2025-05-07T20:32:14.2675576Z self, 2025-05-07T20:32:14.2675662Z T: int, 2025-05-07T20:32:14.2675740Z D: int, 2025-05-07T20:32:14.2675842Z scale_ub: Optional[float], 2025-05-07T20:32:14.2675938Z contiguous: bool, 2025-05-07T20:32:14.2676026Z compiled: bool, 2025-05-07T20:32:14.2676105Z ) -> None: 2025-05-07T20:32:14.2676208Z torch.manual_seed(2025) 2025-05-07T20:32:14.2676282Z 2025-05-07T20:32:14.2676458Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2678356Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2678365Z 2025-05-07T20:32:14.2678490Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2678501Z 2025-05-07T20:32:14.2678607Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2678837Z self=, 2025-05-07T20:32:14.2678923Z T=128, 2025-05-07T20:32:14.2679002Z D=7168, 2025-05-07T20:32:14.2679086Z scale_ub=1200.0, 2025-05-07T20:32:14.2679178Z contiguous=True, 2025-05-07T20:32:14.2679265Z compiled=True, 2025-05-07T20:32:14.2679337Z ) 2025-05-07T20:32:14.2679571Z self = 2025-05-07T20:32:14.2679747Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.2679754Z 2025-05-07T20:32:14.2679836Z @given( 2025-05-07T20:32:14.2679967Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2680073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2680199Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2680368Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2680489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2680572Z ) 2025-05-07T20:32:14.2680826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2680964Z def test_silu_mul_quant( 2025-05-07T20:32:14.2681052Z self, 2025-05-07T20:32:14.2681130Z T: int, 2025-05-07T20:32:14.2681210Z D: int, 2025-05-07T20:32:14.2681318Z scale_ub: Optional[float], 2025-05-07T20:32:14.2681411Z contiguous: bool, 2025-05-07T20:32:14.2681507Z compiled: bool, 2025-05-07T20:32:14.2681590Z ) -> None: 2025-05-07T20:32:14.2681692Z torch.manual_seed(2025) 2025-05-07T20:32:14.2681772Z 2025-05-07T20:32:14.2681946Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2682063Z 2025-05-07T20:32:14.2682163Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2682296Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2682386Z x = x_sign * x_clamp 2025-05-07T20:32:14.2682477Z x0 = x[:, :D] 2025-05-07T20:32:14.2682561Z x1 = x[:, D:] 2025-05-07T20:32:14.2682638Z 2025-05-07T20:32:14.2682750Z if contiguous: 2025-05-07T20:32:14.2682853Z x0 = x0.contiguous() 2025-05-07T20:32:14.2682966Z x1 = x1.contiguous() 2025-05-07T20:32:14.2683046Z 2025-05-07T20:32:14.2683139Z if scale_ub is not None: 2025-05-07T20:32:14.2683253Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2683392Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2683474Z ) 2025-05-07T20:32:14.2683556Z else: 2025-05-07T20:32:14.2683655Z scale_ub_tensor = None 2025-05-07T20:32:14.2683728Z 2025-05-07T20:32:14.2683876Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2683969Z op = silu_mul_quant 2025-05-07T20:32:14.2684058Z if compiled: 2025-05-07T20:32:14.2684174Z op = torch.compile(op) 2025-05-07T20:32:14.2684284Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2684360Z 2025-05-07T20:32:14.2684460Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2684465Z 2025-05-07T20:32:14.2684565Z moe/activation_test.py:117: 2025-05-07T20:32:14.2684704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2684808Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2684912Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2685349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2685450Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2685960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2686076Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2686444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2686684Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2687038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2687134Z kernel = self.compile( 2025-05-07T20:32:14.2687535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2687716Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2687854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2687859Z 2025-05-07T20:32:14.2688074Z self = 2025-05-07T20:32:14.2688918Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2689511Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaeae04dc0>} 2025-05-07T20:32:14.2690282Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2690494Z context = 2025-05-07T20:32:14.2690498Z 2025-05-07T20:32:14.2690672Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2690989Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2691108Z module_map=module_map) 2025-05-07T20:32:14.2691277Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2691391Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2691474Z E ^ 2025-05-07T20:32:14.2691839Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2691844Z 2025-05-07T20:32:14.2692278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2692283Z 2025-05-07T20:32:14.2692394Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2692664Z self=, 2025-05-07T20:32:14.2692754Z T=128, 2025-05-07T20:32:14.2692850Z D=7168, 2025-05-07T20:32:14.2692945Z scale_ub=1200.0, 2025-05-07T20:32:14.2693032Z contiguous=True, 2025-05-07T20:32:14.2693116Z compiled=False, 2025-05-07T20:32:14.2693196Z ) 2025-05-07T20:32:14.2693423Z self = 2025-05-07T20:32:14.2693600Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.2693607Z 2025-05-07T20:32:14.2693694Z @given( 2025-05-07T20:32:14.2693816Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2693920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2694046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2694169Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2694292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2694412Z ) 2025-05-07T20:32:14.2694665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2694767Z def test_silu_mul_quant( 2025-05-07T20:32:14.2694848Z self, 2025-05-07T20:32:14.2694925Z T: int, 2025-05-07T20:32:14.2695011Z D: int, 2025-05-07T20:32:14.2695113Z scale_ub: Optional[float], 2025-05-07T20:32:14.2695205Z contiguous: bool, 2025-05-07T20:32:14.2695298Z compiled: bool, 2025-05-07T20:32:14.2695377Z ) -> None: 2025-05-07T20:32:14.2695476Z torch.manual_seed(2025) 2025-05-07T20:32:14.2695561Z 2025-05-07T20:32:14.2695733Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2695814Z 2025-05-07T20:32:14.2695908Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2696036Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2697931Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2697940Z 2025-05-07T20:32:14.2698212Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:14.2698216Z 2025-05-07T20:32:14.2698334Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2698567Z self=, 2025-05-07T20:32:14.2698647Z T=128, 2025-05-07T20:32:14.2698730Z D=5120, 2025-05-07T20:32:14.2698817Z scale_ub=1200.0, 2025-05-07T20:32:14.2698904Z contiguous=True, 2025-05-07T20:32:14.2699000Z compiled=True, 2025-05-07T20:32:14.2699075Z ) 2025-05-07T20:32:14.2699303Z self = 2025-05-07T20:32:14.2699541Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.2699546Z 2025-05-07T20:32:14.2699625Z @given( 2025-05-07T20:32:14.2699751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2699853Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2699970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2700101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2700215Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2700298Z ) 2025-05-07T20:32:14.2700550Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2700647Z def test_silu_mul_quant( 2025-05-07T20:32:14.2700729Z self, 2025-05-07T20:32:14.2700810Z T: int, 2025-05-07T20:32:14.2700890Z D: int, 2025-05-07T20:32:14.2700998Z scale_ub: Optional[float], 2025-05-07T20:32:14.2701095Z contiguous: bool, 2025-05-07T20:32:14.2701189Z compiled: bool, 2025-05-07T20:32:14.2701277Z ) -> None: 2025-05-07T20:32:14.2701375Z torch.manual_seed(2025) 2025-05-07T20:32:14.2701455Z 2025-05-07T20:32:14.2701633Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2701709Z 2025-05-07T20:32:14.2701803Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2701940Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2703837Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2703851Z 2025-05-07T20:32:14.2703973Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:14.2703983Z 2025-05-07T20:32:14.2704089Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2704326Z self=, 2025-05-07T20:32:14.2704407Z T=128, 2025-05-07T20:32:14.2704486Z D=7168, 2025-05-07T20:32:14.2704575Z scale_ub=None, 2025-05-07T20:32:14.2704661Z contiguous=True, 2025-05-07T20:32:14.2704746Z compiled=True, 2025-05-07T20:32:14.2704824Z ) 2025-05-07T20:32:14.2705046Z self = 2025-05-07T20:32:14.2705216Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.2705226Z 2025-05-07T20:32:14.2705309Z @given( 2025-05-07T20:32:14.2705428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2705535Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2705655Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2705819Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2705943Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2706027Z ) 2025-05-07T20:32:14.2706281Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2706423Z def test_silu_mul_quant( 2025-05-07T20:32:14.2706500Z self, 2025-05-07T20:32:14.2706584Z T: int, 2025-05-07T20:32:14.2706662Z D: int, 2025-05-07T20:32:14.2706762Z scale_ub: Optional[float], 2025-05-07T20:32:14.2706859Z contiguous: bool, 2025-05-07T20:32:14.2706948Z compiled: bool, 2025-05-07T20:32:14.2707027Z ) -> None: 2025-05-07T20:32:14.2707127Z torch.manual_seed(2025) 2025-05-07T20:32:14.2707223Z 2025-05-07T20:32:14.2707405Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2709242Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:14.2709295Z 2025-05-07T20:32:14.2709424Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:14.2709560Z =============================== warnings summary =============================== 2025-05-07T20:32:14.2709880Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:14.2710206Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:14.2710517Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:14.2711432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:14.2711671Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:14.2711676Z 2025-05-07T20:32:14.2711891Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:14.2712068Z ================= 1 failed, 1 deselected, 3 warnings in 21.66s ================= 2025-05-07T20:32:15.8709447Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:15.9326994Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:15.9327273Z 2025-05-07T20:32:17.9344991Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:20.0772447Z ============================= test session starts ============================== 2025-05-07T20:32:20.0773106Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:20.0773639Z cachedir: .pytest_cache 2025-05-07T20:32:20.0774225Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:20.0774982Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:20.0775400Z plugins: hypothesis-6.131.14 2025-05-07T20:32:21.6762057Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:21.8538795Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:21.8539231Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:21.8539469Z 2025-05-07T20:32:24.3750966Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.3752346Z self=, 2025-05-07T20:32:24.3752828Z T=1, 2025-05-07T20:32:24.3753032Z D=5120, 2025-05-07T20:32:24.3753231Z scale_ub=None, 2025-05-07T20:32:24.3753460Z contiguous=True, 2025-05-07T20:32:24.3753694Z compiled=True, 2025-05-07T20:32:24.3753907Z ) 2025-05-07T20:32:24.3754251Z self = 2025-05-07T20:32:24.3754753Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:24.3755025Z 2025-05-07T20:32:24.3755232Z @given( 2025-05-07T20:32:24.3755483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.3756035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.3756349Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.3756694Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.3757041Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.3757340Z ) 2025-05-07T20:32:24.3757698Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.3758153Z def test_silu_mul_quant( 2025-05-07T20:32:24.3758407Z self, 2025-05-07T20:32:24.3758607Z T: int, 2025-05-07T20:32:24.3758816Z D: int, 2025-05-07T20:32:24.3759047Z scale_ub: Optional[float], 2025-05-07T20:32:24.3759326Z contiguous: bool, 2025-05-07T20:32:24.3759579Z compiled: bool, 2025-05-07T20:32:24.3759818Z ) -> None: 2025-05-07T20:32:24.3760039Z torch.manual_seed(2025) 2025-05-07T20:32:24.3760298Z 2025-05-07T20:32:24.3760586Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.3760936Z 2025-05-07T20:32:24.3761137Z x_sign = torch.sign(x) 2025-05-07T20:32:24.3761438Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.3761752Z x = x_sign * x_clamp 2025-05-07T20:32:24.3762007Z x0 = x[:, :D] 2025-05-07T20:32:24.3762234Z x1 = x[:, D:] 2025-05-07T20:32:24.3762449Z 2025-05-07T20:32:24.3762639Z if contiguous: 2025-05-07T20:32:24.3762882Z x0 = x0.contiguous() 2025-05-07T20:32:24.3763153Z x1 = x1.contiguous() 2025-05-07T20:32:24.3763395Z 2025-05-07T20:32:24.3763595Z if scale_ub is not None: 2025-05-07T20:32:24.3763976Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.3764319Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.3764635Z ) 2025-05-07T20:32:24.3764840Z else: 2025-05-07T20:32:24.3765058Z scale_ub_tensor = None 2025-05-07T20:32:24.3765318Z 2025-05-07T20:32:24.3765565Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.3765883Z op = silu_mul_quant 2025-05-07T20:32:24.3766145Z if compiled: 2025-05-07T20:32:24.3766404Z op = torch.compile(op) 2025-05-07T20:32:24.3766711Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.3766995Z 2025-05-07T20:32:24.3767202Z y_fp8, y_scale = fn() 2025-05-07T20:32:24.3767493Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:24.3767791Z 2025-05-07T20:32:24.3768041Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.3768389Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:24.3768691Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:24.3769018Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:24.3769389Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.3769705Z 2025-05-07T20:32:24.3770003Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:24.3770205Z 2025-05-07T20:32:24.3770317Z moe/activation_test.py:126: 2025-05-07T20:32:24.3770618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.3771025Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:24.3771370Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.3772183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:24.3772953Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:24.3773523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.3774226Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.3775000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:24.3775740Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:24.3776511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:24.3777281Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:24.3778131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:24.3778792Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:24.3779413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:24.3779944Z fn() 2025-05-07T20:32:24.3780460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:24.3781061Z self.fn.run( 2025-05-07T20:32:24.3781544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.3782087Z kernel = self.compile( 2025-05-07T20:32:24.3782643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.3783322Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.3783727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.3783955Z 2025-05-07T20:32:24.3784169Z self = 2025-05-07T20:32:24.3785330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.3786766Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89ab60af0>} 2025-05-07T20:32:24.3788139Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.3789187Z context = 2025-05-07T20:32:24.3789485Z 2025-05-07T20:32:24.3789659Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.3790197Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.3790682Z module_map=module_map) 2025-05-07T20:32:24.3791052Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.3791427Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:24.3791706Z E ^ 2025-05-07T20:32:24.3792229Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.3792690Z 2025-05-07T20:32:24.3793116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.3793687Z 2025-05-07T20:32:24.3793798Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.3794225Z self=, 2025-05-07T20:32:24.3794637Z T=2048, 2025-05-07T20:32:24.3794830Z D=5120, 2025-05-07T20:32:24.3795032Z scale_ub=1200.0, 2025-05-07T20:32:24.3795268Z contiguous=True, 2025-05-07T20:32:24.3795498Z compiled=False, 2025-05-07T20:32:24.3795722Z ) 2025-05-07T20:32:25.7364543Z self = 2025-05-07T20:32:25.7366215Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.7366523Z 2025-05-07T20:32:25.7366610Z @given( 2025-05-07T20:32:25.7366853Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.7367174Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.7367507Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.7367849Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.7375904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.7376244Z ) 2025-05-07T20:32:25.7376614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.7377070Z def test_silu_mul_quant( 2025-05-07T20:32:25.7377316Z self, 2025-05-07T20:32:25.7377522Z T: int, 2025-05-07T20:32:25.7377733Z D: int, 2025-05-07T20:32:25.7377953Z scale_ub: Optional[float], 2025-05-07T20:32:25.7378313Z contiguous: bool, 2025-05-07T20:32:25.7378569Z compiled: bool, 2025-05-07T20:32:25.7378802Z ) -> None: 2025-05-07T20:32:25.7379029Z torch.manual_seed(2025) 2025-05-07T20:32:25.7379282Z 2025-05-07T20:32:25.7379562Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.7379916Z 2025-05-07T20:32:25.7380118Z x_sign = torch.sign(x) 2025-05-07T20:32:25.7380414Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.7380730Z x = x_sign * x_clamp 2025-05-07T20:32:25.7380976Z x0 = x[:, :D] 2025-05-07T20:32:25.7381191Z x1 = x[:, D:] 2025-05-07T20:32:25.7381407Z 2025-05-07T20:32:25.7381602Z if contiguous: 2025-05-07T20:32:25.7381833Z x0 = x0.contiguous() 2025-05-07T20:32:25.7382098Z x1 = x1.contiguous() 2025-05-07T20:32:25.7382341Z 2025-05-07T20:32:25.7382695Z if scale_ub is not None: 2025-05-07T20:32:25.7382974Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.7383318Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.7383635Z ) 2025-05-07T20:32:25.7383830Z else: 2025-05-07T20:32:25.7384049Z scale_ub_tensor = None 2025-05-07T20:32:25.7384305Z 2025-05-07T20:32:25.7384543Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.7384866Z op = silu_mul_quant 2025-05-07T20:32:25.7385131Z if compiled: 2025-05-07T20:32:25.7385416Z op = torch.compile(op) 2025-05-07T20:32:25.7385742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.7386026Z 2025-05-07T20:32:25.7386222Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.7386397Z 2025-05-07T20:32:25.7386501Z moe/activation_test.py:117: 2025-05-07T20:32:25.7386801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7387140Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.7387423Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.7388209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.7388923Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.7389465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.7390234Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.7390914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.7391456Z kernel = self.compile( 2025-05-07T20:32:25.7392008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.7392678Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.7393080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7393355Z 2025-05-07T20:32:25.7393569Z self = 2025-05-07T20:32:25.7394674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.7396158Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89aa39990>} 2025-05-07T20:32:25.7397574Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.7398618Z context = 2025-05-07T20:32:25.7398907Z 2025-05-07T20:32:25.7399083Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.7399618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.7400089Z module_map=module_map) 2025-05-07T20:32:25.7400460Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.7400825Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.7401080Z E ^ 2025-05-07T20:32:25.7401553Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.7402009Z 2025-05-07T20:32:25.7402438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.7402955Z 2025-05-07T20:32:25.7403116Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.7403539Z self=, 2025-05-07T20:32:25.7403945Z T=2048, 2025-05-07T20:32:25.7404143Z D=5120, 2025-05-07T20:32:25.7404334Z scale_ub=1200.0, 2025-05-07T20:32:25.7404562Z contiguous=True, 2025-05-07T20:32:25.7404791Z compiled=True, 2025-05-07T20:32:25.7404995Z ) 2025-05-07T20:32:25.7405322Z self = 2025-05-07T20:32:25.7405823Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.7406117Z 2025-05-07T20:32:25.7406203Z @given( 2025-05-07T20:32:25.7406464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.7406782Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.7407095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.7407424Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.7407763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.7408052Z ) 2025-05-07T20:32:25.7408403Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.7408855Z def test_silu_mul_quant( 2025-05-07T20:32:25.7409102Z self, 2025-05-07T20:32:25.7409339Z T: int, 2025-05-07T20:32:25.7409540Z D: int, 2025-05-07T20:32:25.7409761Z scale_ub: Optional[float], 2025-05-07T20:32:25.7410033Z contiguous: bool, 2025-05-07T20:32:25.7410276Z compiled: bool, 2025-05-07T20:32:25.7410543Z ) -> None: 2025-05-07T20:32:25.7410756Z torch.manual_seed(2025) 2025-05-07T20:32:25.7410999Z 2025-05-07T20:32:25.7411278Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.7411620Z 2025-05-07T20:32:25.7411812Z x_sign = torch.sign(x) 2025-05-07T20:32:25.7412111Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.7412422Z x = x_sign * x_clamp 2025-05-07T20:32:25.7412660Z x0 = x[:, :D] 2025-05-07T20:32:25.7412880Z x1 = x[:, D:] 2025-05-07T20:32:25.7413090Z 2025-05-07T20:32:25.7413275Z if contiguous: 2025-05-07T20:32:25.7413555Z x0 = x0.contiguous() 2025-05-07T20:32:25.7413813Z x1 = x1.contiguous() 2025-05-07T20:32:25.7414049Z 2025-05-07T20:32:25.7414245Z if scale_ub is not None: 2025-05-07T20:32:25.7414524Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.7414858Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.7415172Z ) 2025-05-07T20:32:25.7415367Z else: 2025-05-07T20:32:25.7415573Z scale_ub_tensor = None 2025-05-07T20:32:25.7415829Z 2025-05-07T20:32:25.7416068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.7416385Z op = silu_mul_quant 2025-05-07T20:32:25.7416632Z if compiled: 2025-05-07T20:32:25.7416886Z op = torch.compile(op) 2025-05-07T20:32:25.7417193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.7417468Z 2025-05-07T20:32:25.7417665Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.7417961Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.7418337Z 2025-05-07T20:32:25.7418584Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.7418925Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.7419215Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.7419540Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.7419906Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.7420224Z 2025-05-07T20:32:25.7420427Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.7420631Z 2025-05-07T20:32:25.7420731Z moe/activation_test.py:126: 2025-05-07T20:32:25.7421032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7421410Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.7421752Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.7422555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.7423315Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.7423870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.7424566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.7425269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.7426044Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.7426808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.7427566Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.7428306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.7428993Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.7429608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.7430191Z fn() 2025-05-07T20:32:25.7430704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.7431292Z self.fn.run( 2025-05-07T20:32:25.7431770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.7432311Z kernel = self.compile( 2025-05-07T20:32:25.7432857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.7433523Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.7433965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7434191Z 2025-05-07T20:32:25.7434409Z self = 2025-05-07T20:32:25.7435498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.7436948Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8996d96c0>} 2025-05-07T20:32:25.7438316Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.7439356Z context = 2025-05-07T20:32:25.7439647Z 2025-05-07T20:32:25.7439817Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.7440348Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.7440828Z module_map=module_map) 2025-05-07T20:32:25.7441200Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.7441557Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.7441830Z E ^ 2025-05-07T20:32:25.7442305Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.7442760Z 2025-05-07T20:32:25.7443226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.7443750Z 2025-05-07T20:32:25.7443855Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.7444280Z self=, 2025-05-07T20:32:25.7444686Z T=16384, 2025-05-07T20:32:25.7444880Z D=7168, 2025-05-07T20:32:25.7445080Z scale_ub=1200.0, 2025-05-07T20:32:25.7445307Z contiguous=False, 2025-05-07T20:32:25.7445530Z compiled=False, 2025-05-07T20:32:25.7445736Z ) 2025-05-07T20:32:26.9387052Z self = 2025-05-07T20:32:26.9387925Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:26.9388335Z 2025-05-07T20:32:26.9388453Z @given( 2025-05-07T20:32:26.9388756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.9389081Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.9389399Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.9389762Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.9390101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.9390408Z ) 2025-05-07T20:32:26.9390773Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.9391530Z def test_silu_mul_quant( 2025-05-07T20:32:26.9391786Z self, 2025-05-07T20:32:26.9391996Z T: int, 2025-05-07T20:32:26.9392203Z D: int, 2025-05-07T20:32:26.9392426Z scale_ub: Optional[float], 2025-05-07T20:32:26.9392801Z contiguous: bool, 2025-05-07T20:32:26.9393053Z compiled: bool, 2025-05-07T20:32:26.9393283Z ) -> None: 2025-05-07T20:32:26.9393507Z torch.manual_seed(2025) 2025-05-07T20:32:26.9393760Z 2025-05-07T20:32:26.9394039Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.9394390Z 2025-05-07T20:32:26.9394591Z x_sign = torch.sign(x) 2025-05-07T20:32:26.9394891Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.9395209Z x = x_sign * x_clamp 2025-05-07T20:32:26.9395454Z x0 = x[:, :D] 2025-05-07T20:32:26.9395788Z x1 = x[:, D:] 2025-05-07T20:32:26.9396026Z 2025-05-07T20:32:26.9396224Z if contiguous: 2025-05-07T20:32:26.9396459Z x0 = x0.contiguous() 2025-05-07T20:32:26.9396728Z x1 = x1.contiguous() 2025-05-07T20:32:26.9396977Z 2025-05-07T20:32:26.9397180Z if scale_ub is not None: 2025-05-07T20:32:26.9397464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.9397814Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.9398135Z ) 2025-05-07T20:32:26.9398332Z else: 2025-05-07T20:32:26.9398549Z scale_ub_tensor = None 2025-05-07T20:32:26.9398810Z 2025-05-07T20:32:26.9399049Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.9399371Z op = silu_mul_quant 2025-05-07T20:32:26.9399631Z if compiled: 2025-05-07T20:32:26.9399884Z op = torch.compile(op) 2025-05-07T20:32:26.9400196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.9400489Z 2025-05-07T20:32:26.9400690Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.9400869Z 2025-05-07T20:32:26.9400977Z moe/activation_test.py:117: 2025-05-07T20:32:26.9401282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.9401621Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.9401911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.9402630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.9403346Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.9403895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.9404742Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.9405427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.9406017Z kernel = self.compile( 2025-05-07T20:32:26.9406580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.9407255Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.9407665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.9407896Z 2025-05-07T20:32:26.9408114Z self = 2025-05-07T20:32:26.9409222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.9410652Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8996d88b0>} 2025-05-07T20:32:26.9412078Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.9413131Z context = 2025-05-07T20:32:26.9413466Z 2025-05-07T20:32:26.9413639Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.9414178Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.9414659Z module_map=module_map) 2025-05-07T20:32:26.9415035Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.9415396Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.9415688Z E ^ 2025-05-07T20:32:26.9416192Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.9416693Z 2025-05-07T20:32:26.9417121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.9417647Z 2025-05-07T20:32:26.9417754Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.9418268Z self=, 2025-05-07T20:32:26.9418685Z T=1, 2025-05-07T20:32:26.9418873Z D=7168, 2025-05-07T20:32:26.9419075Z scale_ub=None, 2025-05-07T20:32:26.9419302Z contiguous=True, 2025-05-07T20:32:26.9419527Z compiled=True, 2025-05-07T20:32:26.9419743Z ) 2025-05-07T20:32:26.9420075Z self = 2025-05-07T20:32:26.9420569Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:26.9420838Z 2025-05-07T20:32:26.9420919Z @given( 2025-05-07T20:32:26.9421162Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.9421486Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.9421799Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.9422138Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.9422478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.9422765Z ) 2025-05-07T20:32:26.9423128Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.9423581Z def test_silu_mul_quant( 2025-05-07T20:32:26.9423826Z self, 2025-05-07T20:32:26.9424030Z T: int, 2025-05-07T20:32:26.9424236Z D: int, 2025-05-07T20:32:26.9424459Z scale_ub: Optional[float], 2025-05-07T20:32:26.9424742Z contiguous: bool, 2025-05-07T20:32:26.9424992Z compiled: bool, 2025-05-07T20:32:26.9425271Z ) -> None: 2025-05-07T20:32:26.9425502Z torch.manual_seed(2025) 2025-05-07T20:32:26.9425781Z 2025-05-07T20:32:26.9426085Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.9426439Z 2025-05-07T20:32:26.9426647Z x_sign = torch.sign(x) 2025-05-07T20:32:26.9426953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.9427268Z x = x_sign * x_clamp 2025-05-07T20:32:26.9427516Z x0 = x[:, :D] 2025-05-07T20:32:26.9427746Z x1 = x[:, D:] 2025-05-07T20:32:26.9427959Z 2025-05-07T20:32:26.9428160Z if contiguous: 2025-05-07T20:32:26.9428406Z x0 = x0.contiguous() 2025-05-07T20:32:26.9428668Z x1 = x1.contiguous() 2025-05-07T20:32:26.9428917Z 2025-05-07T20:32:26.9429119Z if scale_ub is not None: 2025-05-07T20:32:26.9429401Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.9429747Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.9430068Z ) 2025-05-07T20:32:26.9430262Z else: 2025-05-07T20:32:26.9430482Z scale_ub_tensor = None 2025-05-07T20:32:26.9430748Z 2025-05-07T20:32:26.9430984Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.9431357Z op = silu_mul_quant 2025-05-07T20:32:26.9431616Z if compiled: 2025-05-07T20:32:26.9431873Z op = torch.compile(op) 2025-05-07T20:32:26.9432180Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.9432501Z 2025-05-07T20:32:26.9432705Z y_fp8, y_scale = fn() 2025-05-07T20:32:26.9432994Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:26.9433290Z 2025-05-07T20:32:26.9433540Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.9433881Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:26.9434182Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:26.9434512Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:26.9434876Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.9435192Z 2025-05-07T20:32:26.9435448Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:26.9435651Z 2025-05-07T20:32:26.9435765Z moe/activation_test.py:126: 2025-05-07T20:32:26.9436114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.9436456Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:26.9436800Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.9437602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:26.9438372Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:26.9438934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.9439637Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.9440337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:26.9441081Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:26.9441851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:26.9442615Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:26.9443357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:26.9444012Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:26.9444628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:26.9445152Z fn() 2025-05-07T20:32:26.9445723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:26.9446322Z self.fn.run( 2025-05-07T20:32:26.9446807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.9447343Z kernel = self.compile( 2025-05-07T20:32:26.9447903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.9448578Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.9448982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.9449220Z 2025-05-07T20:32:26.9449434Z self = 2025-05-07T20:32:26.9450542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.9451986Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89937c940>} 2025-05-07T20:32:26.9453358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.9454443Z context = 2025-05-07T20:32:26.9454740Z 2025-05-07T20:32:26.9454911Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.9455446Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.9456178Z module_map=module_map) 2025-05-07T20:32:26.9456550Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.9456918Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:26.9457195Z E ^ 2025-05-07T20:32:26.9457745Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.9458285Z 2025-05-07T20:32:26.9458709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.9459233Z 2025-05-07T20:32:26.9459345Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.9459773Z self=, 2025-05-07T20:32:26.9460175Z T=4096, 2025-05-07T20:32:26.9460370Z D=5120, 2025-05-07T20:32:26.9460572Z scale_ub=None, 2025-05-07T20:32:26.9460791Z contiguous=False, 2025-05-07T20:32:26.9461027Z compiled=False, 2025-05-07T20:32:26.9461240Z ) 2025-05-07T20:32:28.5317525Z self = 2025-05-07T20:32:28.5318292Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.5318702Z 2025-05-07T20:32:28.5318843Z @given( 2025-05-07T20:32:28.5319204Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.5319634Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.5320070Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.5320539Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.5321002Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.5321295Z ) 2025-05-07T20:32:28.5321665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.5322123Z def test_silu_mul_quant( 2025-05-07T20:32:28.5329497Z self, 2025-05-07T20:32:28.5329732Z T: int, 2025-05-07T20:32:28.5329928Z D: int, 2025-05-07T20:32:28.5330137Z scale_ub: Optional[float], 2025-05-07T20:32:28.5330539Z contiguous: bool, 2025-05-07T20:32:28.5330779Z compiled: bool, 2025-05-07T20:32:28.5330996Z ) -> None: 2025-05-07T20:32:28.5331209Z torch.manual_seed(2025) 2025-05-07T20:32:28.5331443Z 2025-05-07T20:32:28.5331719Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5332067Z 2025-05-07T20:32:28.5332264Z x_sign = torch.sign(x) 2025-05-07T20:32:28.5332554Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.5332869Z x = x_sign * x_clamp 2025-05-07T20:32:28.5333117Z x0 = x[:, :D] 2025-05-07T20:32:28.5333330Z x1 = x[:, D:] 2025-05-07T20:32:28.5333545Z 2025-05-07T20:32:28.5333738Z if contiguous: 2025-05-07T20:32:28.5334004Z x0 = x0.contiguous() 2025-05-07T20:32:28.5334269Z x1 = x1.contiguous() 2025-05-07T20:32:28.5334505Z 2025-05-07T20:32:28.5334705Z if scale_ub is not None: 2025-05-07T20:32:28.5334990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.5335328Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.5335643Z ) 2025-05-07T20:32:28.5335845Z else: 2025-05-07T20:32:28.5336066Z scale_ub_tensor = None 2025-05-07T20:32:28.5336387Z 2025-05-07T20:32:28.5336634Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.5336956Z op = silu_mul_quant 2025-05-07T20:32:28.5337204Z if compiled: 2025-05-07T20:32:28.5337524Z op = torch.compile(op) 2025-05-07T20:32:28.5337830Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5338230Z 2025-05-07T20:32:28.5338437Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.5338604Z 2025-05-07T20:32:28.5338714Z moe/activation_test.py:117: 2025-05-07T20:32:28.5339009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5339346Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.5339638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5340346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.5341134Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.5341687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.5342534Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.5343212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.5343760Z kernel = self.compile( 2025-05-07T20:32:28.5344317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.5344991Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.5345389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5345625Z 2025-05-07T20:32:28.5345838Z self = 2025-05-07T20:32:28.5346946Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.5348355Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89937d7e0>} 2025-05-07T20:32:28.5349732Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.5350779Z context = 2025-05-07T20:32:28.5351135Z 2025-05-07T20:32:28.5351305Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.5351841Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.5352321Z module_map=module_map) 2025-05-07T20:32:28.5352691Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.5353184Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.5353495Z E ^ 2025-05-07T20:32:28.5354060Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.5354518Z 2025-05-07T20:32:28.5354945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.5355463Z 2025-05-07T20:32:28.5355837Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.5356269Z self=, 2025-05-07T20:32:28.5356674Z T=4096, 2025-05-07T20:32:28.5356866Z D=7168, 2025-05-07T20:32:28.5357051Z scale_ub=None, 2025-05-07T20:32:28.5357273Z contiguous=False, 2025-05-07T20:32:28.5357504Z compiled=False, 2025-05-07T20:32:28.5357704Z ) 2025-05-07T20:32:28.5358139Z self = 2025-05-07T20:32:28.5358645Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.5358918Z 2025-05-07T20:32:28.5359055Z @given( 2025-05-07T20:32:28.5359288Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.5359603Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.5359907Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.5360242Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.5360577Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.5360868Z ) 2025-05-07T20:32:28.5361223Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.5361675Z def test_silu_mul_quant( 2025-05-07T20:32:28.5361988Z self, 2025-05-07T20:32:28.5362179Z T: int, 2025-05-07T20:32:28.5362377Z D: int, 2025-05-07T20:32:28.5362604Z scale_ub: Optional[float], 2025-05-07T20:32:28.5362872Z contiguous: bool, 2025-05-07T20:32:28.5363115Z compiled: bool, 2025-05-07T20:32:28.5363343Z ) -> None: 2025-05-07T20:32:28.5363557Z torch.manual_seed(2025) 2025-05-07T20:32:28.5363805Z 2025-05-07T20:32:28.5364086Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5364425Z 2025-05-07T20:32:28.5364623Z x_sign = torch.sign(x) 2025-05-07T20:32:28.5364918Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.5365226Z x = x_sign * x_clamp 2025-05-07T20:32:28.5365467Z x0 = x[:, :D] 2025-05-07T20:32:28.5365683Z x1 = x[:, D:] 2025-05-07T20:32:28.5365895Z 2025-05-07T20:32:28.5366077Z if contiguous: 2025-05-07T20:32:28.5366308Z x0 = x0.contiguous() 2025-05-07T20:32:28.5366569Z x1 = x1.contiguous() 2025-05-07T20:32:28.5366803Z 2025-05-07T20:32:28.5366996Z if scale_ub is not None: 2025-05-07T20:32:28.5367271Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.5367605Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.5367916Z ) 2025-05-07T20:32:28.5368116Z else: 2025-05-07T20:32:28.5368322Z scale_ub_tensor = None 2025-05-07T20:32:28.5368576Z 2025-05-07T20:32:28.5368815Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.5369125Z op = silu_mul_quant 2025-05-07T20:32:28.5369375Z if compiled: 2025-05-07T20:32:28.5369623Z op = torch.compile(op) 2025-05-07T20:32:28.5369916Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5370260Z 2025-05-07T20:32:28.5370458Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.5370625Z 2025-05-07T20:32:28.5370729Z moe/activation_test.py:117: 2025-05-07T20:32:28.5371024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5371356Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.5371645Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5372341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.5373043Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.5373587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.5374278Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.5374942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.5375488Z kernel = self.compile( 2025-05-07T20:32:28.5376043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.5376705Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.5377149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5377384Z 2025-05-07T20:32:28.5377594Z self = 2025-05-07T20:32:28.5378809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.5380197Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89937dfc0>} 2025-05-07T20:32:28.5381569Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.5382657Z context = 2025-05-07T20:32:28.5382946Z 2025-05-07T20:32:28.5383122Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.5383656Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.5384131Z module_map=module_map) 2025-05-07T20:32:28.5384501Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.5384863Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.5385122Z E ^ 2025-05-07T20:32:28.5385595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.5386052Z 2025-05-07T20:32:28.5386506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.5387050Z 2025-05-07T20:32:28.5387163Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.5387581Z self=, 2025-05-07T20:32:28.5387986Z T=128, 2025-05-07T20:32:28.5388176Z D=7168, 2025-05-07T20:32:28.5388371Z scale_ub=None, 2025-05-07T20:32:28.5388587Z contiguous=False, 2025-05-07T20:32:28.5388819Z compiled=True, 2025-05-07T20:32:28.5389015Z ) 2025-05-07T20:32:28.6003292Z self = 2025-05-07T20:32:28.6003957Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.6004242Z 2025-05-07T20:32:28.6004321Z @given( 2025-05-07T20:32:28.6004563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6005001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6005321Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6005659Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6005995Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6006288Z ) 2025-05-07T20:32:28.6006648Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6007091Z def test_silu_mul_quant( 2025-05-07T20:32:28.6007337Z self, 2025-05-07T20:32:28.6007537Z T: int, 2025-05-07T20:32:28.6007734Z D: int, 2025-05-07T20:32:28.6007960Z scale_ub: Optional[float], 2025-05-07T20:32:28.6008242Z contiguous: bool, 2025-05-07T20:32:28.6008489Z compiled: bool, 2025-05-07T20:32:28.6008715Z ) -> None: 2025-05-07T20:32:28.6008938Z torch.manual_seed(2025) 2025-05-07T20:32:28.6009186Z 2025-05-07T20:32:28.6009463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6009822Z 2025-05-07T20:32:28.6010024Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6010319Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6010636Z x = x_sign * x_clamp 2025-05-07T20:32:28.6010881Z x0 = x[:, :D] 2025-05-07T20:32:28.6011168Z x1 = x[:, D:] 2025-05-07T20:32:28.6011385Z 2025-05-07T20:32:28.6011579Z if contiguous: 2025-05-07T20:32:28.6011810Z x0 = x0.contiguous() 2025-05-07T20:32:28.6012077Z x1 = x1.contiguous() 2025-05-07T20:32:28.6012407Z 2025-05-07T20:32:28.6012599Z if scale_ub is not None: 2025-05-07T20:32:28.6012882Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6013225Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6013537Z ) 2025-05-07T20:32:28.6013733Z else: 2025-05-07T20:32:28.6013949Z scale_ub_tensor = None 2025-05-07T20:32:28.6014215Z 2025-05-07T20:32:28.6014452Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6014773Z op = silu_mul_quant 2025-05-07T20:32:28.6015031Z if compiled: 2025-05-07T20:32:28.6015349Z op = torch.compile(op) 2025-05-07T20:32:28.6015657Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6015936Z 2025-05-07T20:32:28.6016142Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.6016429Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.6016724Z 2025-05-07T20:32:28.6016999Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6017455Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.6017762Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.6018168Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.6018533Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6018853Z 2025-05-07T20:32:28.6019071Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.6019272Z 2025-05-07T20:32:28.6019381Z moe/activation_test.py:126: 2025-05-07T20:32:28.6019682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6020025Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.6020367Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.6021170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.6021946Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.6022509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6023210Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6023911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.6024713Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6025490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:28.6026256Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.6027002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.6027663Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.6028281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.6028807Z fn() 2025-05-07T20:32:28.6029332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.6029926Z self.fn.run( 2025-05-07T20:32:28.6030415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6030955Z kernel = self.compile( 2025-05-07T20:32:28.6031559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6032234Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6032635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6032911Z 2025-05-07T20:32:28.6033123Z self = 2025-05-07T20:32:28.6034229Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6035638Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89939a560>} 2025-05-07T20:32:28.6037016Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6038105Z context = 2025-05-07T20:32:28.6038405Z 2025-05-07T20:32:28.6038578Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6039117Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6039604Z module_map=module_map) 2025-05-07T20:32:28.6039978Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6040354Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.6040632Z E ^ 2025-05-07T20:32:28.6041109Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6041582Z 2025-05-07T20:32:28.6042007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6042545Z 2025-05-07T20:32:28.6042655Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6043085Z self=, 2025-05-07T20:32:28.6043492Z T=128, 2025-05-07T20:32:28.6043694Z D=7168, 2025-05-07T20:32:28.6043898Z scale_ub=None, 2025-05-07T20:32:28.6044119Z contiguous=False, 2025-05-07T20:32:28.6044357Z compiled=False, 2025-05-07T20:32:28.6044575Z ) 2025-05-07T20:32:28.9611148Z self = 2025-05-07T20:32:28.9612498Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.9613058Z 2025-05-07T20:32:28.9613217Z @given( 2025-05-07T20:32:28.9613908Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.9614543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.9615172Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.9615856Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.9616319Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.9616641Z ) 2025-05-07T20:32:28.9617003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.9617457Z def test_silu_mul_quant( 2025-05-07T20:32:28.9617706Z self, 2025-05-07T20:32:28.9617910Z T: int, 2025-05-07T20:32:28.9618216Z D: int, 2025-05-07T20:32:28.9618446Z scale_ub: Optional[float], 2025-05-07T20:32:28.9618729Z contiguous: bool, 2025-05-07T20:32:28.9618979Z compiled: bool, 2025-05-07T20:32:28.9619207Z ) -> None: 2025-05-07T20:32:28.9619435Z torch.manual_seed(2025) 2025-05-07T20:32:28.9619691Z 2025-05-07T20:32:28.9619972Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.9620328Z 2025-05-07T20:32:28.9620534Z x_sign = torch.sign(x) 2025-05-07T20:32:28.9620833Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.9621220Z x = x_sign * x_clamp 2025-05-07T20:32:28.9621471Z x0 = x[:, :D] 2025-05-07T20:32:28.9621692Z x1 = x[:, D:] 2025-05-07T20:32:28.9621907Z 2025-05-07T20:32:28.9622105Z if contiguous: 2025-05-07T20:32:28.9622398Z x0 = x0.contiguous() 2025-05-07T20:32:28.9622667Z x1 = x1.contiguous() 2025-05-07T20:32:28.9622917Z 2025-05-07T20:32:28.9623113Z if scale_ub is not None: 2025-05-07T20:32:28.9623395Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.9623742Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.9624054Z ) 2025-05-07T20:32:28.9624254Z else: 2025-05-07T20:32:28.9624475Z scale_ub_tensor = None 2025-05-07T20:32:28.9624735Z 2025-05-07T20:32:28.9624969Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.9625367Z op = silu_mul_quant 2025-05-07T20:32:28.9625623Z if compiled: 2025-05-07T20:32:28.9625875Z op = torch.compile(op) 2025-05-07T20:32:28.9626182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.9626504Z 2025-05-07T20:32:28.9626709Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.9626888Z 2025-05-07T20:32:28.9626995Z moe/activation_test.py:117: 2025-05-07T20:32:28.9627302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.9627640Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.9627933Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.9628644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.9629360Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.9629910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.9630614Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.9631293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.9631837Z kernel = self.compile( 2025-05-07T20:32:28.9632401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.9633076Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.9633482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.9633710Z 2025-05-07T20:32:28.9633925Z self = 2025-05-07T20:32:28.9635082Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.9636503Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8993ed7e0>} 2025-05-07T20:32:28.9637883Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.9638940Z context = 2025-05-07T20:32:28.9639234Z 2025-05-07T20:32:28.9639403Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.9639942Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.9640426Z module_map=module_map) 2025-05-07T20:32:28.9640794Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.9641165Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.9641435Z E ^ 2025-05-07T20:32:28.9641960Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.9642423Z 2025-05-07T20:32:28.9642849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.9643423Z 2025-05-07T20:32:28.9643532Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.9643957Z self=, 2025-05-07T20:32:28.9644367Z T=4096, 2025-05-07T20:32:28.9644559Z D=5120, 2025-05-07T20:32:28.9644757Z scale_ub=1200.0, 2025-05-07T20:32:28.9644989Z contiguous=True, 2025-05-07T20:32:28.9645218Z compiled=False, 2025-05-07T20:32:28.9645431Z ) 2025-05-07T20:32:28.9645765Z self = 2025-05-07T20:32:28.9646344Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.9646652Z 2025-05-07T20:32:28.9646731Z @given( 2025-05-07T20:32:28.9646967Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.9647282Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.9647603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.9647940Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.9648278Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.9648566Z ) 2025-05-07T20:32:28.9648924Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.9649374Z def test_silu_mul_quant( 2025-05-07T20:32:28.9649616Z self, 2025-05-07T20:32:28.9649821Z T: int, 2025-05-07T20:32:28.9650028Z D: int, 2025-05-07T20:32:28.9650250Z scale_ub: Optional[float], 2025-05-07T20:32:28.9650529Z contiguous: bool, 2025-05-07T20:32:28.9650780Z compiled: bool, 2025-05-07T20:32:28.9651004Z ) -> None: 2025-05-07T20:32:28.9651230Z torch.manual_seed(2025) 2025-05-07T20:32:28.9651479Z 2025-05-07T20:32:28.9651755Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.9652104Z 2025-05-07T20:32:28.9652307Z x_sign = torch.sign(x) 2025-05-07T20:32:28.9652603Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.9652917Z x = x_sign * x_clamp 2025-05-07T20:32:28.9653165Z x0 = x[:, :D] 2025-05-07T20:32:28.9653386Z x1 = x[:, D:] 2025-05-07T20:32:28.9653595Z 2025-05-07T20:32:28.9653789Z if contiguous: 2025-05-07T20:32:28.9654024Z x0 = x0.contiguous() 2025-05-07T20:32:28.9654281Z x1 = x1.contiguous() 2025-05-07T20:32:28.9654576Z 2025-05-07T20:32:28.9654778Z if scale_ub is not None: 2025-05-07T20:32:28.9655060Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.9655408Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.9655899Z ) 2025-05-07T20:32:28.9656097Z else: 2025-05-07T20:32:28.9656319Z scale_ub_tensor = None 2025-05-07T20:32:28.9656579Z 2025-05-07T20:32:28.9656817Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.9663126Z op = silu_mul_quant 2025-05-07T20:32:28.9663421Z if compiled: 2025-05-07T20:32:28.9663684Z op = torch.compile(op) 2025-05-07T20:32:28.9663981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.9664256Z 2025-05-07T20:32:28.9664448Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.9664614Z 2025-05-07T20:32:28.9664714Z moe/activation_test.py:117: 2025-05-07T20:32:28.9665010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.9665337Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.9665629Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.9666430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.9667134Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.9667678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.9668421Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.9669100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.9669644Z kernel = self.compile( 2025-05-07T20:32:28.9670189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.9670852Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.9671251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.9671542Z 2025-05-07T20:32:28.9671758Z self = 2025-05-07T20:32:28.9672849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.9674242Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8993edf30>} 2025-05-07T20:32:28.9675609Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.9676655Z context = 2025-05-07T20:32:28.9676945Z 2025-05-07T20:32:28.9677121Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.9677648Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.9678125Z module_map=module_map) 2025-05-07T20:32:28.9678498Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.9678868Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.9679129Z E ^ 2025-05-07T20:32:28.9679604Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.9680058Z 2025-05-07T20:32:28.9680484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.9681000Z 2025-05-07T20:32:28.9681171Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.9681595Z self=, 2025-05-07T20:32:28.9682005Z T=1, 2025-05-07T20:32:28.9682193Z D=5120, 2025-05-07T20:32:28.9682384Z scale_ub=None, 2025-05-07T20:32:28.9682607Z contiguous=True, 2025-05-07T20:32:28.9682837Z compiled=True, 2025-05-07T20:32:28.9683039Z ) 2025-05-07T20:32:29.5451938Z self = 2025-05-07T20:32:29.5453247Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.5453789Z 2025-05-07T20:32:29.5453950Z @given( 2025-05-07T20:32:29.5454424Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5455048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5455963Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5456441Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5456779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5457073Z ) 2025-05-07T20:32:29.5457437Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5457891Z def test_silu_mul_quant( 2025-05-07T20:32:29.5458218Z self, 2025-05-07T20:32:29.5458537Z T: int, 2025-05-07T20:32:29.5458743Z D: int, 2025-05-07T20:32:29.5458963Z scale_ub: Optional[float], 2025-05-07T20:32:29.5459244Z contiguous: bool, 2025-05-07T20:32:29.5459549Z compiled: bool, 2025-05-07T20:32:29.5459775Z ) -> None: 2025-05-07T20:32:29.5459997Z torch.manual_seed(2025) 2025-05-07T20:32:29.5460246Z 2025-05-07T20:32:29.5460523Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5460873Z 2025-05-07T20:32:29.5461073Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5461366Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5461686Z x = x_sign * x_clamp 2025-05-07T20:32:29.5461931Z x0 = x[:, :D] 2025-05-07T20:32:29.5462146Z x1 = x[:, D:] 2025-05-07T20:32:29.5462358Z 2025-05-07T20:32:29.5462622Z if contiguous: 2025-05-07T20:32:29.5462855Z x0 = x0.contiguous() 2025-05-07T20:32:29.5463123Z x1 = x1.contiguous() 2025-05-07T20:32:29.5463370Z 2025-05-07T20:32:29.5463573Z if scale_ub is not None: 2025-05-07T20:32:29.5463850Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.5464195Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.5464512Z ) 2025-05-07T20:32:29.5464705Z else: 2025-05-07T20:32:29.5464920Z scale_ub_tensor = None 2025-05-07T20:32:29.5465180Z 2025-05-07T20:32:29.5465417Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5465740Z op = silu_mul_quant 2025-05-07T20:32:29.5465995Z if compiled: 2025-05-07T20:32:29.5466245Z op = torch.compile(op) 2025-05-07T20:32:29.5466553Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5466834Z 2025-05-07T20:32:29.5467033Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.5467330Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.5467627Z 2025-05-07T20:32:29.5467876Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5468213Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.5468512Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.5468836Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.5469198Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.5469515Z 2025-05-07T20:32:29.5469724Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.5469926Z 2025-05-07T20:32:29.5470030Z moe/activation_test.py:126: 2025-05-07T20:32:29.5470329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5470737Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.5471076Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.5471885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.5472657Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.5473218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.5473912Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.5474615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.5475357Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.5476123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:29.5476931Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.5477725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.5478379Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.5478991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.5479554Z fn() 2025-05-07T20:32:29.5480074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.5480667Z self.fn.run( 2025-05-07T20:32:29.5481143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.5481689Z kernel = self.compile( 2025-05-07T20:32:29.5482246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.5482912Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.5483353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5483590Z 2025-05-07T20:32:29.5483805Z self = 2025-05-07T20:32:29.5484912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.5486324Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8993ef0a0>} 2025-05-07T20:32:29.5487693Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.5488739Z context = 2025-05-07T20:32:29.5489038Z 2025-05-07T20:32:29.5489212Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.5489747Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.5490220Z module_map=module_map) 2025-05-07T20:32:29.5490597Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.5490964Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.5491242Z E ^ 2025-05-07T20:32:29.5491712Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.5492175Z 2025-05-07T20:32:29.5492642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.5493163Z 2025-05-07T20:32:29.5493275Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5493696Z self=, 2025-05-07T20:32:29.5494104Z T=2048, 2025-05-07T20:32:29.5494300Z D=5120, 2025-05-07T20:32:29.5494496Z scale_ub=None, 2025-05-07T20:32:29.5494711Z contiguous=True, 2025-05-07T20:32:29.5494937Z compiled=True, 2025-05-07T20:32:29.5495142Z ) 2025-05-07T20:32:30.0850565Z self = 2025-05-07T20:32:30.0851245Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:30.0851524Z 2025-05-07T20:32:30.0851612Z @given( 2025-05-07T20:32:30.0851854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.0852176Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.0852502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.0852845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.0853182Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.0853478Z ) 2025-05-07T20:32:30.0853960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.0854420Z def test_silu_mul_quant( 2025-05-07T20:32:30.0854671Z self, 2025-05-07T20:32:30.0854875Z T: int, 2025-05-07T20:32:30.0855072Z D: int, 2025-05-07T20:32:30.0855371Z scale_ub: Optional[float], 2025-05-07T20:32:30.0855997Z contiguous: bool, 2025-05-07T20:32:30.0856244Z compiled: bool, 2025-05-07T20:32:30.0856479Z ) -> None: 2025-05-07T20:32:30.0856705Z torch.manual_seed(2025) 2025-05-07T20:32:30.0856958Z 2025-05-07T20:32:30.0857238Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.0857592Z 2025-05-07T20:32:30.0857795Z x_sign = torch.sign(x) 2025-05-07T20:32:30.0858215Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.0858534Z x = x_sign * x_clamp 2025-05-07T20:32:30.0858783Z x0 = x[:, :D] 2025-05-07T20:32:30.0859132Z x1 = x[:, D:] 2025-05-07T20:32:30.0859349Z 2025-05-07T20:32:30.0859557Z if contiguous: 2025-05-07T20:32:30.0859793Z x0 = x0.contiguous() 2025-05-07T20:32:30.0860067Z x1 = x1.contiguous() 2025-05-07T20:32:30.0860317Z 2025-05-07T20:32:30.0860513Z if scale_ub is not None: 2025-05-07T20:32:30.0860800Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.0861153Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.0861464Z ) 2025-05-07T20:32:30.0861661Z else: 2025-05-07T20:32:30.0861876Z scale_ub_tensor = None 2025-05-07T20:32:30.0862133Z 2025-05-07T20:32:30.0862383Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.0862719Z op = silu_mul_quant 2025-05-07T20:32:30.0862973Z if compiled: 2025-05-07T20:32:30.0863230Z op = torch.compile(op) 2025-05-07T20:32:30.0863542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.0863820Z 2025-05-07T20:32:30.0864020Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.0864319Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.0864614Z 2025-05-07T20:32:30.0864862Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.0865214Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.0865513Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.0865840Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.0866215Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.0866574Z 2025-05-07T20:32:30.0866801Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.0867012Z 2025-05-07T20:32:30.0867117Z moe/activation_test.py:126: 2025-05-07T20:32:30.0867504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.0867847Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.0868193Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.0869016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.0869791Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.0870357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.0871065Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.0871780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.0872527Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.0873308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:30.0874082Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.0874906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.0875569Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.0876253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.0876839Z fn() 2025-05-07T20:32:30.0877367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.0877973Z self.fn.run( 2025-05-07T20:32:30.0878459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.0879012Z kernel = self.compile( 2025-05-07T20:32:30.0879569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.0880299Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.0880713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.0880948Z 2025-05-07T20:32:30.0881166Z self = 2025-05-07T20:32:30.0882282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.0883707Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa898eabbe0>} 2025-05-07T20:32:30.0885103Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.0886168Z context = 2025-05-07T20:32:30.0886466Z 2025-05-07T20:32:30.0886641Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.0887228Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.0887718Z module_map=module_map) 2025-05-07T20:32:30.0888096Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.0888465Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.0888747Z E ^ 2025-05-07T20:32:30.0889228Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.0889741Z 2025-05-07T20:32:30.0890171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.0890706Z 2025-05-07T20:32:30.0890817Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.0891248Z self=, 2025-05-07T20:32:30.0891663Z T=128, 2025-05-07T20:32:30.0891856Z D=5120, 2025-05-07T20:32:30.0892061Z scale_ub=None, 2025-05-07T20:32:30.0892290Z contiguous=True, 2025-05-07T20:32:30.0892524Z compiled=True, 2025-05-07T20:32:30.0892744Z ) 2025-05-07T20:32:30.9759418Z self = 2025-05-07T20:32:30.9760104Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:30.9760400Z 2025-05-07T20:32:30.9760485Z @given( 2025-05-07T20:32:30.9760738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.9761068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.9761387Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.9761735Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.9762082Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.9762502Z ) 2025-05-07T20:32:30.9762874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.9763333Z def test_silu_mul_quant( 2025-05-07T20:32:30.9763584Z self, 2025-05-07T20:32:30.9763852Z T: int, 2025-05-07T20:32:30.9764057Z D: int, 2025-05-07T20:32:30.9764283Z scale_ub: Optional[float], 2025-05-07T20:32:30.9764568Z contiguous: bool, 2025-05-07T20:32:30.9764816Z compiled: bool, 2025-05-07T20:32:30.9765048Z ) -> None: 2025-05-07T20:32:30.9765273Z torch.manual_seed(2025) 2025-05-07T20:32:30.9765529Z 2025-05-07T20:32:30.9765817Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.9766176Z 2025-05-07T20:32:30.9766386Z x_sign = torch.sign(x) 2025-05-07T20:32:30.9766802Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.9767252Z x = x_sign * x_clamp 2025-05-07T20:32:30.9767504Z x0 = x[:, :D] 2025-05-07T20:32:30.9767730Z x1 = x[:, D:] 2025-05-07T20:32:30.9767946Z 2025-05-07T20:32:30.9768143Z if contiguous: 2025-05-07T20:32:30.9768381Z x0 = x0.contiguous() 2025-05-07T20:32:30.9768652Z x1 = x1.contiguous() 2025-05-07T20:32:30.9768918Z 2025-05-07T20:32:30.9769115Z if scale_ub is not None: 2025-05-07T20:32:30.9769404Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.9769754Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.9770070Z ) 2025-05-07T20:32:30.9770266Z else: 2025-05-07T20:32:30.9770485Z scale_ub_tensor = None 2025-05-07T20:32:30.9770751Z 2025-05-07T20:32:30.9770995Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.9771325Z op = silu_mul_quant 2025-05-07T20:32:30.9771588Z if compiled: 2025-05-07T20:32:30.9771846Z op = torch.compile(op) 2025-05-07T20:32:30.9772157Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9772447Z 2025-05-07T20:32:30.9772645Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.9772948Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.9773248Z 2025-05-07T20:32:30.9773492Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.9773840Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.9774148Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.9774477Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.9774845Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.9775173Z 2025-05-07T20:32:30.9775460Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.9775667Z 2025-05-07T20:32:30.9775774Z moe/activation_test.py:126: 2025-05-07T20:32:30.9776084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9776431Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.9776772Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.9777740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.9778640Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.9779208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.9779911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.9780636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.9781391Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.9782169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:30.9783006Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.9783766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.9784471Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.9785101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.9785634Z fn() 2025-05-07T20:32:30.9786180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.9786811Z self.fn.run( 2025-05-07T20:32:30.9787320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.9787882Z kernel = self.compile( 2025-05-07T20:32:30.9788602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.9789285Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.9789690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9789934Z 2025-05-07T20:32:30.9790152Z self = 2025-05-07T20:32:30.9791278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.9792715Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89804c280>} 2025-05-07T20:32:30.9794112Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.9795187Z context = 2025-05-07T20:32:30.9795487Z 2025-05-07T20:32:30.9795660Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.9796208Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.9796693Z module_map=module_map) 2025-05-07T20:32:30.9797101Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.9797499Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.9797777Z E ^ 2025-05-07T20:32:30.9798314Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.9798801Z 2025-05-07T20:32:30.9799404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.9799946Z 2025-05-07T20:32:30.9800063Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.9800491Z self=, 2025-05-07T20:32:30.9800909Z T=4096, 2025-05-07T20:32:30.9801109Z D=5120, 2025-05-07T20:32:30.9801313Z scale_ub=None, 2025-05-07T20:32:30.9801532Z contiguous=True, 2025-05-07T20:32:30.9801763Z compiled=True, 2025-05-07T20:32:30.9801975Z ) 2025-05-07T20:32:31.7341357Z self = 2025-05-07T20:32:31.7342151Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:31.7342543Z 2025-05-07T20:32:31.7342661Z @given( 2025-05-07T20:32:31.7342997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.7343415Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.7343834Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.7344395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.7344735Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.7345031Z ) 2025-05-07T20:32:31.7345391Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.7345901Z def test_silu_mul_quant( 2025-05-07T20:32:31.7346158Z self, 2025-05-07T20:32:31.7346363Z T: int, 2025-05-07T20:32:31.7346560Z D: int, 2025-05-07T20:32:31.7346789Z scale_ub: Optional[float], 2025-05-07T20:32:31.7347072Z contiguous: bool, 2025-05-07T20:32:31.7347323Z compiled: bool, 2025-05-07T20:32:31.7347551Z ) -> None: 2025-05-07T20:32:31.7347777Z torch.manual_seed(2025) 2025-05-07T20:32:31.7348032Z 2025-05-07T20:32:31.7348312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.7348665Z 2025-05-07T20:32:31.7348938Z x_sign = torch.sign(x) 2025-05-07T20:32:31.7349242Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.7355946Z x = x_sign * x_clamp 2025-05-07T20:32:31.7356209Z x0 = x[:, :D] 2025-05-07T20:32:31.7356425Z x1 = x[:, D:] 2025-05-07T20:32:31.7356632Z 2025-05-07T20:32:31.7356827Z if contiguous: 2025-05-07T20:32:31.7357056Z x0 = x0.contiguous() 2025-05-07T20:32:31.7357312Z x1 = x1.contiguous() 2025-05-07T20:32:31.7357553Z 2025-05-07T20:32:31.7357749Z if scale_ub is not None: 2025-05-07T20:32:31.7358018Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.7358367Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.7358677Z ) 2025-05-07T20:32:31.7358865Z else: 2025-05-07T20:32:31.7359081Z scale_ub_tensor = None 2025-05-07T20:32:31.7359341Z 2025-05-07T20:32:31.7359576Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.7359902Z op = silu_mul_quant 2025-05-07T20:32:31.7360156Z if compiled: 2025-05-07T20:32:31.7360409Z op = torch.compile(op) 2025-05-07T20:32:31.7360708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.7360982Z 2025-05-07T20:32:31.7361172Z y_fp8, y_scale = fn() 2025-05-07T20:32:31.7361469Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:31.7361760Z 2025-05-07T20:32:31.7361998Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.7362338Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:31.7362634Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:31.7362955Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:31.7363422Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.7363740Z 2025-05-07T20:32:31.7363948Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:31.7364149Z 2025-05-07T20:32:31.7364251Z moe/activation_test.py:126: 2025-05-07T20:32:31.7364550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.7364888Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:31.7365215Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.7366022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:31.7366791Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:31.7367402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.7368087Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.7368791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:31.7369520Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.7370348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:31.7371100Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.7371890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:31.7372540Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:31.7373152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:31.7373669Z fn() 2025-05-07T20:32:31.7374188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:31.7374776Z self.fn.run( 2025-05-07T20:32:31.7375243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.7375844Z kernel = self.compile( 2025-05-07T20:32:31.7376399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.7377113Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.7377507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.7377738Z 2025-05-07T20:32:31.7377948Z self = 2025-05-07T20:32:31.7379157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.7380559Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89804d2d0>} 2025-05-07T20:32:31.7381926Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.7382960Z context = 2025-05-07T20:32:31.7383250Z 2025-05-07T20:32:31.7383424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.7383944Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.7384418Z module_map=module_map) 2025-05-07T20:32:31.7384786Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.7385190Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:31.7385462Z E ^ 2025-05-07T20:32:31.7385930Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.7386389Z 2025-05-07T20:32:31.7386817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.7387333Z 2025-05-07T20:32:31.7387437Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.7387857Z self=, 2025-05-07T20:32:31.7388261Z T=16384, 2025-05-07T20:32:31.7388450Z D=5120, 2025-05-07T20:32:31.7388651Z scale_ub=None, 2025-05-07T20:32:31.7388869Z contiguous=True, 2025-05-07T20:32:31.7389094Z compiled=True, 2025-05-07T20:32:31.7389297Z ) 2025-05-07T20:32:31.7775372Z W0507 20:32:31.775000 87841 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:31.7776899Z W0507 20:32:31.775000 87841 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:31.7778447Z W0507 20:32:31.775000 87841 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:31.7779451Z W0507 20:32:31.775000 87841 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:31.7780674Z W0507 20:32:31.775000 87841 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:31.8811379Z self = 2025-05-07T20:32:31.8812146Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:31.8812543Z 2025-05-07T20:32:31.8812652Z @given( 2025-05-07T20:32:31.8812899Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8813368Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8813692Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8814038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8814374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8814681Z ) 2025-05-07T20:32:31.8815049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8815497Z def test_silu_mul_quant( 2025-05-07T20:32:31.8815751Z self, 2025-05-07T20:32:31.8815957Z T: int, 2025-05-07T20:32:31.8816161Z D: int, 2025-05-07T20:32:31.8816391Z scale_ub: Optional[float], 2025-05-07T20:32:31.8816677Z contiguous: bool, 2025-05-07T20:32:31.8816949Z compiled: bool, 2025-05-07T20:32:31.8817212Z ) -> None: 2025-05-07T20:32:31.8817442Z torch.manual_seed(2025) 2025-05-07T20:32:31.8817690Z 2025-05-07T20:32:31.8818077Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8818432Z 2025-05-07T20:32:31.8818638Z x_sign = torch.sign(x) 2025-05-07T20:32:31.8818939Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.8819258Z x = x_sign * x_clamp 2025-05-07T20:32:31.8819510Z x0 = x[:, :D] 2025-05-07T20:32:31.8819735Z x1 = x[:, D:] 2025-05-07T20:32:31.8819951Z 2025-05-07T20:32:31.8820146Z if contiguous: 2025-05-07T20:32:31.8820385Z x0 = x0.contiguous() 2025-05-07T20:32:31.8820654Z x1 = x1.contiguous() 2025-05-07T20:32:31.8820903Z 2025-05-07T20:32:31.8821101Z if scale_ub is not None: 2025-05-07T20:32:31.8821387Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.8821818Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.8822136Z ) 2025-05-07T20:32:31.8822337Z else: 2025-05-07T20:32:31.8822561Z scale_ub_tensor = None 2025-05-07T20:32:31.8822820Z 2025-05-07T20:32:31.8823063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.8823391Z op = silu_mul_quant 2025-05-07T20:32:31.8823649Z if compiled: 2025-05-07T20:32:31.8823901Z op = torch.compile(op) 2025-05-07T20:32:31.8824212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.8824498Z 2025-05-07T20:32:31.8824697Z y_fp8, y_scale = fn() 2025-05-07T20:32:31.8824992Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:31.8825288Z 2025-05-07T20:32:31.8825534Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.8825882Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:31.8826188Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:31.8826512Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:31.8826881Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.8827199Z 2025-05-07T20:32:31.8827404Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:31.8827608Z 2025-05-07T20:32:31.8827796Z moe/activation_test.py:126: 2025-05-07T20:32:31.8828100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8828442Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:31.8828834Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.8829636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:31.8830407Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:31.8830962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.8831660Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.8832362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:31.8833145Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.8833911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:31.8834675Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.8835417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:31.8836073Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:31.8836682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:31.8837218Z fn() 2025-05-07T20:32:31.8837742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:31.8838338Z self.fn.run( 2025-05-07T20:32:31.8838818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.8839365Z kernel = self.compile( 2025-05-07T20:32:31.8839921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.8840587Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.8840992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8841224Z 2025-05-07T20:32:31.8841444Z self = 2025-05-07T20:32:31.8842593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.8844000Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89806df30>} 2025-05-07T20:32:31.8845369Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.8846420Z context = 2025-05-07T20:32:31.8846719Z 2025-05-07T20:32:31.8846895Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.8847424Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.8847909Z module_map=module_map) 2025-05-07T20:32:31.8848293Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.8848664Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:31.8848938Z E ^ 2025-05-07T20:32:31.8849458Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.8849920Z 2025-05-07T20:32:31.8850350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.8850938Z 2025-05-07T20:32:31.8851051Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8851472Z self=, 2025-05-07T20:32:31.8851883Z T=1, 2025-05-07T20:32:31.8852075Z D=5120, 2025-05-07T20:32:31.8852270Z scale_ub=1200.0, 2025-05-07T20:32:31.8852503Z contiguous=True, 2025-05-07T20:32:31.8852737Z compiled=True, 2025-05-07T20:32:31.8852943Z ) 2025-05-07T20:32:32.0296713Z self = 2025-05-07T20:32:32.0297563Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.0298159Z 2025-05-07T20:32:32.0298261Z @given( 2025-05-07T20:32:32.0298515Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.0298843Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.0299155Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.0299497Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.0299842Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.0300128Z ) 2025-05-07T20:32:32.0300490Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.0300939Z def test_silu_mul_quant( 2025-05-07T20:32:32.0301179Z self, 2025-05-07T20:32:32.0301382Z T: int, 2025-05-07T20:32:32.0301584Z D: int, 2025-05-07T20:32:32.0301806Z scale_ub: Optional[float], 2025-05-07T20:32:32.0302091Z contiguous: bool, 2025-05-07T20:32:32.0302339Z compiled: bool, 2025-05-07T20:32:32.0302574Z ) -> None: 2025-05-07T20:32:32.0302797Z torch.manual_seed(2025) 2025-05-07T20:32:32.0303050Z 2025-05-07T20:32:32.0303333Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.0303680Z 2025-05-07T20:32:32.0303881Z x_sign = torch.sign(x) 2025-05-07T20:32:32.0304180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.0304492Z x = x_sign * x_clamp 2025-05-07T20:32:32.0304739Z x0 = x[:, :D] 2025-05-07T20:32:32.0304962Z x1 = x[:, D:] 2025-05-07T20:32:32.0305169Z 2025-05-07T20:32:32.0305362Z if contiguous: 2025-05-07T20:32:32.0305600Z x0 = x0.contiguous() 2025-05-07T20:32:32.0305862Z x1 = x1.contiguous() 2025-05-07T20:32:32.0306109Z 2025-05-07T20:32:32.0306313Z if scale_ub is not None: 2025-05-07T20:32:32.0306672Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.0307022Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.0307338Z ) 2025-05-07T20:32:32.0307530Z else: 2025-05-07T20:32:32.0307745Z scale_ub_tensor = None 2025-05-07T20:32:32.0308009Z 2025-05-07T20:32:32.0308252Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.0308570Z op = silu_mul_quant 2025-05-07T20:32:32.0308826Z if compiled: 2025-05-07T20:32:32.0309085Z op = torch.compile(op) 2025-05-07T20:32:32.0309386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0309668Z 2025-05-07T20:32:32.0309865Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.0310033Z 2025-05-07T20:32:32.0310137Z moe/activation_test.py:117: 2025-05-07T20:32:32.0310441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0310776Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.0311065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0311642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.0312217Z return fn(*args, **kwargs) 2025-05-07T20:32:32.0312964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.0313666Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.0314276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.0314975Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.0315654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.0316194Z kernel = self.compile( 2025-05-07T20:32:32.0316753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.0317473Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.0317915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0318151Z 2025-05-07T20:32:32.0318363Z self = 2025-05-07T20:32:32.0319471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.0320881Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89829b1c0>} 2025-05-07T20:32:32.0322256Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.0323300Z context = 2025-05-07T20:32:32.0323606Z 2025-05-07T20:32:32.0323781Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.0324320Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.0324801Z module_map=module_map) 2025-05-07T20:32:32.0325171Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.0325535Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.0325801Z E ^ 2025-05-07T20:32:32.0326275Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.0326739Z 2025-05-07T20:32:32.0327231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.0327785Z 2025-05-07T20:32:32.0327893Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.0328317Z self=, 2025-05-07T20:32:32.0328721Z T=1, 2025-05-07T20:32:32.0328916Z D=5120, 2025-05-07T20:32:32.0329114Z scale_ub=None, 2025-05-07T20:32:32.0329334Z contiguous=False, 2025-05-07T20:32:32.0329566Z compiled=True, 2025-05-07T20:32:32.0329778Z ) 2025-05-07T20:32:32.1003185Z self = 2025-05-07T20:32:32.1003975Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.1004372Z 2025-05-07T20:32:32.1004490Z @given( 2025-05-07T20:32:32.1004749Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.1005071Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.1005384Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.1005733Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.1006075Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.1006369Z ) 2025-05-07T20:32:32.1006732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.1007297Z def test_silu_mul_quant( 2025-05-07T20:32:32.1007549Z self, 2025-05-07T20:32:32.1007751Z T: int, 2025-05-07T20:32:32.1007956Z D: int, 2025-05-07T20:32:32.1008178Z scale_ub: Optional[float], 2025-05-07T20:32:32.1008519Z contiguous: bool, 2025-05-07T20:32:32.1008765Z compiled: bool, 2025-05-07T20:32:32.1008989Z ) -> None: 2025-05-07T20:32:32.1009216Z torch.manual_seed(2025) 2025-05-07T20:32:32.1009468Z 2025-05-07T20:32:32.1009751Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.1010103Z 2025-05-07T20:32:32.1010308Z x_sign = torch.sign(x) 2025-05-07T20:32:32.1010612Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.1010927Z x = x_sign * x_clamp 2025-05-07T20:32:32.1011173Z x0 = x[:, :D] 2025-05-07T20:32:32.1011470Z x1 = x[:, D:] 2025-05-07T20:32:32.1011679Z 2025-05-07T20:32:32.1011872Z if contiguous: 2025-05-07T20:32:32.1012118Z x0 = x0.contiguous() 2025-05-07T20:32:32.1012380Z x1 = x1.contiguous() 2025-05-07T20:32:32.1012628Z 2025-05-07T20:32:32.1012827Z if scale_ub is not None: 2025-05-07T20:32:32.1013110Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.1013458Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.1013774Z ) 2025-05-07T20:32:32.1013968Z else: 2025-05-07T20:32:32.1014191Z scale_ub_tensor = None 2025-05-07T20:32:32.1014450Z 2025-05-07T20:32:32.1014687Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.1015011Z op = silu_mul_quant 2025-05-07T20:32:32.1015271Z if compiled: 2025-05-07T20:32:32.1015526Z op = torch.compile(op) 2025-05-07T20:32:32.1015827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.1016113Z 2025-05-07T20:32:32.1016315Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.1016605Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.1016901Z 2025-05-07T20:32:32.1017178Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.1017542Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.1017847Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.1018254Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.1018618Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.1018938Z 2025-05-07T20:32:32.1019147Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.1019348Z 2025-05-07T20:32:32.1019455Z moe/activation_test.py:126: 2025-05-07T20:32:32.1019824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.1020169Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.1020512Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.1021323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.1022102Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.1022667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.1023370Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.1024076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.1024819Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.1025595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:32.1026366Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.1027163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.1027823Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.1028484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.1029015Z fn() 2025-05-07T20:32:32.1029542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.1030143Z self.fn.run( 2025-05-07T20:32:32.1030628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.1031172Z kernel = self.compile( 2025-05-07T20:32:32.1031733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.1032450Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.1032854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.1033089Z 2025-05-07T20:32:32.1033303Z self = 2025-05-07T20:32:32.1034418Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.1035832Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8984d3d00>} 2025-05-07T20:32:32.1037269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.1038322Z context = 2025-05-07T20:32:32.1038623Z 2025-05-07T20:32:32.1038794Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.1039332Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.1039816Z module_map=module_map) 2025-05-07T20:32:32.1040189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.1040556Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.1040829Z E ^ 2025-05-07T20:32:32.1041301Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.1041764Z 2025-05-07T20:32:32.1042239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.1042769Z 2025-05-07T20:32:32.1042879Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.1043308Z self=, 2025-05-07T20:32:32.1043715Z T=1, 2025-05-07T20:32:32.1043900Z D=5120, 2025-05-07T20:32:32.1044098Z scale_ub=None, 2025-05-07T20:32:32.1044316Z contiguous=True, 2025-05-07T20:32:32.1044551Z compiled=False, 2025-05-07T20:32:32.1044763Z ) 2025-05-07T20:32:32.4260576Z self = 2025-05-07T20:32:32.4261368Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.4268399Z 2025-05-07T20:32:32.4268545Z @given( 2025-05-07T20:32:32.4268878Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.4269321Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.4269632Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.4269964Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.4270290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.4270572Z ) 2025-05-07T20:32:32.4271037Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.4271482Z def test_silu_mul_quant( 2025-05-07T20:32:32.4271723Z self, 2025-05-07T20:32:32.4271978Z T: int, 2025-05-07T20:32:32.4272165Z D: int, 2025-05-07T20:32:32.4272380Z scale_ub: Optional[float], 2025-05-07T20:32:32.4272656Z contiguous: bool, 2025-05-07T20:32:32.4272898Z compiled: bool, 2025-05-07T20:32:32.4273127Z ) -> None: 2025-05-07T20:32:32.4273344Z torch.manual_seed(2025) 2025-05-07T20:32:32.4273582Z 2025-05-07T20:32:32.4273861Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.4274204Z 2025-05-07T20:32:32.4274396Z x_sign = torch.sign(x) 2025-05-07T20:32:32.4274683Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.4275070Z x = x_sign * x_clamp 2025-05-07T20:32:32.4275314Z x0 = x[:, :D] 2025-05-07T20:32:32.4275528Z x1 = x[:, D:] 2025-05-07T20:32:32.4275736Z 2025-05-07T20:32:32.4275925Z if contiguous: 2025-05-07T20:32:32.4276154Z x0 = x0.contiguous() 2025-05-07T20:32:32.4276418Z x1 = x1.contiguous() 2025-05-07T20:32:32.4276667Z 2025-05-07T20:32:32.4276861Z if scale_ub is not None: 2025-05-07T20:32:32.4277137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.4277480Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.4277783Z ) 2025-05-07T20:32:32.4277984Z else: 2025-05-07T20:32:32.4278193Z scale_ub_tensor = None 2025-05-07T20:32:32.4278445Z 2025-05-07T20:32:32.4278690Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.4279004Z op = silu_mul_quant 2025-05-07T20:32:32.4279249Z if compiled: 2025-05-07T20:32:32.4279502Z op = torch.compile(op) 2025-05-07T20:32:32.4279802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4280084Z 2025-05-07T20:32:32.4280272Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.4280448Z 2025-05-07T20:32:32.4280548Z moe/activation_test.py:117: 2025-05-07T20:32:32.4280842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4281177Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.4281459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4282190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.4282894Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.4283504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.4284192Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.4284874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.4285409Z kernel = self.compile( 2025-05-07T20:32:32.4285954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.4286619Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.4287020Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4287269Z 2025-05-07T20:32:32.4287507Z self = 2025-05-07T20:32:32.4288606Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.4290052Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8984d1ea0>} 2025-05-07T20:32:32.4291425Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.4292499Z context = 2025-05-07T20:32:32.4292795Z 2025-05-07T20:32:32.4292963Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.4293489Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.4293961Z module_map=module_map) 2025-05-07T20:32:32.4294325Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.4294681Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.4294938Z E ^ 2025-05-07T20:32:32.4295451Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.4295909Z 2025-05-07T20:32:32.4296330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.4296850Z 2025-05-07T20:32:32.4296957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.4297426Z self=, 2025-05-07T20:32:32.4297821Z T=128, 2025-05-07T20:32:32.4298110Z D=5120, 2025-05-07T20:32:32.4298304Z scale_ub=None, 2025-05-07T20:32:32.4298518Z contiguous=False, 2025-05-07T20:32:32.4298748Z compiled=True, 2025-05-07T20:32:32.4298950Z ) 2025-05-07T20:32:32.4299271Z self = 2025-05-07T20:32:32.4299765Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.4300037Z 2025-05-07T20:32:32.4300115Z @given( 2025-05-07T20:32:32.4300347Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.4300657Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.4300961Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.4301292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.4301625Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.4301911Z ) 2025-05-07T20:32:32.4302266Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.4302705Z def test_silu_mul_quant( 2025-05-07T20:32:32.4302943Z self, 2025-05-07T20:32:32.4303135Z T: int, 2025-05-07T20:32:32.4303323Z D: int, 2025-05-07T20:32:32.4303542Z scale_ub: Optional[float], 2025-05-07T20:32:32.4303864Z contiguous: bool, 2025-05-07T20:32:32.4304104Z compiled: bool, 2025-05-07T20:32:32.4304321Z ) -> None: 2025-05-07T20:32:32.4304537Z torch.manual_seed(2025) 2025-05-07T20:32:32.4304784Z 2025-05-07T20:32:32.4305059Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.4305395Z 2025-05-07T20:32:32.4305588Z x_sign = torch.sign(x) 2025-05-07T20:32:32.4305875Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.4306186Z x = x_sign * x_clamp 2025-05-07T20:32:32.4306425Z x0 = x[:, :D] 2025-05-07T20:32:32.4306637Z x1 = x[:, D:] 2025-05-07T20:32:32.4306846Z 2025-05-07T20:32:32.4307028Z if contiguous: 2025-05-07T20:32:32.4307249Z x0 = x0.contiguous() 2025-05-07T20:32:32.4307507Z x1 = x1.contiguous() 2025-05-07T20:32:32.4307747Z 2025-05-07T20:32:32.4307935Z if scale_ub is not None: 2025-05-07T20:32:32.4308212Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.4308543Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.4308849Z ) 2025-05-07T20:32:32.4309037Z else: 2025-05-07T20:32:32.4309246Z scale_ub_tensor = None 2025-05-07T20:32:32.4309493Z 2025-05-07T20:32:32.4309767Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.4310083Z op = silu_mul_quant 2025-05-07T20:32:32.4310328Z if compiled: 2025-05-07T20:32:32.4310573Z op = torch.compile(op) 2025-05-07T20:32:32.4310908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4311183Z 2025-05-07T20:32:32.4311371Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.4311542Z 2025-05-07T20:32:32.4311642Z moe/activation_test.py:117: 2025-05-07T20:32:32.4311941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4312259Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.4312546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4313110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.4313720Z return fn(*args, **kwargs) 2025-05-07T20:32:32.4314379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.4315078Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.4315621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.4316307Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.4316974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.4317561Z kernel = self.compile( 2025-05-07T20:32:32.4318114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.4318774Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.4319169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4319401Z 2025-05-07T20:32:32.4319615Z self = 2025-05-07T20:32:32.4320704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.4322093Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8984d0dc0>} 2025-05-07T20:32:32.4323497Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.4324536Z context = 2025-05-07T20:32:32.4324825Z 2025-05-07T20:32:32.4324998Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.4325525Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.4325996Z module_map=module_map) 2025-05-07T20:32:32.4326364Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.4326719Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.4326973Z E ^ 2025-05-07T20:32:32.4327498Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.4327950Z 2025-05-07T20:32:32.4328375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.4328894Z 2025-05-07T20:32:32.4329007Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.4329423Z self=, 2025-05-07T20:32:32.4329824Z T=128, 2025-05-07T20:32:32.4330016Z D=7168, 2025-05-07T20:32:32.4330252Z scale_ub=1200.0, 2025-05-07T20:32:32.4330475Z contiguous=False, 2025-05-07T20:32:32.4330698Z compiled=False, 2025-05-07T20:32:32.4330893Z ) 2025-05-07T20:32:32.5581634Z self = 2025-05-07T20:32:32.5582430Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.5582813Z 2025-05-07T20:32:32.5582939Z @given( 2025-05-07T20:32:32.5583227Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.5583543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.5583862Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.5584209Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.5584541Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.5584832Z ) 2025-05-07T20:32:32.5585317Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.5585768Z def test_silu_mul_quant( 2025-05-07T20:32:32.5586018Z self, 2025-05-07T20:32:32.5586213Z T: int, 2025-05-07T20:32:32.5586407Z D: int, 2025-05-07T20:32:32.5586634Z scale_ub: Optional[float], 2025-05-07T20:32:32.5586913Z contiguous: bool, 2025-05-07T20:32:32.5587149Z compiled: bool, 2025-05-07T20:32:32.5587373Z ) -> None: 2025-05-07T20:32:32.5587589Z torch.manual_seed(2025) 2025-05-07T20:32:32.5587831Z 2025-05-07T20:32:32.5588105Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.5588453Z 2025-05-07T20:32:32.5588648Z x_sign = torch.sign(x) 2025-05-07T20:32:32.5588944Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.5589263Z x = x_sign * x_clamp 2025-05-07T20:32:32.5589508Z x0 = x[:, :D] 2025-05-07T20:32:32.5589734Z x1 = x[:, D:] 2025-05-07T20:32:32.5589936Z 2025-05-07T20:32:32.5590125Z if contiguous: 2025-05-07T20:32:32.5590363Z x0 = x0.contiguous() 2025-05-07T20:32:32.5590622Z x1 = x1.contiguous() 2025-05-07T20:32:32.5590862Z 2025-05-07T20:32:32.5591056Z if scale_ub is not None: 2025-05-07T20:32:32.5591330Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.5591673Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.5591991Z ) 2025-05-07T20:32:32.5592180Z else: 2025-05-07T20:32:32.5592392Z scale_ub_tensor = None 2025-05-07T20:32:32.5592647Z 2025-05-07T20:32:32.5592878Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.5593196Z op = silu_mul_quant 2025-05-07T20:32:32.5593523Z if compiled: 2025-05-07T20:32:32.5593772Z op = torch.compile(op) 2025-05-07T20:32:32.5594073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.5594352Z 2025-05-07T20:32:32.5594547Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.5594713Z 2025-05-07T20:32:32.5594815Z moe/activation_test.py:117: 2025-05-07T20:32:32.5595112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.5595442Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.5595728Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.5596434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.5597142Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.5597685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.5598385Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.5599061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.5599606Z kernel = self.compile( 2025-05-07T20:32:32.5600223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.5600896Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.5601294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.5601561Z 2025-05-07T20:32:32.5601774Z self = 2025-05-07T20:32:32.5602872Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.5604280Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89806e200>} 2025-05-07T20:32:32.5605701Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.5606752Z context = 2025-05-07T20:32:32.5607046Z 2025-05-07T20:32:32.5607220Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.5607743Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.5608216Z module_map=module_map) 2025-05-07T20:32:32.5608585Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.5608937Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.5609197Z E ^ 2025-05-07T20:32:32.5609669Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.5610128Z 2025-05-07T20:32:32.5610559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.5611077Z 2025-05-07T20:32:32.5611182Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.5611605Z self=, 2025-05-07T20:32:32.5612011Z T=128, 2025-05-07T20:32:32.5612228Z D=5120, 2025-05-07T20:32:32.5612419Z scale_ub=None, 2025-05-07T20:32:32.5612639Z contiguous=False, 2025-05-07T20:32:32.5612869Z compiled=False, 2025-05-07T20:32:32.5613072Z ) 2025-05-07T20:32:32.5613397Z self = 2025-05-07T20:32:32.5613895Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.5614228Z 2025-05-07T20:32:32.5614312Z @given( 2025-05-07T20:32:32.5614538Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.5614854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.5615163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.5615491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.5615826Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.5616111Z ) 2025-05-07T20:32:32.5616464Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.5616910Z def test_silu_mul_quant( 2025-05-07T20:32:32.5617155Z self, 2025-05-07T20:32:32.5617345Z T: int, 2025-05-07T20:32:32.5617542Z D: int, 2025-05-07T20:32:32.5617757Z scale_ub: Optional[float], 2025-05-07T20:32:32.5618109Z contiguous: bool, 2025-05-07T20:32:32.5618368Z compiled: bool, 2025-05-07T20:32:32.5618604Z ) -> None: 2025-05-07T20:32:32.5618831Z torch.manual_seed(2025) 2025-05-07T20:32:32.5619084Z 2025-05-07T20:32:32.5619379Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.5619765Z 2025-05-07T20:32:32.5619963Z x_sign = torch.sign(x) 2025-05-07T20:32:32.5620332Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.5620676Z x = x_sign * x_clamp 2025-05-07T20:32:32.5620929Z x0 = x[:, :D] 2025-05-07T20:32:32.5621170Z x1 = x[:, D:] 2025-05-07T20:32:32.5621434Z 2025-05-07T20:32:32.5621625Z if contiguous: 2025-05-07T20:32:32.5621873Z x0 = x0.contiguous() 2025-05-07T20:32:32.5622154Z x1 = x1.contiguous() 2025-05-07T20:32:32.5622408Z 2025-05-07T20:32:32.5622608Z if scale_ub is not None: 2025-05-07T20:32:32.5622906Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.5623277Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.5623627Z ) 2025-05-07T20:32:32.5623830Z else: 2025-05-07T20:32:32.5624047Z scale_ub_tensor = None 2025-05-07T20:32:32.5624319Z 2025-05-07T20:32:32.5624601Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.5624920Z op = silu_mul_quant 2025-05-07T20:32:32.5625166Z if compiled: 2025-05-07T20:32:32.5625414Z op = torch.compile(op) 2025-05-07T20:32:32.5625715Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.5625989Z 2025-05-07T20:32:32.5626189Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.5626354Z 2025-05-07T20:32:32.5626460Z moe/activation_test.py:117: 2025-05-07T20:32:32.5626750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.5627082Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.5627398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.5628126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.5628822Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.5629371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.5630069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.5630736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.5631278Z kernel = self.compile( 2025-05-07T20:32:32.5631831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.5632499Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.5632889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.5633121Z 2025-05-07T20:32:32.5633380Z self = 2025-05-07T20:32:32.5634488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.5635891Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa898626a70>} 2025-05-07T20:32:32.5637262Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.5638310Z context = 2025-05-07T20:32:32.5638608Z 2025-05-07T20:32:32.5638776Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.5639309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.5639779Z module_map=module_map) 2025-05-07T20:32:32.5640151Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.5640547Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.5640808Z E ^ 2025-05-07T20:32:32.5641274Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.5641772Z 2025-05-07T20:32:32.5642191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.5642708Z 2025-05-07T20:32:32.5642817Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.5643233Z self=, 2025-05-07T20:32:32.5643635Z T=128, 2025-05-07T20:32:32.5643823Z D=5120, 2025-05-07T20:32:32.5644021Z scale_ub=1200.0, 2025-05-07T20:32:32.5644242Z contiguous=True, 2025-05-07T20:32:32.5644467Z compiled=False, 2025-05-07T20:32:32.5644674Z ) 2025-05-07T20:32:32.7570620Z self = 2025-05-07T20:32:32.7571220Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.7571495Z 2025-05-07T20:32:32.7571574Z @given( 2025-05-07T20:32:32.7571811Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7572135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7572441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7572774Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7573105Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7573393Z ) 2025-05-07T20:32:32.7573745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7574191Z def test_silu_mul_quant( 2025-05-07T20:32:32.7574440Z self, 2025-05-07T20:32:32.7574634Z T: int, 2025-05-07T20:32:32.7574831Z D: int, 2025-05-07T20:32:32.7575055Z scale_ub: Optional[float], 2025-05-07T20:32:32.7575329Z contiguous: bool, 2025-05-07T20:32:32.7575572Z compiled: bool, 2025-05-07T20:32:32.7575803Z ) -> None: 2025-05-07T20:32:32.7576018Z torch.manual_seed(2025) 2025-05-07T20:32:32.7576261Z 2025-05-07T20:32:32.7576542Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7576886Z 2025-05-07T20:32:32.7577082Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7577382Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7577688Z x = x_sign * x_clamp 2025-05-07T20:32:32.7577930Z x0 = x[:, :D] 2025-05-07T20:32:32.7578224Z x1 = x[:, D:] 2025-05-07T20:32:32.7578428Z 2025-05-07T20:32:32.7578616Z if contiguous: 2025-05-07T20:32:32.7578851Z x0 = x0.contiguous() 2025-05-07T20:32:32.7579220Z x1 = x1.contiguous() 2025-05-07T20:32:32.7579461Z 2025-05-07T20:32:32.7579657Z if scale_ub is not None: 2025-05-07T20:32:32.7579938Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7580277Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7580585Z ) 2025-05-07T20:32:32.7580779Z else: 2025-05-07T20:32:32.7580988Z scale_ub_tensor = None 2025-05-07T20:32:32.7581243Z 2025-05-07T20:32:32.7581479Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7581791Z op = silu_mul_quant 2025-05-07T20:32:32.7582045Z if compiled: 2025-05-07T20:32:32.7582294Z op = torch.compile(op) 2025-05-07T20:32:32.7582590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7582861Z 2025-05-07T20:32:32.7583057Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7583224Z 2025-05-07T20:32:32.7583332Z moe/activation_test.py:117: 2025-05-07T20:32:32.7583621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7583946Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7584238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7584999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7585703Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7586246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7587031Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7593632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7594184Z kernel = self.compile( 2025-05-07T20:32:32.7594751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7595419Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7595927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7596155Z 2025-05-07T20:32:32.7596369Z self = 2025-05-07T20:32:32.7597470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7598871Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8986272e0>} 2025-05-07T20:32:32.7600233Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7601278Z context = 2025-05-07T20:32:32.7601577Z 2025-05-07T20:32:32.7601753Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7602280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7602751Z module_map=module_map) 2025-05-07T20:32:32.7603127Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7603493Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7603753Z E ^ 2025-05-07T20:32:32.7604230Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7604692Z 2025-05-07T20:32:32.7605167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7605686Z 2025-05-07T20:32:32.7605796Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7606212Z self=, 2025-05-07T20:32:32.7606616Z T=1, 2025-05-07T20:32:32.7606806Z D=7168, 2025-05-07T20:32:32.7607005Z scale_ub=1200.0, 2025-05-07T20:32:32.7607237Z contiguous=True, 2025-05-07T20:32:32.7607464Z compiled=True, 2025-05-07T20:32:32.7607677Z ) 2025-05-07T20:32:32.7607997Z self = 2025-05-07T20:32:32.7608686Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.7608953Z 2025-05-07T20:32:32.7609039Z @given( 2025-05-07T20:32:32.7609272Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7609589Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7609897Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7610231Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7610565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7610855Z ) 2025-05-07T20:32:32.7611218Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7611717Z def test_silu_mul_quant( 2025-05-07T20:32:32.7611965Z self, 2025-05-07T20:32:32.7612164Z T: int, 2025-05-07T20:32:32.7612360Z D: int, 2025-05-07T20:32:32.7612580Z scale_ub: Optional[float], 2025-05-07T20:32:32.7612897Z contiguous: bool, 2025-05-07T20:32:32.7613135Z compiled: bool, 2025-05-07T20:32:32.7613363Z ) -> None: 2025-05-07T20:32:32.7613584Z torch.manual_seed(2025) 2025-05-07T20:32:32.7613826Z 2025-05-07T20:32:32.7614105Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7614448Z 2025-05-07T20:32:32.7614642Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7614940Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7615251Z x = x_sign * x_clamp 2025-05-07T20:32:32.7615488Z x0 = x[:, :D] 2025-05-07T20:32:32.7615755Z x1 = x[:, D:] 2025-05-07T20:32:32.7615961Z 2025-05-07T20:32:32.7616150Z if contiguous: 2025-05-07T20:32:32.7616387Z x0 = x0.contiguous() 2025-05-07T20:32:32.7616650Z x1 = x1.contiguous() 2025-05-07T20:32:32.7616888Z 2025-05-07T20:32:32.7617083Z if scale_ub is not None: 2025-05-07T20:32:32.7617362Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7617699Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7618069Z ) 2025-05-07T20:32:32.7618268Z else: 2025-05-07T20:32:32.7618477Z scale_ub_tensor = None 2025-05-07T20:32:32.7618727Z 2025-05-07T20:32:32.7618964Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7619273Z op = silu_mul_quant 2025-05-07T20:32:32.7619527Z if compiled: 2025-05-07T20:32:32.7619776Z op = torch.compile(op) 2025-05-07T20:32:32.7620072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7620354Z 2025-05-07T20:32:32.7620549Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7620714Z 2025-05-07T20:32:32.7620816Z moe/activation_test.py:117: 2025-05-07T20:32:32.7621109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7621442Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7621730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7622293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.7622859Z return fn(*args, **kwargs) 2025-05-07T20:32:32.7623524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7624215Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7624803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7625491Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7626159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7626691Z kernel = self.compile( 2025-05-07T20:32:32.7627238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7627911Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7628310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7628535Z 2025-05-07T20:32:32.7628747Z self = 2025-05-07T20:32:32.7629843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7631278Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8986270a0>} 2025-05-07T20:32:32.7632637Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7633707Z context = 2025-05-07T20:32:32.7634002Z 2025-05-07T20:32:32.7634173Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7634700Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7635176Z module_map=module_map) 2025-05-07T20:32:32.7635539Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7635897Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7636200Z E ^ 2025-05-07T20:32:32.7636671Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7637132Z 2025-05-07T20:32:32.7637602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7638126Z 2025-05-07T20:32:32.7638231Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7638649Z self=, 2025-05-07T20:32:32.7639047Z T=1, 2025-05-07T20:32:32.7639238Z D=7168, 2025-05-07T20:32:32.7639438Z scale_ub=1200.0, 2025-05-07T20:32:32.7639663Z contiguous=False, 2025-05-07T20:32:32.7639887Z compiled=True, 2025-05-07T20:32:32.7640090Z ) 2025-05-07T20:32:32.9013310Z self = 2025-05-07T20:32:32.9014372Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.9014920Z 2025-05-07T20:32:32.9015090Z @given( 2025-05-07T20:32:32.9015564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.9016200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.9016824Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.9017408Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.9017749Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.9018117Z ) 2025-05-07T20:32:32.9018478Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.9018925Z def test_silu_mul_quant( 2025-05-07T20:32:32.9019172Z self, 2025-05-07T20:32:32.9019374Z T: int, 2025-05-07T20:32:32.9019573Z D: int, 2025-05-07T20:32:32.9019960Z scale_ub: Optional[float], 2025-05-07T20:32:32.9020248Z contiguous: bool, 2025-05-07T20:32:32.9020492Z compiled: bool, 2025-05-07T20:32:32.9020730Z ) -> None: 2025-05-07T20:32:32.9020953Z torch.manual_seed(2025) 2025-05-07T20:32:32.9021198Z 2025-05-07T20:32:32.9021480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.9021836Z 2025-05-07T20:32:32.9022034Z x_sign = torch.sign(x) 2025-05-07T20:32:32.9022330Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.9022648Z x = x_sign * x_clamp 2025-05-07T20:32:32.9022891Z x0 = x[:, :D] 2025-05-07T20:32:32.9023113Z x1 = x[:, D:] 2025-05-07T20:32:32.9023326Z 2025-05-07T20:32:32.9023518Z if contiguous: 2025-05-07T20:32:32.9023754Z x0 = x0.contiguous() 2025-05-07T20:32:32.9024022Z x1 = x1.contiguous() 2025-05-07T20:32:32.9024267Z 2025-05-07T20:32:32.9024468Z if scale_ub is not None: 2025-05-07T20:32:32.9024755Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.9025094Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.9025404Z ) 2025-05-07T20:32:32.9025599Z else: 2025-05-07T20:32:32.9025878Z scale_ub_tensor = None 2025-05-07T20:32:32.9026130Z 2025-05-07T20:32:32.9026369Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.9026693Z op = silu_mul_quant 2025-05-07T20:32:32.9027008Z if compiled: 2025-05-07T20:32:32.9027293Z op = torch.compile(op) 2025-05-07T20:32:32.9027631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9027904Z 2025-05-07T20:32:32.9028104Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.9028274Z 2025-05-07T20:32:32.9028376Z moe/activation_test.py:117: 2025-05-07T20:32:32.9028677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9029009Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.9029295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9029868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.9030506Z return fn(*args, **kwargs) 2025-05-07T20:32:32.9031174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.9031879Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.9032426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.9033114Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.9033789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.9034328Z kernel = self.compile( 2025-05-07T20:32:32.9034875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.9035542Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.9035944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9036171Z 2025-05-07T20:32:32.9036388Z self = 2025-05-07T20:32:32.9037539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.9038941Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa898626440>} 2025-05-07T20:32:32.9040354Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.9041404Z context = 2025-05-07T20:32:32.9041696Z 2025-05-07T20:32:32.9041876Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.9042404Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.9042893Z module_map=module_map) 2025-05-07T20:32:32.9043263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.9043622Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.9043888Z E ^ 2025-05-07T20:32:32.9044360Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.9044815Z 2025-05-07T20:32:32.9045239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.9045764Z 2025-05-07T20:32:32.9045871Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.9046296Z self=, 2025-05-07T20:32:32.9046800Z T=1, 2025-05-07T20:32:32.9046993Z D=7168, 2025-05-07T20:32:32.9047191Z scale_ub=None, 2025-05-07T20:32:32.9047438Z contiguous=False, 2025-05-07T20:32:32.9047688Z compiled=True, 2025-05-07T20:32:32.9047936Z ) 2025-05-07T20:32:33.1612078Z self = 2025-05-07T20:32:33.1612582Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.1612894Z 2025-05-07T20:32:33.1613009Z @given( 2025-05-07T20:32:33.1613345Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.1613790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.1614113Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.1614446Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.1614785Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.1615205Z ) 2025-05-07T20:32:33.1615563Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.1616014Z def test_silu_mul_quant( 2025-05-07T20:32:33.1616260Z self, 2025-05-07T20:32:33.1616454Z T: int, 2025-05-07T20:32:33.1616655Z D: int, 2025-05-07T20:32:33.1616885Z scale_ub: Optional[float], 2025-05-07T20:32:33.1617168Z contiguous: bool, 2025-05-07T20:32:33.1617459Z compiled: bool, 2025-05-07T20:32:33.1617688Z ) -> None: 2025-05-07T20:32:33.1617906Z torch.manual_seed(2025) 2025-05-07T20:32:33.1618236Z 2025-05-07T20:32:33.1618521Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.1618873Z 2025-05-07T20:32:33.1619070Z x_sign = torch.sign(x) 2025-05-07T20:32:33.1619366Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.1619679Z x = x_sign * x_clamp 2025-05-07T20:32:33.1619922Z x0 = x[:, :D] 2025-05-07T20:32:33.1620146Z x1 = x[:, D:] 2025-05-07T20:32:33.1620352Z 2025-05-07T20:32:33.1620539Z if contiguous: 2025-05-07T20:32:33.1620775Z x0 = x0.contiguous() 2025-05-07T20:32:33.1621040Z x1 = x1.contiguous() 2025-05-07T20:32:33.1621279Z 2025-05-07T20:32:33.1621479Z if scale_ub is not None: 2025-05-07T20:32:33.1621763Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.1622104Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.1622414Z ) 2025-05-07T20:32:33.1622613Z else: 2025-05-07T20:32:33.1622825Z scale_ub_tensor = None 2025-05-07T20:32:33.1623084Z 2025-05-07T20:32:33.1623323Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.1623723Z op = silu_mul_quant 2025-05-07T20:32:33.1623978Z if compiled: 2025-05-07T20:32:33.1624230Z op = torch.compile(op) 2025-05-07T20:32:33.1624535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1624809Z 2025-05-07T20:32:33.1625007Z y_fp8, y_scale = fn() 2025-05-07T20:32:33.1625305Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:33.1625593Z 2025-05-07T20:32:33.1625844Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.1626190Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:33.1626486Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:33.1626807Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:33.1627177Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.1627514Z 2025-05-07T20:32:33.1627743Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:33.1627947Z 2025-05-07T20:32:33.1628051Z moe/activation_test.py:126: 2025-05-07T20:32:33.1628358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1628692Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:33.1629030Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.1629901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:33.1630674Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:33.1631286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.1631990Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.1632697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:33.1633437Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.1634213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:33.1635022Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.1635775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:33.1636427Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:33.1637048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:33.1637576Z fn() 2025-05-07T20:32:33.1638098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:33.1638689Z self.fn.run( 2025-05-07T20:32:33.1639171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.1639718Z kernel = self.compile( 2025-05-07T20:32:33.1640269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.1640948Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.1641352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1641582Z 2025-05-07T20:32:33.1641800Z self = 2025-05-07T20:32:33.1642925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.1644387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7adcaadd0>} 2025-05-07T20:32:33.1645772Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.1646829Z context = 2025-05-07T20:32:33.1647127Z 2025-05-07T20:32:33.1647319Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.1647890Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.1648374Z module_map=module_map) 2025-05-07T20:32:33.1648748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.1649111Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:33.1649385Z E ^ 2025-05-07T20:32:33.1649865Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.1650324Z 2025-05-07T20:32:33.1650755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.1651282Z 2025-05-07T20:32:33.1651390Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.1651858Z self=, 2025-05-07T20:32:33.1652269Z T=1, 2025-05-07T20:32:33.1652452Z D=5120, 2025-05-07T20:32:33.1652651Z scale_ub=1200.0, 2025-05-07T20:32:33.1652921Z contiguous=False, 2025-05-07T20:32:33.1653148Z compiled=True, 2025-05-07T20:32:33.1653359Z ) 2025-05-07T20:32:33.3334396Z self = 2025-05-07T20:32:33.3334921Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:33.3337688Z 2025-05-07T20:32:33.3338100Z @given( 2025-05-07T20:32:33.3338456Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.3338951Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.3339291Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.3339947Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.3340287Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.3340598Z ) 2025-05-07T20:32:33.3340973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.3341433Z def test_silu_mul_quant( 2025-05-07T20:32:33.3341702Z self, 2025-05-07T20:32:33.3341911Z T: int, 2025-05-07T20:32:33.3342158Z D: int, 2025-05-07T20:32:33.3342404Z scale_ub: Optional[float], 2025-05-07T20:32:33.3342688Z contiguous: bool, 2025-05-07T20:32:33.3342948Z compiled: bool, 2025-05-07T20:32:33.3343188Z ) -> None: 2025-05-07T20:32:33.3343410Z torch.manual_seed(2025) 2025-05-07T20:32:33.3343667Z 2025-05-07T20:32:33.3343961Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.3344325Z 2025-05-07T20:32:33.3344523Z x_sign = torch.sign(x) 2025-05-07T20:32:33.3344831Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.3345159Z x = x_sign * x_clamp 2025-05-07T20:32:33.3345408Z x0 = x[:, :D] 2025-05-07T20:32:33.3345641Z x1 = x[:, D:] 2025-05-07T20:32:33.3345863Z 2025-05-07T20:32:33.3346058Z if contiguous: 2025-05-07T20:32:33.3346301Z x0 = x0.contiguous() 2025-05-07T20:32:33.3346580Z x1 = x1.contiguous() 2025-05-07T20:32:33.3346827Z 2025-05-07T20:32:33.3347036Z if scale_ub is not None: 2025-05-07T20:32:33.3347325Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.3347669Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.3347994Z ) 2025-05-07T20:32:33.3348201Z else: 2025-05-07T20:32:33.3348419Z scale_ub_tensor = None 2025-05-07T20:32:33.3348686Z 2025-05-07T20:32:33.3349029Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.3349366Z op = silu_mul_quant 2025-05-07T20:32:33.3349628Z if compiled: 2025-05-07T20:32:33.3349894Z op = torch.compile(op) 2025-05-07T20:32:33.3350214Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.3350494Z 2025-05-07T20:32:33.3350701Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.3350871Z 2025-05-07T20:32:33.3350989Z moe/activation_test.py:117: 2025-05-07T20:32:33.3351298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.3351647Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.3351945Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.3352526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.3353113Z return fn(*args, **kwargs) 2025-05-07T20:32:33.3353800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.3354519Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.3355152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.3356163Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.3356851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.3357502Z kernel = self.compile( 2025-05-07T20:32:33.3358113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.3358792Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.3359203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.3359438Z 2025-05-07T20:32:33.3359657Z self = 2025-05-07T20:32:33.3360775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.3362277Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7adcabeb0>} 2025-05-07T20:32:33.3363664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.3364720Z context = 2025-05-07T20:32:33.3365016Z 2025-05-07T20:32:33.3365190Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.3365728Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.3366232Z module_map=module_map) 2025-05-07T20:32:33.3374276Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.3374663Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.3374930Z E ^ 2025-05-07T20:32:33.3375419Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.3375885Z 2025-05-07T20:32:33.3376325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.3376851Z 2025-05-07T20:32:33.3376972Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.3377393Z self=, 2025-05-07T20:32:33.3377811Z T=1, 2025-05-07T20:32:33.3378099Z D=5120, 2025-05-07T20:32:33.3378412Z scale_ub=1200.0, 2025-05-07T20:32:33.3378655Z contiguous=False, 2025-05-07T20:32:33.3378895Z compiled=False, 2025-05-07T20:32:33.3379112Z ) 2025-05-07T20:32:33.3379449Z self = 2025-05-07T20:32:33.3379964Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:33.3380237Z 2025-05-07T20:32:33.3380318Z @given( 2025-05-07T20:32:33.3380570Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.3380904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.3381228Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.3381568Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.3381919Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.3382220Z ) 2025-05-07T20:32:33.3382578Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.3383039Z def test_silu_mul_quant( 2025-05-07T20:32:33.3383295Z self, 2025-05-07T20:32:33.3383495Z T: int, 2025-05-07T20:32:33.3383705Z D: int, 2025-05-07T20:32:33.3383944Z scale_ub: Optional[float], 2025-05-07T20:32:33.3384222Z contiguous: bool, 2025-05-07T20:32:33.3384552Z compiled: bool, 2025-05-07T20:32:33.3384792Z ) -> None: 2025-05-07T20:32:33.3385017Z torch.manual_seed(2025) 2025-05-07T20:32:33.3385278Z 2025-05-07T20:32:33.3385569Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.3385974Z 2025-05-07T20:32:33.3386176Z x_sign = torch.sign(x) 2025-05-07T20:32:33.3386487Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.3386812Z x = x_sign * x_clamp 2025-05-07T20:32:33.3387057Z x0 = x[:, :D] 2025-05-07T20:32:33.3387291Z x1 = x[:, D:] 2025-05-07T20:32:33.3387516Z 2025-05-07T20:32:33.3387708Z if contiguous: 2025-05-07T20:32:33.3387963Z x0 = x0.contiguous() 2025-05-07T20:32:33.3388241Z x1 = x1.contiguous() 2025-05-07T20:32:33.3388489Z 2025-05-07T20:32:33.3388699Z if scale_ub is not None: 2025-05-07T20:32:33.3389038Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.3389387Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.3389712Z ) 2025-05-07T20:32:33.3389920Z else: 2025-05-07T20:32:33.3390140Z scale_ub_tensor = None 2025-05-07T20:32:33.3390412Z 2025-05-07T20:32:33.3390663Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.3390995Z op = silu_mul_quant 2025-05-07T20:32:33.3391251Z if compiled: 2025-05-07T20:32:33.3391519Z op = torch.compile(op) 2025-05-07T20:32:33.3391831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.3392108Z 2025-05-07T20:32:33.3392317Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.3392487Z 2025-05-07T20:32:33.3392594Z moe/activation_test.py:117: 2025-05-07T20:32:33.3392901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.3393247Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.3393535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.3394250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.3394960Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.3395520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.3396216Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.3396902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.3397451Z kernel = self.compile( 2025-05-07T20:32:33.3398105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.3398784Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.3399200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.3399433Z 2025-05-07T20:32:33.3399655Z self = 2025-05-07T20:32:33.3400761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.3402180Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa898624940>} 2025-05-07T20:32:33.3403559Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.3404610Z context = 2025-05-07T20:32:33.3404905Z 2025-05-07T20:32:33.3405132Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.3405663Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.3406150Z module_map=module_map) 2025-05-07T20:32:33.3406570Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.3406929Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.3407199Z E ^ 2025-05-07T20:32:33.3407680Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.3408138Z 2025-05-07T20:32:33.3408574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.3409095Z 2025-05-07T20:32:33.3409202Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.3409677Z self=, 2025-05-07T20:32:33.3410094Z T=16384, 2025-05-07T20:32:33.3410291Z D=5120, 2025-05-07T20:32:33.3410496Z scale_ub=1200.0, 2025-05-07T20:32:33.3410731Z contiguous=False, 2025-05-07T20:32:33.3410959Z compiled=True, 2025-05-07T20:32:33.3411171Z ) 2025-05-07T20:32:33.4403419Z self = 2025-05-07T20:32:33.4403977Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:33.4404268Z 2025-05-07T20:32:33.4404352Z @given( 2025-05-07T20:32:33.4404600Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.4404925Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.4405259Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.4405598Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.4405942Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.4406248Z ) 2025-05-07T20:32:33.4406612Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.4407068Z def test_silu_mul_quant( 2025-05-07T20:32:33.4407326Z self, 2025-05-07T20:32:33.4407555Z T: int, 2025-05-07T20:32:33.4407786Z D: int, 2025-05-07T20:32:33.4408045Z scale_ub: Optional[float], 2025-05-07T20:32:33.4408332Z contiguous: bool, 2025-05-07T20:32:33.4408586Z compiled: bool, 2025-05-07T20:32:33.4408830Z ) -> None: 2025-05-07T20:32:33.4409052Z torch.manual_seed(2025) 2025-05-07T20:32:33.4409307Z 2025-05-07T20:32:33.4409624Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.4409985Z 2025-05-07T20:32:33.4410196Z x_sign = torch.sign(x) 2025-05-07T20:32:33.4410756Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.4411086Z x = x_sign * x_clamp 2025-05-07T20:32:33.4411348Z x0 = x[:, :D] 2025-05-07T20:32:33.4411575Z x1 = x[:, D:] 2025-05-07T20:32:33.4411795Z 2025-05-07T20:32:33.4412001Z if contiguous: 2025-05-07T20:32:33.4412239Z x0 = x0.contiguous() 2025-05-07T20:32:33.4412514Z x1 = x1.contiguous() 2025-05-07T20:32:33.4412769Z 2025-05-07T20:32:33.4412975Z if scale_ub is not None: 2025-05-07T20:32:33.4413258Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.4413610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.4413931Z ) 2025-05-07T20:32:33.4414129Z else: 2025-05-07T20:32:33.4414352Z scale_ub_tensor = None 2025-05-07T20:32:33.4414619Z 2025-05-07T20:32:33.4414865Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.4415198Z op = silu_mul_quant 2025-05-07T20:32:33.4415460Z if compiled: 2025-05-07T20:32:33.4415714Z op = torch.compile(op) 2025-05-07T20:32:33.4416030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4416316Z 2025-05-07T20:32:33.4416594Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.4416772Z 2025-05-07T20:32:33.4416878Z moe/activation_test.py:117: 2025-05-07T20:32:33.4417184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4417626Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.4417939Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4418595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.4419171Z return fn(*args, **kwargs) 2025-05-07T20:32:33.4419840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.4420547Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.4421102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.4421882Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.4422561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.4423112Z kernel = self.compile( 2025-05-07T20:32:33.4423672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.4424344Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.4424752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4424991Z 2025-05-07T20:32:33.4425205Z self = 2025-05-07T20:32:33.4426316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.4427741Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ada448b0>} 2025-05-07T20:32:33.4429110Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.4430167Z context = 2025-05-07T20:32:33.4430468Z 2025-05-07T20:32:33.4430640Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.4431223Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.4431699Z module_map=module_map) 2025-05-07T20:32:33.4432081Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.4432456Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.4432719Z E ^ 2025-05-07T20:32:33.4433204Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.4433670Z 2025-05-07T20:32:33.4434094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.4434619Z 2025-05-07T20:32:33.4434735Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.4435158Z self=, 2025-05-07T20:32:33.4435572Z T=2048, 2025-05-07T20:32:33.4435773Z D=7168, 2025-05-07T20:32:33.4435970Z scale_ub=1200.0, 2025-05-07T20:32:33.4436208Z contiguous=False, 2025-05-07T20:32:33.4436448Z compiled=True, 2025-05-07T20:32:33.4436658Z ) 2025-05-07T20:32:33.4436990Z self = 2025-05-07T20:32:33.4437508Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:33.4437785Z 2025-05-07T20:32:33.4437922Z @given( 2025-05-07T20:32:33.4438156Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.4438482Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.4438802Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.4439213Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.4439554Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.4439851Z ) 2025-05-07T20:32:33.4440209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.4440668Z def test_silu_mul_quant( 2025-05-07T20:32:33.4440932Z self, 2025-05-07T20:32:33.4441140Z T: int, 2025-05-07T20:32:33.4441346Z D: int, 2025-05-07T20:32:33.4441582Z scale_ub: Optional[float], 2025-05-07T20:32:33.4441867Z contiguous: bool, 2025-05-07T20:32:33.4442162Z compiled: bool, 2025-05-07T20:32:33.4442398Z ) -> None: 2025-05-07T20:32:33.4442633Z torch.manual_seed(2025) 2025-05-07T20:32:33.4442881Z 2025-05-07T20:32:33.4443173Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.4443529Z 2025-05-07T20:32:33.4443737Z x_sign = torch.sign(x) 2025-05-07T20:32:33.4444040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.4444363Z x = x_sign * x_clamp 2025-05-07T20:32:33.4444617Z x0 = x[:, :D] 2025-05-07T20:32:33.4444843Z x1 = x[:, D:] 2025-05-07T20:32:33.4445064Z 2025-05-07T20:32:33.4445265Z if contiguous: 2025-05-07T20:32:33.4445503Z x0 = x0.contiguous() 2025-05-07T20:32:33.4445774Z x1 = x1.contiguous() 2025-05-07T20:32:33.4446026Z 2025-05-07T20:32:33.4446227Z if scale_ub is not None: 2025-05-07T20:32:33.4446514Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.4446866Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.4447176Z ) 2025-05-07T20:32:33.4447390Z else: 2025-05-07T20:32:33.4447652Z scale_ub_tensor = None 2025-05-07T20:32:33.4447914Z 2025-05-07T20:32:33.4448161Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.4448493Z op = silu_mul_quant 2025-05-07T20:32:33.4448755Z if compiled: 2025-05-07T20:32:33.4449008Z op = torch.compile(op) 2025-05-07T20:32:33.4449322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4449610Z 2025-05-07T20:32:33.4449807Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.4449987Z 2025-05-07T20:32:33.4450092Z moe/activation_test.py:117: 2025-05-07T20:32:33.4450450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4450783Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.4451079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4451661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.4452240Z return fn(*args, **kwargs) 2025-05-07T20:32:33.4452909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.4453620Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.4454172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.4454864Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.4455797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.4456355Z kernel = self.compile( 2025-05-07T20:32:33.4456914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.4457587Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.4458160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4458420Z 2025-05-07T20:32:33.4458645Z self = 2025-05-07T20:32:33.4459810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.4461208Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ada45090>} 2025-05-07T20:32:33.4462584Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.4463696Z context = 2025-05-07T20:32:33.4463994Z 2025-05-07T20:32:33.4464172Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.4464701Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.4465188Z module_map=module_map) 2025-05-07T20:32:33.4465568Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.4465937Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.4466199Z E ^ 2025-05-07T20:32:33.4466680Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.4467139Z 2025-05-07T20:32:33.4467576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.4468093Z 2025-05-07T20:32:33.5756320Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.5756863Z self=, 2025-05-07T20:32:33.5757458Z T=1, 2025-05-07T20:32:33.5757679Z D=5120, 2025-05-07T20:32:33.5757881Z scale_ub=None, 2025-05-07T20:32:33.5758113Z contiguous=False, 2025-05-07T20:32:33.5758351Z compiled=False, 2025-05-07T20:32:33.5758572Z ) 2025-05-07T20:32:33.5758907Z self = 2025-05-07T20:32:33.5759404Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:33.5759680Z 2025-05-07T20:32:33.5759763Z @given( 2025-05-07T20:32:33.5760009Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.5760327Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.5760932Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.5761281Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.5761622Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.5761920Z ) 2025-05-07T20:32:33.5762290Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.5762750Z def test_silu_mul_quant( 2025-05-07T20:32:33.5762998Z self, 2025-05-07T20:32:33.5763205Z T: int, 2025-05-07T20:32:33.5763417Z D: int, 2025-05-07T20:32:33.5763643Z scale_ub: Optional[float], 2025-05-07T20:32:33.5763931Z contiguous: bool, 2025-05-07T20:32:33.5764185Z compiled: bool, 2025-05-07T20:32:33.5764419Z ) -> None: 2025-05-07T20:32:33.5764647Z torch.manual_seed(2025) 2025-05-07T20:32:33.5764902Z 2025-05-07T20:32:33.5765184Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.5765543Z 2025-05-07T20:32:33.5765756Z x_sign = torch.sign(x) 2025-05-07T20:32:33.5766053Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.5766378Z x = x_sign * x_clamp 2025-05-07T20:32:33.5766632Z x0 = x[:, :D] 2025-05-07T20:32:33.5766931Z x1 = x[:, D:] 2025-05-07T20:32:33.5767152Z 2025-05-07T20:32:33.5767352Z if contiguous: 2025-05-07T20:32:33.5767596Z x0 = x0.contiguous() 2025-05-07T20:32:33.5767903Z x1 = x1.contiguous() 2025-05-07T20:32:33.5768243Z 2025-05-07T20:32:33.5768448Z if scale_ub is not None: 2025-05-07T20:32:33.5768728Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.5769077Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.5769398Z ) 2025-05-07T20:32:33.5769595Z else: 2025-05-07T20:32:33.5769818Z scale_ub_tensor = None 2025-05-07T20:32:33.5770085Z 2025-05-07T20:32:33.5770328Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.5770656Z op = silu_mul_quant 2025-05-07T20:32:33.5770916Z if compiled: 2025-05-07T20:32:33.5771252Z op = torch.compile(op) 2025-05-07T20:32:33.5771564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.5771856Z 2025-05-07T20:32:33.5772057Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.5772234Z 2025-05-07T20:32:33.5772339Z moe/activation_test.py:117: 2025-05-07T20:32:33.5772643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5772993Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.5773278Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.5773993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.5774706Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.5775255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.5775952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.5776635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.5777184Z kernel = self.compile( 2025-05-07T20:32:33.5777732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.5778550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.5778951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5779179Z 2025-05-07T20:32:33.5779395Z self = 2025-05-07T20:32:33.5780541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.5781960Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ada457e0>} 2025-05-07T20:32:33.5783334Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.5784380Z context = 2025-05-07T20:32:33.5784670Z 2025-05-07T20:32:33.5784843Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.5785378Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.5785857Z module_map=module_map) 2025-05-07T20:32:33.5786237Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.5786591Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.5786856Z E ^ 2025-05-07T20:32:33.5787343Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.5787844Z 2025-05-07T20:32:33.5788268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.5788793Z 2025-05-07T20:32:33.5788943Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.5789372Z self=, 2025-05-07T20:32:33.5789781Z T=4096, 2025-05-07T20:32:33.5789975Z D=7168, 2025-05-07T20:32:33.5790181Z scale_ub=1200.0, 2025-05-07T20:32:33.5790416Z contiguous=False, 2025-05-07T20:32:33.5790645Z compiled=False, 2025-05-07T20:32:33.5790863Z ) 2025-05-07T20:32:33.5791197Z self = 2025-05-07T20:32:33.5791697Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:33.5791984Z 2025-05-07T20:32:33.5792114Z @given( 2025-05-07T20:32:33.5792358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.5792681Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.5792990Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.5793330Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.5793671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.5793955Z ) 2025-05-07T20:32:33.5794314Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.5794760Z def test_silu_mul_quant( 2025-05-07T20:32:33.5795001Z self, 2025-05-07T20:32:33.5795201Z T: int, 2025-05-07T20:32:33.5795400Z D: int, 2025-05-07T20:32:33.5795617Z scale_ub: Optional[float], 2025-05-07T20:32:33.5795896Z contiguous: bool, 2025-05-07T20:32:33.5796144Z compiled: bool, 2025-05-07T20:32:33.5796366Z ) -> None: 2025-05-07T20:32:33.5796589Z torch.manual_seed(2025) 2025-05-07T20:32:33.5796843Z 2025-05-07T20:32:33.5797124Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.5797467Z 2025-05-07T20:32:33.5797668Z x_sign = torch.sign(x) 2025-05-07T20:32:33.5798016Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.5798328Z x = x_sign * x_clamp 2025-05-07T20:32:33.5798575Z x0 = x[:, :D] 2025-05-07T20:32:33.5798794Z x1 = x[:, D:] 2025-05-07T20:32:33.5799000Z 2025-05-07T20:32:33.5799194Z if contiguous: 2025-05-07T20:32:33.5799430Z x0 = x0.contiguous() 2025-05-07T20:32:33.5799690Z x1 = x1.contiguous() 2025-05-07T20:32:33.5799935Z 2025-05-07T20:32:33.5800133Z if scale_ub is not None: 2025-05-07T20:32:33.5800462Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.5800806Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.5801116Z ) 2025-05-07T20:32:33.5801330Z else: 2025-05-07T20:32:33.5809471Z scale_ub_tensor = None 2025-05-07T20:32:33.5809736Z 2025-05-07T20:32:33.5809987Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.5810305Z op = silu_mul_quant 2025-05-07T20:32:33.5810561Z if compiled: 2025-05-07T20:32:33.5810817Z op = torch.compile(op) 2025-05-07T20:32:33.5811116Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.5811392Z 2025-05-07T20:32:33.5811596Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.5811763Z 2025-05-07T20:32:33.5811864Z moe/activation_test.py:117: 2025-05-07T20:32:33.5812167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5812503Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.5812794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.5813491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.5814199Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.5814826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.5815519Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.5816238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.5816783Z kernel = self.compile( 2025-05-07T20:32:33.5817337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.5818077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.5818484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.5818710Z 2025-05-07T20:32:33.5818928Z self = 2025-05-07T20:32:33.5820081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.5821478Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ada46200>} 2025-05-07T20:32:33.5822849Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.5823896Z context = 2025-05-07T20:32:33.5824191Z 2025-05-07T20:32:33.5824371Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.5824897Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.5825381Z module_map=module_map) 2025-05-07T20:32:33.5825758Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.5826121Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.5826378Z E ^ 2025-05-07T20:32:33.5826853Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.5827316Z 2025-05-07T20:32:33.5827792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.5828313Z 2025-05-07T20:32:33.5828431Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.5828851Z self=, 2025-05-07T20:32:33.5829309Z T=16384, 2025-05-07T20:32:33.5829513Z D=7168, 2025-05-07T20:32:33.5829705Z scale_ub=None, 2025-05-07T20:32:33.5829932Z contiguous=True, 2025-05-07T20:32:33.5830180Z compiled=True, 2025-05-07T20:32:33.5830384Z ) 2025-05-07T20:32:33.7769210Z self = 2025-05-07T20:32:33.7769875Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:33.7770285Z 2025-05-07T20:32:33.7770418Z @given( 2025-05-07T20:32:33.7770726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.7771148Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.7771482Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.7771834Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.7772179Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.7772472Z ) 2025-05-07T20:32:33.7772847Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.7773306Z def test_silu_mul_quant( 2025-05-07T20:32:33.7773565Z self, 2025-05-07T20:32:33.7773804Z T: int, 2025-05-07T20:32:33.7774023Z D: int, 2025-05-07T20:32:33.7774433Z scale_ub: Optional[float], 2025-05-07T20:32:33.7774722Z contiguous: bool, 2025-05-07T20:32:33.7774971Z compiled: bool, 2025-05-07T20:32:33.7775208Z ) -> None: 2025-05-07T20:32:33.7775436Z torch.manual_seed(2025) 2025-05-07T20:32:33.7775763Z 2025-05-07T20:32:33.7776052Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.7776405Z 2025-05-07T20:32:33.7776604Z x_sign = torch.sign(x) 2025-05-07T20:32:33.7776910Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.7777230Z x = x_sign * x_clamp 2025-05-07T20:32:33.7777474Z x0 = x[:, :D] 2025-05-07T20:32:33.7777703Z x1 = x[:, D:] 2025-05-07T20:32:33.7777951Z 2025-05-07T20:32:33.7778258Z if contiguous: 2025-05-07T20:32:33.7778503Z x0 = x0.contiguous() 2025-05-07T20:32:33.7778777Z x1 = x1.contiguous() 2025-05-07T20:32:33.7779114Z 2025-05-07T20:32:33.7779320Z if scale_ub is not None: 2025-05-07T20:32:33.7779617Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.7779968Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.7780287Z ) 2025-05-07T20:32:33.7780494Z else: 2025-05-07T20:32:33.7780723Z scale_ub_tensor = None 2025-05-07T20:32:33.7780988Z 2025-05-07T20:32:33.7781237Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.7781564Z op = silu_mul_quant 2025-05-07T20:32:33.7781821Z if compiled: 2025-05-07T20:32:33.7782081Z op = torch.compile(op) 2025-05-07T20:32:33.7782395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.7782674Z 2025-05-07T20:32:33.7782883Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.7783054Z 2025-05-07T20:32:33.7783165Z moe/activation_test.py:117: 2025-05-07T20:32:33.7783464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.7783807Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.7784105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.7784684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.7785260Z return fn(*args, **kwargs) 2025-05-07T20:32:33.7785943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.7786653Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.7787200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.7787992Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.7788679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.7789229Z kernel = self.compile( 2025-05-07T20:32:33.7789792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.7790474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.7790886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.7791125Z 2025-05-07T20:32:33.7791344Z self = 2025-05-07T20:32:33.7792448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.7793879Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ada47760>} 2025-05-07T20:32:33.7795310Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.7796362Z context = 2025-05-07T20:32:33.7796697Z 2025-05-07T20:32:33.7796870Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.7797408Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.7797894Z module_map=module_map) 2025-05-07T20:32:33.7798271Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.7798631Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.7798904Z E ^ 2025-05-07T20:32:33.7799385Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.7799890Z 2025-05-07T20:32:33.7800318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.7800845Z 2025-05-07T20:32:33.7800956Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.7801388Z self=, 2025-05-07T20:32:33.7801811Z T=4096, 2025-05-07T20:32:33.7802010Z D=5120, 2025-05-07T20:32:33.7802217Z scale_ub=None, 2025-05-07T20:32:33.7802472Z contiguous=False, 2025-05-07T20:32:33.7802712Z compiled=True, 2025-05-07T20:32:33.7802930Z ) 2025-05-07T20:32:33.7803265Z self = 2025-05-07T20:32:33.7803775Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.7804055Z 2025-05-07T20:32:33.7804138Z @given( 2025-05-07T20:32:33.7804384Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.7804716Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.7805040Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.7805384Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.7805728Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.7806032Z ) 2025-05-07T20:32:33.7806395Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.7806851Z def test_silu_mul_quant( 2025-05-07T20:32:33.7807102Z self, 2025-05-07T20:32:33.7807300Z T: int, 2025-05-07T20:32:33.7807506Z D: int, 2025-05-07T20:32:33.7807734Z scale_ub: Optional[float], 2025-05-07T20:32:33.7808011Z contiguous: bool, 2025-05-07T20:32:33.7808263Z compiled: bool, 2025-05-07T20:32:33.7808495Z ) -> None: 2025-05-07T20:32:33.7808771Z torch.manual_seed(2025) 2025-05-07T20:32:33.7809028Z 2025-05-07T20:32:33.7809315Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.7809667Z 2025-05-07T20:32:33.7809866Z x_sign = torch.sign(x) 2025-05-07T20:32:33.7810171Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.7810490Z x = x_sign * x_clamp 2025-05-07T20:32:33.7810734Z x0 = x[:, :D] 2025-05-07T20:32:33.7810964Z x1 = x[:, D:] 2025-05-07T20:32:33.7811188Z 2025-05-07T20:32:33.7811382Z if contiguous: 2025-05-07T20:32:33.7811626Z x0 = x0.contiguous() 2025-05-07T20:32:33.7811894Z x1 = x1.contiguous() 2025-05-07T20:32:33.7812139Z 2025-05-07T20:32:33.7812345Z if scale_ub is not None: 2025-05-07T20:32:33.7812633Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.7812976Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.7813304Z ) 2025-05-07T20:32:33.7813512Z else: 2025-05-07T20:32:33.7813729Z scale_ub_tensor = None 2025-05-07T20:32:33.7813991Z 2025-05-07T20:32:33.7814243Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.7814564Z op = silu_mul_quant 2025-05-07T20:32:33.7814878Z if compiled: 2025-05-07T20:32:33.7815141Z op = torch.compile(op) 2025-05-07T20:32:33.7815454Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.7815732Z 2025-05-07T20:32:33.7815981Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.7816152Z 2025-05-07T20:32:33.7816263Z moe/activation_test.py:117: 2025-05-07T20:32:33.7816565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.7816922Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.7817212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.7817794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.7818481Z return fn(*args, **kwargs) 2025-05-07T20:32:33.7819158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.7819953Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.7820509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.7821209Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.7821889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.7822441Z kernel = self.compile( 2025-05-07T20:32:33.7823004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.7823685Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.7824092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.7824329Z 2025-05-07T20:32:33.7824549Z self = 2025-05-07T20:32:33.7825663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.7827075Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad768280>} 2025-05-07T20:32:33.7828451Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.7829553Z context = 2025-05-07T20:32:33.7829857Z 2025-05-07T20:32:33.7830031Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.7830573Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.7831054Z module_map=module_map) 2025-05-07T20:32:33.7831433Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.7831804Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.7832072Z E ^ 2025-05-07T20:32:33.7832551Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.7833018Z 2025-05-07T20:32:33.7833444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.7833966Z 2025-05-07T20:32:34.1088923Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.1089470Z self=, 2025-05-07T20:32:34.1089891Z T=4096, 2025-05-07T20:32:34.1090095Z D=5120, 2025-05-07T20:32:34.1090311Z scale_ub=1200.0, 2025-05-07T20:32:34.1090557Z contiguous=False, 2025-05-07T20:32:34.1090795Z compiled=False, 2025-05-07T20:32:34.1091304Z ) 2025-05-07T20:32:34.1091647Z self = 2025-05-07T20:32:34.1092166Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:34.1092528Z 2025-05-07T20:32:34.1092613Z @given( 2025-05-07T20:32:34.1092858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1093184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1093501Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1093850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1094200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1094499Z ) 2025-05-07T20:32:34.1094862Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1095314Z def test_silu_mul_quant( 2025-05-07T20:32:34.1095658Z self, 2025-05-07T20:32:34.1095858Z T: int, 2025-05-07T20:32:34.1096070Z D: int, 2025-05-07T20:32:34.1096299Z scale_ub: Optional[float], 2025-05-07T20:32:34.1096577Z contiguous: bool, 2025-05-07T20:32:34.1096829Z compiled: bool, 2025-05-07T20:32:34.1097069Z ) -> None: 2025-05-07T20:32:34.1097296Z torch.manual_seed(2025) 2025-05-07T20:32:34.1097574Z 2025-05-07T20:32:34.1097887Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1098301Z 2025-05-07T20:32:34.1098506Z x_sign = torch.sign(x) 2025-05-07T20:32:34.1098809Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.1099122Z x = x_sign * x_clamp 2025-05-07T20:32:34.1099373Z x0 = x[:, :D] 2025-05-07T20:32:34.1099605Z x1 = x[:, D:] 2025-05-07T20:32:34.1099820Z 2025-05-07T20:32:34.1100028Z if contiguous: 2025-05-07T20:32:34.1100284Z x0 = x0.contiguous() 2025-05-07T20:32:34.1100550Z x1 = x1.contiguous() 2025-05-07T20:32:34.1100801Z 2025-05-07T20:32:34.1101005Z if scale_ub is not None: 2025-05-07T20:32:34.1101283Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.1101628Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.1101947Z ) 2025-05-07T20:32:34.1102147Z else: 2025-05-07T20:32:34.1102364Z scale_ub_tensor = None 2025-05-07T20:32:34.1102626Z 2025-05-07T20:32:34.1102870Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.1103188Z op = silu_mul_quant 2025-05-07T20:32:34.1103449Z if compiled: 2025-05-07T20:32:34.1103705Z op = torch.compile(op) 2025-05-07T20:32:34.1104099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1104388Z 2025-05-07T20:32:34.1104594Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.1104761Z 2025-05-07T20:32:34.1104868Z moe/activation_test.py:117: 2025-05-07T20:32:34.1105172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1105510Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.1105801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1106502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.1107212Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.1107817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.1108508Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.1109189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.1109735Z kernel = self.compile( 2025-05-07T20:32:34.1110291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.1111028Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.1111434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1111663Z 2025-05-07T20:32:34.1111885Z self = 2025-05-07T20:32:34.1113035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.1114456Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad769000>} 2025-05-07T20:32:34.1115834Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.1116935Z context = 2025-05-07T20:32:34.1117232Z 2025-05-07T20:32:34.1117413Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.1117950Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.1118489Z module_map=module_map) 2025-05-07T20:32:34.1118873Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.1119239Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.1119505Z E ^ 2025-05-07T20:32:34.1119985Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.1120442Z 2025-05-07T20:32:34.1120875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.1121402Z 2025-05-07T20:32:34.1121528Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.1121954Z self=, 2025-05-07T20:32:34.1122373Z T=4096, 2025-05-07T20:32:34.1122577Z D=5120, 2025-05-07T20:32:34.1122784Z scale_ub=1200.0, 2025-05-07T20:32:34.1123024Z contiguous=False, 2025-05-07T20:32:34.1123264Z compiled=True, 2025-05-07T20:32:34.1123474Z ) 2025-05-07T20:32:34.1123806Z self = 2025-05-07T20:32:34.1124312Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:34.1124591Z 2025-05-07T20:32:34.1124671Z @given( 2025-05-07T20:32:34.1124960Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1125286Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1125602Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1125944Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1126284Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1126585Z ) 2025-05-07T20:32:34.1126945Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1127397Z def test_silu_mul_quant( 2025-05-07T20:32:34.1127655Z self, 2025-05-07T20:32:34.1127855Z T: int, 2025-05-07T20:32:34.1128062Z D: int, 2025-05-07T20:32:34.1128293Z scale_ub: Optional[float], 2025-05-07T20:32:34.1128568Z contiguous: bool, 2025-05-07T20:32:34.1128819Z compiled: bool, 2025-05-07T20:32:34.1129051Z ) -> None: 2025-05-07T20:32:34.1129272Z torch.manual_seed(2025) 2025-05-07T20:32:34.1129526Z 2025-05-07T20:32:34.1129818Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1130165Z 2025-05-07T20:32:34.1130371Z x_sign = torch.sign(x) 2025-05-07T20:32:34.1130676Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.1130995Z x = x_sign * x_clamp 2025-05-07T20:32:34.1131290Z x0 = x[:, :D] 2025-05-07T20:32:34.1131521Z x1 = x[:, D:] 2025-05-07T20:32:34.1131741Z 2025-05-07T20:32:34.1131932Z if contiguous: 2025-05-07T20:32:34.1132175Z x0 = x0.contiguous() 2025-05-07T20:32:34.1132485Z x1 = x1.contiguous() 2025-05-07T20:32:34.1132731Z 2025-05-07T20:32:34.1132938Z if scale_ub is not None: 2025-05-07T20:32:34.1133229Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.1133572Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.1133892Z ) 2025-05-07T20:32:34.1134097Z else: 2025-05-07T20:32:34.1134314Z scale_ub_tensor = None 2025-05-07T20:32:34.1134584Z 2025-05-07T20:32:34.1134830Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.1135149Z op = silu_mul_quant 2025-05-07T20:32:34.1135459Z if compiled: 2025-05-07T20:32:34.1135719Z op = torch.compile(op) 2025-05-07T20:32:34.1136037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1136317Z 2025-05-07T20:32:34.1136520Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.1136688Z 2025-05-07T20:32:34.1136796Z moe/activation_test.py:117: 2025-05-07T20:32:34.1137094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1137432Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.1137724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1138401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.1138973Z return fn(*args, **kwargs) 2025-05-07T20:32:34.1139652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.1140366Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.1140912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.1141617Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.1142293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.1142837Z kernel = self.compile( 2025-05-07T20:32:34.1143395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.1144069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.1144477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1144757Z 2025-05-07T20:32:34.1144973Z self = 2025-05-07T20:32:34.1146076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.1147479Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad768700>} 2025-05-07T20:32:34.1148905Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.1149953Z context = 2025-05-07T20:32:34.1150251Z 2025-05-07T20:32:34.1150430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.1150972Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.1151459Z module_map=module_map) 2025-05-07T20:32:34.1151878Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.1152244Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.1152518Z E ^ 2025-05-07T20:32:34.1152994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.1153542Z 2025-05-07T20:32:34.1162320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.1162899Z 2025-05-07T20:32:34.2430622Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.2431154Z self=, 2025-05-07T20:32:34.2431725Z T=2048, 2025-05-07T20:32:34.2431951Z D=7168, 2025-05-07T20:32:34.2432159Z scale_ub=1200.0, 2025-05-07T20:32:34.2432397Z contiguous=False, 2025-05-07T20:32:34.2432640Z compiled=False, 2025-05-07T20:32:34.2433181Z ) 2025-05-07T20:32:34.2433522Z self = 2025-05-07T20:32:34.2434053Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:34.2434354Z 2025-05-07T20:32:34.2434438Z @given( 2025-05-07T20:32:34.2434688Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.2435022Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.2435351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.2435708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.2436051Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.2436355Z ) 2025-05-07T20:32:34.2436728Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.2437191Z def test_silu_mul_quant( 2025-05-07T20:32:34.2437442Z self, 2025-05-07T20:32:34.2437653Z T: int, 2025-05-07T20:32:34.2437901Z D: int, 2025-05-07T20:32:34.2438140Z scale_ub: Optional[float], 2025-05-07T20:32:34.2438431Z contiguous: bool, 2025-05-07T20:32:34.2438685Z compiled: bool, 2025-05-07T20:32:34.2438924Z ) -> None: 2025-05-07T20:32:34.2439153Z torch.manual_seed(2025) 2025-05-07T20:32:34.2439412Z 2025-05-07T20:32:34.2439701Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.2440069Z 2025-05-07T20:32:34.2440276Z x_sign = torch.sign(x) 2025-05-07T20:32:34.2440579Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.2440904Z x = x_sign * x_clamp 2025-05-07T20:32:34.2441161Z x0 = x[:, :D] 2025-05-07T20:32:34.2441382Z x1 = x[:, D:] 2025-05-07T20:32:34.2441604Z 2025-05-07T20:32:34.2441901Z if contiguous: 2025-05-07T20:32:34.2442149Z x0 = x0.contiguous() 2025-05-07T20:32:34.2442431Z x1 = x1.contiguous() 2025-05-07T20:32:34.2442682Z 2025-05-07T20:32:34.2442883Z if scale_ub is not None: 2025-05-07T20:32:34.2443183Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.2443539Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.2443875Z ) 2025-05-07T20:32:34.2444073Z else: 2025-05-07T20:32:34.2444304Z scale_ub_tensor = None 2025-05-07T20:32:34.2444577Z 2025-05-07T20:32:34.2444819Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.2445154Z op = silu_mul_quant 2025-05-07T20:32:34.2445424Z if compiled: 2025-05-07T20:32:34.2445713Z op = torch.compile(op) 2025-05-07T20:32:34.2446053Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.2446358Z 2025-05-07T20:32:34.2446566Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.2446756Z 2025-05-07T20:32:34.2446871Z moe/activation_test.py:117: 2025-05-07T20:32:34.2447201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.2447614Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.2448035Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.2448756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.2449484Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.2450126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.2450836Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.2451524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.2452079Z kernel = self.compile( 2025-05-07T20:32:34.2452647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.2453338Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.2453826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.2454064Z 2025-05-07T20:32:34.2454282Z self = 2025-05-07T20:32:34.2455412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.2457154Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad769240>} 2025-05-07T20:32:34.2458641Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.2459732Z context = 2025-05-07T20:32:34.2460039Z 2025-05-07T20:32:34.2460215Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.2460758Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.2461244Z module_map=module_map) 2025-05-07T20:32:34.2461626Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.2461998Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.2462266Z E ^ 2025-05-07T20:32:34.2462752Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.2463227Z 2025-05-07T20:32:34.2463772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.2464306Z 2025-05-07T20:32:34.2464425Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.2464856Z self=, 2025-05-07T20:32:34.2465280Z T=1, 2025-05-07T20:32:34.2465478Z D=7168, 2025-05-07T20:32:34.2465682Z scale_ub=None, 2025-05-07T20:32:34.2465902Z contiguous=True, 2025-05-07T20:32:34.2466140Z compiled=False, 2025-05-07T20:32:34.2466358Z ) 2025-05-07T20:32:34.2466685Z self = 2025-05-07T20:32:34.2467182Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.2467449Z 2025-05-07T20:32:34.2467536Z @given( 2025-05-07T20:32:34.2467772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.2468100Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.2468451Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.2468811Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.2469154Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.2469460Z ) 2025-05-07T20:32:34.2469892Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.2470343Z def test_silu_mul_quant( 2025-05-07T20:32:34.2470594Z self, 2025-05-07T20:32:34.2470800Z T: int, 2025-05-07T20:32:34.2471000Z D: int, 2025-05-07T20:32:34.2471287Z scale_ub: Optional[float], 2025-05-07T20:32:34.2471569Z contiguous: bool, 2025-05-07T20:32:34.2471813Z compiled: bool, 2025-05-07T20:32:34.2472051Z ) -> None: 2025-05-07T20:32:34.2472278Z torch.manual_seed(2025) 2025-05-07T20:32:34.2472523Z 2025-05-07T20:32:34.2472808Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.2473163Z 2025-05-07T20:32:34.2473364Z x_sign = torch.sign(x) 2025-05-07T20:32:34.2473669Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.2473989Z x = x_sign * x_clamp 2025-05-07T20:32:34.2474303Z x0 = x[:, :D] 2025-05-07T20:32:34.2474531Z x1 = x[:, D:] 2025-05-07T20:32:34.2474750Z 2025-05-07T20:32:34.2474944Z if contiguous: 2025-05-07T20:32:34.2475186Z x0 = x0.contiguous() 2025-05-07T20:32:34.2475457Z x1 = x1.contiguous() 2025-05-07T20:32:34.2475708Z 2025-05-07T20:32:34.2475911Z if scale_ub is not None: 2025-05-07T20:32:34.2476201Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.2476551Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.2476863Z ) 2025-05-07T20:32:34.2477067Z else: 2025-05-07T20:32:34.2477290Z scale_ub_tensor = None 2025-05-07T20:32:34.2477551Z 2025-05-07T20:32:34.2477796Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.2478132Z op = silu_mul_quant 2025-05-07T20:32:34.2478389Z if compiled: 2025-05-07T20:32:34.2478650Z op = torch.compile(op) 2025-05-07T20:32:34.2478962Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.2479242Z 2025-05-07T20:32:34.2479447Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.2479616Z 2025-05-07T20:32:34.2479725Z moe/activation_test.py:117: 2025-05-07T20:32:34.2480029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.2480366Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.2480661Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.2481367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.2482077Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.2482682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.2483383Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.2484064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.2484610Z kernel = self.compile( 2025-05-07T20:32:34.2485170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.2485845Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.2486251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.2486486Z 2025-05-07T20:32:34.2486700Z self = 2025-05-07T20:32:34.2487856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.2489272Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad76a050>} 2025-05-07T20:32:34.2490699Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.2491785Z context = 2025-05-07T20:32:34.2492086Z 2025-05-07T20:32:34.2492259Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.2492795Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.2493279Z module_map=module_map) 2025-05-07T20:32:34.2493649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.2494017Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.2494288Z E ^ 2025-05-07T20:32:34.2494762Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.2495275Z 2025-05-07T20:32:34.2495703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.2496234Z 2025-05-07T20:32:34.2496345Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.2496780Z self=, 2025-05-07T20:32:34.2497189Z T=16384, 2025-05-07T20:32:34.2497401Z D=7168, 2025-05-07T20:32:34.2497615Z scale_ub=1200.0, 2025-05-07T20:32:34.2497849Z contiguous=False, 2025-05-07T20:32:34.2498190Z compiled=True, 2025-05-07T20:32:34.5117568Z ) 2025-05-07T20:32:34.5118144Z self = 2025-05-07T20:32:34.5118843Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:34.5119130Z 2025-05-07T20:32:34.5119221Z @given( 2025-05-07T20:32:34.5119466Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.5119798Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.5120124Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.5120458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.5120804Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.5121108Z ) 2025-05-07T20:32:34.5121473Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.5121923Z def test_silu_mul_quant( 2025-05-07T20:32:34.5122173Z self, 2025-05-07T20:32:34.5122378Z T: int, 2025-05-07T20:32:34.5122578Z D: int, 2025-05-07T20:32:34.5122806Z scale_ub: Optional[float], 2025-05-07T20:32:34.5123092Z contiguous: bool, 2025-05-07T20:32:34.5123568Z compiled: bool, 2025-05-07T20:32:34.5123809Z ) -> None: 2025-05-07T20:32:34.5124033Z torch.manual_seed(2025) 2025-05-07T20:32:34.5124279Z 2025-05-07T20:32:34.5124564Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.5124922Z 2025-05-07T20:32:34.5125118Z x_sign = torch.sign(x) 2025-05-07T20:32:34.5125417Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.5125733Z x = x_sign * x_clamp 2025-05-07T20:32:34.5125977Z x0 = x[:, :D] 2025-05-07T20:32:34.5126204Z x1 = x[:, D:] 2025-05-07T20:32:34.5126417Z 2025-05-07T20:32:34.5126611Z if contiguous: 2025-05-07T20:32:34.5126847Z x0 = x0.contiguous() 2025-05-07T20:32:34.5127120Z x1 = x1.contiguous() 2025-05-07T20:32:34.5127368Z 2025-05-07T20:32:34.5127564Z if scale_ub is not None: 2025-05-07T20:32:34.5127851Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.5128229Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.5128572Z ) 2025-05-07T20:32:34.5128777Z else: 2025-05-07T20:32:34.5128998Z scale_ub_tensor = None 2025-05-07T20:32:34.5129252Z 2025-05-07T20:32:34.5129581Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.5129905Z op = silu_mul_quant 2025-05-07T20:32:34.5130157Z if compiled: 2025-05-07T20:32:34.5130416Z op = torch.compile(op) 2025-05-07T20:32:34.5130804Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.5131089Z 2025-05-07T20:32:34.5131294Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.5131469Z 2025-05-07T20:32:34.5131573Z moe/activation_test.py:117: 2025-05-07T20:32:34.5131876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.5132209Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.5132501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.5133086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.5133661Z return fn(*args, **kwargs) 2025-05-07T20:32:34.5134434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.5135146Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.5135702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.5136401Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.5137095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.5137644Z kernel = self.compile( 2025-05-07T20:32:34.5138343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.5139049Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.5139458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.5139691Z 2025-05-07T20:32:34.5139915Z self = 2025-05-07T20:32:34.5141021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.5142451Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad76b490>} 2025-05-07T20:32:34.5143888Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.5144944Z context = 2025-05-07T20:32:34.5145240Z 2025-05-07T20:32:34.5145421Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.5145957Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.5146440Z module_map=module_map) 2025-05-07T20:32:34.5146814Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.5147178Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.5147447Z E ^ 2025-05-07T20:32:34.5147928Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.5148387Z 2025-05-07T20:32:34.5148817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.5149340Z 2025-05-07T20:32:34.5149454Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.5149885Z self=, 2025-05-07T20:32:34.5150300Z T=1, 2025-05-07T20:32:34.5150492Z D=7168, 2025-05-07T20:32:34.5150698Z scale_ub=None, 2025-05-07T20:32:34.5150974Z contiguous=False, 2025-05-07T20:32:34.5151210Z compiled=False, 2025-05-07T20:32:34.5151431Z ) 2025-05-07T20:32:34.5151764Z self = 2025-05-07T20:32:34.5152308Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.5152575Z 2025-05-07T20:32:34.5152657Z @given( 2025-05-07T20:32:34.5152901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.5153225Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.5153538Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.5153882Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.5154228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.5154519Z ) 2025-05-07T20:32:34.5154880Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.5155415Z def test_silu_mul_quant( 2025-05-07T20:32:34.5155985Z self, 2025-05-07T20:32:34.5156185Z T: int, 2025-05-07T20:32:34.5156394Z D: int, 2025-05-07T20:32:34.5156623Z scale_ub: Optional[float], 2025-05-07T20:32:34.5156907Z contiguous: bool, 2025-05-07T20:32:34.5157161Z compiled: bool, 2025-05-07T20:32:34.5157395Z ) -> None: 2025-05-07T20:32:34.5157616Z torch.manual_seed(2025) 2025-05-07T20:32:34.5157873Z 2025-05-07T20:32:34.5158202Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.5158558Z 2025-05-07T20:32:34.5158763Z x_sign = torch.sign(x) 2025-05-07T20:32:34.5159066Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.5159382Z x = x_sign * x_clamp 2025-05-07T20:32:34.5159634Z x0 = x[:, :D] 2025-05-07T20:32:34.5159860Z x1 = x[:, D:] 2025-05-07T20:32:34.5160074Z 2025-05-07T20:32:34.5160276Z if contiguous: 2025-05-07T20:32:34.5160518Z x0 = x0.contiguous() 2025-05-07T20:32:34.5160784Z x1 = x1.contiguous() 2025-05-07T20:32:34.5161037Z 2025-05-07T20:32:34.5161240Z if scale_ub is not None: 2025-05-07T20:32:34.5161524Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.5161867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.5162185Z ) 2025-05-07T20:32:34.5162386Z else: 2025-05-07T20:32:34.5162602Z scale_ub_tensor = None 2025-05-07T20:32:34.5162863Z 2025-05-07T20:32:34.5163106Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.5163424Z op = silu_mul_quant 2025-05-07T20:32:34.5163683Z if compiled: 2025-05-07T20:32:34.5164020Z op = torch.compile(op) 2025-05-07T20:32:34.5164326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.5164609Z 2025-05-07T20:32:34.5164815Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.5164985Z 2025-05-07T20:32:34.5165089Z moe/activation_test.py:117: 2025-05-07T20:32:34.5165398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.5165737Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.5166042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.5166750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.5167459Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.5168016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.5168784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.5169461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.5170012Z kernel = self.compile( 2025-05-07T20:32:34.5170647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.5171322Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.5171729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.5172018Z 2025-05-07T20:32:34.5172233Z self = 2025-05-07T20:32:34.5173342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.5174752Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad76b7f0>} 2025-05-07T20:32:34.5176132Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.5177249Z context = 2025-05-07T20:32:34.5177545Z 2025-05-07T20:32:34.5177725Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.5178327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.5178810Z module_map=module_map) 2025-05-07T20:32:34.5179187Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.5179558Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.5179822Z E ^ 2025-05-07T20:32:34.5180304Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.5180775Z 2025-05-07T20:32:34.5181207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.5181732Z 2025-05-07T20:32:34.5181849Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.5182272Z self=, 2025-05-07T20:32:34.5182692Z T=2048, 2025-05-07T20:32:34.5182892Z D=7168, 2025-05-07T20:32:34.5183092Z scale_ub=None, 2025-05-07T20:32:34.5183320Z contiguous=False, 2025-05-07T20:32:34.5183556Z compiled=True, 2025-05-07T20:32:34.5183766Z ) 2025-05-07T20:32:34.6183350Z self = 2025-05-07T20:32:34.6183899Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:34.6184217Z 2025-05-07T20:32:34.6184591Z @given( 2025-05-07T20:32:34.6184837Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.6185162Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.6185478Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.6185825Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.6186165Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.6186451Z ) 2025-05-07T20:32:34.6186812Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.6187272Z def test_silu_mul_quant( 2025-05-07T20:32:34.6187517Z self, 2025-05-07T20:32:34.6187724Z T: int, 2025-05-07T20:32:34.6187951Z D: int, 2025-05-07T20:32:34.6188200Z scale_ub: Optional[float], 2025-05-07T20:32:34.6188484Z contiguous: bool, 2025-05-07T20:32:34.6188731Z compiled: bool, 2025-05-07T20:32:34.6188970Z ) -> None: 2025-05-07T20:32:34.6189194Z torch.manual_seed(2025) 2025-05-07T20:32:34.6189450Z 2025-05-07T20:32:34.6189734Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.6190080Z 2025-05-07T20:32:34.6190286Z x_sign = torch.sign(x) 2025-05-07T20:32:34.6190668Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.6190986Z x = x_sign * x_clamp 2025-05-07T20:32:34.6191236Z x0 = x[:, :D] 2025-05-07T20:32:34.6191483Z x1 = x[:, D:] 2025-05-07T20:32:34.6191703Z 2025-05-07T20:32:34.6191983Z if contiguous: 2025-05-07T20:32:34.6192228Z x0 = x0.contiguous() 2025-05-07T20:32:34.6200768Z x1 = x1.contiguous() 2025-05-07T20:32:34.6201047Z 2025-05-07T20:32:34.6201254Z if scale_ub is not None: 2025-05-07T20:32:34.6201540Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.6201889Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.6202194Z ) 2025-05-07T20:32:34.6202400Z else: 2025-05-07T20:32:34.6202617Z scale_ub_tensor = None 2025-05-07T20:32:34.6202871Z 2025-05-07T20:32:34.6203115Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.6203568Z op = silu_mul_quant 2025-05-07T20:32:34.6203820Z if compiled: 2025-05-07T20:32:34.6204082Z op = torch.compile(op) 2025-05-07T20:32:34.6204391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.6204662Z 2025-05-07T20:32:34.6204864Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.6205034Z 2025-05-07T20:32:34.6205143Z moe/activation_test.py:117: 2025-05-07T20:32:34.6205440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.6205775Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.6206066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.6206640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.6207208Z return fn(*args, **kwargs) 2025-05-07T20:32:34.6207879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.6208637Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.6209181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.6209871Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.6210551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.6211097Z kernel = self.compile( 2025-05-07T20:32:34.6211646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.6212315Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.6212774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.6213007Z 2025-05-07T20:32:34.6213226Z self = 2025-05-07T20:32:34.6214334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.6215761Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad24caf0>} 2025-05-07T20:32:34.6217142Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.6218269Z context = 2025-05-07T20:32:34.6218615Z 2025-05-07T20:32:34.6218796Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.6219329Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.6219862Z module_map=module_map) 2025-05-07T20:32:34.6220238Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.6220593Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.6220861Z E ^ 2025-05-07T20:32:34.6221396Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.6221854Z 2025-05-07T20:32:34.6222288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.6222806Z 2025-05-07T20:32:34.6222912Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.6223340Z self=, 2025-05-07T20:32:34.6223751Z T=4096, 2025-05-07T20:32:34.6223938Z D=7168, 2025-05-07T20:32:34.6224139Z scale_ub=None, 2025-05-07T20:32:34.6224413Z contiguous=False, 2025-05-07T20:32:34.6224637Z compiled=True, 2025-05-07T20:32:34.6224850Z ) 2025-05-07T20:32:34.6225176Z self = 2025-05-07T20:32:34.6225669Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:34.6225946Z 2025-05-07T20:32:34.6226025Z @given( 2025-05-07T20:32:34.6226259Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.6226576Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.6226879Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.6227211Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.6227542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.6227824Z ) 2025-05-07T20:32:34.6228183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.6228629Z def test_silu_mul_quant( 2025-05-07T20:32:34.6228865Z self, 2025-05-07T20:32:34.6229065Z T: int, 2025-05-07T20:32:34.6229263Z D: int, 2025-05-07T20:32:34.6229482Z scale_ub: Optional[float], 2025-05-07T20:32:34.6229754Z contiguous: bool, 2025-05-07T20:32:34.6229995Z compiled: bool, 2025-05-07T20:32:34.6230216Z ) -> None: 2025-05-07T20:32:34.6230440Z torch.manual_seed(2025) 2025-05-07T20:32:34.6230686Z 2025-05-07T20:32:34.6230957Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.6231306Z 2025-05-07T20:32:34.6231503Z x_sign = torch.sign(x) 2025-05-07T20:32:34.6231801Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.6232109Z x = x_sign * x_clamp 2025-05-07T20:32:34.6232352Z x0 = x[:, :D] 2025-05-07T20:32:34.6232573Z x1 = x[:, D:] 2025-05-07T20:32:34.6232833Z 2025-05-07T20:32:34.6233027Z if contiguous: 2025-05-07T20:32:34.6233263Z x0 = x0.contiguous() 2025-05-07T20:32:34.6233524Z x1 = x1.contiguous() 2025-05-07T20:32:34.6233771Z 2025-05-07T20:32:34.6233969Z if scale_ub is not None: 2025-05-07T20:32:34.6234246Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.6234588Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.6234901Z ) 2025-05-07T20:32:34.6235094Z else: 2025-05-07T20:32:34.6235310Z scale_ub_tensor = None 2025-05-07T20:32:34.6235566Z 2025-05-07T20:32:34.6235798Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.6236117Z op = silu_mul_quant 2025-05-07T20:32:34.6236368Z if compiled: 2025-05-07T20:32:34.6236618Z op = torch.compile(op) 2025-05-07T20:32:34.6236912Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.6237187Z 2025-05-07T20:32:34.6237389Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.6237554Z 2025-05-07T20:32:34.6237656Z moe/activation_test.py:117: 2025-05-07T20:32:34.6237961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.6238391Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.6238676Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.6239246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.6239856Z return fn(*args, **kwargs) 2025-05-07T20:32:34.6240523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.6241217Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.6241761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.6242453Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.6243120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.6243706Z kernel = self.compile( 2025-05-07T20:32:34.6244260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.6244929Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.6245325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.6245562Z 2025-05-07T20:32:34.6245774Z self = 2025-05-07T20:32:34.6246873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.6248277Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad24c280>} 2025-05-07T20:32:34.6249698Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.6250741Z context = 2025-05-07T20:32:34.6251040Z 2025-05-07T20:32:34.6251210Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.6251742Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.6252216Z module_map=module_map) 2025-05-07T20:32:34.6252585Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.6252945Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.6253258Z E ^ 2025-05-07T20:32:34.6253725Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.6254191Z 2025-05-07T20:32:34.6254615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.6255132Z 2025-05-07T20:32:34.9664773Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.9665754Z self=, 2025-05-07T20:32:34.9666617Z T=16384, 2025-05-07T20:32:34.9667019Z D=5120, 2025-05-07T20:32:34.9667411Z scale_ub=1200.0, 2025-05-07T20:32:34.9667781Z contiguous=False, 2025-05-07T20:32:34.9668018Z compiled=False, 2025-05-07T20:32:34.9668232Z ) 2025-05-07T20:32:34.9668567Z self = 2025-05-07T20:32:34.9669102Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:34.9669398Z 2025-05-07T20:32:34.9669487Z @given( 2025-05-07T20:32:34.9669730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.9670066Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.9670649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.9670992Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.9671334Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.9671633Z ) 2025-05-07T20:32:34.9672078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.9672540Z def test_silu_mul_quant( 2025-05-07T20:32:34.9672799Z self, 2025-05-07T20:32:34.9673001Z T: int, 2025-05-07T20:32:34.9673214Z D: int, 2025-05-07T20:32:34.9673450Z scale_ub: Optional[float], 2025-05-07T20:32:34.9673730Z contiguous: bool, 2025-05-07T20:32:34.9673989Z compiled: bool, 2025-05-07T20:32:34.9674235Z ) -> None: 2025-05-07T20:32:34.9674466Z torch.manual_seed(2025) 2025-05-07T20:32:34.9674715Z 2025-05-07T20:32:34.9675009Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.9675463Z 2025-05-07T20:32:34.9675665Z x_sign = torch.sign(x) 2025-05-07T20:32:34.9675973Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.9676300Z x = x_sign * x_clamp 2025-05-07T20:32:34.9676544Z x0 = x[:, :D] 2025-05-07T20:32:34.9676781Z x1 = x[:, D:] 2025-05-07T20:32:34.9677008Z 2025-05-07T20:32:34.9677206Z if contiguous: 2025-05-07T20:32:34.9677450Z x0 = x0.contiguous() 2025-05-07T20:32:34.9677721Z x1 = x1.contiguous() 2025-05-07T20:32:34.9677969Z 2025-05-07T20:32:34.9678170Z if scale_ub is not None: 2025-05-07T20:32:34.9678454Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.9678795Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.9679120Z ) 2025-05-07T20:32:34.9679330Z else: 2025-05-07T20:32:34.9679544Z scale_ub_tensor = None 2025-05-07T20:32:34.9679808Z 2025-05-07T20:32:34.9680066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.9680389Z op = silu_mul_quant 2025-05-07T20:32:34.9680652Z if compiled: 2025-05-07T20:32:34.9680914Z op = torch.compile(op) 2025-05-07T20:32:34.9681219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.9681508Z 2025-05-07T20:32:34.9681714Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.9681885Z 2025-05-07T20:32:34.9681991Z moe/activation_test.py:117: 2025-05-07T20:32:34.9682297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.9682643Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.9682944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.9683728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.9684447Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.9685006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.9685704Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.9686386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.9686946Z kernel = self.compile( 2025-05-07T20:32:34.9687503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.9688213Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.9688630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.9688858Z 2025-05-07T20:32:34.9689085Z self = 2025-05-07T20:32:34.9690265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.9691685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad24ed40>} 2025-05-07T20:32:34.9693095Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.9694143Z context = 2025-05-07T20:32:34.9694440Z 2025-05-07T20:32:34.9694621Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.9695152Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.9695635Z module_map=module_map) 2025-05-07T20:32:34.9696056Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.9696436Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.9696700Z E ^ 2025-05-07T20:32:34.9697184Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.9697647Z 2025-05-07T20:32:34.9698237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.9698759Z 2025-05-07T20:32:34.9698876Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.9699297Z self=, 2025-05-07T20:32:34.9699713Z T=16384, 2025-05-07T20:32:34.9699920Z D=5120, 2025-05-07T20:32:34.9700124Z scale_ub=1200.0, 2025-05-07T20:32:34.9700360Z contiguous=True, 2025-05-07T20:32:34.9700592Z compiled=True, 2025-05-07T20:32:34.9700801Z ) 2025-05-07T20:32:34.9701141Z self = 2025-05-07T20:32:34.9701658Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:34.9701937Z 2025-05-07T20:32:34.9702019Z @given( 2025-05-07T20:32:34.9702264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.9702594Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.9702910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.9703247Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.9703590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.9703888Z ) 2025-05-07T20:32:34.9704248Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.9704756Z def test_silu_mul_quant( 2025-05-07T20:32:34.9705010Z self, 2025-05-07T20:32:34.9705213Z T: int, 2025-05-07T20:32:34.9705430Z D: int, 2025-05-07T20:32:34.9705665Z scale_ub: Optional[float], 2025-05-07T20:32:34.9705948Z contiguous: bool, 2025-05-07T20:32:34.9706203Z compiled: bool, 2025-05-07T20:32:34.9706435Z ) -> None: 2025-05-07T20:32:34.9706659Z torch.manual_seed(2025) 2025-05-07T20:32:34.9706913Z 2025-05-07T20:32:34.9707199Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.9707553Z 2025-05-07T20:32:34.9707755Z x_sign = torch.sign(x) 2025-05-07T20:32:34.9708062Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.9708428Z x = x_sign * x_clamp 2025-05-07T20:32:34.9708674Z x0 = x[:, :D] 2025-05-07T20:32:34.9708900Z x1 = x[:, D:] 2025-05-07T20:32:34.9709120Z 2025-05-07T20:32:34.9709310Z if contiguous: 2025-05-07T20:32:34.9709554Z x0 = x0.contiguous() 2025-05-07T20:32:34.9709823Z x1 = x1.contiguous() 2025-05-07T20:32:34.9710066Z 2025-05-07T20:32:34.9710270Z if scale_ub is not None: 2025-05-07T20:32:34.9710557Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.9710949Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.9711268Z ) 2025-05-07T20:32:34.9711474Z else: 2025-05-07T20:32:34.9711688Z scale_ub_tensor = None 2025-05-07T20:32:34.9711953Z 2025-05-07T20:32:34.9712241Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.9712564Z op = silu_mul_quant 2025-05-07T20:32:34.9712819Z if compiled: 2025-05-07T20:32:34.9713077Z op = torch.compile(op) 2025-05-07T20:32:34.9713385Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.9713663Z 2025-05-07T20:32:34.9713866Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.9714035Z 2025-05-07T20:32:34.9714148Z moe/activation_test.py:117: 2025-05-07T20:32:34.9714447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.9714786Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.9715124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.9715700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.9716273Z return fn(*args, **kwargs) 2025-05-07T20:32:34.9716951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.9717660Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.9718204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.9718951Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.9719632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.9720181Z kernel = self.compile( 2025-05-07T20:32:34.9720732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.9721414Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.9721823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.9722052Z 2025-05-07T20:32:34.9722269Z self = 2025-05-07T20:32:34.9723371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.9724821Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad24e830>} 2025-05-07T20:32:34.9726198Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.9727249Z context = 2025-05-07T20:32:34.9727543Z 2025-05-07T20:32:34.9727716Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.9728255Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.9728744Z module_map=module_map) 2025-05-07T20:32:34.9729118Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.9729479Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.9729752Z E ^ 2025-05-07T20:32:34.9730234Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.9730689Z 2025-05-07T20:32:34.9731116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.9731644Z 2025-05-07T20:32:35.1602372Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.1602879Z self=, 2025-05-07T20:32:35.1603298Z T=16384, 2025-05-07T20:32:35.1603514Z D=5120, 2025-05-07T20:32:35.1603795Z scale_ub=None, 2025-05-07T20:32:35.1604025Z contiguous=False, 2025-05-07T20:32:35.1604263Z compiled=True, 2025-05-07T20:32:35.1604474Z ) 2025-05-07T20:32:35.1604809Z self = 2025-05-07T20:32:35.1605323Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.1605609Z 2025-05-07T20:32:35.1605692Z @given( 2025-05-07T20:32:35.1605939Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.1606267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.1606590Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.1607026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.1607374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.1607671Z ) 2025-05-07T20:32:35.1608079Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.1608539Z def test_silu_mul_quant( 2025-05-07T20:32:35.1608794Z self, 2025-05-07T20:32:35.1608994Z T: int, 2025-05-07T20:32:35.1609203Z D: int, 2025-05-07T20:32:35.1609433Z scale_ub: Optional[float], 2025-05-07T20:32:35.1609712Z contiguous: bool, 2025-05-07T20:32:35.1609963Z compiled: bool, 2025-05-07T20:32:35.1610203Z ) -> None: 2025-05-07T20:32:35.1610430Z torch.manual_seed(2025) 2025-05-07T20:32:35.1610685Z 2025-05-07T20:32:35.1610978Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.1611335Z 2025-05-07T20:32:35.1611534Z x_sign = torch.sign(x) 2025-05-07T20:32:35.1611845Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.1612172Z x = x_sign * x_clamp 2025-05-07T20:32:35.1612417Z x0 = x[:, :D] 2025-05-07T20:32:35.1612645Z x1 = x[:, D:] 2025-05-07T20:32:35.1612865Z 2025-05-07T20:32:35.1613060Z if contiguous: 2025-05-07T20:32:35.1613306Z x0 = x0.contiguous() 2025-05-07T20:32:35.1613583Z x1 = x1.contiguous() 2025-05-07T20:32:35.1613829Z 2025-05-07T20:32:35.1614038Z if scale_ub is not None: 2025-05-07T20:32:35.1614328Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.1614672Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.1614993Z ) 2025-05-07T20:32:35.1615199Z else: 2025-05-07T20:32:35.1615509Z scale_ub_tensor = None 2025-05-07T20:32:35.1615783Z 2025-05-07T20:32:35.1616029Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.1616348Z op = silu_mul_quant 2025-05-07T20:32:35.1616609Z if compiled: 2025-05-07T20:32:35.1616874Z op = torch.compile(op) 2025-05-07T20:32:35.1617183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.1617462Z 2025-05-07T20:32:35.1617666Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.1617835Z 2025-05-07T20:32:35.1617951Z moe/activation_test.py:117: 2025-05-07T20:32:35.1618369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.1618760Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.1619058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.1619633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.1620212Z return fn(*args, **kwargs) 2025-05-07T20:32:35.1620896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.1621619Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.1622240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.1622946Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.1623632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.1624226Z kernel = self.compile( 2025-05-07T20:32:35.1624782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.1625462Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.1625875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.1626110Z 2025-05-07T20:32:35.1626327Z self = 2025-05-07T20:32:35.1627448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.1628929Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad24f760>} 2025-05-07T20:32:35.1630323Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.1631378Z context = 2025-05-07T20:32:35.1631673Z 2025-05-07T20:32:35.1631850Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.1632393Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.1632885Z module_map=module_map) 2025-05-07T20:32:35.1633260Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.1633631Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.1641749Z E ^ 2025-05-07T20:32:35.1642246Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.1642722Z 2025-05-07T20:32:35.1643149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.1643681Z 2025-05-07T20:32:35.1643792Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.1644223Z self=, 2025-05-07T20:32:35.1644711Z T=2048, 2025-05-07T20:32:35.1644910Z D=5120, 2025-05-07T20:32:35.1645113Z scale_ub=None, 2025-05-07T20:32:35.1645330Z contiguous=False, 2025-05-07T20:32:35.1645575Z compiled=True, 2025-05-07T20:32:35.1645791Z ) 2025-05-07T20:32:35.2670149Z self = 2025-05-07T20:32:35.2670735Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.2671013Z 2025-05-07T20:32:35.2671096Z @given( 2025-05-07T20:32:35.2671354Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2671681Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2672001Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2672338Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2672681Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2672976Z ) 2025-05-07T20:32:35.2673344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2673800Z def test_silu_mul_quant( 2025-05-07T20:32:35.2674056Z self, 2025-05-07T20:32:35.2674260Z T: int, 2025-05-07T20:32:35.2674476Z D: int, 2025-05-07T20:32:35.2674704Z scale_ub: Optional[float], 2025-05-07T20:32:35.2675221Z contiguous: bool, 2025-05-07T20:32:35.2675482Z compiled: bool, 2025-05-07T20:32:35.2675723Z ) -> None: 2025-05-07T20:32:35.2675947Z torch.manual_seed(2025) 2025-05-07T20:32:35.2676200Z 2025-05-07T20:32:35.2676572Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2676931Z 2025-05-07T20:32:35.2677129Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2677438Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2677755Z x = x_sign * x_clamp 2025-05-07T20:32:35.2677997Z x0 = x[:, :D] 2025-05-07T20:32:35.2678226Z x1 = x[:, D:] 2025-05-07T20:32:35.2678468Z 2025-05-07T20:32:35.2678689Z if contiguous: 2025-05-07T20:32:35.2678933Z x0 = x0.contiguous() 2025-05-07T20:32:35.2679200Z x1 = x1.contiguous() 2025-05-07T20:32:35.2679537Z 2025-05-07T20:32:35.2679741Z if scale_ub is not None: 2025-05-07T20:32:35.2680028Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2680375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2680699Z ) 2025-05-07T20:32:35.2680903Z else: 2025-05-07T20:32:35.2681120Z scale_ub_tensor = None 2025-05-07T20:32:35.2681389Z 2025-05-07T20:32:35.2681637Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2681961Z op = silu_mul_quant 2025-05-07T20:32:35.2682223Z if compiled: 2025-05-07T20:32:35.2682479Z op = torch.compile(op) 2025-05-07T20:32:35.2682783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2683063Z 2025-05-07T20:32:35.2683269Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.2683436Z 2025-05-07T20:32:35.2683540Z moe/activation_test.py:117: 2025-05-07T20:32:35.2683840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2684182Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.2684469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2685039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.2685612Z return fn(*args, **kwargs) 2025-05-07T20:32:35.2686288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.2686988Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.2687540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2688233Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2688999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2689544Z kernel = self.compile( 2025-05-07T20:32:35.2690109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2690782Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2691187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2691424Z 2025-05-07T20:32:35.2691640Z self = 2025-05-07T20:32:35.2692744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2694170Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acdd43a0>} 2025-05-07T20:32:35.2695595Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2696643Z context = 2025-05-07T20:32:35.2696947Z 2025-05-07T20:32:35.2697158Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2697691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2698290Z module_map=module_map) 2025-05-07T20:32:35.2698688Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2699055Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.2699323Z E ^ 2025-05-07T20:32:35.2699800Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2700263Z 2025-05-07T20:32:35.2700739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2701265Z 2025-05-07T20:32:35.2701373Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2701802Z self=, 2025-05-07T20:32:35.2702210Z T=2048, 2025-05-07T20:32:35.2702404Z D=5120, 2025-05-07T20:32:35.2702605Z scale_ub=1200.0, 2025-05-07T20:32:35.2702833Z contiguous=False, 2025-05-07T20:32:35.2703065Z compiled=True, 2025-05-07T20:32:35.2703278Z ) 2025-05-07T20:32:35.2703600Z self = 2025-05-07T20:32:35.2704110Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.2704386Z 2025-05-07T20:32:35.2704475Z @given( 2025-05-07T20:32:35.2704714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2705029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2705352Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2705691Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2706022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2706315Z ) 2025-05-07T20:32:35.2706676Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2707120Z def test_silu_mul_quant( 2025-05-07T20:32:35.2707371Z self, 2025-05-07T20:32:35.2707577Z T: int, 2025-05-07T20:32:35.2707778Z D: int, 2025-05-07T20:32:35.2708022Z scale_ub: Optional[float], 2025-05-07T20:32:35.2708335Z contiguous: bool, 2025-05-07T20:32:35.2708576Z compiled: bool, 2025-05-07T20:32:35.2708809Z ) -> None: 2025-05-07T20:32:35.2709095Z torch.manual_seed(2025) 2025-05-07T20:32:35.2709341Z 2025-05-07T20:32:35.2709627Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2709979Z 2025-05-07T20:32:35.2710185Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2710479Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2710805Z x = x_sign * x_clamp 2025-05-07T20:32:35.2711057Z x0 = x[:, :D] 2025-05-07T20:32:35.2711275Z x1 = x[:, D:] 2025-05-07T20:32:35.2711494Z 2025-05-07T20:32:35.2711691Z if contiguous: 2025-05-07T20:32:35.2711926Z x0 = x0.contiguous() 2025-05-07T20:32:35.2712194Z x1 = x1.contiguous() 2025-05-07T20:32:35.2712441Z 2025-05-07T20:32:35.2712640Z if scale_ub is not None: 2025-05-07T20:32:35.2712921Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2713268Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2713579Z ) 2025-05-07T20:32:35.2713783Z else: 2025-05-07T20:32:35.2714002Z scale_ub_tensor = None 2025-05-07T20:32:35.2714254Z 2025-05-07T20:32:35.2714498Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2714825Z op = silu_mul_quant 2025-05-07T20:32:35.2715146Z if compiled: 2025-05-07T20:32:35.2715403Z op = torch.compile(op) 2025-05-07T20:32:35.2715710Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2715989Z 2025-05-07T20:32:35.2716184Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.2716403Z 2025-05-07T20:32:35.2716507Z moe/activation_test.py:117: 2025-05-07T20:32:35.2716805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2717134Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.2717426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2717994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.2718568Z return fn(*args, **kwargs) 2025-05-07T20:32:35.2719232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.2719981Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.2720531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2721218Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2721896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2722440Z kernel = self.compile( 2025-05-07T20:32:35.2722990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2723651Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2724063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2724294Z 2025-05-07T20:32:35.2724513Z self = 2025-05-07T20:32:35.2725616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2727016Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acdd4820>} 2025-05-07T20:32:35.2728443Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2729491Z context = 2025-05-07T20:32:35.2729828Z 2025-05-07T20:32:35.2730006Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2730539Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2731021Z module_map=module_map) 2025-05-07T20:32:35.2731400Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2731767Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.2732032Z E ^ 2025-05-07T20:32:35.2732510Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2732965Z 2025-05-07T20:32:35.2733392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2733912Z 2025-05-07T20:32:35.6342462Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6342955Z self=, 2025-05-07T20:32:35.6343364Z T=4096, 2025-05-07T20:32:35.6343555Z D=5120, 2025-05-07T20:32:35.6343773Z scale_ub=1200.0, 2025-05-07T20:32:35.6344005Z contiguous=True, 2025-05-07T20:32:35.6344231Z compiled=True, 2025-05-07T20:32:35.6344467Z ) 2025-05-07T20:32:35.6345073Z self = 2025-05-07T20:32:35.6345581Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.6345985Z 2025-05-07T20:32:35.6346066Z @given( 2025-05-07T20:32:35.6346303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.6346614Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.6346924Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.6347261Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.6347594Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.6347899Z ) 2025-05-07T20:32:35.6348380Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.6348884Z def test_silu_mul_quant( 2025-05-07T20:32:35.6349229Z self, 2025-05-07T20:32:35.6349428Z T: int, 2025-05-07T20:32:35.6349627Z D: int, 2025-05-07T20:32:35.6349848Z scale_ub: Optional[float], 2025-05-07T20:32:35.6350125Z contiguous: bool, 2025-05-07T20:32:35.6350370Z compiled: bool, 2025-05-07T20:32:35.6350599Z ) -> None: 2025-05-07T20:32:35.6350825Z torch.manual_seed(2025) 2025-05-07T20:32:35.6351071Z 2025-05-07T20:32:35.6351344Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.6351689Z 2025-05-07T20:32:35.6351885Z x_sign = torch.sign(x) 2025-05-07T20:32:35.6352176Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.6352491Z x = x_sign * x_clamp 2025-05-07T20:32:35.6352733Z x0 = x[:, :D] 2025-05-07T20:32:35.6352955Z x1 = x[:, D:] 2025-05-07T20:32:35.6353159Z 2025-05-07T20:32:35.6353349Z if contiguous: 2025-05-07T20:32:35.6353584Z x0 = x0.contiguous() 2025-05-07T20:32:35.6353843Z x1 = x1.contiguous() 2025-05-07T20:32:35.6354090Z 2025-05-07T20:32:35.6354291Z if scale_ub is not None: 2025-05-07T20:32:35.6354564Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.6354906Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.6355221Z ) 2025-05-07T20:32:35.6355418Z else: 2025-05-07T20:32:35.6355978Z scale_ub_tensor = None 2025-05-07T20:32:35.6356238Z 2025-05-07T20:32:35.6356470Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.6356793Z op = silu_mul_quant 2025-05-07T20:32:35.6357049Z if compiled: 2025-05-07T20:32:35.6357297Z op = torch.compile(op) 2025-05-07T20:32:35.6357599Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6357989Z 2025-05-07T20:32:35.6358190Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.6358357Z 2025-05-07T20:32:35.6358460Z moe/activation_test.py:117: 2025-05-07T20:32:35.6358763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6359099Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.6359380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.6359960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.6360542Z return fn(*args, **kwargs) 2025-05-07T20:32:35.6361211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.6361918Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.6362468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.6363169Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.6363839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.6364386Z kernel = self.compile( 2025-05-07T20:32:35.6365007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.6365681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.6366138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.6366375Z 2025-05-07T20:32:35.6366587Z self = 2025-05-07T20:32:35.6367700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.6369187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acdd5360>} 2025-05-07T20:32:35.6370634Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.6371684Z context = 2025-05-07T20:32:35.6371987Z 2025-05-07T20:32:35.6372155Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.6372688Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.6373159Z module_map=module_map) 2025-05-07T20:32:35.6373530Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.6373895Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.6374153Z E ^ 2025-05-07T20:32:35.6374629Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.6375096Z 2025-05-07T20:32:35.6375522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.6376043Z 2025-05-07T20:32:35.6376155Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.6376572Z self=, 2025-05-07T20:32:35.6376983Z T=128, 2025-05-07T20:32:35.6377174Z D=5120, 2025-05-07T20:32:35.6377365Z scale_ub=1200.0, 2025-05-07T20:32:35.6377598Z contiguous=False, 2025-05-07T20:32:35.6377825Z compiled=True, 2025-05-07T20:32:35.6378140Z ) 2025-05-07T20:32:35.7523758Z self = 2025-05-07T20:32:35.7525237Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.7525795Z 2025-05-07T20:32:35.7525957Z @given( 2025-05-07T20:32:35.7526434Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7527084Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7527712Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7528170Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7528512Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7528801Z ) 2025-05-07T20:32:35.7529168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7529620Z def test_silu_mul_quant( 2025-05-07T20:32:35.7529866Z self, 2025-05-07T20:32:35.7530073Z T: int, 2025-05-07T20:32:35.7530282Z D: int, 2025-05-07T20:32:35.7530516Z scale_ub: Optional[float], 2025-05-07T20:32:35.7530795Z contiguous: bool, 2025-05-07T20:32:35.7531047Z compiled: bool, 2025-05-07T20:32:35.7531287Z ) -> None: 2025-05-07T20:32:35.7531510Z torch.manual_seed(2025) 2025-05-07T20:32:35.7531767Z 2025-05-07T20:32:35.7532055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7532404Z 2025-05-07T20:32:35.7532686Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7532993Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7533308Z x = x_sign * x_clamp 2025-05-07T20:32:35.7533562Z x0 = x[:, :D] 2025-05-07T20:32:35.7533864Z x1 = x[:, D:] 2025-05-07T20:32:35.7534075Z 2025-05-07T20:32:35.7534276Z if contiguous: 2025-05-07T20:32:35.7534535Z x0 = x0.contiguous() 2025-05-07T20:32:35.7534801Z x1 = x1.contiguous() 2025-05-07T20:32:35.7535054Z 2025-05-07T20:32:35.7535261Z if scale_ub is not None: 2025-05-07T20:32:35.7535538Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7535889Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7536209Z ) 2025-05-07T20:32:35.7536410Z else: 2025-05-07T20:32:35.7536625Z scale_ub_tensor = None 2025-05-07T20:32:35.7536975Z 2025-05-07T20:32:35.7537224Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7537547Z op = silu_mul_quant 2025-05-07T20:32:35.7537807Z if compiled: 2025-05-07T20:32:35.7538162Z op = torch.compile(op) 2025-05-07T20:32:35.7538495Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7538806Z 2025-05-07T20:32:35.7539010Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7539180Z 2025-05-07T20:32:35.7539284Z moe/activation_test.py:117: 2025-05-07T20:32:35.7539586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7539922Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7540217Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7540791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7541363Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7542038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7542748Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7543298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7543995Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7544674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7545217Z kernel = self.compile( 2025-05-07T20:32:35.7545777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7546506Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7546908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7547145Z 2025-05-07T20:32:35.7547358Z self = 2025-05-07T20:32:35.7548468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7549938Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acdd6290>} 2025-05-07T20:32:35.7551314Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7552357Z context = 2025-05-07T20:32:35.7552655Z 2025-05-07T20:32:35.7552826Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7553416Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7553902Z module_map=module_map) 2025-05-07T20:32:35.7554271Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7554639Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7554956Z E ^ 2025-05-07T20:32:35.7555428Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7556185Z 2025-05-07T20:32:35.7556614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7557142Z 2025-05-07T20:32:35.7557251Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7557681Z self=, 2025-05-07T20:32:35.7558112Z T=16384, 2025-05-07T20:32:35.7558418Z D=7168, 2025-05-07T20:32:35.7558621Z scale_ub=1200.0, 2025-05-07T20:32:35.7558848Z contiguous=True, 2025-05-07T20:32:35.7559086Z compiled=True, 2025-05-07T20:32:35.7559309Z ) 2025-05-07T20:32:35.7559635Z self = 2025-05-07T20:32:35.7560143Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.7560434Z 2025-05-07T20:32:35.7560516Z @given( 2025-05-07T20:32:35.7560759Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7561077Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7561393Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7561742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7562083Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7562379Z ) 2025-05-07T20:32:35.7562739Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7563184Z def test_silu_mul_quant( 2025-05-07T20:32:35.7563431Z self, 2025-05-07T20:32:35.7563639Z T: int, 2025-05-07T20:32:35.7563841Z D: int, 2025-05-07T20:32:35.7564064Z scale_ub: Optional[float], 2025-05-07T20:32:35.7564343Z contiguous: bool, 2025-05-07T20:32:35.7564585Z compiled: bool, 2025-05-07T20:32:35.7564817Z ) -> None: 2025-05-07T20:32:35.7565043Z torch.manual_seed(2025) 2025-05-07T20:32:35.7565290Z 2025-05-07T20:32:35.7565582Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7565933Z 2025-05-07T20:32:35.7566139Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7566435Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7566752Z x = x_sign * x_clamp 2025-05-07T20:32:35.7574993Z x0 = x[:, :D] 2025-05-07T20:32:35.7575265Z x1 = x[:, D:] 2025-05-07T20:32:35.7575495Z 2025-05-07T20:32:35.7575698Z if contiguous: 2025-05-07T20:32:35.7575949Z x0 = x0.contiguous() 2025-05-07T20:32:35.7576237Z x1 = x1.contiguous() 2025-05-07T20:32:35.7576507Z 2025-05-07T20:32:35.7576707Z if scale_ub is not None: 2025-05-07T20:32:35.7577015Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7577367Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7577688Z ) 2025-05-07T20:32:35.7577894Z else: 2025-05-07T20:32:35.7578211Z scale_ub_tensor = None 2025-05-07T20:32:35.7578497Z 2025-05-07T20:32:35.7578735Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7579058Z op = silu_mul_quant 2025-05-07T20:32:35.7579318Z if compiled: 2025-05-07T20:32:35.7579566Z op = torch.compile(op) 2025-05-07T20:32:35.7579873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7580155Z 2025-05-07T20:32:35.7580346Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7580524Z 2025-05-07T20:32:35.7580626Z moe/activation_test.py:117: 2025-05-07T20:32:35.7581001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7581342Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7581626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7582200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7582836Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7583496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7584194Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7584741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7585429Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7586142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7586688Z kernel = self.compile( 2025-05-07T20:32:35.7587245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7587928Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7588361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7588593Z 2025-05-07T20:32:35.7588805Z self = 2025-05-07T20:32:35.7589906Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7591307Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acdd6d40>} 2025-05-07T20:32:35.7592673Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7593714Z context = 2025-05-07T20:32:35.7594005Z 2025-05-07T20:32:35.7594180Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7594712Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7595185Z module_map=module_map) 2025-05-07T20:32:35.7595607Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7595971Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7596226Z E ^ 2025-05-07T20:32:35.7596703Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7597163Z 2025-05-07T20:32:35.7597598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7598113Z 2025-05-07T20:32:35.8942594Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8943575Z self=, 2025-05-07T20:32:35.8944401Z T=16384, 2025-05-07T20:32:35.8944795Z D=5120, 2025-05-07T20:32:35.8945192Z scale_ub=1200.0, 2025-05-07T20:32:35.8945651Z contiguous=True, 2025-05-07T20:32:35.8946096Z compiled=False, 2025-05-07T20:32:35.8946517Z ) 2025-05-07T20:32:35.8947167Z self = 2025-05-07T20:32:35.8948002Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.8948338Z 2025-05-07T20:32:35.8948418Z @given( 2025-05-07T20:32:35.8948668Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8949215Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8949538Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8949879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8950221Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8950584Z ) 2025-05-07T20:32:35.8950943Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8951395Z def test_silu_mul_quant( 2025-05-07T20:32:35.8951641Z self, 2025-05-07T20:32:35.8951844Z T: int, 2025-05-07T20:32:35.8952045Z D: int, 2025-05-07T20:32:35.8952266Z scale_ub: Optional[float], 2025-05-07T20:32:35.8952549Z contiguous: bool, 2025-05-07T20:32:35.8952800Z compiled: bool, 2025-05-07T20:32:35.8953030Z ) -> None: 2025-05-07T20:32:35.8953256Z torch.manual_seed(2025) 2025-05-07T20:32:35.8953590Z 2025-05-07T20:32:35.8953867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8954222Z 2025-05-07T20:32:35.8954426Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8954722Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8955041Z x = x_sign * x_clamp 2025-05-07T20:32:35.8955290Z x0 = x[:, :D] 2025-05-07T20:32:35.8955514Z x1 = x[:, D:] 2025-05-07T20:32:35.8956024Z 2025-05-07T20:32:35.8956226Z if contiguous: 2025-05-07T20:32:35.8956481Z x0 = x0.contiguous() 2025-05-07T20:32:35.8956763Z x1 = x1.contiguous() 2025-05-07T20:32:35.8957028Z 2025-05-07T20:32:35.8957234Z if scale_ub is not None: 2025-05-07T20:32:35.8957532Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8957914Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8958262Z ) 2025-05-07T20:32:35.8958463Z else: 2025-05-07T20:32:35.8958728Z scale_ub_tensor = None 2025-05-07T20:32:35.8959025Z 2025-05-07T20:32:35.8959274Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8959631Z op = silu_mul_quant 2025-05-07T20:32:35.8959906Z if compiled: 2025-05-07T20:32:35.8960172Z op = torch.compile(op) 2025-05-07T20:32:35.8960510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8960818Z 2025-05-07T20:32:35.8961021Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8961212Z 2025-05-07T20:32:35.8961322Z moe/activation_test.py:117: 2025-05-07T20:32:35.8961655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8962039Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8962443Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8963154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8963868Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8964420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8965121Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8965801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8966349Z kernel = self.compile( 2025-05-07T20:32:35.8966904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8967573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8967983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8968210Z 2025-05-07T20:32:35.8968429Z self = 2025-05-07T20:32:35.8969596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8971024Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acdd7ac0>} 2025-05-07T20:32:35.8972453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8973502Z context = 2025-05-07T20:32:35.8973798Z 2025-05-07T20:32:35.8973978Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8974509Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8975055Z module_map=module_map) 2025-05-07T20:32:35.8975436Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8975796Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8976065Z E ^ 2025-05-07T20:32:35.8976544Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8977005Z 2025-05-07T20:32:35.8977436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8977957Z 2025-05-07T20:32:35.8978168Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8978644Z self=, 2025-05-07T20:32:35.8979062Z T=1, 2025-05-07T20:32:35.8979248Z D=7168, 2025-05-07T20:32:35.8979450Z scale_ub=1200.0, 2025-05-07T20:32:35.8979686Z contiguous=False, 2025-05-07T20:32:35.8979917Z compiled=False, 2025-05-07T20:32:35.8980131Z ) 2025-05-07T20:32:35.8980463Z self = 2025-05-07T20:32:35.8980961Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.8981238Z 2025-05-07T20:32:35.8981318Z @given( 2025-05-07T20:32:35.8981562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8981884Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8982197Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8982539Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8982879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8983166Z ) 2025-05-07T20:32:35.8983585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8984042Z def test_silu_mul_quant( 2025-05-07T20:32:35.8984286Z self, 2025-05-07T20:32:35.8984493Z T: int, 2025-05-07T20:32:35.8984699Z D: int, 2025-05-07T20:32:35.8984921Z scale_ub: Optional[float], 2025-05-07T20:32:35.8985205Z contiguous: bool, 2025-05-07T20:32:35.8985461Z compiled: bool, 2025-05-07T20:32:35.8985695Z ) -> None: 2025-05-07T20:32:35.8985916Z torch.manual_seed(2025) 2025-05-07T20:32:35.8986173Z 2025-05-07T20:32:35.8986459Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8986804Z 2025-05-07T20:32:35.8987015Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8987317Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8987627Z x = x_sign * x_clamp 2025-05-07T20:32:35.8987875Z x0 = x[:, :D] 2025-05-07T20:32:35.8988097Z x1 = x[:, D:] 2025-05-07T20:32:35.8988304Z 2025-05-07T20:32:35.8988502Z if contiguous: 2025-05-07T20:32:35.8988764Z x0 = x0.contiguous() 2025-05-07T20:32:35.8989048Z x1 = x1.contiguous() 2025-05-07T20:32:35.8989296Z 2025-05-07T20:32:35.8989500Z if scale_ub is not None: 2025-05-07T20:32:35.8989832Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8990197Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8990516Z ) 2025-05-07T20:32:35.8990714Z else: 2025-05-07T20:32:35.8991007Z scale_ub_tensor = None 2025-05-07T20:32:35.8991265Z 2025-05-07T20:32:35.8991519Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8991845Z op = silu_mul_quant 2025-05-07T20:32:35.8992100Z if compiled: 2025-05-07T20:32:35.8992358Z op = torch.compile(op) 2025-05-07T20:32:35.8992665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8992942Z 2025-05-07T20:32:35.8993150Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8993320Z 2025-05-07T20:32:35.8993430Z moe/activation_test.py:117: 2025-05-07T20:32:35.8993728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8994113Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8994409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8995114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8995817Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8996371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8997070Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8997741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8998287Z kernel = self.compile( 2025-05-07T20:32:35.8998844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8999518Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8999919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.9000154Z 2025-05-07T20:32:35.9000367Z self = 2025-05-07T20:32:35.9001465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.9002865Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac9744c0>} 2025-05-07T20:32:35.9004281Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.9005325Z context = 2025-05-07T20:32:35.9005628Z 2025-05-07T20:32:35.9005799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.9006333Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.9006812Z module_map=module_map) 2025-05-07T20:32:35.9007190Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.9007558Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.9007829Z E ^ 2025-05-07T20:32:35.9008304Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.9008819Z 2025-05-07T20:32:35.9009244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.9009763Z 2025-05-07T20:32:36.0920997Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.0921784Z self=, 2025-05-07T20:32:36.0922214Z T=4096, 2025-05-07T20:32:36.0922406Z D=7168, 2025-05-07T20:32:36.0922606Z scale_ub=1200.0, 2025-05-07T20:32:36.0922843Z contiguous=False, 2025-05-07T20:32:36.0923154Z compiled=True, 2025-05-07T20:32:36.0923371Z ) 2025-05-07T20:32:36.0923706Z self = 2025-05-07T20:32:36.0924214Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:36.0924503Z 2025-05-07T20:32:36.0924584Z @given( 2025-05-07T20:32:36.0924824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.0925143Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.0925469Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.0925816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.0926250Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.0926539Z ) 2025-05-07T20:32:36.0926908Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.0927370Z def test_silu_mul_quant( 2025-05-07T20:32:36.0927616Z self, 2025-05-07T20:32:36.0927822Z T: int, 2025-05-07T20:32:36.0928041Z D: int, 2025-05-07T20:32:36.0928305Z scale_ub: Optional[float], 2025-05-07T20:32:36.0928588Z contiguous: bool, 2025-05-07T20:32:36.0928841Z compiled: bool, 2025-05-07T20:32:36.0929076Z ) -> None: 2025-05-07T20:32:36.0929306Z torch.manual_seed(2025) 2025-05-07T20:32:36.0929558Z 2025-05-07T20:32:36.0929838Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.0930192Z 2025-05-07T20:32:36.0930401Z x_sign = torch.sign(x) 2025-05-07T20:32:36.0930703Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.0931024Z x = x_sign * x_clamp 2025-05-07T20:32:36.0931272Z x0 = x[:, :D] 2025-05-07T20:32:36.0931495Z x1 = x[:, D:] 2025-05-07T20:32:36.0931708Z 2025-05-07T20:32:36.0931904Z if contiguous: 2025-05-07T20:32:36.0932145Z x0 = x0.contiguous() 2025-05-07T20:32:36.0932409Z x1 = x1.contiguous() 2025-05-07T20:32:36.0932661Z 2025-05-07T20:32:36.0932861Z if scale_ub is not None: 2025-05-07T20:32:36.0933141Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.0933495Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.0933812Z ) 2025-05-07T20:32:36.0934010Z else: 2025-05-07T20:32:36.0934230Z scale_ub_tensor = None 2025-05-07T20:32:36.0934489Z 2025-05-07T20:32:36.0934812Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.0935142Z op = silu_mul_quant 2025-05-07T20:32:36.0935400Z if compiled: 2025-05-07T20:32:36.0935648Z op = torch.compile(op) 2025-05-07T20:32:36.0935960Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.0936253Z 2025-05-07T20:32:36.0936456Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.0936625Z 2025-05-07T20:32:36.0936728Z moe/activation_test.py:117: 2025-05-07T20:32:36.0937036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.0937376Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.0937665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.0938356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:36.0938986Z return fn(*args, **kwargs) 2025-05-07T20:32:36.0939671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.0940377Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.0940930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.0941684Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.0942365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.0942957Z kernel = self.compile( 2025-05-07T20:32:36.0943520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.0944199Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.0944605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.0944841Z 2025-05-07T20:32:36.0945061Z self = 2025-05-07T20:32:36.0946176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.0947662Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac9751b0>} 2025-05-07T20:32:36.0949043Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.0950108Z context = 2025-05-07T20:32:36.0950414Z 2025-05-07T20:32:36.0950588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.0951132Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.0951612Z module_map=module_map) 2025-05-07T20:32:36.0951993Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.0952362Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.0952635Z E ^ 2025-05-07T20:32:36.0953108Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.0953576Z 2025-05-07T20:32:36.0954006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.0954531Z 2025-05-07T20:32:36.0954645Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.0955068Z self=, 2025-05-07T20:32:36.0955481Z T=128, 2025-05-07T20:32:36.0955932Z D=7168, 2025-05-07T20:32:36.0956133Z scale_ub=1200.0, 2025-05-07T20:32:36.0956448Z contiguous=False, 2025-05-07T20:32:36.0956786Z compiled=True, 2025-05-07T20:32:36.0957191Z ) 2025-05-07T20:32:36.1990579Z self = 2025-05-07T20:32:36.1991185Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:36.1991464Z 2025-05-07T20:32:36.1991547Z @given( 2025-05-07T20:32:36.1991789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.1992112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.1992432Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.1992775Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.1993117Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.1993409Z ) 2025-05-07T20:32:36.1993783Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.1994239Z def test_silu_mul_quant( 2025-05-07T20:32:36.1994504Z self, 2025-05-07T20:32:36.1994705Z T: int, 2025-05-07T20:32:36.1994913Z D: int, 2025-05-07T20:32:36.1995143Z scale_ub: Optional[float], 2025-05-07T20:32:36.1995425Z contiguous: bool, 2025-05-07T20:32:36.1995679Z compiled: bool, 2025-05-07T20:32:36.1996152Z ) -> None: 2025-05-07T20:32:36.1996378Z torch.manual_seed(2025) 2025-05-07T20:32:36.1996632Z 2025-05-07T20:32:36.1996917Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.1997337Z 2025-05-07T20:32:36.1997545Z x_sign = torch.sign(x) 2025-05-07T20:32:36.1997848Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.1998166Z x = x_sign * x_clamp 2025-05-07T20:32:36.1998416Z x0 = x[:, :D] 2025-05-07T20:32:36.1998667Z x1 = x[:, D:] 2025-05-07T20:32:36.1998911Z 2025-05-07T20:32:36.1999108Z if contiguous: 2025-05-07T20:32:36.1999355Z x0 = x0.contiguous() 2025-05-07T20:32:36.1999624Z x1 = x1.contiguous() 2025-05-07T20:32:36.1999875Z 2025-05-07T20:32:36.2000081Z if scale_ub is not None: 2025-05-07T20:32:36.2000369Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.2000812Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.2001134Z ) 2025-05-07T20:32:36.2001336Z else: 2025-05-07T20:32:36.2001551Z scale_ub_tensor = None 2025-05-07T20:32:36.2001812Z 2025-05-07T20:32:36.2002056Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.2002376Z op = silu_mul_quant 2025-05-07T20:32:36.2002635Z if compiled: 2025-05-07T20:32:36.2002892Z op = torch.compile(op) 2025-05-07T20:32:36.2003194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.2003477Z 2025-05-07T20:32:36.2003682Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.2003851Z 2025-05-07T20:32:36.2003955Z moe/activation_test.py:117: 2025-05-07T20:32:36.2004265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2004608Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.2004907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.2005480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:36.2006058Z return fn(*args, **kwargs) 2025-05-07T20:32:36.2006734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.2007440Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.2007990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.2008698Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.2009459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.2010009Z kernel = self.compile( 2025-05-07T20:32:36.2010571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.2011249Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.2011656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2011886Z 2025-05-07T20:32:36.2012101Z self = 2025-05-07T20:32:36.2013225Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.2014654Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac9740d0>} 2025-05-07T20:32:36.2016039Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.2017661Z context = 2025-05-07T20:32:36.2017971Z 2025-05-07T20:32:36.2018225Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.2018815Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.2027942Z module_map=module_map) 2025-05-07T20:32:36.2028375Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.2028752Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.2029023Z E ^ 2025-05-07T20:32:36.2029513Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.2029973Z 2025-05-07T20:32:36.2030402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.2031019Z 2025-05-07T20:32:36.2031127Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2031558Z self=, 2025-05-07T20:32:36.2031971Z T=2048, 2025-05-07T20:32:36.2032161Z D=7168, 2025-05-07T20:32:36.2032364Z scale_ub=None, 2025-05-07T20:32:36.2032592Z contiguous=True, 2025-05-07T20:32:36.2032817Z compiled=True, 2025-05-07T20:32:36.2033034Z ) 2025-05-07T20:32:36.2033365Z self = 2025-05-07T20:32:36.2033866Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:36.2034146Z 2025-05-07T20:32:36.2034223Z @given( 2025-05-07T20:32:36.2034462Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.2034787Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.2035098Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.2035439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.2035779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.2036065Z ) 2025-05-07T20:32:36.2036428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.2036882Z def test_silu_mul_quant( 2025-05-07T20:32:36.2037127Z self, 2025-05-07T20:32:36.2037330Z T: int, 2025-05-07T20:32:36.2037533Z D: int, 2025-05-07T20:32:36.2037751Z scale_ub: Optional[float], 2025-05-07T20:32:36.2038041Z contiguous: bool, 2025-05-07T20:32:36.2038330Z compiled: bool, 2025-05-07T20:32:36.2038553Z ) -> None: 2025-05-07T20:32:36.2038779Z torch.manual_seed(2025) 2025-05-07T20:32:36.2039028Z 2025-05-07T20:32:36.2039356Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.2039712Z 2025-05-07T20:32:36.2039916Z x_sign = torch.sign(x) 2025-05-07T20:32:36.2040217Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.2040531Z x = x_sign * x_clamp 2025-05-07T20:32:36.2040781Z x0 = x[:, :D] 2025-05-07T20:32:36.2041005Z x1 = x[:, D:] 2025-05-07T20:32:36.2041211Z 2025-05-07T20:32:36.2041409Z if contiguous: 2025-05-07T20:32:36.2041650Z x0 = x0.contiguous() 2025-05-07T20:32:36.2041915Z x1 = x1.contiguous() 2025-05-07T20:32:36.2042166Z 2025-05-07T20:32:36.2042376Z if scale_ub is not None: 2025-05-07T20:32:36.2042655Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.2043005Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.2043313Z ) 2025-05-07T20:32:36.2043505Z else: 2025-05-07T20:32:36.2043724Z scale_ub_tensor = None 2025-05-07T20:32:36.2043983Z 2025-05-07T20:32:36.2044225Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.2044540Z op = silu_mul_quant 2025-05-07T20:32:36.2044793Z if compiled: 2025-05-07T20:32:36.2045048Z op = torch.compile(op) 2025-05-07T20:32:36.2045431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.2045704Z 2025-05-07T20:32:36.2045900Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.2046066Z 2025-05-07T20:32:36.2046172Z moe/activation_test.py:117: 2025-05-07T20:32:36.2046507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2046839Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.2047127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.2047695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:36.2048294Z return fn(*args, **kwargs) 2025-05-07T20:32:36.2048987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.2049688Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.2050276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.2050969Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.2051641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.2052188Z kernel = self.compile( 2025-05-07T20:32:36.2052735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.2053402Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.2053802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2054029Z 2025-05-07T20:32:36.2054243Z self = 2025-05-07T20:32:36.2055345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.2057092Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac976560>} 2025-05-07T20:32:36.2058595Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.2059638Z context = 2025-05-07T20:32:36.2059931Z 2025-05-07T20:32:36.2060099Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.2060723Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.2061204Z module_map=module_map) 2025-05-07T20:32:36.2061573Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.2061941Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.2062204Z E ^ 2025-05-07T20:32:36.2062680Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.2063138Z 2025-05-07T20:32:36.2063560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.2064083Z 2025-05-07T20:32:36.2899951Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2900444Z self=, 2025-05-07T20:32:36.2900974Z T=16384, 2025-05-07T20:32:36.2901250Z D=5120, 2025-05-07T20:32:36.2901470Z scale_ub=None, 2025-05-07T20:32:36.2901699Z contiguous=False, 2025-05-07T20:32:36.2901932Z compiled=False, 2025-05-07T20:32:36.2902158Z ) 2025-05-07T20:32:36.2902491Z self = 2025-05-07T20:32:36.2903183Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:36.2903484Z 2025-05-07T20:32:36.2903566Z @given( 2025-05-07T20:32:36.2903812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.2904215Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.2904525Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.2904866Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.2905206Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.2905493Z ) 2025-05-07T20:32:36.2905856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.2906313Z def test_silu_mul_quant( 2025-05-07T20:32:36.2906556Z self, 2025-05-07T20:32:36.2906759Z T: int, 2025-05-07T20:32:36.2906960Z D: int, 2025-05-07T20:32:36.2907266Z scale_ub: Optional[float], 2025-05-07T20:32:36.2907545Z contiguous: bool, 2025-05-07T20:32:36.2907794Z compiled: bool, 2025-05-07T20:32:36.2908026Z ) -> None: 2025-05-07T20:32:36.2908246Z torch.manual_seed(2025) 2025-05-07T20:32:36.2908493Z 2025-05-07T20:32:36.2908776Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.2909122Z 2025-05-07T20:32:36.2909323Z x_sign = torch.sign(x) 2025-05-07T20:32:36.2909620Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.2911709Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.2913650Z 2025-05-07T20:32:36.2913778Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:36.2913996Z 2025-05-07T20:32:36.2914102Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2914530Z self=, 2025-05-07T20:32:36.2914939Z T=4096, 2025-05-07T20:32:36.2915133Z D=7168, 2025-05-07T20:32:36.2915327Z scale_ub=1200.0, 2025-05-07T20:32:36.2915556Z contiguous=True, 2025-05-07T20:32:36.2915783Z compiled=True, 2025-05-07T20:32:36.2915985Z ) 2025-05-07T20:32:36.2916313Z self = 2025-05-07T20:32:36.2916892Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:36.2917169Z 2025-05-07T20:32:36.2917249Z @given( 2025-05-07T20:32:36.2917490Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.2917811Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.2918118Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.2918459Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.2918795Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.2919091Z ) 2025-05-07T20:32:36.2919443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.2919890Z def test_silu_mul_quant( 2025-05-07T20:32:36.2920136Z self, 2025-05-07T20:32:36.2920332Z T: int, 2025-05-07T20:32:36.2920534Z D: int, 2025-05-07T20:32:36.2920760Z scale_ub: Optional[float], 2025-05-07T20:32:36.2921030Z contiguous: bool, 2025-05-07T20:32:36.2921278Z compiled: bool, 2025-05-07T20:32:36.2921508Z ) -> None: 2025-05-07T20:32:36.2921724Z torch.manual_seed(2025) 2025-05-07T20:32:36.2921972Z 2025-05-07T20:32:36.2922250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.2922641Z 2025-05-07T20:32:36.2922843Z x_sign = torch.sign(x) 2025-05-07T20:32:36.2923140Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.2925227Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.2927186Z 2025-05-07T20:32:36.2927309Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:36.2927564Z 2025-05-07T20:32:36.2927676Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2928096Z self=, 2025-05-07T20:32:36.2928507Z T=16384, 2025-05-07T20:32:36.2928705Z D=7168, 2025-05-07T20:32:36.2928896Z scale_ub=None, 2025-05-07T20:32:36.2929120Z contiguous=False, 2025-05-07T20:32:36.2929354Z compiled=False, 2025-05-07T20:32:36.2929559Z ) 2025-05-07T20:32:36.2929887Z self = 2025-05-07T20:32:36.2930391Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:36.2930673Z 2025-05-07T20:32:36.2930755Z @given( 2025-05-07T20:32:36.2930984Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.2931309Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.2931619Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.2931947Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.2932283Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.2932590Z ) 2025-05-07T20:32:36.2932953Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.2933406Z def test_silu_mul_quant( 2025-05-07T20:32:36.2933647Z self, 2025-05-07T20:32:36.2933849Z T: int, 2025-05-07T20:32:36.2934052Z D: int, 2025-05-07T20:32:36.2934270Z scale_ub: Optional[float], 2025-05-07T20:32:36.2934548Z contiguous: bool, 2025-05-07T20:32:36.2934794Z compiled: bool, 2025-05-07T20:32:36.2935016Z ) -> None: 2025-05-07T20:32:36.2935240Z torch.manual_seed(2025) 2025-05-07T20:32:36.2935491Z 2025-05-07T20:32:36.2935764Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.2937924Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.2939948Z 2025-05-07T20:32:36.2940072Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.2940294Z 2025-05-07T20:32:36.2940399Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2940823Z self=, 2025-05-07T20:32:36.2941229Z T=2048, 2025-05-07T20:32:36.2941420Z D=7168, 2025-05-07T20:32:36.2941618Z scale_ub=1200.0, 2025-05-07T20:32:36.2941840Z contiguous=True, 2025-05-07T20:32:36.2942066Z compiled=True, 2025-05-07T20:32:36.2942272Z ) 2025-05-07T20:32:36.2942595Z self = 2025-05-07T20:32:36.2943143Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:36.2943417Z 2025-05-07T20:32:36.2943500Z @given( 2025-05-07T20:32:36.2943734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.2944090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.2944401Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.2944736Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.2945066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.2945358Z ) 2025-05-07T20:32:36.2945720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.2946164Z def test_silu_mul_quant( 2025-05-07T20:32:36.2946418Z self, 2025-05-07T20:32:36.2946628Z T: int, 2025-05-07T20:32:36.2946822Z D: int, 2025-05-07T20:32:36.2947044Z scale_ub: Optional[float], 2025-05-07T20:32:36.2947368Z contiguous: bool, 2025-05-07T20:32:36.2947610Z compiled: bool, 2025-05-07T20:32:36.2947837Z ) -> None: 2025-05-07T20:32:36.2948077Z torch.manual_seed(2025) 2025-05-07T20:32:36.2948360Z 2025-05-07T20:32:36.2948632Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.2948986Z 2025-05-07T20:32:36.2949185Z x_sign = torch.sign(x) 2025-05-07T20:32:36.2949476Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.2951527Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.2953429Z 2025-05-07T20:32:36.2953550Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:36.2953766Z 2025-05-07T20:32:36.2953877Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2954306Z self=, 2025-05-07T20:32:36.2954709Z T=2048, 2025-05-07T20:32:36.2954904Z D=7168, 2025-05-07T20:32:36.2955104Z scale_ub=None, 2025-05-07T20:32:36.2955320Z contiguous=True, 2025-05-07T20:32:36.2955841Z compiled=False, 2025-05-07T20:32:36.2956124Z ) 2025-05-07T20:32:36.5960352Z self = 2025-05-07T20:32:36.5961353Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.5961647Z 2025-05-07T20:32:36.5961731Z @given( 2025-05-07T20:32:36.5961973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.5962298Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.5962625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.5962974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.5963318Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.5963613Z ) 2025-05-07T20:32:36.5963976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.5964429Z def test_silu_mul_quant( 2025-05-07T20:32:36.5964678Z self, 2025-05-07T20:32:36.5964883Z T: int, 2025-05-07T20:32:36.5965086Z D: int, 2025-05-07T20:32:36.5965311Z scale_ub: Optional[float], 2025-05-07T20:32:36.5965593Z contiguous: bool, 2025-05-07T20:32:36.5966047Z compiled: bool, 2025-05-07T20:32:36.5966280Z ) -> None: 2025-05-07T20:32:36.5966507Z torch.manual_seed(2025) 2025-05-07T20:32:36.5966757Z 2025-05-07T20:32:36.5967041Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.5967400Z 2025-05-07T20:32:36.5967684Z > x_sign = torch.sign(x) 2025-05-07T20:32:36.5969713Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.5971725Z 2025-05-07T20:32:36.5971859Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:36.5972079Z 2025-05-07T20:32:36.5972187Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.5972616Z self=, 2025-05-07T20:32:36.5973140Z T=1, 2025-05-07T20:32:36.5973333Z D=7168, 2025-05-07T20:32:36.5973536Z scale_ub=1200.0, 2025-05-07T20:32:36.5973768Z contiguous=True, 2025-05-07T20:32:36.5973995Z compiled=False, 2025-05-07T20:32:36.5974213Z ) 2025-05-07T20:32:36.5974547Z self = 2025-05-07T20:32:36.5975048Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:36.5975317Z 2025-05-07T20:32:36.5975398Z @given( 2025-05-07T20:32:36.5975636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.5975961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.5976273Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.5976618Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.5976964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.5977258Z ) 2025-05-07T20:32:36.5977623Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.5978175Z def test_silu_mul_quant( 2025-05-07T20:32:36.5978433Z self, 2025-05-07T20:32:36.5978663Z T: int, 2025-05-07T20:32:36.5978888Z D: int, 2025-05-07T20:32:36.5979118Z scale_ub: Optional[float], 2025-05-07T20:32:36.5979399Z contiguous: bool, 2025-05-07T20:32:36.5979650Z compiled: bool, 2025-05-07T20:32:36.5979886Z ) -> None: 2025-05-07T20:32:36.5980108Z torch.manual_seed(2025) 2025-05-07T20:32:36.5980359Z 2025-05-07T20:32:36.5980645Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.5980993Z 2025-05-07T20:32:36.5981197Z x_sign = torch.sign(x) 2025-05-07T20:32:36.5981547Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.5981861Z x = x_sign * x_clamp 2025-05-07T20:32:36.5982113Z x0 = x[:, :D] 2025-05-07T20:32:36.5982341Z x1 = x[:, D:] 2025-05-07T20:32:36.5982550Z 2025-05-07T20:32:36.5982750Z if contiguous: 2025-05-07T20:32:36.5982995Z x0 = x0.contiguous() 2025-05-07T20:32:36.5983259Z x1 = x1.contiguous() 2025-05-07T20:32:36.5983508Z 2025-05-07T20:32:36.5983713Z if scale_ub is not None: 2025-05-07T20:32:36.5984002Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.5984349Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.5984682Z ) 2025-05-07T20:32:36.5984886Z else: 2025-05-07T20:32:36.5985102Z scale_ub_tensor = None 2025-05-07T20:32:36.5985365Z 2025-05-07T20:32:36.5985610Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.5985929Z op = silu_mul_quant 2025-05-07T20:32:36.5986194Z if compiled: 2025-05-07T20:32:36.5986450Z op = torch.compile(op) 2025-05-07T20:32:36.5986751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.5987039Z 2025-05-07T20:32:36.5987249Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.5987419Z 2025-05-07T20:32:36.5987571Z moe/activation_test.py:117: 2025-05-07T20:32:36.5987877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.5988219Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.5988552Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.5989261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.5989973Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.5990529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.5991231Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.5991914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.5992508Z kernel = self.compile( 2025-05-07T20:32:36.5993072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.5993743Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.5994155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.5994383Z 2025-05-07T20:32:36.5994607Z self = 2025-05-07T20:32:36.5995715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.5997123Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac6884c0>} 2025-05-07T20:32:36.5998510Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.5999565Z context = 2025-05-07T20:32:36.5999864Z 2025-05-07T20:32:36.6000042Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.6000575Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.6001062Z module_map=module_map) 2025-05-07T20:32:36.6001437Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.6001802Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.6002114Z E ^ 2025-05-07T20:32:36.6002595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.6003055Z 2025-05-07T20:32:36.6003488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.6004009Z 2025-05-07T20:32:36.6004125Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.6004548Z self=, 2025-05-07T20:32:36.6004967Z T=128, 2025-05-07T20:32:36.6005168Z D=5120, 2025-05-07T20:32:36.6005367Z scale_ub=None, 2025-05-07T20:32:36.6005595Z contiguous=True, 2025-05-07T20:32:36.6005832Z compiled=False, 2025-05-07T20:32:36.6006043Z ) 2025-05-07T20:32:36.6783690Z self = 2025-05-07T20:32:36.6784500Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.6784894Z 2025-05-07T20:32:36.6785007Z @given( 2025-05-07T20:32:36.6785346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.6785688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.6786246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.6786600Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.6786944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.6787238Z ) 2025-05-07T20:32:36.6787677Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.6788133Z def test_silu_mul_quant( 2025-05-07T20:32:36.6788388Z self, 2025-05-07T20:32:36.6788598Z T: int, 2025-05-07T20:32:36.6788809Z D: int, 2025-05-07T20:32:36.6789036Z scale_ub: Optional[float], 2025-05-07T20:32:36.6789323Z contiguous: bool, 2025-05-07T20:32:36.6789592Z compiled: bool, 2025-05-07T20:32:36.6789832Z ) -> None: 2025-05-07T20:32:36.6790063Z torch.manual_seed(2025) 2025-05-07T20:32:36.6790319Z 2025-05-07T20:32:36.6790604Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.6799346Z 2025-05-07T20:32:36.6799605Z x_sign = torch.sign(x) 2025-05-07T20:32:36.6799927Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.6800253Z x = x_sign * x_clamp 2025-05-07T20:32:36.6800503Z x0 = x[:, :D] 2025-05-07T20:32:36.6800719Z x1 = x[:, D:] 2025-05-07T20:32:36.6800947Z 2025-05-07T20:32:36.6801147Z if contiguous: 2025-05-07T20:32:36.6801383Z x0 = x0.contiguous() 2025-05-07T20:32:36.6801652Z x1 = x1.contiguous() 2025-05-07T20:32:36.6801901Z 2025-05-07T20:32:36.6802104Z if scale_ub is not None: 2025-05-07T20:32:36.6802385Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.6802741Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.6803074Z ) 2025-05-07T20:32:36.6803275Z else: 2025-05-07T20:32:36.6803502Z scale_ub_tensor = None 2025-05-07T20:32:36.6803768Z 2025-05-07T20:32:36.6804009Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.6804345Z op = silu_mul_quant 2025-05-07T20:32:36.6804611Z if compiled: 2025-05-07T20:32:36.6804865Z op = torch.compile(op) 2025-05-07T20:32:36.6805177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.6805463Z 2025-05-07T20:32:36.6805660Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.6805838Z 2025-05-07T20:32:36.6805944Z moe/activation_test.py:117: 2025-05-07T20:32:36.6806257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.6806602Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.6806894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.6807733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.6808450Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.6809000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.6809702Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.6810386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.6810938Z kernel = self.compile( 2025-05-07T20:32:36.6811491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.6812169Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.6812581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.6812811Z 2025-05-07T20:32:36.6813037Z self = 2025-05-07T20:32:36.6814193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.6815622Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac688940>} 2025-05-07T20:32:36.6817046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.6818216Z context = 2025-05-07T20:32:36.6818516Z 2025-05-07T20:32:36.6818689Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.6819232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.6819720Z module_map=module_map) 2025-05-07T20:32:36.6820149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.6820516Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.6820790Z E ^ 2025-05-07T20:32:36.6821272Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.6821735Z 2025-05-07T20:32:36.6822161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.6822692Z 2025-05-07T20:32:36.6822800Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.6823232Z self=, 2025-05-07T20:32:36.6823642Z T=128, 2025-05-07T20:32:36.6823832Z D=7168, 2025-05-07T20:32:36.6824037Z scale_ub=None, 2025-05-07T20:32:36.6824257Z contiguous=True, 2025-05-07T20:32:36.6824486Z compiled=False, 2025-05-07T20:32:36.6824705Z ) 2025-05-07T20:32:36.6825031Z self = 2025-05-07T20:32:36.6825537Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.6825807Z 2025-05-07T20:32:36.6825890Z @given( 2025-05-07T20:32:36.6826127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.6826451Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.6826768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.6827106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.6827439Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.6827734Z ) 2025-05-07T20:32:36.6828096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.6828588Z def test_silu_mul_quant( 2025-05-07T20:32:36.6828879Z self, 2025-05-07T20:32:36.6829087Z T: int, 2025-05-07T20:32:36.6829281Z D: int, 2025-05-07T20:32:36.6829510Z scale_ub: Optional[float], 2025-05-07T20:32:36.6829789Z contiguous: bool, 2025-05-07T20:32:36.6830035Z compiled: bool, 2025-05-07T20:32:36.6830271Z ) -> None: 2025-05-07T20:32:36.6830494Z torch.manual_seed(2025) 2025-05-07T20:32:36.6830739Z 2025-05-07T20:32:36.6831029Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.6831384Z 2025-05-07T20:32:36.6831586Z x_sign = torch.sign(x) 2025-05-07T20:32:36.6831883Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.6832199Z x = x_sign * x_clamp 2025-05-07T20:32:36.6832450Z x0 = x[:, :D] 2025-05-07T20:32:36.6832672Z x1 = x[:, D:] 2025-05-07T20:32:36.6832886Z 2025-05-07T20:32:36.6833084Z if contiguous: 2025-05-07T20:32:36.6833321Z x0 = x0.contiguous() 2025-05-07T20:32:36.6833591Z x1 = x1.contiguous() 2025-05-07T20:32:36.6833841Z 2025-05-07T20:32:36.6834035Z if scale_ub is not None: 2025-05-07T20:32:36.6834318Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.6834712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.6835015Z ) 2025-05-07T20:32:36.6835210Z else: 2025-05-07T20:32:36.6835425Z scale_ub_tensor = None 2025-05-07T20:32:36.6835675Z 2025-05-07T20:32:36.6835955Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.6836274Z op = silu_mul_quant 2025-05-07T20:32:36.6836530Z if compiled: 2025-05-07T20:32:36.6836778Z op = torch.compile(op) 2025-05-07T20:32:36.6837085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.6837368Z 2025-05-07T20:32:36.6837568Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.6837749Z 2025-05-07T20:32:36.6837855Z moe/activation_test.py:117: 2025-05-07T20:32:36.6838161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.6838491Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.6838850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.6839585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.6840293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.6840835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.6841536Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.6842217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.6842759Z kernel = self.compile( 2025-05-07T20:32:36.6843317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.6843989Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.6844395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.6844623Z 2025-05-07T20:32:36.6844840Z self = 2025-05-07T20:32:36.6845944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.6847346Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac689240>} 2025-05-07T20:32:36.6848764Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.6849816Z context = 2025-05-07T20:32:36.6850115Z 2025-05-07T20:32:36.6850288Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.6850823Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.6851305Z module_map=module_map) 2025-05-07T20:32:36.6851677Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.6852039Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.6852308Z E ^ 2025-05-07T20:32:36.6852778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.6853242Z 2025-05-07T20:32:36.6853671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.6854198Z 2025-05-07T20:32:36.6854305Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.6854733Z self=, 2025-05-07T20:32:36.6855162Z T=2048, 2025-05-07T20:32:36.6855396Z D=7168, 2025-05-07T20:32:36.6855927Z scale_ub=1200.0, 2025-05-07T20:32:36.6856162Z contiguous=True, 2025-05-07T20:32:36.6856384Z compiled=False, 2025-05-07T20:32:36.6856594Z ) 2025-05-07T20:32:36.7806307Z self = 2025-05-07T20:32:36.7807136Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:36.7807534Z 2025-05-07T20:32:36.7807650Z @given( 2025-05-07T20:32:36.7807905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.7808234Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.7808600Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.7808964Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.7809300Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.7809757Z ) 2025-05-07T20:32:36.7810126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.7810592Z def test_silu_mul_quant( 2025-05-07T20:32:36.7810840Z self, 2025-05-07T20:32:36.7811048Z T: int, 2025-05-07T20:32:36.7811257Z D: int, 2025-05-07T20:32:36.7811483Z scale_ub: Optional[float], 2025-05-07T20:32:36.7811779Z contiguous: bool, 2025-05-07T20:32:36.7812032Z compiled: bool, 2025-05-07T20:32:36.7812265Z ) -> None: 2025-05-07T20:32:36.7812499Z torch.manual_seed(2025) 2025-05-07T20:32:36.7812757Z 2025-05-07T20:32:36.7813040Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.7815186Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.7817118Z 2025-05-07T20:32:36.7817250Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.7817477Z 2025-05-07T20:32:36.7817587Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.7818111Z self=, 2025-05-07T20:32:36.7818521Z T=1, 2025-05-07T20:32:36.7818721Z D=5120, 2025-05-07T20:32:36.7818922Z scale_ub=1200.0, 2025-05-07T20:32:36.7819150Z contiguous=True, 2025-05-07T20:32:36.7819380Z compiled=False, 2025-05-07T20:32:36.7819694Z ) 2025-05-07T20:32:36.7820022Z self = 2025-05-07T20:32:36.7820526Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:36.7820799Z 2025-05-07T20:32:36.7820886Z @given( 2025-05-07T20:32:36.7821129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.7821447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.7821764Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.7822113Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.7822450Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.7822763Z ) 2025-05-07T20:32:36.7823131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.7823587Z def test_silu_mul_quant( 2025-05-07T20:32:36.7823841Z self, 2025-05-07T20:32:36.7824046Z T: int, 2025-05-07T20:32:36.7824252Z D: int, 2025-05-07T20:32:36.7824482Z scale_ub: Optional[float], 2025-05-07T20:32:36.7824769Z contiguous: bool, 2025-05-07T20:32:36.7825016Z compiled: bool, 2025-05-07T20:32:36.7825252Z ) -> None: 2025-05-07T20:32:36.7825484Z torch.manual_seed(2025) 2025-05-07T20:32:36.7825809Z 2025-05-07T20:32:36.7826100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.7826455Z 2025-05-07T20:32:36.7826657Z x_sign = torch.sign(x) 2025-05-07T20:32:36.7827063Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.7827384Z x = x_sign * x_clamp 2025-05-07T20:32:36.7827634Z x0 = x[:, :D] 2025-05-07T20:32:36.7827863Z x1 = x[:, D:] 2025-05-07T20:32:36.7828084Z 2025-05-07T20:32:36.7828300Z if contiguous: 2025-05-07T20:32:36.7828573Z x0 = x0.contiguous() 2025-05-07T20:32:36.7828845Z x1 = x1.contiguous() 2025-05-07T20:32:36.7829097Z 2025-05-07T20:32:36.7829298Z if scale_ub is not None: 2025-05-07T20:32:36.7829585Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.7829936Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.7830301Z ) 2025-05-07T20:32:36.7830504Z else: 2025-05-07T20:32:36.7830728Z scale_ub_tensor = None 2025-05-07T20:32:36.7830985Z 2025-05-07T20:32:36.7831231Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.7831562Z op = silu_mul_quant 2025-05-07T20:32:36.7831819Z if compiled: 2025-05-07T20:32:36.7832081Z op = torch.compile(op) 2025-05-07T20:32:36.7832393Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.7832673Z 2025-05-07T20:32:36.7832877Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.7833055Z 2025-05-07T20:32:36.7833161Z moe/activation_test.py:117: 2025-05-07T20:32:36.7833467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.7833808Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.7834105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.7834821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.7835532Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.7836090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.7836790Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.7837469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.7838015Z kernel = self.compile( 2025-05-07T20:32:36.7838575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.7839351Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.7839756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.7839994Z 2025-05-07T20:32:36.7840209Z self = 2025-05-07T20:32:36.7841318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.7842729Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac68a200>} 2025-05-07T20:32:36.7844106Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.7845157Z context = 2025-05-07T20:32:36.7845459Z 2025-05-07T20:32:36.7845630Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.7846219Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.7846712Z module_map=module_map) 2025-05-07T20:32:36.7847083Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.7847452Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.7847764Z E ^ 2025-05-07T20:32:36.7848232Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.7848698Z 2025-05-07T20:32:36.7849123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.7849652Z 2025-05-07T20:32:36.7849762Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.7850193Z self=, 2025-05-07T20:32:36.7850600Z T=2048, 2025-05-07T20:32:36.7850841Z D=5120, 2025-05-07T20:32:36.7851043Z scale_ub=None, 2025-05-07T20:32:36.7851261Z contiguous=True, 2025-05-07T20:32:36.7851500Z compiled=False, 2025-05-07T20:32:36.7851713Z ) 2025-05-07T20:32:36.7852037Z self = 2025-05-07T20:32:36.7852542Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.7852826Z 2025-05-07T20:32:36.7852908Z @given( 2025-05-07T20:32:36.7853151Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.7853470Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.7853785Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.7854123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.7854459Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.7854751Z ) 2025-05-07T20:32:36.7855111Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.7855847Z def test_silu_mul_quant( 2025-05-07T20:32:36.7856102Z self, 2025-05-07T20:32:36.7856305Z T: int, 2025-05-07T20:32:36.7856503Z D: int, 2025-05-07T20:32:36.7856729Z scale_ub: Optional[float], 2025-05-07T20:32:36.7857008Z contiguous: bool, 2025-05-07T20:32:36.7857251Z compiled: bool, 2025-05-07T20:32:36.7857485Z ) -> None: 2025-05-07T20:32:36.7857709Z torch.manual_seed(2025) 2025-05-07T20:32:36.7857959Z 2025-05-07T20:32:36.7858314Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.7858668Z 2025-05-07T20:32:36.7858900Z > x_sign = torch.sign(x) 2025-05-07T20:32:36.7860989Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.7862912Z 2025-05-07T20:32:36.7863037Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:36.7863263Z 2025-05-07T20:32:36.7863370Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.7863797Z self=, 2025-05-07T20:32:36.7864209Z T=16384, 2025-05-07T20:32:36.7864405Z D=5120, 2025-05-07T20:32:36.7864605Z scale_ub=None, 2025-05-07T20:32:36.7864827Z contiguous=True, 2025-05-07T20:32:36.7865053Z compiled=False, 2025-05-07T20:32:36.7865266Z ) 2025-05-07T20:32:36.8837863Z self = 2025-05-07T20:32:36.8838635Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.8839036Z 2025-05-07T20:32:36.8839146Z @given( 2025-05-07T20:32:36.8839676Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.8840136Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.8840448Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.8840872Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.8841209Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.8841500Z ) 2025-05-07T20:32:36.8841852Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.8842301Z def test_silu_mul_quant( 2025-05-07T20:32:36.8842551Z self, 2025-05-07T20:32:36.8842746Z T: int, 2025-05-07T20:32:36.8842950Z D: int, 2025-05-07T20:32:36.8843178Z scale_ub: Optional[float], 2025-05-07T20:32:36.8843453Z contiguous: bool, 2025-05-07T20:32:36.8843700Z compiled: bool, 2025-05-07T20:32:36.8844012Z ) -> None: 2025-05-07T20:32:36.8844230Z torch.manual_seed(2025) 2025-05-07T20:32:36.8844480Z 2025-05-07T20:32:36.8844766Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.8846886Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.8848810Z 2025-05-07T20:32:36.8848942Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.8849161Z 2025-05-07T20:32:36.8849267Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.8849699Z self=, 2025-05-07T20:32:36.8850115Z T=4096, 2025-05-07T20:32:36.8850302Z D=5120, 2025-05-07T20:32:36.8850504Z scale_ub=None, 2025-05-07T20:32:36.8850727Z contiguous=True, 2025-05-07T20:32:36.8850951Z compiled=False, 2025-05-07T20:32:36.8851167Z ) 2025-05-07T20:32:36.8851497Z self = 2025-05-07T20:32:36.8851994Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.8852276Z 2025-05-07T20:32:36.8852354Z @given( 2025-05-07T20:32:36.8852592Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.8852911Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.8853295Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.8853639Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.8853977Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.8854271Z ) 2025-05-07T20:32:36.8854635Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.8855091Z def test_silu_mul_quant( 2025-05-07T20:32:36.8855338Z self, 2025-05-07T20:32:36.8855907Z T: int, 2025-05-07T20:32:36.8856123Z D: int, 2025-05-07T20:32:36.8856346Z scale_ub: Optional[float], 2025-05-07T20:32:36.8856626Z contiguous: bool, 2025-05-07T20:32:36.8856873Z compiled: bool, 2025-05-07T20:32:36.8857105Z ) -> None: 2025-05-07T20:32:36.8857324Z torch.manual_seed(2025) 2025-05-07T20:32:36.8857576Z 2025-05-07T20:32:36.8857856Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.8860202Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.8862169Z 2025-05-07T20:32:36.8862294Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.8862515Z 2025-05-07T20:32:36.8862621Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.8863047Z self=, 2025-05-07T20:32:36.8863455Z T=2048, 2025-05-07T20:32:36.8863641Z D=5120, 2025-05-07T20:32:36.8863839Z scale_ub=None, 2025-05-07T20:32:36.8864060Z contiguous=False, 2025-05-07T20:32:36.8864291Z compiled=False, 2025-05-07T20:32:36.8864501Z ) 2025-05-07T20:32:36.8864825Z self = 2025-05-07T20:32:36.8865391Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:36.8865673Z 2025-05-07T20:32:36.8865757Z @given( 2025-05-07T20:32:36.8866010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.8866331Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.8866651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.8866987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.8867323Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.8867610Z ) 2025-05-07T20:32:36.8867967Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.8868416Z def test_silu_mul_quant( 2025-05-07T20:32:36.8868657Z self, 2025-05-07T20:32:36.8868863Z T: int, 2025-05-07T20:32:36.8869064Z D: int, 2025-05-07T20:32:36.8869281Z scale_ub: Optional[float], 2025-05-07T20:32:36.8869559Z contiguous: bool, 2025-05-07T20:32:36.8869808Z compiled: bool, 2025-05-07T20:32:36.8870032Z ) -> None: 2025-05-07T20:32:36.8870258Z torch.manual_seed(2025) 2025-05-07T20:32:36.8870506Z 2025-05-07T20:32:36.8878477Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.8880831Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.8882746Z 2025-05-07T20:32:36.8882871Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.8883097Z 2025-05-07T20:32:36.8883203Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.8883629Z self=, 2025-05-07T20:32:36.8884027Z T=4096, 2025-05-07T20:32:36.8884218Z D=7168, 2025-05-07T20:32:36.8884413Z scale_ub=None, 2025-05-07T20:32:36.8884632Z contiguous=True, 2025-05-07T20:32:36.8884854Z compiled=True, 2025-05-07T20:32:36.8885059Z ) 2025-05-07T20:32:36.8885383Z self = 2025-05-07T20:32:36.8885872Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:36.8886146Z 2025-05-07T20:32:36.8886229Z @given( 2025-05-07T20:32:36.8886465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.8886782Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.8887100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.8887438Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.8887768Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.8888059Z ) 2025-05-07T20:32:36.8888497Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.8888976Z def test_silu_mul_quant( 2025-05-07T20:32:36.8889216Z self, 2025-05-07T20:32:36.8889463Z T: int, 2025-05-07T20:32:36.8889666Z D: int, 2025-05-07T20:32:36.8889884Z scale_ub: Optional[float], 2025-05-07T20:32:36.8890163Z contiguous: bool, 2025-05-07T20:32:36.8890411Z compiled: bool, 2025-05-07T20:32:36.8890634Z ) -> None: 2025-05-07T20:32:36.8890856Z torch.manual_seed(2025) 2025-05-07T20:32:36.8891109Z 2025-05-07T20:32:36.8891419Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.8893510Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.8895464Z 2025-05-07T20:32:36.8895583Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.8895804Z 2025-05-07T20:32:36.8895910Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.8896333Z self=, 2025-05-07T20:32:36.8896735Z T=2048, 2025-05-07T20:32:36.8896927Z D=5120, 2025-05-07T20:32:36.8897129Z scale_ub=1200.0, 2025-05-07T20:32:36.8897364Z contiguous=False, 2025-05-07T20:32:36.8897590Z compiled=False, 2025-05-07T20:32:36.8897805Z ) 2025-05-07T20:32:36.8898226Z self = 2025-05-07T20:32:36.8898729Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:36.8899013Z 2025-05-07T20:32:36.8899092Z @given( 2025-05-07T20:32:36.8899330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.8899646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.8899959Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.8900296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.8900627Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.8900923Z ) 2025-05-07T20:32:36.8901283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.8901730Z def test_silu_mul_quant( 2025-05-07T20:32:36.8902016Z self, 2025-05-07T20:32:36.8902216Z T: int, 2025-05-07T20:32:36.8902419Z D: int, 2025-05-07T20:32:36.8902636Z scale_ub: Optional[float], 2025-05-07T20:32:36.8902916Z contiguous: bool, 2025-05-07T20:32:36.8903154Z compiled: bool, 2025-05-07T20:32:36.8903381Z ) -> None: 2025-05-07T20:32:36.8903603Z torch.manual_seed(2025) 2025-05-07T20:32:36.8903843Z 2025-05-07T20:32:36.8904123Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.8906209Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.8908102Z 2025-05-07T20:32:36.8908241Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.8908493Z 2025-05-07T20:32:36.8908650Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.8909068Z self=, 2025-05-07T20:32:36.8909475Z T=4096, 2025-05-07T20:32:36.8909671Z D=7168, 2025-05-07T20:32:36.8909899Z scale_ub=1200.0, 2025-05-07T20:32:36.8910126Z contiguous=True, 2025-05-07T20:32:36.8910354Z compiled=False, 2025-05-07T20:32:36.8910560Z ) 2025-05-07T20:32:37.0183186Z self = 2025-05-07T20:32:37.0183970Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.0184372Z 2025-05-07T20:32:37.0184484Z @given( 2025-05-07T20:32:37.0184834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0185252Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0185571Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0186188Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0186534Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0186822Z ) 2025-05-07T20:32:37.0187185Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0187639Z def test_silu_mul_quant( 2025-05-07T20:32:37.0187888Z self, 2025-05-07T20:32:37.0188094Z T: int, 2025-05-07T20:32:37.0188301Z D: int, 2025-05-07T20:32:37.0188526Z scale_ub: Optional[float], 2025-05-07T20:32:37.0188807Z contiguous: bool, 2025-05-07T20:32:37.0189059Z compiled: bool, 2025-05-07T20:32:37.0189291Z ) -> None: 2025-05-07T20:32:37.0189519Z torch.manual_seed(2025) 2025-05-07T20:32:37.0189772Z 2025-05-07T20:32:37.0190053Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0192180Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.0194119Z 2025-05-07T20:32:37.0194242Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.0194464Z 2025-05-07T20:32:37.0194572Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0195001Z self=, 2025-05-07T20:32:37.0195491Z T=16384, 2025-05-07T20:32:37.0195696Z D=7168, 2025-05-07T20:32:37.0195896Z scale_ub=None, 2025-05-07T20:32:37.0196116Z contiguous=False, 2025-05-07T20:32:37.0196351Z compiled=True, 2025-05-07T20:32:37.0196569Z ) 2025-05-07T20:32:37.0196895Z self = 2025-05-07T20:32:37.0197402Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.0197689Z 2025-05-07T20:32:37.0197769Z @given( 2025-05-07T20:32:37.0198011Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0198353Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0198690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0199027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0199359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0199650Z ) 2025-05-07T20:32:37.0200013Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0200459Z def test_silu_mul_quant( 2025-05-07T20:32:37.0200709Z self, 2025-05-07T20:32:37.0200912Z T: int, 2025-05-07T20:32:37.0201121Z D: int, 2025-05-07T20:32:37.0201341Z scale_ub: Optional[float], 2025-05-07T20:32:37.0201743Z contiguous: bool, 2025-05-07T20:32:37.0201998Z compiled: bool, 2025-05-07T20:32:37.0202224Z ) -> None: 2025-05-07T20:32:37.0202450Z torch.manual_seed(2025) 2025-05-07T20:32:37.0202705Z 2025-05-07T20:32:37.0203059Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0205159Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.0207117Z 2025-05-07T20:32:37.0207243Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.0207466Z 2025-05-07T20:32:37.0207574Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0207999Z self=, 2025-05-07T20:32:37.0208407Z T=4096, 2025-05-07T20:32:37.0208605Z D=7168, 2025-05-07T20:32:37.0208811Z scale_ub=None, 2025-05-07T20:32:37.0209056Z contiguous=True, 2025-05-07T20:32:37.0209313Z compiled=False, 2025-05-07T20:32:37.0209527Z ) 2025-05-07T20:32:37.0209853Z self = 2025-05-07T20:32:37.0210359Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.0210636Z 2025-05-07T20:32:37.0210727Z @given( 2025-05-07T20:32:37.0210961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0211283Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0211599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0211938Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0212272Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0212566Z ) 2025-05-07T20:32:37.0212923Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0213379Z def test_silu_mul_quant( 2025-05-07T20:32:37.0213627Z self, 2025-05-07T20:32:37.0213828Z T: int, 2025-05-07T20:32:37.0214035Z D: int, 2025-05-07T20:32:37.0214259Z scale_ub: Optional[float], 2025-05-07T20:32:37.0214543Z contiguous: bool, 2025-05-07T20:32:37.0214785Z compiled: bool, 2025-05-07T20:32:37.0215015Z ) -> None: 2025-05-07T20:32:37.0215291Z torch.manual_seed(2025) 2025-05-07T20:32:37.0215541Z 2025-05-07T20:32:37.0215819Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0217914Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.0219990Z 2025-05-07T20:32:37.0220112Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.0220327Z 2025-05-07T20:32:37.0220441Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0220864Z self=, 2025-05-07T20:32:37.0221276Z T=16384, 2025-05-07T20:32:37.0221475Z D=7168, 2025-05-07T20:32:37.0221670Z scale_ub=None, 2025-05-07T20:32:37.0221893Z contiguous=True, 2025-05-07T20:32:37.0222128Z compiled=False, 2025-05-07T20:32:37.0222337Z ) 2025-05-07T20:32:37.0222715Z self = 2025-05-07T20:32:37.0223223Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.0223572Z 2025-05-07T20:32:37.0223663Z @given( 2025-05-07T20:32:37.0223897Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0224219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0224534Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0224865Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0225202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0225497Z ) 2025-05-07T20:32:37.0225853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0226304Z def test_silu_mul_quant( 2025-05-07T20:32:37.0226601Z self, 2025-05-07T20:32:37.0226802Z T: int, 2025-05-07T20:32:37.0227000Z D: int, 2025-05-07T20:32:37.0227228Z scale_ub: Optional[float], 2025-05-07T20:32:37.0227507Z contiguous: bool, 2025-05-07T20:32:37.0227750Z compiled: bool, 2025-05-07T20:32:37.0227979Z ) -> None: 2025-05-07T20:32:37.0228211Z torch.manual_seed(2025) 2025-05-07T20:32:37.0228455Z 2025-05-07T20:32:37.0228731Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0230841Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.0232750Z 2025-05-07T20:32:37.0232871Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.0233088Z 2025-05-07T20:32:37.0233205Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0233620Z self=, 2025-05-07T20:32:37.0234029Z T=16384, 2025-05-07T20:32:37.0234228Z D=7168, 2025-05-07T20:32:37.0234445Z scale_ub=1200.0, 2025-05-07T20:32:37.0234668Z contiguous=True, 2025-05-07T20:32:37.0234897Z compiled=False, 2025-05-07T20:32:37.0235108Z ) 2025-05-07T20:32:37.0235427Z self = 2025-05-07T20:32:37.0235984Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.0236266Z 2025-05-07T20:32:37.0236352Z @given( 2025-05-07T20:32:37.0236587Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0236905Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0237219Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0237754Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0238086Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0238381Z ) 2025-05-07T20:32:37.0238734Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0239173Z def test_silu_mul_quant( 2025-05-07T20:32:37.0239417Z self, 2025-05-07T20:32:37.0239615Z T: int, 2025-05-07T20:32:37.0239809Z D: int, 2025-05-07T20:32:37.0240033Z scale_ub: Optional[float], 2025-05-07T20:32:37.0240307Z contiguous: bool, 2025-05-07T20:32:37.0240546Z compiled: bool, 2025-05-07T20:32:37.0240780Z ) -> None: 2025-05-07T20:32:37.0241007Z torch.manual_seed(2025) 2025-05-07T20:32:37.0241253Z 2025-05-07T20:32:37.0241526Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0243666Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.0245604Z 2025-05-07T20:32:37.0245724Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.0245937Z 2025-05-07T20:32:37.0246055Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0246473Z self=, 2025-05-07T20:32:37.0246879Z T=128, 2025-05-07T20:32:37.0247115Z D=5120, 2025-05-07T20:32:37.0247313Z scale_ub=1200.0, 2025-05-07T20:32:37.0247540Z contiguous=False, 2025-05-07T20:32:37.0247772Z compiled=False, 2025-05-07T20:32:37.0247983Z ) 2025-05-07T20:32:37.1666287Z self = 2025-05-07T20:32:37.1666927Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.1667231Z 2025-05-07T20:32:37.1667317Z @given( 2025-05-07T20:32:37.1667551Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1667871Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1668187Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1668567Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1668912Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1669202Z ) 2025-05-07T20:32:37.1669563Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1670016Z def test_silu_mul_quant( 2025-05-07T20:32:37.1670263Z self, 2025-05-07T20:32:37.1670487Z T: int, 2025-05-07T20:32:37.1670688Z D: int, 2025-05-07T20:32:37.1670907Z scale_ub: Optional[float], 2025-05-07T20:32:37.1671185Z contiguous: bool, 2025-05-07T20:32:37.1671431Z compiled: bool, 2025-05-07T20:32:37.1671658Z ) -> None: 2025-05-07T20:32:37.1671882Z torch.manual_seed(2025) 2025-05-07T20:32:37.1672128Z 2025-05-07T20:32:37.1672400Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1672749Z 2025-05-07T20:32:37.1672950Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1673243Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1673557Z x = x_sign * x_clamp 2025-05-07T20:32:37.1674079Z x0 = x[:, :D] 2025-05-07T20:32:37.1674322Z x1 = x[:, D:] 2025-05-07T20:32:37.1674540Z 2025-05-07T20:32:37.1674746Z if contiguous: 2025-05-07T20:32:37.1674997Z x0 = x0.contiguous() 2025-05-07T20:32:37.1675279Z x1 = x1.contiguous() 2025-05-07T20:32:37.1675544Z 2025-05-07T20:32:37.1675753Z if scale_ub is not None: 2025-05-07T20:32:37.1676050Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1676426Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1676777Z ) 2025-05-07T20:32:37.1676975Z else: 2025-05-07T20:32:37.1677201Z scale_ub_tensor = None 2025-05-07T20:32:37.1677479Z 2025-05-07T20:32:37.1677726Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1678080Z op = silu_mul_quant 2025-05-07T20:32:37.1678356Z if compiled: 2025-05-07T20:32:37.1678699Z op = torch.compile(op) 2025-05-07T20:32:37.1679093Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1679431Z 2025-05-07T20:32:37.1679631Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1679801Z 2025-05-07T20:32:37.1679902Z moe/activation_test.py:117: 2025-05-07T20:32:37.1680290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1680672Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1680979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1681897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1682742Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1683385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1684205Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1685014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1685658Z kernel = self.compile( 2025-05-07T20:32:37.1686380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1687179Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1687646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1687918Z 2025-05-07T20:32:37.1688164Z self = 2025-05-07T20:32:37.1689561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1691327Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acabdea0>} 2025-05-07T20:32:37.1693035Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1694318Z context = 2025-05-07T20:32:37.1694663Z 2025-05-07T20:32:37.1694859Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1695477Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1696038Z module_map=module_map) 2025-05-07T20:32:37.1696453Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1696850Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1697142Z E ^ 2025-05-07T20:32:37.1697742Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1698359Z 2025-05-07T20:32:37.1698788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1699312Z 2025-05-07T20:32:37.1699420Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1699841Z self=, 2025-05-07T20:32:37.1700249Z T=2048, 2025-05-07T20:32:37.1700441Z D=7168, 2025-05-07T20:32:37.1700638Z scale_ub=None, 2025-05-07T20:32:37.1700859Z contiguous=False, 2025-05-07T20:32:37.1701086Z compiled=False, 2025-05-07T20:32:37.1701302Z ) 2025-05-07T20:32:37.1701628Z self = 2025-05-07T20:32:37.1702133Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.1702407Z 2025-05-07T20:32:37.1702487Z @given( 2025-05-07T20:32:37.1702728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1703047Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1703359Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1703698Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1704081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1704367Z ) 2025-05-07T20:32:37.1704720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1705204Z def test_silu_mul_quant( 2025-05-07T20:32:37.1705444Z self, 2025-05-07T20:32:37.1705641Z T: int, 2025-05-07T20:32:37.1705842Z D: int, 2025-05-07T20:32:37.1706067Z scale_ub: Optional[float], 2025-05-07T20:32:37.1706338Z contiguous: bool, 2025-05-07T20:32:37.1706582Z compiled: bool, 2025-05-07T20:32:37.1706808Z ) -> None: 2025-05-07T20:32:37.1707023Z torch.manual_seed(2025) 2025-05-07T20:32:37.1707271Z 2025-05-07T20:32:37.1707551Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1709658Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1711620Z 2025-05-07T20:32:37.1711743Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1711964Z 2025-05-07T20:32:37.1712069Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1712493Z self=, 2025-05-07T20:32:37.1712903Z T=128, 2025-05-07T20:32:37.1713090Z D=7168, 2025-05-07T20:32:37.1713286Z scale_ub=1200.0, 2025-05-07T20:32:37.1713513Z contiguous=True, 2025-05-07T20:32:37.1713739Z compiled=True, 2025-05-07T20:32:37.1713946Z ) 2025-05-07T20:32:37.2133967Z self = 2025-05-07T20:32:37.2134523Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.2134795Z 2025-05-07T20:32:37.2134883Z @given( 2025-05-07T20:32:37.2135120Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.2135444Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.2135758Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.2136087Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.2136424Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.2136713Z ) 2025-05-07T20:32:37.2137255Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.2137702Z def test_silu_mul_quant( 2025-05-07T20:32:37.2137950Z self, 2025-05-07T20:32:37.2138269Z T: int, 2025-05-07T20:32:37.2138466Z D: int, 2025-05-07T20:32:37.2138693Z scale_ub: Optional[float], 2025-05-07T20:32:37.2138974Z contiguous: bool, 2025-05-07T20:32:37.2139214Z compiled: bool, 2025-05-07T20:32:37.2139445Z ) -> None: 2025-05-07T20:32:37.2139668Z torch.manual_seed(2025) 2025-05-07T20:32:37.2139914Z 2025-05-07T20:32:37.2140194Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.2140544Z 2025-05-07T20:32:37.2140739Z x_sign = torch.sign(x) 2025-05-07T20:32:37.2141038Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.2141352Z x = x_sign * x_clamp 2025-05-07T20:32:37.2141592Z x0 = x[:, :D] 2025-05-07T20:32:37.2141815Z x1 = x[:, D:] 2025-05-07T20:32:37.2142033Z 2025-05-07T20:32:37.2142225Z if contiguous: 2025-05-07T20:32:37.2142460Z x0 = x0.contiguous() 2025-05-07T20:32:37.2142725Z x1 = x1.contiguous() 2025-05-07T20:32:37.2142972Z 2025-05-07T20:32:37.2143164Z if scale_ub is not None: 2025-05-07T20:32:37.2143581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.2152154Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.2152471Z ) 2025-05-07T20:32:37.2152676Z else: 2025-05-07T20:32:37.2153037Z scale_ub_tensor = None 2025-05-07T20:32:37.2153287Z 2025-05-07T20:32:37.2153530Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.2153850Z op = silu_mul_quant 2025-05-07T20:32:37.2154105Z if compiled: 2025-05-07T20:32:37.2154352Z op = torch.compile(op) 2025-05-07T20:32:37.2154652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.2154928Z 2025-05-07T20:32:37.2155122Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.2155296Z 2025-05-07T20:32:37.2155396Z moe/activation_test.py:117: 2025-05-07T20:32:37.2156097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.2156547Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.2156845Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.2157423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.2158005Z return fn(*args, **kwargs) 2025-05-07T20:32:37.2158726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.2159435Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.2159986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.2160678Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.2161355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.2161902Z kernel = self.compile( 2025-05-07T20:32:37.2162459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.2163124Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.2163526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.2163755Z 2025-05-07T20:32:37.2163976Z self = 2025-05-07T20:32:37.2165083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.2166578Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acabf7f0>} 2025-05-07T20:32:37.2167958Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.2169054Z context = 2025-05-07T20:32:37.2169350Z 2025-05-07T20:32:37.2169527Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.2170052Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.2170533Z module_map=module_map) 2025-05-07T20:32:37.2170910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.2171275Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.2171542Z E ^ 2025-05-07T20:32:37.2172019Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.2172479Z 2025-05-07T20:32:37.2172981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.2173501Z 2025-05-07T20:32:37.2173608Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.2174031Z self=, 2025-05-07T20:32:37.2174497Z T=128, 2025-05-07T20:32:37.2174691Z D=7168, 2025-05-07T20:32:37.2174888Z scale_ub=1200.0, 2025-05-07T20:32:37.2175116Z contiguous=True, 2025-05-07T20:32:37.2175345Z compiled=False, 2025-05-07T20:32:37.2175549Z ) 2025-05-07T20:32:37.2175878Z self = 2025-05-07T20:32:37.2176378Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.2176653Z 2025-05-07T20:32:37.2176730Z @given( 2025-05-07T20:32:37.2176966Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.2177331Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.2177638Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.2178148Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.2178539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.2178830Z ) 2025-05-07T20:32:37.2179187Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.2179636Z def test_silu_mul_quant( 2025-05-07T20:32:37.2179874Z self, 2025-05-07T20:32:37.2180075Z T: int, 2025-05-07T20:32:37.2180273Z D: int, 2025-05-07T20:32:37.2180487Z scale_ub: Optional[float], 2025-05-07T20:32:37.2180764Z contiguous: bool, 2025-05-07T20:32:37.2181007Z compiled: bool, 2025-05-07T20:32:37.2181227Z ) -> None: 2025-05-07T20:32:37.2181450Z torch.manual_seed(2025) 2025-05-07T20:32:37.2181695Z 2025-05-07T20:32:37.2181969Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.2182316Z 2025-05-07T20:32:37.2182513Z x_sign = torch.sign(x) 2025-05-07T20:32:37.2182805Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.2184853Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.2186755Z 2025-05-07T20:32:37.2186943Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.2187167Z 2025-05-07T20:32:37.2187274Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.2187699Z self=, 2025-05-07T20:32:37.2188102Z T=128, 2025-05-07T20:32:37.2188299Z D=5120, 2025-05-07T20:32:37.2188498Z scale_ub=1200.0, 2025-05-07T20:32:37.2188749Z contiguous=True, 2025-05-07T20:32:37.2189074Z compiled=True, 2025-05-07T20:32:37.2189357Z ) 2025-05-07T20:32:37.2189793Z self = 2025-05-07T20:32:37.2190400Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.2190677Z 2025-05-07T20:32:37.2190754Z @given( 2025-05-07T20:32:37.2190986Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.2191294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.2191603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.2191944Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.2192273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.2192564Z ) 2025-05-07T20:32:37.2192918Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.2193436Z def test_silu_mul_quant( 2025-05-07T20:32:37.2193676Z self, 2025-05-07T20:32:37.2193870Z T: int, 2025-05-07T20:32:37.2194066Z D: int, 2025-05-07T20:32:37.2194279Z scale_ub: Optional[float], 2025-05-07T20:32:37.2194599Z contiguous: bool, 2025-05-07T20:32:37.2194842Z compiled: bool, 2025-05-07T20:32:37.2195062Z ) -> None: 2025-05-07T20:32:37.2195280Z torch.manual_seed(2025) 2025-05-07T20:32:37.2195522Z 2025-05-07T20:32:37.2195794Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.2196135Z 2025-05-07T20:32:37.2196329Z x_sign = torch.sign(x) 2025-05-07T20:32:37.2196624Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.2198670Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.2200616Z 2025-05-07T20:32:37.2200736Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.2200955Z 2025-05-07T20:32:37.2201060Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.2201479Z self=, 2025-05-07T20:32:37.2201879Z T=128, 2025-05-07T20:32:37.2202073Z D=7168, 2025-05-07T20:32:37.2202267Z scale_ub=None, 2025-05-07T20:32:37.2202476Z contiguous=True, 2025-05-07T20:32:37.2202699Z compiled=True, 2025-05-07T20:32:37.2202906Z ) 2025-05-07T20:32:37.4175937Z self = 2025-05-07T20:32:37.4176499Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.4176777Z 2025-05-07T20:32:37.4176861Z @given( 2025-05-07T20:32:37.4177109Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4177433Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4177750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4178205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4178591Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4178887Z ) 2025-05-07T20:32:37.4179250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4179979Z def test_silu_mul_quant( 2025-05-07T20:32:37.4180238Z self, 2025-05-07T20:32:37.4180447Z T: int, 2025-05-07T20:32:37.4180653Z D: int, 2025-05-07T20:32:37.4180887Z scale_ub: Optional[float], 2025-05-07T20:32:37.4181174Z contiguous: bool, 2025-05-07T20:32:37.4181422Z compiled: bool, 2025-05-07T20:32:37.4181652Z ) -> None: 2025-05-07T20:32:37.4181880Z torch.manual_seed(2025) 2025-05-07T20:32:37.4182132Z 2025-05-07T20:32:37.4182411Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4184542Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.4186477Z 2025-05-07T20:32:37.4186601Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.4186902Z 2025-05-07T20:32:37.4196674Z FAILED 2025-05-07T20:32:37.4196966Z 2025-05-07T20:32:37.4197481Z =================================== FAILURES =================================== 2025-05-07T20:32:37.4198149Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:37.4199103Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:37.4199994Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:32:37.4200762Z | yield 2025-05-07T20:32:37.4201396Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run 2025-05-07T20:32:37.4202134Z | self._callTestMethod(testMethod) 2025-05-07T20:32:37.4202957Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod 2025-05-07T20:32:37.4203842Z | method() 2025-05-07T20:32:37.4204750Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:37.4205704Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4206364Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:37.4207167Z | raise the_error_hypothesis_found 2025-05-07T20:32:37.4207671Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:37.4208172Z +-+---------------- 1 ---------------- 2025-05-07T20:32:37.4208612Z | Traceback (most recent call last): 2025-05-07T20:32:37.4209618Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:37.4210709Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4213629Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.4216465Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:37.4217172Z | self=, 2025-05-07T20:32:37.4217759Z | T=2048, 2025-05-07T20:32:37.4218210Z | D=5120, # or any other generated value 2025-05-07T20:32:37.4218749Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:37.4219276Z | contiguous=True, # or any other generated value 2025-05-07T20:32:37.4219794Z | compiled=False, # or any other generated value 2025-05-07T20:32:37.4220256Z | ) 2025-05-07T20:32:37.4220515Z | 2025-05-07T20:32:37.4221254Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:37.4222097Z +---------------- 2 ---------------- 2025-05-07T20:32:37.4222516Z | Traceback (most recent call last): 2025-05-07T20:32:37.4223549Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:37.4224647Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4227582Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.4230505Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:37.4231129Z | self=, 2025-05-07T20:32:37.4231692Z | T=128, 2025-05-07T20:32:37.4231978Z | D=7168, 2025-05-07T20:32:37.4232270Z | scale_ub=None, 2025-05-07T20:32:37.4232623Z | contiguous=True, 2025-05-07T20:32:37.4232965Z | compiled=True, 2025-05-07T20:32:37.4233279Z | ) 2025-05-07T20:32:37.4233640Z | 2025-05-07T20:32:37.4234390Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:37.4235226Z +---------------- 3 ---------------- 2025-05-07T20:32:37.4235637Z | Traceback (most recent call last): 2025-05-07T20:32:37.4236692Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:37.4237786Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4240405Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.4242431Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:37.4242877Z | self=, 2025-05-07T20:32:37.4243341Z | T=128, 2025-05-07T20:32:37.4243618Z | D=5120, 2025-05-07T20:32:37.4243925Z | scale_ub=1200.0, 2025-05-07T20:32:37.4244282Z | contiguous=True, 2025-05-07T20:32:37.4244646Z | compiled=True, 2025-05-07T20:32:37.4244972Z | ) 2025-05-07T20:32:37.4245236Z | 2025-05-07T20:32:37.4246092Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:37.4246995Z +---------------- 4 ---------------- 2025-05-07T20:32:37.4247434Z | Traceback (most recent call last): 2025-05-07T20:32:37.4248556Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:37.4249638Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4250317Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:37.4251026Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4251877Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:37.4252684Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4253333Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:37.4254407Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4255979Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:37.4257329Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4258790Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:37.4261009Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4262188Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:37.4263224Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4264200Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:37.4265251Z | fn() 2025-05-07T20:32:37.4266082Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:37.4266995Z | self.fn.run( 2025-05-07T20:32:37.4267757Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:37.4268607Z | kernel = self.compile( 2025-05-07T20:32:37.4269486Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:37.4270512Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4271545Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:37.4272701Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4273465Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4273980Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4274364Z | ^ 2025-05-07T20:32:37.4275036Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4275872Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:37.4276487Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:37.4277264Z | self=, 2025-05-07T20:32:37.4277914Z | T=1, # or any other generated value 2025-05-07T20:32:37.4278529Z | D=5120, # or any other generated value 2025-05-07T20:32:37.4279037Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:37.4279562Z | contiguous=True, # or any other generated value 2025-05-07T20:32:37.4280100Z | compiled=True, # or any other generated value 2025-05-07T20:32:37.4280545Z | ) 2025-05-07T20:32:37.4280809Z | 2025-05-07T20:32:37.4281551Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:37.4282444Z +------------------------------------ 2025-05-07T20:32:37.4282964Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:37.4283498Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4284106Z self=, 2025-05-07T20:32:37.4284689Z T=1, 2025-05-07T20:32:37.4284970Z D=5120, 2025-05-07T20:32:37.4285259Z scale_ub=None, 2025-05-07T20:32:37.4285583Z contiguous=True, 2025-05-07T20:32:37.4285903Z compiled=True, 2025-05-07T20:32:37.4286213Z ) 2025-05-07T20:32:37.4286681Z self = 2025-05-07T20:32:37.4287452Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.4287849Z 2025-05-07T20:32:37.4287966Z @given( 2025-05-07T20:32:37.4288317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4288808Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4289316Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4289813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4290304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4290720Z ) 2025-05-07T20:32:37.4291238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4291893Z def test_silu_mul_quant( 2025-05-07T20:32:37.4292258Z self, 2025-05-07T20:32:37.4292539Z T: int, 2025-05-07T20:32:37.4292836Z D: int, 2025-05-07T20:32:37.4293160Z scale_ub: Optional[float], 2025-05-07T20:32:37.4293619Z contiguous: bool, 2025-05-07T20:32:37.4293979Z compiled: bool, 2025-05-07T20:32:37.4294323Z ) -> None: 2025-05-07T20:32:37.4294638Z torch.manual_seed(2025) 2025-05-07T20:32:37.4295000Z 2025-05-07T20:32:37.4295402Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4295878Z 2025-05-07T20:32:37.4296146Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4296573Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4297015Z x = x_sign * x_clamp 2025-05-07T20:32:37.4297357Z x0 = x[:, :D] 2025-05-07T20:32:37.4297672Z x1 = x[:, D:] 2025-05-07T20:32:37.4297972Z 2025-05-07T20:32:37.4298363Z if contiguous: 2025-05-07T20:32:37.4298706Z x0 = x0.contiguous() 2025-05-07T20:32:37.4299084Z x1 = x1.contiguous() 2025-05-07T20:32:37.4299440Z 2025-05-07T20:32:37.4299729Z if scale_ub is not None: 2025-05-07T20:32:37.4300156Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4300653Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4301123Z ) 2025-05-07T20:32:37.4301414Z else: 2025-05-07T20:32:37.4301725Z scale_ub_tensor = None 2025-05-07T20:32:37.4302097Z 2025-05-07T20:32:37.4302437Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4302894Z op = silu_mul_quant 2025-05-07T20:32:37.4303261Z if compiled: 2025-05-07T20:32:37.4303622Z op = torch.compile(op) 2025-05-07T20:32:37.4304044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4304443Z 2025-05-07T20:32:37.4304730Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.4305147Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.4305620Z 2025-05-07T20:32:37.4305944Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4306392Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.4306797Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.4307235Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.4307729Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4308152Z 2025-05-07T20:32:37.4308463Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4308766Z 2025-05-07T20:32:37.4308914Z moe/activation_test.py:126: 2025-05-07T20:32:37.4309315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4309790Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.4310290Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4331903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.4332982Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4333781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4334864Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4335879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.4336985Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4338206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.4339310Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4340352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.4341266Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4342093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.4342864Z fn() 2025-05-07T20:32:37.4343583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.4344394Z self.fn.run( 2025-05-07T20:32:37.4345071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4345852Z kernel = self.compile( 2025-05-07T20:32:37.4346637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4347585Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4348161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4348494Z 2025-05-07T20:32:37.4348789Z self = 2025-05-07T20:32:37.4350334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4352316Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89ab60af0>} 2025-05-07T20:32:37.4354233Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4356053Z context = 2025-05-07T20:32:37.4356475Z 2025-05-07T20:32:37.4356840Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4357612Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4358285Z module_map=module_map) 2025-05-07T20:32:37.4358789Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4359285Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4359670Z E ^ 2025-05-07T20:32:37.4360326Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4360970Z 2025-05-07T20:32:37.4361553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4362301Z 2025-05-07T20:32:37.4362451Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4363058Z self=, 2025-05-07T20:32:37.4363637Z T=2048, 2025-05-07T20:32:37.4363906Z D=5120, 2025-05-07T20:32:37.4364189Z scale_ub=1200.0, 2025-05-07T20:32:37.4364513Z contiguous=True, 2025-05-07T20:32:37.4364839Z compiled=False, 2025-05-07T20:32:37.4365142Z ) 2025-05-07T20:32:37.4365674Z self = 2025-05-07T20:32:37.4366353Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.4366741Z 2025-05-07T20:32:37.4366847Z @given( 2025-05-07T20:32:37.4367254Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4367692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4368128Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4368592Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4369056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4369451Z ) 2025-05-07T20:32:37.4369946Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4370569Z def test_silu_mul_quant( 2025-05-07T20:32:37.4370904Z self, 2025-05-07T20:32:37.4371184Z T: int, 2025-05-07T20:32:37.4371534Z D: int, 2025-05-07T20:32:37.4371822Z scale_ub: Optional[float], 2025-05-07T20:32:37.4372202Z contiguous: bool, 2025-05-07T20:32:37.4372531Z compiled: bool, 2025-05-07T20:32:37.4372844Z ) -> None: 2025-05-07T20:32:37.4373143Z torch.manual_seed(2025) 2025-05-07T20:32:37.4373473Z 2025-05-07T20:32:37.4373867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4374336Z 2025-05-07T20:32:37.4374603Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4375001Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4375437Z x = x_sign * x_clamp 2025-05-07T20:32:37.4375777Z x0 = x[:, :D] 2025-05-07T20:32:37.4376068Z x1 = x[:, D:] 2025-05-07T20:32:37.4376364Z 2025-05-07T20:32:37.4376644Z if contiguous: 2025-05-07T20:32:37.4376972Z x0 = x0.contiguous() 2025-05-07T20:32:37.4377333Z x1 = x1.contiguous() 2025-05-07T20:32:37.4377679Z 2025-05-07T20:32:37.4377940Z if scale_ub is not None: 2025-05-07T20:32:37.4378451Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4378987Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4379419Z ) 2025-05-07T20:32:37.4379696Z else: 2025-05-07T20:32:37.4380001Z scale_ub_tensor = None 2025-05-07T20:32:37.4380361Z 2025-05-07T20:32:37.4380704Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4381149Z op = silu_mul_quant 2025-05-07T20:32:37.4381505Z if compiled: 2025-05-07T20:32:37.4381858Z op = torch.compile(op) 2025-05-07T20:32:37.4382274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4382658Z 2025-05-07T20:32:37.4382980Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4383214Z 2025-05-07T20:32:37.4383357Z moe/activation_test.py:117: 2025-05-07T20:32:37.4383765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4384218Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4384636Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4385641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4386624Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4387378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4388345Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4389292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4390042Z kernel = self.compile( 2025-05-07T20:32:37.4390825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4391759Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4392396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4392718Z 2025-05-07T20:32:37.4393010Z self = 2025-05-07T20:32:37.4394504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4396470Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89aa39990>} 2025-05-07T20:32:37.4398473Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4400026Z context = 2025-05-07T20:32:37.4400434Z 2025-05-07T20:32:37.4400649Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4401346Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4401955Z module_map=module_map) 2025-05-07T20:32:37.4402420Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4402931Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4403282Z E ^ 2025-05-07T20:32:37.4403950Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4404605Z 2025-05-07T20:32:37.4405202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4405937Z 2025-05-07T20:32:37.4406088Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4406667Z self=, 2025-05-07T20:32:37.4407222Z T=2048, 2025-05-07T20:32:37.4407487Z D=5120, 2025-05-07T20:32:37.4407756Z scale_ub=1200.0, 2025-05-07T20:32:37.4408062Z contiguous=True, 2025-05-07T20:32:37.4408387Z compiled=True, 2025-05-07T20:32:37.4408728Z ) 2025-05-07T20:32:37.4409179Z self = 2025-05-07T20:32:37.4409893Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.4410285Z 2025-05-07T20:32:37.4410395Z @given( 2025-05-07T20:32:37.4410726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4411153Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4411672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4412030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4412364Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4412657Z ) 2025-05-07T20:32:37.4413020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4413469Z def test_silu_mul_quant( 2025-05-07T20:32:37.4413721Z self, 2025-05-07T20:32:37.4413922Z T: int, 2025-05-07T20:32:37.4414120Z D: int, 2025-05-07T20:32:37.4414348Z scale_ub: Optional[float], 2025-05-07T20:32:37.4414626Z contiguous: bool, 2025-05-07T20:32:37.4414873Z compiled: bool, 2025-05-07T20:32:37.4415099Z ) -> None: 2025-05-07T20:32:37.4415323Z torch.manual_seed(2025) 2025-05-07T20:32:37.4415569Z 2025-05-07T20:32:37.4415841Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4416187Z 2025-05-07T20:32:37.4416389Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4416679Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4417001Z x = x_sign * x_clamp 2025-05-07T20:32:37.4417249Z x0 = x[:, :D] 2025-05-07T20:32:37.4417465Z x1 = x[:, D:] 2025-05-07T20:32:37.4417723Z 2025-05-07T20:32:37.4417916Z if contiguous: 2025-05-07T20:32:37.4418251Z x0 = x0.contiguous() 2025-05-07T20:32:37.4418521Z x1 = x1.contiguous() 2025-05-07T20:32:37.4418816Z 2025-05-07T20:32:37.4419013Z if scale_ub is not None: 2025-05-07T20:32:37.4419298Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4419647Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4419961Z ) 2025-05-07T20:32:37.4420160Z else: 2025-05-07T20:32:37.4420384Z scale_ub_tensor = None 2025-05-07T20:32:37.4420644Z 2025-05-07T20:32:37.4420882Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4421207Z op = silu_mul_quant 2025-05-07T20:32:37.4421464Z if compiled: 2025-05-07T20:32:37.4421714Z op = torch.compile(op) 2025-05-07T20:32:37.4422096Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4422379Z 2025-05-07T20:32:37.4422575Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.4422872Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.4423168Z 2025-05-07T20:32:37.4423415Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4423760Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.4424066Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.4424392Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.4424751Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4425069Z 2025-05-07T20:32:37.4425279Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4425477Z 2025-05-07T20:32:37.4425582Z moe/activation_test.py:126: 2025-05-07T20:32:37.4425884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4426228Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.4426562Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4427369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.4428136Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4428738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4429425Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4430132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.4430916Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4431682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.4432440Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4433180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.4433842Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4434460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.4434985Z fn() 2025-05-07T20:32:37.4435506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.4436103Z self.fn.run( 2025-05-07T20:32:37.4436588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4437124Z kernel = self.compile( 2025-05-07T20:32:37.4437684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4438403Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4438850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4439086Z 2025-05-07T20:32:37.4439339Z self = 2025-05-07T20:32:37.4440442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4441843Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8996d96c0>} 2025-05-07T20:32:37.4443213Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4444296Z context = 2025-05-07T20:32:37.4444592Z 2025-05-07T20:32:37.4444764Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4445303Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4445782Z module_map=module_map) 2025-05-07T20:32:37.4446151Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4446517Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4446795Z E ^ 2025-05-07T20:32:37.4447265Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4447728Z 2025-05-07T20:32:37.4448154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4448735Z 2025-05-07T20:32:37.4448845Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4449276Z self=, 2025-05-07T20:32:37.4449677Z T=16384, 2025-05-07T20:32:37.4449878Z D=7168, 2025-05-07T20:32:37.4450088Z scale_ub=1200.0, 2025-05-07T20:32:37.4450319Z contiguous=False, 2025-05-07T20:32:37.4450555Z compiled=False, 2025-05-07T20:32:37.4450770Z ) 2025-05-07T20:32:37.4451093Z self = 2025-05-07T20:32:37.4451608Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.4451892Z 2025-05-07T20:32:37.4451980Z @given( 2025-05-07T20:32:37.4452262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4452594Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4452914Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4453261Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4453595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4453893Z ) 2025-05-07T20:32:37.4454257Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4454709Z def test_silu_mul_quant( 2025-05-07T20:32:37.4454960Z self, 2025-05-07T20:32:37.4455163Z T: int, 2025-05-07T20:32:37.4455365Z D: int, 2025-05-07T20:32:37.4455855Z scale_ub: Optional[float], 2025-05-07T20:32:37.4456193Z contiguous: bool, 2025-05-07T20:32:37.4456436Z compiled: bool, 2025-05-07T20:32:37.4456669Z ) -> None: 2025-05-07T20:32:37.4456895Z torch.manual_seed(2025) 2025-05-07T20:32:37.4457143Z 2025-05-07T20:32:37.4457434Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4457788Z 2025-05-07T20:32:37.4458057Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4458359Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4458774Z x = x_sign * x_clamp 2025-05-07T20:32:37.4459026Z x0 = x[:, :D] 2025-05-07T20:32:37.4459246Z x1 = x[:, D:] 2025-05-07T20:32:37.4459463Z 2025-05-07T20:32:37.4459659Z if contiguous: 2025-05-07T20:32:37.4459951Z x0 = x0.contiguous() 2025-05-07T20:32:37.4460221Z x1 = x1.contiguous() 2025-05-07T20:32:37.4460469Z 2025-05-07T20:32:37.4460667Z if scale_ub is not None: 2025-05-07T20:32:37.4460953Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4461302Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4461613Z ) 2025-05-07T20:32:37.4461813Z else: 2025-05-07T20:32:37.4462035Z scale_ub_tensor = None 2025-05-07T20:32:37.4462287Z 2025-05-07T20:32:37.4462530Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4462848Z op = silu_mul_quant 2025-05-07T20:32:37.4463175Z if compiled: 2025-05-07T20:32:37.4463427Z op = torch.compile(op) 2025-05-07T20:32:37.4463733Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4464012Z 2025-05-07T20:32:37.4464210Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4464381Z 2025-05-07T20:32:37.4464487Z moe/activation_test.py:117: 2025-05-07T20:32:37.4464784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4465113Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4465401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4466104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4466809Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4467351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4468045Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4468721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4469258Z kernel = self.compile( 2025-05-07T20:32:37.4469811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4470483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4470881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4471107Z 2025-05-07T20:32:37.4471317Z self = 2025-05-07T20:32:37.4472481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4473876Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8996d88b0>} 2025-05-07T20:32:37.4475238Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4476281Z context = 2025-05-07T20:32:37.4476570Z 2025-05-07T20:32:37.4476739Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4477269Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4477747Z module_map=module_map) 2025-05-07T20:32:37.4478113Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4478512Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4478797Z E ^ 2025-05-07T20:32:37.4479315Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4479773Z 2025-05-07T20:32:37.4480193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4480754Z 2025-05-07T20:32:37.4480862Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4481288Z self=, 2025-05-07T20:32:37.4481688Z T=1, 2025-05-07T20:32:37.4481881Z D=7168, 2025-05-07T20:32:37.4482081Z scale_ub=None, 2025-05-07T20:32:37.4482296Z contiguous=True, 2025-05-07T20:32:37.4482527Z compiled=True, 2025-05-07T20:32:37.4482741Z ) 2025-05-07T20:32:37.4483068Z self = 2025-05-07T20:32:37.4483550Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.4483862Z 2025-05-07T20:32:37.4483941Z @given( 2025-05-07T20:32:37.4484181Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4484493Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4484806Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4485142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4485472Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4485762Z ) 2025-05-07T20:32:37.4486119Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4486565Z def test_silu_mul_quant( 2025-05-07T20:32:37.4486809Z self, 2025-05-07T20:32:37.4487009Z T: int, 2025-05-07T20:32:37.4487214Z D: int, 2025-05-07T20:32:37.4487434Z scale_ub: Optional[float], 2025-05-07T20:32:37.4487712Z contiguous: bool, 2025-05-07T20:32:37.4487959Z compiled: bool, 2025-05-07T20:32:37.4488185Z ) -> None: 2025-05-07T20:32:37.4488432Z torch.manual_seed(2025) 2025-05-07T20:32:37.4488707Z 2025-05-07T20:32:37.4488981Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4489332Z 2025-05-07T20:32:37.4489531Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4489822Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4490136Z x = x_sign * x_clamp 2025-05-07T20:32:37.4490379Z x0 = x[:, :D] 2025-05-07T20:32:37.4490595Z x1 = x[:, D:] 2025-05-07T20:32:37.4490807Z 2025-05-07T20:32:37.4491006Z if contiguous: 2025-05-07T20:32:37.4491236Z x0 = x0.contiguous() 2025-05-07T20:32:37.4491502Z x1 = x1.contiguous() 2025-05-07T20:32:37.4491748Z 2025-05-07T20:32:37.4491996Z if scale_ub is not None: 2025-05-07T20:32:37.4492273Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4492615Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4492930Z ) 2025-05-07T20:32:37.4493128Z else: 2025-05-07T20:32:37.4493349Z scale_ub_tensor = None 2025-05-07T20:32:37.4493608Z 2025-05-07T20:32:37.4493841Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4494162Z op = silu_mul_quant 2025-05-07T20:32:37.4494422Z if compiled: 2025-05-07T20:32:37.4494670Z op = torch.compile(op) 2025-05-07T20:32:37.4494973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4495255Z 2025-05-07T20:32:37.4495454Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.4495748Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.4496040Z 2025-05-07T20:32:37.4496281Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4496623Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.4496920Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.4497245Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.4497650Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4497970Z 2025-05-07T20:32:37.4498235Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4498433Z 2025-05-07T20:32:37.4498545Z moe/activation_test.py:126: 2025-05-07T20:32:37.4498958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4499292Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.4499620Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4500417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.4501189Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4501742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4502485Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4503189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.4503921Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4504684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.4505437Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4506173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.4506822Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4507433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.4507957Z fn() 2025-05-07T20:32:37.4508517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.4509114Z self.fn.run( 2025-05-07T20:32:37.4509586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4510128Z kernel = self.compile( 2025-05-07T20:32:37.4510678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4511341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4511735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4519904Z 2025-05-07T20:32:37.4520220Z self = 2025-05-07T20:32:37.4521332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4522724Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89937c940>} 2025-05-07T20:32:37.4524095Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4525137Z context = 2025-05-07T20:32:37.4525428Z 2025-05-07T20:32:37.4525603Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4526129Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4526605Z module_map=module_map) 2025-05-07T20:32:37.4526980Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4527393Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4527659Z E ^ 2025-05-07T20:32:37.4528133Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4528658Z 2025-05-07T20:32:37.4529115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4529634Z 2025-05-07T20:32:37.4529749Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4530166Z self=, 2025-05-07T20:32:37.4530573Z T=4096, 2025-05-07T20:32:37.4530771Z D=5120, 2025-05-07T20:32:37.4530964Z scale_ub=None, 2025-05-07T20:32:37.4531190Z contiguous=False, 2025-05-07T20:32:37.4531425Z compiled=False, 2025-05-07T20:32:37.4531628Z ) 2025-05-07T20:32:37.4531955Z self = 2025-05-07T20:32:37.4532602Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.4532877Z 2025-05-07T20:32:37.4532954Z @given( 2025-05-07T20:32:37.4533189Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4533509Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4533823Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4534152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4534489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4534779Z ) 2025-05-07T20:32:37.4535129Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4535578Z def test_silu_mul_quant( 2025-05-07T20:32:37.4535826Z self, 2025-05-07T20:32:37.4536017Z T: int, 2025-05-07T20:32:37.4536214Z D: int, 2025-05-07T20:32:37.4536436Z scale_ub: Optional[float], 2025-05-07T20:32:37.4536710Z contiguous: bool, 2025-05-07T20:32:37.4536953Z compiled: bool, 2025-05-07T20:32:37.4537185Z ) -> None: 2025-05-07T20:32:37.4537399Z torch.manual_seed(2025) 2025-05-07T20:32:37.4537642Z 2025-05-07T20:32:37.4537921Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4538363Z 2025-05-07T20:32:37.4538588Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4538889Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4539203Z x = x_sign * x_clamp 2025-05-07T20:32:37.4539439Z x0 = x[:, :D] 2025-05-07T20:32:37.4539659Z x1 = x[:, D:] 2025-05-07T20:32:37.4539867Z 2025-05-07T20:32:37.4540051Z if contiguous: 2025-05-07T20:32:37.4540291Z x0 = x0.contiguous() 2025-05-07T20:32:37.4540606Z x1 = x1.contiguous() 2025-05-07T20:32:37.4540842Z 2025-05-07T20:32:37.4541039Z if scale_ub is not None: 2025-05-07T20:32:37.4541321Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4541656Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4541964Z ) 2025-05-07T20:32:37.4542151Z else: 2025-05-07T20:32:37.4542363Z scale_ub_tensor = None 2025-05-07T20:32:37.4542616Z 2025-05-07T20:32:37.4542846Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4543160Z op = silu_mul_quant 2025-05-07T20:32:37.4543414Z if compiled: 2025-05-07T20:32:37.4543663Z op = torch.compile(op) 2025-05-07T20:32:37.4543954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4544250Z 2025-05-07T20:32:37.4544532Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4544754Z 2025-05-07T20:32:37.4544858Z moe/activation_test.py:117: 2025-05-07T20:32:37.4545156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4545486Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4545768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4546525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4547231Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4547778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4548562Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4549239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4549779Z kernel = self.compile( 2025-05-07T20:32:37.4550329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4550997Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4551444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4551670Z 2025-05-07T20:32:37.4551891Z self = 2025-05-07T20:32:37.4552985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4554385Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89937d7e0>} 2025-05-07T20:32:37.4555982Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4557027Z context = 2025-05-07T20:32:37.4557317Z 2025-05-07T20:32:37.4557494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4558014Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4558496Z module_map=module_map) 2025-05-07T20:32:37.4558908Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4559259Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4559520Z E ^ 2025-05-07T20:32:37.4559991Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4560443Z 2025-05-07T20:32:37.4560964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4561481Z 2025-05-07T20:32:37.4561585Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4562004Z self=, 2025-05-07T20:32:37.4562408Z T=4096, 2025-05-07T20:32:37.4562592Z D=7168, 2025-05-07T20:32:37.4562786Z scale_ub=None, 2025-05-07T20:32:37.4563001Z contiguous=False, 2025-05-07T20:32:37.4563223Z compiled=False, 2025-05-07T20:32:37.4563427Z ) 2025-05-07T20:32:37.4563746Z self = 2025-05-07T20:32:37.4564244Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.4564516Z 2025-05-07T20:32:37.4564595Z @given( 2025-05-07T20:32:37.4564827Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4565141Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4565444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4565780Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4566107Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4566387Z ) 2025-05-07T20:32:37.4566742Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4567248Z def test_silu_mul_quant( 2025-05-07T20:32:37.4567489Z self, 2025-05-07T20:32:37.4567677Z T: int, 2025-05-07T20:32:37.4567870Z D: int, 2025-05-07T20:32:37.4568089Z scale_ub: Optional[float], 2025-05-07T20:32:37.4568442Z contiguous: bool, 2025-05-07T20:32:37.4568706Z compiled: bool, 2025-05-07T20:32:37.4568927Z ) -> None: 2025-05-07T20:32:37.4569138Z torch.manual_seed(2025) 2025-05-07T20:32:37.4569380Z 2025-05-07T20:32:37.4569654Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4569989Z 2025-05-07T20:32:37.4570184Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4570487Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4570786Z x = x_sign * x_clamp 2025-05-07T20:32:37.4571025Z x0 = x[:, :D] 2025-05-07T20:32:37.4571311Z x1 = x[:, D:] 2025-05-07T20:32:37.4571513Z 2025-05-07T20:32:37.4571697Z if contiguous: 2025-05-07T20:32:37.4571929Z x0 = x0.contiguous() 2025-05-07T20:32:37.4572179Z x1 = x1.contiguous() 2025-05-07T20:32:37.4572417Z 2025-05-07T20:32:37.4572612Z if scale_ub is not None: 2025-05-07T20:32:37.4572891Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4573219Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4573528Z ) 2025-05-07T20:32:37.4573723Z else: 2025-05-07T20:32:37.4573926Z scale_ub_tensor = None 2025-05-07T20:32:37.4574177Z 2025-05-07T20:32:37.4574408Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4574718Z op = silu_mul_quant 2025-05-07T20:32:37.4574967Z if compiled: 2025-05-07T20:32:37.4575215Z op = torch.compile(op) 2025-05-07T20:32:37.4575505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4575780Z 2025-05-07T20:32:37.4575974Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4576137Z 2025-05-07T20:32:37.4576237Z moe/activation_test.py:117: 2025-05-07T20:32:37.4576529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4576856Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4577139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4577830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4578642Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4579183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4579914Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4580591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4581129Z kernel = self.compile( 2025-05-07T20:32:37.4581676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4582333Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4582727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4582954Z 2025-05-07T20:32:37.4583167Z self = 2025-05-07T20:32:37.4584250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4585637Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89937dfc0>} 2025-05-07T20:32:37.4587070Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4588105Z context = 2025-05-07T20:32:37.4588434Z 2025-05-07T20:32:37.4588632Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4589173Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4589645Z module_map=module_map) 2025-05-07T20:32:37.4590008Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4590366Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4590628Z E ^ 2025-05-07T20:32:37.4591094Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4591591Z 2025-05-07T20:32:37.4592019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4592538Z 2025-05-07T20:32:37.4592646Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4593058Z self=, 2025-05-07T20:32:37.4593466Z T=128, 2025-05-07T20:32:37.4593653Z D=7168, 2025-05-07T20:32:37.4593839Z scale_ub=None, 2025-05-07T20:32:37.4594054Z contiguous=False, 2025-05-07T20:32:37.4594276Z compiled=True, 2025-05-07T20:32:37.4594471Z ) 2025-05-07T20:32:37.4594793Z self = 2025-05-07T20:32:37.4595287Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.4595553Z 2025-05-07T20:32:37.4595629Z @given( 2025-05-07T20:32:37.4595858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4596174Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4596483Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4596815Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4597143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4597431Z ) 2025-05-07T20:32:37.4597784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4598219Z def test_silu_mul_quant( 2025-05-07T20:32:37.4598456Z self, 2025-05-07T20:32:37.4598650Z T: int, 2025-05-07T20:32:37.4598836Z D: int, 2025-05-07T20:32:37.4599056Z scale_ub: Optional[float], 2025-05-07T20:32:37.4599332Z contiguous: bool, 2025-05-07T20:32:37.4599565Z compiled: bool, 2025-05-07T20:32:37.4599785Z ) -> None: 2025-05-07T20:32:37.4600054Z torch.manual_seed(2025) 2025-05-07T20:32:37.4600290Z 2025-05-07T20:32:37.4600566Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4600909Z 2025-05-07T20:32:37.4601099Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4601392Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4601696Z x = x_sign * x_clamp 2025-05-07T20:32:37.4601932Z x0 = x[:, :D] 2025-05-07T20:32:37.4602149Z x1 = x[:, D:] 2025-05-07T20:32:37.4602356Z 2025-05-07T20:32:37.4602534Z if contiguous: 2025-05-07T20:32:37.4602767Z x0 = x0.contiguous() 2025-05-07T20:32:37.4603021Z x1 = x1.contiguous() 2025-05-07T20:32:37.4603258Z 2025-05-07T20:32:37.4603444Z if scale_ub is not None: 2025-05-07T20:32:37.4603721Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4604056Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4604361Z ) 2025-05-07T20:32:37.4604554Z else: 2025-05-07T20:32:37.4604765Z scale_ub_tensor = None 2025-05-07T20:32:37.4605011Z 2025-05-07T20:32:37.4605241Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4605598Z op = silu_mul_quant 2025-05-07T20:32:37.4605843Z if compiled: 2025-05-07T20:32:37.4606088Z op = torch.compile(op) 2025-05-07T20:32:37.4606383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4606689Z 2025-05-07T20:32:37.4606881Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.4607162Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.4607449Z 2025-05-07T20:32:37.4607682Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4608014Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.4608309Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.4608675Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.4609028Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4609339Z 2025-05-07T20:32:37.4609583Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4609784Z 2025-05-07T20:32:37.4609882Z moe/activation_test.py:126: 2025-05-07T20:32:37.4610178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4610504Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.4610834Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4611630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.4611732Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4612100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4612328Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4612699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.4612965Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4613367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.4613627Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4614007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.4614175Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4614524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.4614601Z fn() 2025-05-07T20:32:37.4615053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.4615137Z self.fn.run( 2025-05-07T20:32:37.4615482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4615582Z kernel = self.compile( 2025-05-07T20:32:37.4615968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4616146Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4616276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4616281Z 2025-05-07T20:32:37.4616487Z self = 2025-05-07T20:32:37.4617281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4617790Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89939a560>} 2025-05-07T20:32:37.4618674Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4618932Z context = 2025-05-07T20:32:37.4618937Z 2025-05-07T20:32:37.4619105Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4619377Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4619485Z module_map=module_map) 2025-05-07T20:32:37.4619653Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4619761Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4619835Z E ^ 2025-05-07T20:32:37.4620241Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4620249Z 2025-05-07T20:32:37.4620669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4620674Z 2025-05-07T20:32:37.4620782Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4621011Z self=, 2025-05-07T20:32:37.4621085Z T=128, 2025-05-07T20:32:37.4621166Z D=7168, 2025-05-07T20:32:37.4621248Z scale_ub=None, 2025-05-07T20:32:37.4621334Z contiguous=False, 2025-05-07T20:32:37.4621419Z compiled=False, 2025-05-07T20:32:37.4621491Z ) 2025-05-07T20:32:37.4621713Z self = 2025-05-07T20:32:37.4621894Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.4621898Z 2025-05-07T20:32:37.4621978Z @given( 2025-05-07T20:32:37.4622096Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4622204Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4622320Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4622444Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4622562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4622636Z ) 2025-05-07T20:32:37.4622889Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4622983Z def test_silu_mul_quant( 2025-05-07T20:32:37.4623058Z self, 2025-05-07T20:32:37.4623140Z T: int, 2025-05-07T20:32:37.4623215Z D: int, 2025-05-07T20:32:37.4623314Z scale_ub: Optional[float], 2025-05-07T20:32:37.4623457Z contiguous: bool, 2025-05-07T20:32:37.4623549Z compiled: bool, 2025-05-07T20:32:37.4623627Z ) -> None: 2025-05-07T20:32:37.4623729Z torch.manual_seed(2025) 2025-05-07T20:32:37.4623803Z 2025-05-07T20:32:37.4623986Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4624057Z 2025-05-07T20:32:37.4624149Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4624277Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4624366Z x = x_sign * x_clamp 2025-05-07T20:32:37.4624448Z x0 = x[:, :D] 2025-05-07T20:32:37.4624529Z x1 = x[:, D:] 2025-05-07T20:32:37.4624601Z 2025-05-07T20:32:37.4624685Z if contiguous: 2025-05-07T20:32:37.4624779Z x0 = x0.contiguous() 2025-05-07T20:32:37.4624868Z x1 = x1.contiguous() 2025-05-07T20:32:37.4624942Z 2025-05-07T20:32:37.4625039Z if scale_ub is not None: 2025-05-07T20:32:37.4625147Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4625284Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4625365Z ) 2025-05-07T20:32:37.4625444Z else: 2025-05-07T20:32:37.4625544Z scale_ub_tensor = None 2025-05-07T20:32:37.4625616Z 2025-05-07T20:32:37.4625791Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4625885Z op = silu_mul_quant 2025-05-07T20:32:37.4625970Z if compiled: 2025-05-07T20:32:37.4626069Z op = torch.compile(op) 2025-05-07T20:32:37.4626230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4626303Z 2025-05-07T20:32:37.4626395Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4626399Z 2025-05-07T20:32:37.4626500Z moe/activation_test.py:117: 2025-05-07T20:32:37.4626626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4626732Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4626837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4627347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4627494Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4627861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4628085Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4628438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4628533Z kernel = self.compile( 2025-05-07T20:32:37.4628924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4629102Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4629229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4629233Z 2025-05-07T20:32:37.4629448Z self = 2025-05-07T20:32:37.4630239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4630751Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8993ed7e0>} 2025-05-07T20:32:37.4631511Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4631703Z context = 2025-05-07T20:32:37.4631715Z 2025-05-07T20:32:37.4631946Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4632217Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4632333Z module_map=module_map) 2025-05-07T20:32:37.4632496Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4632596Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4632676Z E ^ 2025-05-07T20:32:37.4633037Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4633044Z 2025-05-07T20:32:37.4633464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4633469Z 2025-05-07T20:32:37.4633575Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4633800Z self=, 2025-05-07T20:32:37.4633885Z T=4096, 2025-05-07T20:32:37.4633961Z D=5120, 2025-05-07T20:32:37.4634045Z scale_ub=1200.0, 2025-05-07T20:32:37.4634133Z contiguous=True, 2025-05-07T20:32:37.4634220Z compiled=False, 2025-05-07T20:32:37.4634292Z ) 2025-05-07T20:32:37.4634557Z self = 2025-05-07T20:32:37.4634735Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.4634740Z 2025-05-07T20:32:37.4634860Z @given( 2025-05-07T20:32:37.4634980Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4635084Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4635210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4635328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4635444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4635522Z ) 2025-05-07T20:32:37.4635772Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4635869Z def test_silu_mul_quant( 2025-05-07T20:32:37.4635945Z self, 2025-05-07T20:32:37.4636066Z T: int, 2025-05-07T20:32:37.4636147Z D: int, 2025-05-07T20:32:37.4636249Z scale_ub: Optional[float], 2025-05-07T20:32:37.4636338Z contiguous: bool, 2025-05-07T20:32:37.4636428Z compiled: bool, 2025-05-07T20:32:37.4636505Z ) -> None: 2025-05-07T20:32:37.4636600Z torch.manual_seed(2025) 2025-05-07T20:32:37.4636681Z 2025-05-07T20:32:37.4636850Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4636923Z 2025-05-07T20:32:37.4637022Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4637148Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4637236Z x = x_sign * x_clamp 2025-05-07T20:32:37.4637321Z x0 = x[:, :D] 2025-05-07T20:32:37.4637402Z x1 = x[:, D:] 2025-05-07T20:32:37.4637480Z 2025-05-07T20:32:37.4637564Z if contiguous: 2025-05-07T20:32:37.4637655Z x0 = x0.contiguous() 2025-05-07T20:32:37.4637750Z x1 = x1.contiguous() 2025-05-07T20:32:37.4637825Z 2025-05-07T20:32:37.4637916Z if scale_ub is not None: 2025-05-07T20:32:37.4638029Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4638164Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4638239Z ) 2025-05-07T20:32:37.4638331Z else: 2025-05-07T20:32:37.4638444Z scale_ub_tensor = None 2025-05-07T20:32:37.4638532Z 2025-05-07T20:32:37.4638675Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4638763Z op = silu_mul_quant 2025-05-07T20:32:37.4638851Z if compiled: 2025-05-07T20:32:37.4638949Z op = torch.compile(op) 2025-05-07T20:32:37.4639056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4639132Z 2025-05-07T20:32:37.4639267Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4639273Z 2025-05-07T20:32:37.4639371Z moe/activation_test.py:117: 2025-05-07T20:32:37.4639506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4639614Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4639719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4640230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4640330Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4640696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4640917Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4641260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4641360Z kernel = self.compile( 2025-05-07T20:32:37.4641746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4641923Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4642093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4642098Z 2025-05-07T20:32:37.4642303Z self = 2025-05-07T20:32:37.4643131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4643639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8993edf30>} 2025-05-07T20:32:37.4644399Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4644635Z context = 2025-05-07T20:32:37.4644640Z 2025-05-07T20:32:37.4644806Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4645076Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4645218Z module_map=module_map) 2025-05-07T20:32:37.4645451Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4645578Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4645655Z E ^ 2025-05-07T20:32:37.4646020Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4646025Z 2025-05-07T20:32:37.4646448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4646456Z 2025-05-07T20:32:37.4646565Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4646792Z self=, 2025-05-07T20:32:37.4646868Z T=1, 2025-05-07T20:32:37.4646949Z D=5120, 2025-05-07T20:32:37.4647031Z scale_ub=None, 2025-05-07T20:32:37.4647114Z contiguous=True, 2025-05-07T20:32:37.4647204Z compiled=True, 2025-05-07T20:32:37.4647282Z ) 2025-05-07T20:32:37.4647502Z self = 2025-05-07T20:32:37.4647668Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.4647672Z 2025-05-07T20:32:37.4647746Z @given( 2025-05-07T20:32:37.4647864Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4648022Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4648138Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4648256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4648371Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4648442Z ) 2025-05-07T20:32:37.4648733Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4648840Z def test_silu_mul_quant( 2025-05-07T20:32:37.4648914Z self, 2025-05-07T20:32:37.4648993Z T: int, 2025-05-07T20:32:37.4649068Z D: int, 2025-05-07T20:32:37.4649165Z scale_ub: Optional[float], 2025-05-07T20:32:37.4649255Z contiguous: bool, 2025-05-07T20:32:37.4649341Z compiled: bool, 2025-05-07T20:32:37.4649420Z ) -> None: 2025-05-07T20:32:37.4649512Z torch.manual_seed(2025) 2025-05-07T20:32:37.4649581Z 2025-05-07T20:32:37.4649754Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4649828Z 2025-05-07T20:32:37.4649920Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4650048Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4650137Z x = x_sign * x_clamp 2025-05-07T20:32:37.4650215Z x0 = x[:, :D] 2025-05-07T20:32:37.4650341Z x1 = x[:, D:] 2025-05-07T20:32:37.4650413Z 2025-05-07T20:32:37.4650495Z if contiguous: 2025-05-07T20:32:37.4650586Z x0 = x0.contiguous() 2025-05-07T20:32:37.4650673Z x1 = x1.contiguous() 2025-05-07T20:32:37.4650786Z 2025-05-07T20:32:37.4650880Z if scale_ub is not None: 2025-05-07T20:32:37.4650983Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4651122Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4651196Z ) 2025-05-07T20:32:37.4651269Z else: 2025-05-07T20:32:37.4651369Z scale_ub_tensor = None 2025-05-07T20:32:37.4651439Z 2025-05-07T20:32:37.4651569Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4651660Z op = silu_mul_quant 2025-05-07T20:32:37.4651741Z if compiled: 2025-05-07T20:32:37.4651883Z op = torch.compile(op) 2025-05-07T20:32:37.4651992Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4652065Z 2025-05-07T20:32:37.4652155Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.4652280Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.4652349Z 2025-05-07T20:32:37.4652491Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4652591Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.4652689Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.4652812Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.4652954Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4653023Z 2025-05-07T20:32:37.4653126Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4653133Z 2025-05-07T20:32:37.4653229Z moe/activation_test.py:126: 2025-05-07T20:32:37.4653354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4653466Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.4653602Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4654171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.4654273Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4654636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4654873Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4655245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.4655777Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4656204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.4656462Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4656842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.4657017Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4657370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.4657448Z fn() 2025-05-07T20:32:37.4670304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.4670407Z self.fn.run( 2025-05-07T20:32:37.4670774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4670878Z kernel = self.compile( 2025-05-07T20:32:37.4671276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4671566Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4671695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4671701Z 2025-05-07T20:32:37.4671966Z self = 2025-05-07T20:32:37.4672756Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4673268Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8993ef0a0>} 2025-05-07T20:32:37.4674031Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4674282Z context = 2025-05-07T20:32:37.4674288Z 2025-05-07T20:32:37.4674456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4674726Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4674833Z module_map=module_map) 2025-05-07T20:32:37.4675002Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4675103Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4675176Z E ^ 2025-05-07T20:32:37.4675543Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4675548Z 2025-05-07T20:32:37.4675970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4675978Z 2025-05-07T20:32:37.4676093Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4676314Z self=, 2025-05-07T20:32:37.4676389Z T=2048, 2025-05-07T20:32:37.4676467Z D=5120, 2025-05-07T20:32:37.4676547Z scale_ub=None, 2025-05-07T20:32:37.4676630Z contiguous=True, 2025-05-07T20:32:37.4676721Z compiled=True, 2025-05-07T20:32:37.4676792Z ) 2025-05-07T20:32:37.4677014Z self = 2025-05-07T20:32:37.4677187Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.4677192Z 2025-05-07T20:32:37.4677266Z @given( 2025-05-07T20:32:37.4677449Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4677550Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4677667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4677792Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4677906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4677978Z ) 2025-05-07T20:32:37.4678227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4678322Z def test_silu_mul_quant( 2025-05-07T20:32:37.4678396Z self, 2025-05-07T20:32:37.4678471Z T: int, 2025-05-07T20:32:37.4678561Z D: int, 2025-05-07T20:32:37.4679255Z scale_ub: Optional[float], 2025-05-07T20:32:37.4679346Z contiguous: bool, 2025-05-07T20:32:37.4679430Z compiled: bool, 2025-05-07T20:32:37.4679513Z ) -> None: 2025-05-07T20:32:37.4679606Z torch.manual_seed(2025) 2025-05-07T20:32:37.4679677Z 2025-05-07T20:32:37.4679856Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4679929Z 2025-05-07T20:32:37.4680019Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4680155Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4680290Z x = x_sign * x_clamp 2025-05-07T20:32:37.4680375Z x0 = x[:, :D] 2025-05-07T20:32:37.4680452Z x1 = x[:, D:] 2025-05-07T20:32:37.4680525Z 2025-05-07T20:32:37.4680614Z if contiguous: 2025-05-07T20:32:37.4680747Z x0 = x0.contiguous() 2025-05-07T20:32:37.4680835Z x1 = x1.contiguous() 2025-05-07T20:32:37.4680910Z 2025-05-07T20:32:37.4681000Z if scale_ub is not None: 2025-05-07T20:32:37.4681108Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4681248Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4681320Z ) 2025-05-07T20:32:37.4681394Z else: 2025-05-07T20:32:37.4681494Z scale_ub_tensor = None 2025-05-07T20:32:37.4681568Z 2025-05-07T20:32:37.4681699Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4681834Z op = silu_mul_quant 2025-05-07T20:32:37.4681915Z if compiled: 2025-05-07T20:32:37.4682019Z op = torch.compile(op) 2025-05-07T20:32:37.4682123Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4682198Z 2025-05-07T20:32:37.4682291Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.4682415Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.4682486Z 2025-05-07T20:32:37.4682628Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4682729Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.4682829Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.4682965Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.4683108Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4683182Z 2025-05-07T20:32:37.4683282Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4683287Z 2025-05-07T20:32:37.4683388Z moe/activation_test.py:126: 2025-05-07T20:32:37.4683521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4683626Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.4683763Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4684335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.4684442Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4684806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4685032Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4685450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.4685713Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4686120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.4686371Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4686755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.4686925Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4687273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.4687350Z fn() 2025-05-07T20:32:37.4687754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.4687840Z self.fn.run( 2025-05-07T20:32:37.4688189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4688286Z kernel = self.compile( 2025-05-07T20:32:37.4688713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4688891Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4689079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4689084Z 2025-05-07T20:32:37.4689295Z self = 2025-05-07T20:32:37.4690084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4690601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa898eabbe0>} 2025-05-07T20:32:37.4691402Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4691596Z context = 2025-05-07T20:32:37.4691605Z 2025-05-07T20:32:37.4691777Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4692042Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4692151Z module_map=module_map) 2025-05-07T20:32:37.4692315Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4692421Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4692504Z E ^ 2025-05-07T20:32:37.4692864Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4692872Z 2025-05-07T20:32:37.4693289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4693297Z 2025-05-07T20:32:37.4693403Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4693625Z self=, 2025-05-07T20:32:37.4693708Z T=128, 2025-05-07T20:32:37.4693785Z D=5120, 2025-05-07T20:32:37.4693867Z scale_ub=None, 2025-05-07T20:32:37.4693957Z contiguous=True, 2025-05-07T20:32:37.4694040Z compiled=True, 2025-05-07T20:32:37.4694113Z ) 2025-05-07T20:32:37.4694338Z self = 2025-05-07T20:32:37.4694546Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.4694551Z 2025-05-07T20:32:37.4694632Z @given( 2025-05-07T20:32:37.4694753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4694855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4694977Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4695095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4695210Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4695289Z ) 2025-05-07T20:32:37.4695539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4695633Z def test_silu_mul_quant( 2025-05-07T20:32:37.4695713Z self, 2025-05-07T20:32:37.4695790Z T: int, 2025-05-07T20:32:37.4695864Z D: int, 2025-05-07T20:32:37.4695969Z scale_ub: Optional[float], 2025-05-07T20:32:37.4696061Z contiguous: bool, 2025-05-07T20:32:37.4696151Z compiled: bool, 2025-05-07T20:32:37.4696232Z ) -> None: 2025-05-07T20:32:37.4696328Z torch.manual_seed(2025) 2025-05-07T20:32:37.4696404Z 2025-05-07T20:32:37.4696572Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4696649Z 2025-05-07T20:32:37.4696744Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4696912Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4697003Z x = x_sign * x_clamp 2025-05-07T20:32:37.4697087Z x0 = x[:, :D] 2025-05-07T20:32:37.4697167Z x1 = x[:, D:] 2025-05-07T20:32:37.4697279Z 2025-05-07T20:32:37.4697367Z if contiguous: 2025-05-07T20:32:37.4697459Z x0 = x0.contiguous() 2025-05-07T20:32:37.4697553Z x1 = x1.contiguous() 2025-05-07T20:32:37.4697629Z 2025-05-07T20:32:37.4697719Z if scale_ub is not None: 2025-05-07T20:32:37.4697829Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4697963Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4698118Z ) 2025-05-07T20:32:37.4698200Z else: 2025-05-07T20:32:37.4698298Z scale_ub_tensor = None 2025-05-07T20:32:37.4698439Z 2025-05-07T20:32:37.4698591Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4698693Z op = silu_mul_quant 2025-05-07T20:32:37.4698777Z if compiled: 2025-05-07T20:32:37.4698879Z op = torch.compile(op) 2025-05-07T20:32:37.4698985Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4699062Z 2025-05-07T20:32:37.4699153Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.4699274Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.4699348Z 2025-05-07T20:32:37.4699484Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4699583Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.4699686Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.4699810Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.4699952Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4700031Z 2025-05-07T20:32:37.4700134Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4700139Z 2025-05-07T20:32:37.4700243Z moe/activation_test.py:126: 2025-05-07T20:32:37.4700369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4700474Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.4700617Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4701185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.4701287Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4701655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4701922Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4702298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.4702556Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4702958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.4703217Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4703596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.4703771Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4704114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.4704193Z fn() 2025-05-07T20:32:37.4704599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.4704682Z self.fn.run( 2025-05-07T20:32:37.4705063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4705164Z kernel = self.compile( 2025-05-07T20:32:37.4705546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4705766Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4705889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4705893Z 2025-05-07T20:32:37.4706099Z self = 2025-05-07T20:32:37.4706895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4707406Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89804c280>} 2025-05-07T20:32:37.4708205Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4708404Z context = 2025-05-07T20:32:37.4708409Z 2025-05-07T20:32:37.4708592Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4708891Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4709000Z module_map=module_map) 2025-05-07T20:32:37.4709169Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4709272Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4709348Z E ^ 2025-05-07T20:32:37.4709712Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4709719Z 2025-05-07T20:32:37.4710135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4710140Z 2025-05-07T20:32:37.4710251Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4710473Z self=, 2025-05-07T20:32:37.4710551Z T=4096, 2025-05-07T20:32:37.4710631Z D=5120, 2025-05-07T20:32:37.4710713Z scale_ub=None, 2025-05-07T20:32:37.4710799Z contiguous=True, 2025-05-07T20:32:37.4710888Z compiled=True, 2025-05-07T20:32:37.4710959Z ) 2025-05-07T20:32:37.4711221Z self = 2025-05-07T20:32:37.4711396Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.4711401Z 2025-05-07T20:32:37.4711481Z @given( 2025-05-07T20:32:37.4711608Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4711710Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4711825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4711948Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4712065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4712141Z ) 2025-05-07T20:32:37.4712392Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4712485Z def test_silu_mul_quant( 2025-05-07T20:32:37.4712562Z self, 2025-05-07T20:32:37.4712640Z T: int, 2025-05-07T20:32:37.4712715Z D: int, 2025-05-07T20:32:37.4712812Z scale_ub: Optional[float], 2025-05-07T20:32:37.4712909Z contiguous: bool, 2025-05-07T20:32:37.4712994Z compiled: bool, 2025-05-07T20:32:37.4713078Z ) -> None: 2025-05-07T20:32:37.4713172Z torch.manual_seed(2025) 2025-05-07T20:32:37.4713248Z 2025-05-07T20:32:37.4713538Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4713611Z 2025-05-07T20:32:37.4713702Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4713829Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4713957Z x = x_sign * x_clamp 2025-05-07T20:32:37.4714037Z x0 = x[:, :D] 2025-05-07T20:32:37.4714119Z x1 = x[:, D:] 2025-05-07T20:32:37.4714189Z 2025-05-07T20:32:37.4714273Z if contiguous: 2025-05-07T20:32:37.4714367Z x0 = x0.contiguous() 2025-05-07T20:32:37.4714453Z x1 = x1.contiguous() 2025-05-07T20:32:37.4714528Z 2025-05-07T20:32:37.4714619Z if scale_ub is not None: 2025-05-07T20:32:37.4714727Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4714864Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4714940Z ) 2025-05-07T20:32:37.4715062Z else: 2025-05-07T20:32:37.4715164Z scale_ub_tensor = None 2025-05-07T20:32:37.4715236Z 2025-05-07T20:32:37.4715367Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4715460Z op = silu_mul_quant 2025-05-07T20:32:37.4715544Z if compiled: 2025-05-07T20:32:37.4715645Z op = torch.compile(op) 2025-05-07T20:32:37.4715756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4715828Z 2025-05-07T20:32:37.4715918Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.4716042Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.4716113Z 2025-05-07T20:32:37.4716251Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4716353Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.4716453Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.4716579Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.4716726Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4716798Z 2025-05-07T20:32:37.4716902Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4716906Z 2025-05-07T20:32:37.4717005Z moe/activation_test.py:126: 2025-05-07T20:32:37.4717132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4717243Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.4717380Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4717946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.4718049Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4718454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4718685Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4719061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.4719323Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4719722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.4719976Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4720358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.4720525Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4720874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.4720950Z fn() 2025-05-07T20:32:37.4721354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.4721442Z self.fn.run( 2025-05-07T20:32:37.4721845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4721941Z kernel = self.compile( 2025-05-07T20:32:37.4722328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4722542Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4722669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4722674Z 2025-05-07T20:32:37.4722881Z self = 2025-05-07T20:32:37.4723667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4724225Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89804d2d0>} 2025-05-07T20:32:37.4724981Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4725179Z context = 2025-05-07T20:32:37.4725184Z 2025-05-07T20:32:37.4725350Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4725615Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4725730Z module_map=module_map) 2025-05-07T20:32:37.4725894Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4726002Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4726078Z E ^ 2025-05-07T20:32:37.4726434Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4726439Z 2025-05-07T20:32:37.4726861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4726869Z 2025-05-07T20:32:37.4726975Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4727203Z self=, 2025-05-07T20:32:37.4727279Z T=16384, 2025-05-07T20:32:37.4727355Z D=5120, 2025-05-07T20:32:37.4727442Z scale_ub=None, 2025-05-07T20:32:37.4727527Z contiguous=True, 2025-05-07T20:32:37.4727654Z compiled=True, 2025-05-07T20:32:37.4727732Z ) 2025-05-07T20:32:37.4727955Z self = 2025-05-07T20:32:37.4728135Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.4728139Z 2025-05-07T20:32:37.4728222Z @given( 2025-05-07T20:32:37.4728340Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4728472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4728600Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4728736Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4728850Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4728924Z ) 2025-05-07T20:32:37.4729174Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4729276Z def test_silu_mul_quant( 2025-05-07T20:32:37.4729350Z self, 2025-05-07T20:32:37.4729426Z T: int, 2025-05-07T20:32:37.4729509Z D: int, 2025-05-07T20:32:37.4729606Z scale_ub: Optional[float], 2025-05-07T20:32:37.4729693Z contiguous: bool, 2025-05-07T20:32:37.4729784Z compiled: bool, 2025-05-07T20:32:37.4729862Z ) -> None: 2025-05-07T20:32:37.4730001Z torch.manual_seed(2025) 2025-05-07T20:32:37.4730073Z 2025-05-07T20:32:37.4730243Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4730320Z 2025-05-07T20:32:37.4730412Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4730579Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4730670Z x = x_sign * x_clamp 2025-05-07T20:32:37.4730748Z x0 = x[:, :D] 2025-05-07T20:32:37.4730826Z x1 = x[:, D:] 2025-05-07T20:32:37.4730903Z 2025-05-07T20:32:37.4730987Z if contiguous: 2025-05-07T20:32:37.4731079Z x0 = x0.contiguous() 2025-05-07T20:32:37.4731174Z x1 = x1.contiguous() 2025-05-07T20:32:37.4731246Z 2025-05-07T20:32:37.4731339Z if scale_ub is not None: 2025-05-07T20:32:37.4731449Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4731583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4731706Z ) 2025-05-07T20:32:37.4731784Z else: 2025-05-07T20:32:37.4731882Z scale_ub_tensor = None 2025-05-07T20:32:37.4731953Z 2025-05-07T20:32:37.4732084Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4732178Z op = silu_mul_quant 2025-05-07T20:32:37.4732265Z if compiled: 2025-05-07T20:32:37.4732364Z op = torch.compile(op) 2025-05-07T20:32:37.4732473Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4732544Z 2025-05-07T20:32:37.4732635Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.4732761Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.4732831Z 2025-05-07T20:32:37.4732968Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4733073Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.4733172Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.4733302Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.4733446Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4733519Z 2025-05-07T20:32:37.4733623Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4733628Z 2025-05-07T20:32:37.4733727Z moe/activation_test.py:126: 2025-05-07T20:32:37.4733856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4733968Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.4734105Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4734672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.4734821Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4735187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4735424Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4735797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.4736053Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4736461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.4736714Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4737094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.4737264Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4737608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.4737689Z fn() 2025-05-07T20:32:37.4738185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.4738273Z self.fn.run( 2025-05-07T20:32:37.4738628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4738778Z kernel = self.compile( 2025-05-07T20:32:37.4739185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4739358Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4739481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4739486Z 2025-05-07T20:32:37.4739701Z self = 2025-05-07T20:32:37.4740487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4741043Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89806df30>} 2025-05-07T20:32:37.4741801Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4741997Z context = 2025-05-07T20:32:37.4742002Z 2025-05-07T20:32:37.4742166Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4742433Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4742544Z module_map=module_map) 2025-05-07T20:32:37.4742712Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4742816Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4742896Z E ^ 2025-05-07T20:32:37.4743254Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4743261Z 2025-05-07T20:32:37.4743685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4743690Z 2025-05-07T20:32:37.4743796Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4744020Z self=, 2025-05-07T20:32:37.4744102Z T=1, 2025-05-07T20:32:37.4744178Z D=5120, 2025-05-07T20:32:37.4744302Z scale_ub=1200.0, 2025-05-07T20:32:37.4744397Z contiguous=True, 2025-05-07T20:32:37.4744480Z compiled=True, 2025-05-07T20:32:37.4744557Z ) 2025-05-07T20:32:37.4744779Z self = 2025-05-07T20:32:37.4744945Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.4744950Z 2025-05-07T20:32:37.4745030Z @given( 2025-05-07T20:32:37.4745150Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4745252Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4745374Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4745492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4745606Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4745687Z ) 2025-05-07T20:32:37.4745933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4746037Z def test_silu_mul_quant( 2025-05-07T20:32:37.4746112Z self, 2025-05-07T20:32:37.4746187Z T: int, 2025-05-07T20:32:37.4746264Z D: int, 2025-05-07T20:32:37.4746366Z scale_ub: Optional[float], 2025-05-07T20:32:37.4746454Z contiguous: bool, 2025-05-07T20:32:37.4746589Z compiled: bool, 2025-05-07T20:32:37.4746669Z ) -> None: 2025-05-07T20:32:37.4746763Z torch.manual_seed(2025) 2025-05-07T20:32:37.4746838Z 2025-05-07T20:32:37.4747005Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4747118Z 2025-05-07T20:32:37.4747215Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4747341Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4747432Z x = x_sign * x_clamp 2025-05-07T20:32:37.4747511Z x0 = x[:, :D] 2025-05-07T20:32:37.4747591Z x1 = x[:, D:] 2025-05-07T20:32:37.4747669Z 2025-05-07T20:32:37.4747757Z if contiguous: 2025-05-07T20:32:37.4747849Z x0 = x0.contiguous() 2025-05-07T20:32:37.4747942Z x1 = x1.contiguous() 2025-05-07T20:32:37.4748015Z 2025-05-07T20:32:37.4748105Z if scale_ub is not None: 2025-05-07T20:32:37.4748255Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4748391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4748468Z ) 2025-05-07T20:32:37.4748549Z else: 2025-05-07T20:32:37.4748646Z scale_ub_tensor = None 2025-05-07T20:32:37.4748719Z 2025-05-07T20:32:37.4748856Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4748946Z op = silu_mul_quant 2025-05-07T20:32:37.4749034Z if compiled: 2025-05-07T20:32:37.4749132Z op = torch.compile(op) 2025-05-07T20:32:37.4749236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4749314Z 2025-05-07T20:32:37.4749405Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4749410Z 2025-05-07T20:32:37.4749513Z moe/activation_test.py:117: 2025-05-07T20:32:37.4749642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4749744Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4749847Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4750225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.4750318Z return fn(*args, **kwargs) 2025-05-07T20:32:37.4750821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4750921Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4751280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4751511Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4751920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4752018Z kernel = self.compile( 2025-05-07T20:32:37.4752403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4752584Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4752713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4752718Z 2025-05-07T20:32:37.4752924Z self = 2025-05-07T20:32:37.4753713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4754228Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89829b1c0>} 2025-05-07T20:32:37.4755025Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4755229Z context = 2025-05-07T20:32:37.4755233Z 2025-05-07T20:32:37.4755400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4755994Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4756126Z module_map=module_map) 2025-05-07T20:32:37.4756288Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4756392Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4756468Z E ^ 2025-05-07T20:32:37.4756829Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4756839Z 2025-05-07T20:32:37.4757256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4757346Z 2025-05-07T20:32:37.4757454Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4757682Z self=, 2025-05-07T20:32:37.4757762Z T=1, 2025-05-07T20:32:37.4757836Z D=5120, 2025-05-07T20:32:37.4757924Z scale_ub=None, 2025-05-07T20:32:37.4758009Z contiguous=False, 2025-05-07T20:32:37.4758091Z compiled=True, 2025-05-07T20:32:37.4758170Z ) 2025-05-07T20:32:37.4758396Z self = 2025-05-07T20:32:37.4758589Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.4758595Z 2025-05-07T20:32:37.4758688Z @given( 2025-05-07T20:32:37.4758809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4758914Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4759031Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4759153Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4759272Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4759347Z ) 2025-05-07T20:32:37.4759595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4759694Z def test_silu_mul_quant( 2025-05-07T20:32:37.4759776Z self, 2025-05-07T20:32:37.4759855Z T: int, 2025-05-07T20:32:37.4759930Z D: int, 2025-05-07T20:32:37.4760033Z scale_ub: Optional[float], 2025-05-07T20:32:37.4760128Z contiguous: bool, 2025-05-07T20:32:37.4760214Z compiled: bool, 2025-05-07T20:32:37.4760293Z ) -> None: 2025-05-07T20:32:37.4760393Z torch.manual_seed(2025) 2025-05-07T20:32:37.4760464Z 2025-05-07T20:32:37.4760701Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4760778Z 2025-05-07T20:32:37.4760869Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4760995Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4761087Z x = x_sign * x_clamp 2025-05-07T20:32:37.4761167Z x0 = x[:, :D] 2025-05-07T20:32:37.4761252Z x1 = x[:, D:] 2025-05-07T20:32:37.4761322Z 2025-05-07T20:32:37.4761407Z if contiguous: 2025-05-07T20:32:37.4761501Z x0 = x0.contiguous() 2025-05-07T20:32:37.4761592Z x1 = x1.contiguous() 2025-05-07T20:32:37.4761664Z 2025-05-07T20:32:37.4761758Z if scale_ub is not None: 2025-05-07T20:32:37.4761865Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4761999Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4762079Z ) 2025-05-07T20:32:37.4762155Z else: 2025-05-07T20:32:37.4762252Z scale_ub_tensor = None 2025-05-07T20:32:37.4762326Z 2025-05-07T20:32:37.4762457Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4762546Z op = silu_mul_quant 2025-05-07T20:32:37.4762637Z if compiled: 2025-05-07T20:32:37.4762799Z op = torch.compile(op) 2025-05-07T20:32:37.4762909Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4762979Z 2025-05-07T20:32:37.4763068Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.4763195Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.4763321Z 2025-05-07T20:32:37.4763461Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4763567Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.4763667Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.4763788Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.4763934Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4764010Z 2025-05-07T20:32:37.4764113Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4764118Z 2025-05-07T20:32:37.4764216Z moe/activation_test.py:126: 2025-05-07T20:32:37.4764384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4764498Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.4764634Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4765200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.4765307Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4765669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4765900Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4766272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.4766530Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4766941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.4767194Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4767574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.4767745Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4768087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.4768167Z fn() 2025-05-07T20:32:37.4768570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.4768714Z self.fn.run( 2025-05-07T20:32:37.4769091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4769188Z kernel = self.compile( 2025-05-07T20:32:37.4769577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4769753Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4769878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4769885Z 2025-05-07T20:32:37.4770095Z self = 2025-05-07T20:32:37.4770883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4771483Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8984d3d00>} 2025-05-07T20:32:37.4772331Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4772527Z context = 2025-05-07T20:32:37.4772578Z 2025-05-07T20:32:37.4772746Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4773012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4773125Z module_map=module_map) 2025-05-07T20:32:37.4773285Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4773388Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4773470Z E ^ 2025-05-07T20:32:37.4773828Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4773875Z 2025-05-07T20:32:37.4774301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4774306Z 2025-05-07T20:32:37.4774411Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4774635Z self=, 2025-05-07T20:32:37.4774717Z T=1, 2025-05-07T20:32:37.4774792Z D=5120, 2025-05-07T20:32:37.4774874Z scale_ub=None, 2025-05-07T20:32:37.4774962Z contiguous=True, 2025-05-07T20:32:37.4775047Z compiled=False, 2025-05-07T20:32:37.4775120Z ) 2025-05-07T20:32:37.4775345Z self = 2025-05-07T20:32:37.4775507Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.4775516Z 2025-05-07T20:32:37.4775601Z @given( 2025-05-07T20:32:37.4775720Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4775822Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4775942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4776065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4776178Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4776255Z ) 2025-05-07T20:32:37.4776501Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4776597Z def test_silu_mul_quant( 2025-05-07T20:32:37.4776677Z self, 2025-05-07T20:32:37.4776752Z T: int, 2025-05-07T20:32:37.4776830Z D: int, 2025-05-07T20:32:37.4776930Z scale_ub: Optional[float], 2025-05-07T20:32:37.4777019Z contiguous: bool, 2025-05-07T20:32:37.4777108Z compiled: bool, 2025-05-07T20:32:37.4777187Z ) -> None: 2025-05-07T20:32:37.4777329Z torch.manual_seed(2025) 2025-05-07T20:32:37.4777402Z 2025-05-07T20:32:37.4777571Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4777645Z 2025-05-07T20:32:37.4777740Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4777865Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4777951Z x = x_sign * x_clamp 2025-05-07T20:32:37.4778096Z x0 = x[:, :D] 2025-05-07T20:32:37.4778177Z x1 = x[:, D:] 2025-05-07T20:32:37.4778255Z 2025-05-07T20:32:37.4778340Z if contiguous: 2025-05-07T20:32:37.4778430Z x0 = x0.contiguous() 2025-05-07T20:32:37.4778523Z x1 = x1.contiguous() 2025-05-07T20:32:37.4778594Z 2025-05-07T20:32:37.4778683Z if scale_ub is not None: 2025-05-07T20:32:37.4778791Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4778923Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4779000Z ) 2025-05-07T20:32:37.4779078Z else: 2025-05-07T20:32:37.4779174Z scale_ub_tensor = None 2025-05-07T20:32:37.4779244Z 2025-05-07T20:32:37.4779382Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4779471Z op = silu_mul_quant 2025-05-07T20:32:37.4779601Z if compiled: 2025-05-07T20:32:37.4779707Z op = torch.compile(op) 2025-05-07T20:32:37.4779811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4779884Z 2025-05-07T20:32:37.4780015Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4780020Z 2025-05-07T20:32:37.4780117Z moe/activation_test.py:117: 2025-05-07T20:32:37.4780250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4780349Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4780449Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4780964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4781061Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4781425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4781717Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4782062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4782168Z kernel = self.compile( 2025-05-07T20:32:37.4782553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4782729Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4782857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4782862Z 2025-05-07T20:32:37.4783070Z self = 2025-05-07T20:32:37.4783863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4784373Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8984d1ea0>} 2025-05-07T20:32:37.4785135Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4785329Z context = 2025-05-07T20:32:37.4785333Z 2025-05-07T20:32:37.4785499Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4785810Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4785920Z module_map=module_map) 2025-05-07T20:32:37.4786089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4786189Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4786266Z E ^ 2025-05-07T20:32:37.4786627Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4786634Z 2025-05-07T20:32:37.4787051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4787055Z 2025-05-07T20:32:37.4787160Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4787386Z self=, 2025-05-07T20:32:37.4787460Z T=128, 2025-05-07T20:32:37.4787537Z D=5120, 2025-05-07T20:32:37.4787617Z scale_ub=None, 2025-05-07T20:32:37.4787705Z contiguous=False, 2025-05-07T20:32:37.4787793Z compiled=True, 2025-05-07T20:32:37.4787862Z ) 2025-05-07T20:32:37.4788080Z self = 2025-05-07T20:32:37.4788315Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.4788321Z 2025-05-07T20:32:37.4788401Z @given( 2025-05-07T20:32:37.4788543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4788645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4788801Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4788923Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4789035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4789106Z ) 2025-05-07T20:32:37.4789357Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4789450Z def test_silu_mul_quant( 2025-05-07T20:32:37.4789531Z self, 2025-05-07T20:32:37.4789608Z T: int, 2025-05-07T20:32:37.4789682Z D: int, 2025-05-07T20:32:37.4789780Z scale_ub: Optional[float], 2025-05-07T20:32:37.4789914Z contiguous: bool, 2025-05-07T20:32:37.4790000Z compiled: bool, 2025-05-07T20:32:37.4790078Z ) -> None: 2025-05-07T20:32:37.4790175Z torch.manual_seed(2025) 2025-05-07T20:32:37.4790244Z 2025-05-07T20:32:37.4790417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4790491Z 2025-05-07T20:32:37.4790585Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4790711Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4790798Z x = x_sign * x_clamp 2025-05-07T20:32:37.4790874Z x0 = x[:, :D] 2025-05-07T20:32:37.4790954Z x1 = x[:, D:] 2025-05-07T20:32:37.4791023Z 2025-05-07T20:32:37.4791104Z if contiguous: 2025-05-07T20:32:37.4791197Z x0 = x0.contiguous() 2025-05-07T20:32:37.4791287Z x1 = x1.contiguous() 2025-05-07T20:32:37.4791357Z 2025-05-07T20:32:37.4791449Z if scale_ub is not None: 2025-05-07T20:32:37.4791555Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4791693Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4791766Z ) 2025-05-07T20:32:37.4791840Z else: 2025-05-07T20:32:37.4791936Z scale_ub_tensor = None 2025-05-07T20:32:37.4792007Z 2025-05-07T20:32:37.4792136Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4792231Z op = silu_mul_quant 2025-05-07T20:32:37.4792316Z if compiled: 2025-05-07T20:32:37.4792414Z op = torch.compile(op) 2025-05-07T20:32:37.4792521Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4792593Z 2025-05-07T20:32:37.4792682Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4792686Z 2025-05-07T20:32:37.4792786Z moe/activation_test.py:117: 2025-05-07T20:32:37.4792960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4796911Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4797030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4797414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.4797511Z return fn(*args, **kwargs) 2025-05-07T20:32:37.4798008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4798110Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4798474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4798734Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4799086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4799175Z kernel = self.compile( 2025-05-07T20:32:37.4799562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4799805Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4799930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4799936Z 2025-05-07T20:32:37.4800145Z self = 2025-05-07T20:32:37.4800973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4801488Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8984d0dc0>} 2025-05-07T20:32:37.4802245Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4802478Z context = 2025-05-07T20:32:37.4802483Z 2025-05-07T20:32:37.4802652Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4802919Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4803035Z module_map=module_map) 2025-05-07T20:32:37.4803194Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4803291Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4803368Z E ^ 2025-05-07T20:32:37.4803728Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4803733Z 2025-05-07T20:32:37.4804151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4804163Z 2025-05-07T20:32:37.4804266Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4804490Z self=, 2025-05-07T20:32:37.4804570Z T=128, 2025-05-07T20:32:37.4804643Z D=7168, 2025-05-07T20:32:37.4804724Z scale_ub=1200.0, 2025-05-07T20:32:37.4804814Z contiguous=False, 2025-05-07T20:32:37.4804896Z compiled=False, 2025-05-07T20:32:37.4804967Z ) 2025-05-07T20:32:37.4805188Z self = 2025-05-07T20:32:37.4805360Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.4805365Z 2025-05-07T20:32:37.4805442Z @given( 2025-05-07T20:32:37.4805601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4805700Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4805817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4805937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4806054Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4806129Z ) 2025-05-07T20:32:37.4806375Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4806468Z def test_silu_mul_quant( 2025-05-07T20:32:37.4806551Z self, 2025-05-07T20:32:37.4806625Z T: int, 2025-05-07T20:32:37.4806699Z D: int, 2025-05-07T20:32:37.4806798Z scale_ub: Optional[float], 2025-05-07T20:32:37.4806884Z contiguous: bool, 2025-05-07T20:32:37.4806968Z compiled: bool, 2025-05-07T20:32:37.4807044Z ) -> None: 2025-05-07T20:32:37.4807136Z torch.manual_seed(2025) 2025-05-07T20:32:37.4807208Z 2025-05-07T20:32:37.4807379Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4807455Z 2025-05-07T20:32:37.4807549Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4807676Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4807762Z x = x_sign * x_clamp 2025-05-07T20:32:37.4807889Z x0 = x[:, :D] 2025-05-07T20:32:37.4807967Z x1 = x[:, D:] 2025-05-07T20:32:37.4808037Z 2025-05-07T20:32:37.4808121Z if contiguous: 2025-05-07T20:32:37.4808214Z x0 = x0.contiguous() 2025-05-07T20:32:37.4808350Z x1 = x1.contiguous() 2025-05-07T20:32:37.4808419Z 2025-05-07T20:32:37.4808507Z if scale_ub is not None: 2025-05-07T20:32:37.4808618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4808752Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4808825Z ) 2025-05-07T20:32:37.4808905Z else: 2025-05-07T20:32:37.4808996Z scale_ub_tensor = None 2025-05-07T20:32:37.4809074Z 2025-05-07T20:32:37.4809209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4809296Z op = silu_mul_quant 2025-05-07T20:32:37.4809421Z if compiled: 2025-05-07T20:32:37.4809528Z op = torch.compile(op) 2025-05-07T20:32:37.4809634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4809711Z 2025-05-07T20:32:37.4809799Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4809803Z 2025-05-07T20:32:37.4809899Z moe/activation_test.py:117: 2025-05-07T20:32:37.4810029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4810128Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4810226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4810737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4810832Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4811201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4811422Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4811768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4811866Z kernel = self.compile( 2025-05-07T20:32:37.4812252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4812428Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4812554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4812558Z 2025-05-07T20:32:37.4812763Z self = 2025-05-07T20:32:37.4813591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4814103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa89806e200>} 2025-05-07T20:32:37.4814863Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4815062Z context = 2025-05-07T20:32:37.4815067Z 2025-05-07T20:32:37.4815231Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4815499Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4815606Z module_map=module_map) 2025-05-07T20:32:37.4815766Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4815868Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4815946Z E ^ 2025-05-07T20:32:37.4816350Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4816355Z 2025-05-07T20:32:37.4816771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4816839Z 2025-05-07T20:32:37.4816943Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4817169Z self=, 2025-05-07T20:32:37.4817243Z T=128, 2025-05-07T20:32:37.4817321Z D=5120, 2025-05-07T20:32:37.4817401Z scale_ub=None, 2025-05-07T20:32:37.4817486Z contiguous=False, 2025-05-07T20:32:37.4817568Z compiled=False, 2025-05-07T20:32:37.4817643Z ) 2025-05-07T20:32:37.4817864Z self = 2025-05-07T20:32:37.4818147Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.4818200Z 2025-05-07T20:32:37.4818281Z @given( 2025-05-07T20:32:37.4818414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4818526Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4818668Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4818787Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4818905Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4818976Z ) 2025-05-07T20:32:37.4819221Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4819317Z def test_silu_mul_quant( 2025-05-07T20:32:37.4819390Z self, 2025-05-07T20:32:37.4819464Z T: int, 2025-05-07T20:32:37.4819541Z D: int, 2025-05-07T20:32:37.4819639Z scale_ub: Optional[float], 2025-05-07T20:32:37.4819727Z contiguous: bool, 2025-05-07T20:32:37.4819811Z compiled: bool, 2025-05-07T20:32:37.4819890Z ) -> None: 2025-05-07T20:32:37.4819982Z torch.manual_seed(2025) 2025-05-07T20:32:37.4820055Z 2025-05-07T20:32:37.4820226Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4820297Z 2025-05-07T20:32:37.4820392Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4820514Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4820607Z x = x_sign * x_clamp 2025-05-07T20:32:37.4820683Z x0 = x[:, :D] 2025-05-07T20:32:37.4820759Z x1 = x[:, D:] 2025-05-07T20:32:37.4820834Z 2025-05-07T20:32:37.4820914Z if contiguous: 2025-05-07T20:32:37.4821002Z x0 = x0.contiguous() 2025-05-07T20:32:37.4821091Z x1 = x1.contiguous() 2025-05-07T20:32:37.4821163Z 2025-05-07T20:32:37.4821297Z if scale_ub is not None: 2025-05-07T20:32:37.4821405Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4821539Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4821614Z ) 2025-05-07T20:32:37.4821692Z else: 2025-05-07T20:32:37.4821787Z scale_ub_tensor = None 2025-05-07T20:32:37.4821859Z 2025-05-07T20:32:37.4821989Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4822076Z op = silu_mul_quant 2025-05-07T20:32:37.4822162Z if compiled: 2025-05-07T20:32:37.4822259Z op = torch.compile(op) 2025-05-07T20:32:37.4822362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4822436Z 2025-05-07T20:32:37.4822524Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4822528Z 2025-05-07T20:32:37.4822623Z moe/activation_test.py:117: 2025-05-07T20:32:37.4822750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4822851Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4822954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4823456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4823598Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4823963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4824184Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4824569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4824668Z kernel = self.compile( 2025-05-07T20:32:37.4825051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4825228Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4825352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4825357Z 2025-05-07T20:32:37.4825604Z self = 2025-05-07T20:32:37.4826393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4826903Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa898626a70>} 2025-05-07T20:32:37.4827665Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4827859Z context = 2025-05-07T20:32:37.4827863Z 2025-05-07T20:32:37.4828030Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4828295Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4828403Z module_map=module_map) 2025-05-07T20:32:37.4828580Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4828690Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4828780Z E ^ 2025-05-07T20:32:37.4829150Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4829155Z 2025-05-07T20:32:37.4829571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4829575Z 2025-05-07T20:32:37.4829680Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4829946Z self=, 2025-05-07T20:32:37.4830021Z T=128, 2025-05-07T20:32:37.4830099Z D=5120, 2025-05-07T20:32:37.4830183Z scale_ub=1200.0, 2025-05-07T20:32:37.4830267Z contiguous=True, 2025-05-07T20:32:37.4830355Z compiled=False, 2025-05-07T20:32:37.4830424Z ) 2025-05-07T20:32:37.4830644Z self = 2025-05-07T20:32:37.4830815Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.4830822Z 2025-05-07T20:32:37.4830900Z @given( 2025-05-07T20:32:37.4831022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4831119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4831234Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4831354Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4831466Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4831541Z ) 2025-05-07T20:32:37.4831790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4831883Z def test_silu_mul_quant( 2025-05-07T20:32:37.4831960Z self, 2025-05-07T20:32:37.4832039Z T: int, 2025-05-07T20:32:37.4832157Z D: int, 2025-05-07T20:32:37.4832257Z scale_ub: Optional[float], 2025-05-07T20:32:37.4832345Z contiguous: bool, 2025-05-07T20:32:37.4832427Z compiled: bool, 2025-05-07T20:32:37.4832542Z ) -> None: 2025-05-07T20:32:37.4832636Z torch.manual_seed(2025) 2025-05-07T20:32:37.4832707Z 2025-05-07T20:32:37.4832879Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4832948Z 2025-05-07T20:32:37.4833036Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4833165Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4833255Z x = x_sign * x_clamp 2025-05-07T20:32:37.4833332Z x0 = x[:, :D] 2025-05-07T20:32:37.4833414Z x1 = x[:, D:] 2025-05-07T20:32:37.4833483Z 2025-05-07T20:32:37.4833566Z if contiguous: 2025-05-07T20:32:37.4833700Z x0 = x0.contiguous() 2025-05-07T20:32:37.4833785Z x1 = x1.contiguous() 2025-05-07T20:32:37.4833857Z 2025-05-07T20:32:37.4833951Z if scale_ub is not None: 2025-05-07T20:32:37.4834054Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4834190Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4834265Z ) 2025-05-07T20:32:37.4834339Z else: 2025-05-07T20:32:37.4834433Z scale_ub_tensor = None 2025-05-07T20:32:37.4834504Z 2025-05-07T20:32:37.4834633Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4834721Z op = silu_mul_quant 2025-05-07T20:32:37.4834802Z if compiled: 2025-05-07T20:32:37.4834897Z op = torch.compile(op) 2025-05-07T20:32:37.4835007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4835078Z 2025-05-07T20:32:37.4835168Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4835173Z 2025-05-07T20:32:37.4835271Z moe/activation_test.py:117: 2025-05-07T20:32:37.4835399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4835503Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4835601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4836103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4836204Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4836564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4836793Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4837179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4837274Z kernel = self.compile( 2025-05-07T20:32:37.4837660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4837839Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4837966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4837971Z 2025-05-07T20:32:37.4838175Z self = 2025-05-07T20:32:37.4838958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4839476Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8986272e0>} 2025-05-07T20:32:37.4840229Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4840465Z context = 2025-05-07T20:32:37.4840471Z 2025-05-07T20:32:37.4840634Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4840941Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4841050Z module_map=module_map) 2025-05-07T20:32:37.4841209Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4841307Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4841380Z E ^ 2025-05-07T20:32:37.4841740Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4841744Z 2025-05-07T20:32:37.4842164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4842211Z 2025-05-07T20:32:37.4842316Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4842541Z self=, 2025-05-07T20:32:37.4842614Z T=1, 2025-05-07T20:32:37.4842686Z D=7168, 2025-05-07T20:32:37.4842773Z scale_ub=1200.0, 2025-05-07T20:32:37.4842854Z contiguous=True, 2025-05-07T20:32:37.4842935Z compiled=True, 2025-05-07T20:32:37.4843009Z ) 2025-05-07T20:32:37.4843228Z self = 2025-05-07T20:32:37.4843391Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.4843396Z 2025-05-07T20:32:37.4843473Z @given( 2025-05-07T20:32:37.4843593Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4843690Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4843811Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4843930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4844049Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4844118Z ) 2025-05-07T20:32:37.4844362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4844455Z def test_silu_mul_quant( 2025-05-07T20:32:37.4844531Z self, 2025-05-07T20:32:37.4844604Z T: int, 2025-05-07T20:32:37.4844680Z D: int, 2025-05-07T20:32:37.4844776Z scale_ub: Optional[float], 2025-05-07T20:32:37.4844865Z contiguous: bool, 2025-05-07T20:32:37.4844952Z compiled: bool, 2025-05-07T20:32:37.4845028Z ) -> None: 2025-05-07T20:32:37.4845120Z torch.manual_seed(2025) 2025-05-07T20:32:37.4845192Z 2025-05-07T20:32:37.4845472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4845549Z 2025-05-07T20:32:37.4845638Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4845763Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4845850Z x = x_sign * x_clamp 2025-05-07T20:32:37.4845930Z x0 = x[:, :D] 2025-05-07T20:32:37.4846006Z x1 = x[:, D:] 2025-05-07T20:32:37.4846081Z 2025-05-07T20:32:37.4846161Z if contiguous: 2025-05-07T20:32:37.4846248Z x0 = x0.contiguous() 2025-05-07T20:32:37.4846338Z x1 = x1.contiguous() 2025-05-07T20:32:37.4846407Z 2025-05-07T20:32:37.4846495Z if scale_ub is not None: 2025-05-07T20:32:37.4846600Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4846737Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4846813Z ) 2025-05-07T20:32:37.4846885Z else: 2025-05-07T20:32:37.4846976Z scale_ub_tensor = None 2025-05-07T20:32:37.4847050Z 2025-05-07T20:32:37.4847182Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4847269Z op = silu_mul_quant 2025-05-07T20:32:37.4847360Z if compiled: 2025-05-07T20:32:37.4847456Z op = torch.compile(op) 2025-05-07T20:32:37.4847625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4847699Z 2025-05-07T20:32:37.4847787Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4847792Z 2025-05-07T20:32:37.4847888Z moe/activation_test.py:117: 2025-05-07T20:32:37.4848058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4848158Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4848258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4848679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.4848769Z return fn(*args, **kwargs) 2025-05-07T20:32:37.4849271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4849406Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4849766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4849991Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4850333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4850433Z kernel = self.compile( 2025-05-07T20:32:37.4850815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4850989Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4851116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4851123Z 2025-05-07T20:32:37.4851328Z self = 2025-05-07T20:32:37.4852115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4852625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa8986270a0>} 2025-05-07T20:32:37.4853385Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4853579Z context = 2025-05-07T20:32:37.4853583Z 2025-05-07T20:32:37.4853788Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4854058Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4854167Z module_map=module_map) 2025-05-07T20:32:37.4854331Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4854431Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4854505Z E ^ 2025-05-07T20:32:37.4854865Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4854872Z 2025-05-07T20:32:37.4855287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4855292Z 2025-05-07T20:32:37.4855394Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4855847Z self=, 2025-05-07T20:32:37.4855965Z T=1, 2025-05-07T20:32:37.4856069Z D=7168, 2025-05-07T20:32:37.4856156Z scale_ub=1200.0, 2025-05-07T20:32:37.4856241Z contiguous=False, 2025-05-07T20:32:37.4856329Z compiled=True, 2025-05-07T20:32:37.4856402Z ) 2025-05-07T20:32:37.4856621Z self = 2025-05-07T20:32:37.4856888Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.4856894Z 2025-05-07T20:32:37.4856970Z @given( 2025-05-07T20:32:37.4857089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4857250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4857364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4857481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4857596Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4857669Z ) 2025-05-07T20:32:37.4857922Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4858078Z def test_silu_mul_quant( 2025-05-07T20:32:37.4858156Z self, 2025-05-07T20:32:37.4858236Z T: int, 2025-05-07T20:32:37.4858391Z D: int, 2025-05-07T20:32:37.4858489Z scale_ub: Optional[float], 2025-05-07T20:32:37.4858584Z contiguous: bool, 2025-05-07T20:32:37.4858670Z compiled: bool, 2025-05-07T20:32:37.4858773Z ) -> None: 2025-05-07T20:32:37.4858874Z torch.manual_seed(2025) 2025-05-07T20:32:37.4858964Z 2025-05-07T20:32:37.4859142Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4859222Z 2025-05-07T20:32:37.4859313Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4859440Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4859527Z x = x_sign * x_clamp 2025-05-07T20:32:37.4859606Z x0 = x[:, :D] 2025-05-07T20:32:37.4859697Z x1 = x[:, D:] 2025-05-07T20:32:37.4859773Z 2025-05-07T20:32:37.4859858Z if contiguous: 2025-05-07T20:32:37.4859949Z x0 = x0.contiguous() 2025-05-07T20:32:37.4860036Z x1 = x1.contiguous() 2025-05-07T20:32:37.4860110Z 2025-05-07T20:32:37.4860204Z if scale_ub is not None: 2025-05-07T20:32:37.4860311Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4860451Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4860525Z ) 2025-05-07T20:32:37.4860599Z else: 2025-05-07T20:32:37.4860695Z scale_ub_tensor = None 2025-05-07T20:32:37.4860776Z 2025-05-07T20:32:37.4860909Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4861000Z op = silu_mul_quant 2025-05-07T20:32:37.4861084Z if compiled: 2025-05-07T20:32:37.4861188Z op = torch.compile(op) 2025-05-07T20:32:37.4861293Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4861365Z 2025-05-07T20:32:37.4861458Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4861532Z 2025-05-07T20:32:37.4861631Z moe/activation_test.py:117: 2025-05-07T20:32:37.4861757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4861862Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4861964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4862336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.4862429Z return fn(*args, **kwargs) 2025-05-07T20:32:37.4862933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4863033Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4863392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4863616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4863965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4864057Z kernel = self.compile( 2025-05-07T20:32:37.4864497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4864676Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4864799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4864843Z 2025-05-07T20:32:37.4865055Z self = 2025-05-07T20:32:37.4865841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4866356Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa898626440>} 2025-05-07T20:32:37.4867114Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4867345Z context = 2025-05-07T20:32:37.4867350Z 2025-05-07T20:32:37.4867517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4867787Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4867897Z module_map=module_map) 2025-05-07T20:32:37.4868058Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4868156Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4868237Z E ^ 2025-05-07T20:32:37.4868598Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4868603Z 2025-05-07T20:32:37.4869020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4869033Z 2025-05-07T20:32:37.4869137Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4869361Z self=, 2025-05-07T20:32:37.4869441Z T=1, 2025-05-07T20:32:37.4869518Z D=7168, 2025-05-07T20:32:37.4869600Z scale_ub=None, 2025-05-07T20:32:37.4869688Z contiguous=False, 2025-05-07T20:32:37.4869769Z compiled=True, 2025-05-07T20:32:37.4869840Z ) 2025-05-07T20:32:37.4870062Z self = 2025-05-07T20:32:37.4870233Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.4870238Z 2025-05-07T20:32:37.4870320Z @given( 2025-05-07T20:32:37.4870482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4870586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4870708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4870831Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4870944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4871020Z ) 2025-05-07T20:32:37.4871266Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4871360Z def test_silu_mul_quant( 2025-05-07T20:32:37.4871438Z self, 2025-05-07T20:32:37.4871512Z T: int, 2025-05-07T20:32:37.4871587Z D: int, 2025-05-07T20:32:37.4871690Z scale_ub: Optional[float], 2025-05-07T20:32:37.4871778Z contiguous: bool, 2025-05-07T20:32:37.4871864Z compiled: bool, 2025-05-07T20:32:37.4871941Z ) -> None: 2025-05-07T20:32:37.4872050Z torch.manual_seed(2025) 2025-05-07T20:32:37.4872161Z 2025-05-07T20:32:37.4872393Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4872468Z 2025-05-07T20:32:37.4872565Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4872688Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4872827Z x = x_sign * x_clamp 2025-05-07T20:32:37.4872912Z x0 = x[:, :D] 2025-05-07T20:32:37.4872988Z x1 = x[:, D:] 2025-05-07T20:32:37.4873056Z 2025-05-07T20:32:37.4873140Z if contiguous: 2025-05-07T20:32:37.4873271Z x0 = x0.contiguous() 2025-05-07T20:32:37.4873360Z x1 = x1.contiguous() 2025-05-07T20:32:37.4873428Z 2025-05-07T20:32:37.4873516Z if scale_ub is not None: 2025-05-07T20:32:37.4873625Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4873761Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4873835Z ) 2025-05-07T20:32:37.4873911Z else: 2025-05-07T20:32:37.4874006Z scale_ub_tensor = None 2025-05-07T20:32:37.4874076Z 2025-05-07T20:32:37.4874211Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4874340Z op = silu_mul_quant 2025-05-07T20:32:37.4874421Z if compiled: 2025-05-07T20:32:37.4874524Z op = torch.compile(op) 2025-05-07T20:32:37.4874631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4874705Z 2025-05-07T20:32:37.4874794Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.4874918Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.4874990Z 2025-05-07T20:32:37.4875127Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4875227Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.4875327Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.4875448Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.4875590Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4875668Z 2025-05-07T20:32:37.4875766Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.4875771Z 2025-05-07T20:32:37.4875872Z moe/activation_test.py:126: 2025-05-07T20:32:37.4876001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4876108Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.4876247Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.4876812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.4876913Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.4877280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4877505Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4877945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.4878202Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4878610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.4878866Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.4879246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.4879417Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.4879760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.4879834Z fn() 2025-05-07T20:32:37.4880240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.4880319Z self.fn.run( 2025-05-07T20:32:37.4880660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4880759Z kernel = self.compile( 2025-05-07T20:32:37.4881183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4881361Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4881525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4881529Z 2025-05-07T20:32:37.4881736Z self = 2025-05-07T20:32:37.4882527Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4883036Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7adcaadd0>} 2025-05-07T20:32:37.4883832Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4884028Z context = 2025-05-07T20:32:37.4884035Z 2025-05-07T20:32:37.4884199Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4884470Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4884579Z module_map=module_map) 2025-05-07T20:32:37.4884744Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4884849Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.4884925Z E ^ 2025-05-07T20:32:37.4885282Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4885289Z 2025-05-07T20:32:37.4885708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4885713Z 2025-05-07T20:32:37.4885821Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4886043Z self=, 2025-05-07T20:32:37.4886119Z T=1, 2025-05-07T20:32:37.4886199Z D=5120, 2025-05-07T20:32:37.4886284Z scale_ub=1200.0, 2025-05-07T20:32:37.4886374Z contiguous=False, 2025-05-07T20:32:37.4886461Z compiled=True, 2025-05-07T20:32:37.4886533Z ) 2025-05-07T20:32:37.4886750Z self = 2025-05-07T20:32:37.4886961Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.4886966Z 2025-05-07T20:32:37.4887041Z @given( 2025-05-07T20:32:37.4887163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4887263Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4887380Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4887504Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4887619Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4887694Z ) 2025-05-07T20:32:37.4887944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4888038Z def test_silu_mul_quant( 2025-05-07T20:32:37.4888114Z self, 2025-05-07T20:32:37.4888190Z T: int, 2025-05-07T20:32:37.4888265Z D: int, 2025-05-07T20:32:37.4888364Z scale_ub: Optional[float], 2025-05-07T20:32:37.4888456Z contiguous: bool, 2025-05-07T20:32:37.4888543Z compiled: bool, 2025-05-07T20:32:37.4888631Z ) -> None: 2025-05-07T20:32:37.4888728Z torch.manual_seed(2025) 2025-05-07T20:32:37.4888799Z 2025-05-07T20:32:37.4888970Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4889044Z 2025-05-07T20:32:37.4889179Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4889311Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4889398Z x = x_sign * x_clamp 2025-05-07T20:32:37.4889476Z x0 = x[:, :D] 2025-05-07T20:32:37.4889595Z x1 = x[:, D:] 2025-05-07T20:32:37.4889666Z 2025-05-07T20:32:37.4889748Z if contiguous: 2025-05-07T20:32:37.4889845Z x0 = x0.contiguous() 2025-05-07T20:32:37.4889931Z x1 = x1.contiguous() 2025-05-07T20:32:37.4890004Z 2025-05-07T20:32:37.4890095Z if scale_ub is not None: 2025-05-07T20:32:37.4890200Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4890341Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4890417Z ) 2025-05-07T20:32:37.4890493Z else: 2025-05-07T20:32:37.4890591Z scale_ub_tensor = None 2025-05-07T20:32:37.4890708Z 2025-05-07T20:32:37.4890838Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4890933Z op = silu_mul_quant 2025-05-07T20:32:37.4891016Z if compiled: 2025-05-07T20:32:37.4891115Z op = torch.compile(op) 2025-05-07T20:32:37.4891225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4891299Z 2025-05-07T20:32:37.4891392Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4891396Z 2025-05-07T20:32:37.4891493Z moe/activation_test.py:117: 2025-05-07T20:32:37.4891619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4891722Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4891822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4892195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.4892291Z return fn(*args, **kwargs) 2025-05-07T20:32:37.4892792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4892894Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4893252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4893476Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4893823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4893916Z kernel = self.compile( 2025-05-07T20:32:37.4894300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4894519Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4894645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4894652Z 2025-05-07T20:32:37.4894861Z self = 2025-05-07T20:32:37.4895649Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4896159Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7adcabeb0>} 2025-05-07T20:32:37.4896923Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4897118Z context = 2025-05-07T20:32:37.4897123Z 2025-05-07T20:32:37.4897292Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4897598Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4897708Z module_map=module_map) 2025-05-07T20:32:37.4897870Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4898080Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4898184Z E ^ 2025-05-07T20:32:37.4898556Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4898561Z 2025-05-07T20:32:37.4898977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4898982Z 2025-05-07T20:32:37.4899093Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4899315Z self=, 2025-05-07T20:32:37.4899392Z T=1, 2025-05-07T20:32:37.4899512Z D=5120, 2025-05-07T20:32:37.4899595Z scale_ub=1200.0, 2025-05-07T20:32:37.4899684Z contiguous=False, 2025-05-07T20:32:37.4899770Z compiled=False, 2025-05-07T20:32:37.4899841Z ) 2025-05-07T20:32:37.4900062Z self = 2025-05-07T20:32:37.4900230Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.4900238Z 2025-05-07T20:32:37.4900315Z @given( 2025-05-07T20:32:37.4900437Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4900537Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4900654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4900773Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4900887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4900962Z ) 2025-05-07T20:32:37.4901206Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4901302Z def test_silu_mul_quant( 2025-05-07T20:32:37.4901381Z self, 2025-05-07T20:32:37.4901459Z T: int, 2025-05-07T20:32:37.4901533Z D: int, 2025-05-07T20:32:37.4901639Z scale_ub: Optional[float], 2025-05-07T20:32:37.4901728Z contiguous: bool, 2025-05-07T20:32:37.4901812Z compiled: bool, 2025-05-07T20:32:37.4901894Z ) -> None: 2025-05-07T20:32:37.4901988Z torch.manual_seed(2025) 2025-05-07T20:32:37.4902061Z 2025-05-07T20:32:37.4902229Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4902302Z 2025-05-07T20:32:37.4902394Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4902518Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4902605Z x = x_sign * x_clamp 2025-05-07T20:32:37.4902742Z x0 = x[:, :D] 2025-05-07T20:32:37.4902822Z x1 = x[:, D:] 2025-05-07T20:32:37.4902893Z 2025-05-07T20:32:37.4902979Z if contiguous: 2025-05-07T20:32:37.4903074Z x0 = x0.contiguous() 2025-05-07T20:32:37.4903162Z x1 = x1.contiguous() 2025-05-07T20:32:37.4903238Z 2025-05-07T20:32:37.4903328Z if scale_ub is not None: 2025-05-07T20:32:37.4903435Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4903568Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4903647Z ) 2025-05-07T20:32:37.4903726Z else: 2025-05-07T20:32:37.4903820Z scale_ub_tensor = None 2025-05-07T20:32:37.4903892Z 2025-05-07T20:32:37.4904028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4904120Z op = silu_mul_quant 2025-05-07T20:32:37.4904204Z if compiled: 2025-05-07T20:32:37.4904304Z op = torch.compile(op) 2025-05-07T20:32:37.4904411Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4904483Z 2025-05-07T20:32:37.4904582Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4904590Z 2025-05-07T20:32:37.4904687Z moe/activation_test.py:117: 2025-05-07T20:32:37.4904861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4904963Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4905063Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4905568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4905707Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4906067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4906294Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4906640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4906737Z kernel = self.compile( 2025-05-07T20:32:37.4907188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4907365Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4907492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4907496Z 2025-05-07T20:32:37.4907705Z self = 2025-05-07T20:32:37.4908490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4908997Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa898624940>} 2025-05-07T20:32:37.4909751Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4909948Z context = 2025-05-07T20:32:37.4909952Z 2025-05-07T20:32:37.4910117Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4910390Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4910496Z module_map=module_map) 2025-05-07T20:32:37.4910657Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4910758Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4910833Z E ^ 2025-05-07T20:32:37.4911233Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4911242Z 2025-05-07T20:32:37.4911657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4911664Z 2025-05-07T20:32:37.4911770Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4911996Z self=, 2025-05-07T20:32:37.4912071Z T=16384, 2025-05-07T20:32:37.4912149Z D=5120, 2025-05-07T20:32:37.4912234Z scale_ub=1200.0, 2025-05-07T20:32:37.4912319Z contiguous=False, 2025-05-07T20:32:37.4912402Z compiled=True, 2025-05-07T20:32:37.4912476Z ) 2025-05-07T20:32:37.4912693Z self = 2025-05-07T20:32:37.4912873Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.4912878Z 2025-05-07T20:32:37.4912954Z @given( 2025-05-07T20:32:37.4913075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4913178Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4913292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4913412Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4913570Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4913643Z ) 2025-05-07T20:32:37.4913891Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4914022Z def test_silu_mul_quant( 2025-05-07T20:32:37.4914096Z self, 2025-05-07T20:32:37.4914176Z T: int, 2025-05-07T20:32:37.4914251Z D: int, 2025-05-07T20:32:37.4914349Z scale_ub: Optional[float], 2025-05-07T20:32:37.4914439Z contiguous: bool, 2025-05-07T20:32:37.4914522Z compiled: bool, 2025-05-07T20:32:37.4914599Z ) -> None: 2025-05-07T20:32:37.4914696Z torch.manual_seed(2025) 2025-05-07T20:32:37.4914766Z 2025-05-07T20:32:37.4914939Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4915014Z 2025-05-07T20:32:37.4915106Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4915271Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4915363Z x = x_sign * x_clamp 2025-05-07T20:32:37.4915440Z x0 = x[:, :D] 2025-05-07T20:32:37.4915520Z x1 = x[:, D:] 2025-05-07T20:32:37.4915591Z 2025-05-07T20:32:37.4915671Z if contiguous: 2025-05-07T20:32:37.4915765Z x0 = x0.contiguous() 2025-05-07T20:32:37.4915852Z x1 = x1.contiguous() 2025-05-07T20:32:37.4915922Z 2025-05-07T20:32:37.4916015Z if scale_ub is not None: 2025-05-07T20:32:37.4916119Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4916251Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4916328Z ) 2025-05-07T20:32:37.4916403Z else: 2025-05-07T20:32:37.4916497Z scale_ub_tensor = None 2025-05-07T20:32:37.4916571Z 2025-05-07T20:32:37.4916702Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4916796Z op = silu_mul_quant 2025-05-07T20:32:37.4916877Z if compiled: 2025-05-07T20:32:37.4916978Z op = torch.compile(op) 2025-05-07T20:32:37.4917085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4917155Z 2025-05-07T20:32:37.4917248Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4917255Z 2025-05-07T20:32:37.4921774Z moe/activation_test.py:117: 2025-05-07T20:32:37.4921913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4922015Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4922121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4922500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.4922660Z return fn(*args, **kwargs) 2025-05-07T20:32:37.4923161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4923258Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4923620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4923840Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4924186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4924277Z kernel = self.compile( 2025-05-07T20:32:37.4924660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4924835Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4924960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4924965Z 2025-05-07T20:32:37.4925174Z self = 2025-05-07T20:32:37.4926027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4926540Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ada448b0>} 2025-05-07T20:32:37.4927344Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4927535Z context = 2025-05-07T20:32:37.4927540Z 2025-05-07T20:32:37.4927714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4927977Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4928194Z module_map=module_map) 2025-05-07T20:32:37.4928359Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4928455Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4928533Z E ^ 2025-05-07T20:32:37.4928890Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4928898Z 2025-05-07T20:32:37.4929312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4929316Z 2025-05-07T20:32:37.4929424Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4929645Z self=, 2025-05-07T20:32:37.4929725Z T=2048, 2025-05-07T20:32:37.4929799Z D=7168, 2025-05-07T20:32:37.4929880Z scale_ub=1200.0, 2025-05-07T20:32:37.4929970Z contiguous=False, 2025-05-07T20:32:37.4930054Z compiled=True, 2025-05-07T20:32:37.4930124Z ) 2025-05-07T20:32:37.4930345Z self = 2025-05-07T20:32:37.4930517Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.4930522Z 2025-05-07T20:32:37.4930595Z @given( 2025-05-07T20:32:37.4930717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4930819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4930932Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4931051Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4931162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4931234Z ) 2025-05-07T20:32:37.4931527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4931620Z def test_silu_mul_quant( 2025-05-07T20:32:37.4931699Z self, 2025-05-07T20:32:37.4931773Z T: int, 2025-05-07T20:32:37.4931845Z D: int, 2025-05-07T20:32:37.4931942Z scale_ub: Optional[float], 2025-05-07T20:32:37.4932031Z contiguous: bool, 2025-05-07T20:32:37.4932113Z compiled: bool, 2025-05-07T20:32:37.4932190Z ) -> None: 2025-05-07T20:32:37.4932282Z torch.manual_seed(2025) 2025-05-07T20:32:37.4932355Z 2025-05-07T20:32:37.4932525Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4932600Z 2025-05-07T20:32:37.4932690Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4932813Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4932898Z x = x_sign * x_clamp 2025-05-07T20:32:37.4932977Z x0 = x[:, :D] 2025-05-07T20:32:37.4933051Z x1 = x[:, D:] 2025-05-07T20:32:37.4933123Z 2025-05-07T20:32:37.4933211Z if contiguous: 2025-05-07T20:32:37.4933301Z x0 = x0.contiguous() 2025-05-07T20:32:37.4933388Z x1 = x1.contiguous() 2025-05-07T20:32:37.4933463Z 2025-05-07T20:32:37.4933550Z if scale_ub is not None: 2025-05-07T20:32:37.4933697Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4933837Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4933908Z ) 2025-05-07T20:32:37.4933985Z else: 2025-05-07T20:32:37.4934115Z scale_ub_tensor = None 2025-05-07T20:32:37.4934183Z 2025-05-07T20:32:37.4934316Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4934403Z op = silu_mul_quant 2025-05-07T20:32:37.4934483Z if compiled: 2025-05-07T20:32:37.4934583Z op = torch.compile(op) 2025-05-07T20:32:37.4934685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4934753Z 2025-05-07T20:32:37.4934846Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4934850Z 2025-05-07T20:32:37.4934945Z moe/activation_test.py:117: 2025-05-07T20:32:37.4935073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4935213Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4935313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4935689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.4935782Z return fn(*args, **kwargs) 2025-05-07T20:32:37.4936282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4936381Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4936743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4936972Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4937314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4937408Z kernel = self.compile( 2025-05-07T20:32:37.4937796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4937968Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4938179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4938187Z 2025-05-07T20:32:37.4938420Z self = 2025-05-07T20:32:37.4939235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4939792Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ada45090>} 2025-05-07T20:32:37.4940558Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4940752Z context = 2025-05-07T20:32:37.4940759Z 2025-05-07T20:32:37.4940926Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4941195Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4941304Z module_map=module_map) 2025-05-07T20:32:37.4941463Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4941561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4941637Z E ^ 2025-05-07T20:32:37.4941996Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4942003Z 2025-05-07T20:32:37.4942465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4942470Z 2025-05-07T20:32:37.4942574Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4942796Z self=, 2025-05-07T20:32:37.4942908Z T=1, 2025-05-07T20:32:37.4942986Z D=5120, 2025-05-07T20:32:37.4943067Z scale_ub=None, 2025-05-07T20:32:37.4943151Z contiguous=False, 2025-05-07T20:32:37.4943237Z compiled=False, 2025-05-07T20:32:37.4943306Z ) 2025-05-07T20:32:37.4943527Z self = 2025-05-07T20:32:37.4943694Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.4943699Z 2025-05-07T20:32:37.4943775Z @given( 2025-05-07T20:32:37.4943896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4943993Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4944151Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4944273Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4944385Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4944459Z ) 2025-05-07T20:32:37.4944704Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4944799Z def test_silu_mul_quant( 2025-05-07T20:32:37.4944875Z self, 2025-05-07T20:32:37.4944948Z T: int, 2025-05-07T20:32:37.4945020Z D: int, 2025-05-07T20:32:37.4945120Z scale_ub: Optional[float], 2025-05-07T20:32:37.4945207Z contiguous: bool, 2025-05-07T20:32:37.4945292Z compiled: bool, 2025-05-07T20:32:37.4945370Z ) -> None: 2025-05-07T20:32:37.4945466Z torch.manual_seed(2025) 2025-05-07T20:32:37.4945535Z 2025-05-07T20:32:37.4945708Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4945781Z 2025-05-07T20:32:37.4945871Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4946001Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4946088Z x = x_sign * x_clamp 2025-05-07T20:32:37.4946166Z x0 = x[:, :D] 2025-05-07T20:32:37.4946243Z x1 = x[:, D:] 2025-05-07T20:32:37.4946313Z 2025-05-07T20:32:37.4946396Z if contiguous: 2025-05-07T20:32:37.4946485Z x0 = x0.contiguous() 2025-05-07T20:32:37.4946571Z x1 = x1.contiguous() 2025-05-07T20:32:37.4946644Z 2025-05-07T20:32:37.4946733Z if scale_ub is not None: 2025-05-07T20:32:37.4946836Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4946974Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4947046Z ) 2025-05-07T20:32:37.4947164Z else: 2025-05-07T20:32:37.4947258Z scale_ub_tensor = None 2025-05-07T20:32:37.4947326Z 2025-05-07T20:32:37.4947457Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4947546Z op = silu_mul_quant 2025-05-07T20:32:37.4947627Z if compiled: 2025-05-07T20:32:37.4947725Z op = torch.compile(op) 2025-05-07T20:32:37.4947827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4947894Z 2025-05-07T20:32:37.4947987Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4947991Z 2025-05-07T20:32:37.4948084Z moe/activation_test.py:117: 2025-05-07T20:32:37.4948207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4948308Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4948404Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4948965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4949061Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4949423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4949691Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4950033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4950163Z kernel = self.compile( 2025-05-07T20:32:37.4950550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4950723Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4950847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4950852Z 2025-05-07T20:32:37.4951060Z self = 2025-05-07T20:32:37.4951846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4952400Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ada457e0>} 2025-05-07T20:32:37.4953156Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4953351Z context = 2025-05-07T20:32:37.4953356Z 2025-05-07T20:32:37.4953519Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4953788Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4953893Z module_map=module_map) 2025-05-07T20:32:37.4954056Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4954155Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4954229Z E ^ 2025-05-07T20:32:37.4954586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4954590Z 2025-05-07T20:32:37.4955013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4955018Z 2025-05-07T20:32:37.4955119Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4955346Z self=, 2025-05-07T20:32:37.4955417Z T=4096, 2025-05-07T20:32:37.4955488Z D=7168, 2025-05-07T20:32:37.4955770Z scale_ub=1200.0, 2025-05-07T20:32:37.4956027Z contiguous=False, 2025-05-07T20:32:37.4956115Z compiled=False, 2025-05-07T20:32:37.4956192Z ) 2025-05-07T20:32:37.4956410Z self = 2025-05-07T20:32:37.4956593Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.4956601Z 2025-05-07T20:32:37.4956673Z @given( 2025-05-07T20:32:37.4956790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4956891Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4957009Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4957126Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4957243Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4957312Z ) 2025-05-07T20:32:37.4957561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4957654Z def test_silu_mul_quant( 2025-05-07T20:32:37.4957730Z self, 2025-05-07T20:32:37.4957802Z T: int, 2025-05-07T20:32:37.4957874Z D: int, 2025-05-07T20:32:37.4957970Z scale_ub: Optional[float], 2025-05-07T20:32:37.4958062Z contiguous: bool, 2025-05-07T20:32:37.4958144Z compiled: bool, 2025-05-07T20:32:37.4958299Z ) -> None: 2025-05-07T20:32:37.4958398Z torch.manual_seed(2025) 2025-05-07T20:32:37.4958465Z 2025-05-07T20:32:37.4958634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4958764Z 2025-05-07T20:32:37.4958853Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4958975Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4959063Z x = x_sign * x_clamp 2025-05-07T20:32:37.4959139Z x0 = x[:, :D] 2025-05-07T20:32:37.4959215Z x1 = x[:, D:] 2025-05-07T20:32:37.4959285Z 2025-05-07T20:32:37.4959365Z if contiguous: 2025-05-07T20:32:37.4959455Z x0 = x0.contiguous() 2025-05-07T20:32:37.4959546Z x1 = x1.contiguous() 2025-05-07T20:32:37.4959613Z 2025-05-07T20:32:37.4959709Z if scale_ub is not None: 2025-05-07T20:32:37.4959877Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4960012Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4960088Z ) 2025-05-07T20:32:37.4960160Z else: 2025-05-07T20:32:37.4960251Z scale_ub_tensor = None 2025-05-07T20:32:37.4960322Z 2025-05-07T20:32:37.4960451Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4960540Z op = silu_mul_quant 2025-05-07T20:32:37.4960623Z if compiled: 2025-05-07T20:32:37.4960718Z op = torch.compile(op) 2025-05-07T20:32:37.4960820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4960892Z 2025-05-07T20:32:37.4960979Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4960984Z 2025-05-07T20:32:37.4961081Z moe/activation_test.py:117: 2025-05-07T20:32:37.4961209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4961308Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4961412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4961919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4962015Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4962378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4962603Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4962949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4963046Z kernel = self.compile( 2025-05-07T20:32:37.4963479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4963661Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4963782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4963790Z 2025-05-07T20:32:37.4964000Z self = 2025-05-07T20:32:37.4964786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4965302Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ada46200>} 2025-05-07T20:32:37.4966067Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4966258Z context = 2025-05-07T20:32:37.4966264Z 2025-05-07T20:32:37.4966431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4967323Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4967435Z module_map=module_map) 2025-05-07T20:32:37.4967643Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4967739Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4967813Z E ^ 2025-05-07T20:32:37.4968171Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4968176Z 2025-05-07T20:32:37.4968645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4968653Z 2025-05-07T20:32:37.4968761Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4969030Z self=, 2025-05-07T20:32:37.4969208Z T=16384, 2025-05-07T20:32:37.4969288Z D=7168, 2025-05-07T20:32:37.4969371Z scale_ub=None, 2025-05-07T20:32:37.4969455Z contiguous=True, 2025-05-07T20:32:37.4969533Z compiled=True, 2025-05-07T20:32:37.4969601Z ) 2025-05-07T20:32:37.4969824Z self = 2025-05-07T20:32:37.4970003Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.4970008Z 2025-05-07T20:32:37.4970078Z @given( 2025-05-07T20:32:37.4970198Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4970295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4970408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4970526Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4970638Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4970711Z ) 2025-05-07T20:32:37.4970959Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4971054Z def test_silu_mul_quant( 2025-05-07T20:32:37.4971131Z self, 2025-05-07T20:32:37.4971203Z T: int, 2025-05-07T20:32:37.4971275Z D: int, 2025-05-07T20:32:37.4971374Z scale_ub: Optional[float], 2025-05-07T20:32:37.4971461Z contiguous: bool, 2025-05-07T20:32:37.4971544Z compiled: bool, 2025-05-07T20:32:37.4971621Z ) -> None: 2025-05-07T20:32:37.4971715Z torch.manual_seed(2025) 2025-05-07T20:32:37.4971781Z 2025-05-07T20:32:37.4971952Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4972022Z 2025-05-07T20:32:37.4972115Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4972288Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4972375Z x = x_sign * x_clamp 2025-05-07T20:32:37.4972453Z x0 = x[:, :D] 2025-05-07T20:32:37.4972529Z x1 = x[:, D:] 2025-05-07T20:32:37.4972599Z 2025-05-07T20:32:37.4972682Z if contiguous: 2025-05-07T20:32:37.4972771Z x0 = x0.contiguous() 2025-05-07T20:32:37.4972858Z x1 = x1.contiguous() 2025-05-07T20:32:37.4972928Z 2025-05-07T20:32:37.4973015Z if scale_ub is not None: 2025-05-07T20:32:37.4973118Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4973255Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4973328Z ) 2025-05-07T20:32:37.4973403Z else: 2025-05-07T20:32:37.4973495Z scale_ub_tensor = None 2025-05-07T20:32:37.4973562Z 2025-05-07T20:32:37.4973694Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4973784Z op = silu_mul_quant 2025-05-07T20:32:37.4973867Z if compiled: 2025-05-07T20:32:37.4973967Z op = torch.compile(op) 2025-05-07T20:32:37.4974073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4974145Z 2025-05-07T20:32:37.4974236Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4974241Z 2025-05-07T20:32:37.4974381Z moe/activation_test.py:117: 2025-05-07T20:32:37.4974507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4974609Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4974743Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4975118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.4975210Z return fn(*args, **kwargs) 2025-05-07T20:32:37.4975707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4975807Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4976168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4976395Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4976779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4976870Z kernel = self.compile( 2025-05-07T20:32:37.4977256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4977435Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4977557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4977562Z 2025-05-07T20:32:37.4977770Z self = 2025-05-07T20:32:37.4978668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4979207Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ada47760>} 2025-05-07T20:32:37.4979963Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4980160Z context = 2025-05-07T20:32:37.4980165Z 2025-05-07T20:32:37.4980329Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4980594Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4980749Z module_map=module_map) 2025-05-07T20:32:37.4980914Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4981010Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4981089Z E ^ 2025-05-07T20:32:37.4981447Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4981452Z 2025-05-07T20:32:37.4981872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4981879Z 2025-05-07T20:32:37.4981980Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4982201Z self=, 2025-05-07T20:32:37.4982276Z T=4096, 2025-05-07T20:32:37.4982347Z D=5120, 2025-05-07T20:32:37.4982424Z scale_ub=None, 2025-05-07T20:32:37.4982510Z contiguous=False, 2025-05-07T20:32:37.4982589Z compiled=True, 2025-05-07T20:32:37.4982662Z ) 2025-05-07T20:32:37.4982881Z self = 2025-05-07T20:32:37.4983052Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.4983060Z 2025-05-07T20:32:37.4983133Z @given( 2025-05-07T20:32:37.4983292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4983389Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4983508Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4983625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4983775Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4983850Z ) 2025-05-07T20:32:37.4984096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4984191Z def test_silu_mul_quant( 2025-05-07T20:32:37.4984267Z self, 2025-05-07T20:32:37.4984341Z T: int, 2025-05-07T20:32:37.4984414Z D: int, 2025-05-07T20:32:37.4984511Z scale_ub: Optional[float], 2025-05-07T20:32:37.4984602Z contiguous: bool, 2025-05-07T20:32:37.4984685Z compiled: bool, 2025-05-07T20:32:37.4984827Z ) -> None: 2025-05-07T20:32:37.4984922Z torch.manual_seed(2025) 2025-05-07T20:32:37.4984989Z 2025-05-07T20:32:37.4985159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4985230Z 2025-05-07T20:32:37.4985318Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4985442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4985531Z x = x_sign * x_clamp 2025-05-07T20:32:37.4985606Z x0 = x[:, :D] 2025-05-07T20:32:37.4985682Z x1 = x[:, D:] 2025-05-07T20:32:37.4985753Z 2025-05-07T20:32:37.4985834Z if contiguous: 2025-05-07T20:32:37.4985925Z x0 = x0.contiguous() 2025-05-07T20:32:37.4986010Z x1 = x1.contiguous() 2025-05-07T20:32:37.4986077Z 2025-05-07T20:32:37.4986166Z if scale_ub is not None: 2025-05-07T20:32:37.4986271Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4986403Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4986481Z ) 2025-05-07T20:32:37.4986552Z else: 2025-05-07T20:32:37.4986646Z scale_ub_tensor = None 2025-05-07T20:32:37.4986719Z 2025-05-07T20:32:37.4986848Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4986936Z op = silu_mul_quant 2025-05-07T20:32:37.4987020Z if compiled: 2025-05-07T20:32:37.4987119Z op = torch.compile(op) 2025-05-07T20:32:37.4987225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4987292Z 2025-05-07T20:32:37.4987380Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.4987384Z 2025-05-07T20:32:37.4987481Z moe/activation_test.py:117: 2025-05-07T20:32:37.4987604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4987749Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.4987851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.4988221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.4988315Z return fn(*args, **kwargs) 2025-05-07T20:32:37.4988818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.4988912Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.4989279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.4989503Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.4989845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.4989940Z kernel = self.compile( 2025-05-07T20:32:37.4990327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.4990502Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4990626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.4990671Z 2025-05-07T20:32:37.4990878Z self = 2025-05-07T20:32:37.4991665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.4992214Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad768280>} 2025-05-07T20:32:37.4992973Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.4993165Z context = 2025-05-07T20:32:37.4993208Z 2025-05-07T20:32:37.4993374Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.4993642Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4993749Z module_map=module_map) 2025-05-07T20:32:37.4993915Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4994011Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4994084Z E ^ 2025-05-07T20:32:37.4994442Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4994446Z 2025-05-07T20:32:37.4994864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.4994868Z 2025-05-07T20:32:37.4994974Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.4995197Z self=, 2025-05-07T20:32:37.4995271Z T=4096, 2025-05-07T20:32:37.4995345Z D=5120, 2025-05-07T20:32:37.4995424Z scale_ub=1200.0, 2025-05-07T20:32:37.4995507Z contiguous=False, 2025-05-07T20:32:37.4995591Z compiled=False, 2025-05-07T20:32:37.4995663Z ) 2025-05-07T20:32:37.4995879Z self = 2025-05-07T20:32:37.4996055Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.4996059Z 2025-05-07T20:32:37.4996135Z @given( 2025-05-07T20:32:37.4996253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.4996350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.4996506Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.4996625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.4996735Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.4996807Z ) 2025-05-07T20:32:37.4997058Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.4997151Z def test_silu_mul_quant( 2025-05-07T20:32:37.4997223Z self, 2025-05-07T20:32:37.4997298Z T: int, 2025-05-07T20:32:37.4997369Z D: int, 2025-05-07T20:32:37.4997467Z scale_ub: Optional[float], 2025-05-07T20:32:37.4997555Z contiguous: bool, 2025-05-07T20:32:37.4997637Z compiled: bool, 2025-05-07T20:32:37.4997717Z ) -> None: 2025-05-07T20:32:37.4997808Z torch.manual_seed(2025) 2025-05-07T20:32:37.4997876Z 2025-05-07T20:32:37.4998047Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.4998118Z 2025-05-07T20:32:37.4998207Z x_sign = torch.sign(x) 2025-05-07T20:32:37.4998334Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.4998419Z x = x_sign * x_clamp 2025-05-07T20:32:37.4998499Z x0 = x[:, :D] 2025-05-07T20:32:37.4998576Z x1 = x[:, D:] 2025-05-07T20:32:37.4998643Z 2025-05-07T20:32:37.4998765Z if contiguous: 2025-05-07T20:32:37.4998859Z x0 = x0.contiguous() 2025-05-07T20:32:37.4998945Z x1 = x1.contiguous() 2025-05-07T20:32:37.4999017Z 2025-05-07T20:32:37.4999106Z if scale_ub is not None: 2025-05-07T20:32:37.4999249Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.4999385Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.4999455Z ) 2025-05-07T20:32:37.4999527Z else: 2025-05-07T20:32:37.4999620Z scale_ub_tensor = None 2025-05-07T20:32:37.4999689Z 2025-05-07T20:32:37.4999817Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.4999912Z op = silu_mul_quant 2025-05-07T20:32:37.4999992Z if compiled: 2025-05-07T20:32:37.5000086Z op = torch.compile(op) 2025-05-07T20:32:37.5000238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5000307Z 2025-05-07T20:32:37.5000400Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5000405Z 2025-05-07T20:32:37.5000498Z moe/activation_test.py:117: 2025-05-07T20:32:37.5000621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5000724Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5000823Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5001327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5001424Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5001782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5002010Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5002353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5002447Z kernel = self.compile( 2025-05-07T20:32:37.5002835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5003008Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5003131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5003140Z 2025-05-07T20:32:37.5003344Z self = 2025-05-07T20:32:37.5004169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5004681Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad769000>} 2025-05-07T20:32:37.5005439Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5005632Z context = 2025-05-07T20:32:37.5005639Z 2025-05-07T20:32:37.5005803Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5006067Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5006174Z module_map=module_map) 2025-05-07T20:32:37.5006333Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5006430Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5006507Z E ^ 2025-05-07T20:32:37.5006862Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5006869Z 2025-05-07T20:32:37.5007330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5007335Z 2025-05-07T20:32:37.5007437Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5007696Z self=, 2025-05-07T20:32:37.5007771Z T=4096, 2025-05-07T20:32:37.5007844Z D=5120, 2025-05-07T20:32:37.5007926Z scale_ub=1200.0, 2025-05-07T20:32:37.5008009Z contiguous=False, 2025-05-07T20:32:37.5008089Z compiled=True, 2025-05-07T20:32:37.5008159Z ) 2025-05-07T20:32:37.5008378Z self = 2025-05-07T20:32:37.5008554Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.5008559Z 2025-05-07T20:32:37.5008633Z @given( 2025-05-07T20:32:37.5008748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5008888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5009006Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5009122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5009235Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5009309Z ) 2025-05-07T20:32:37.5009552Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5009648Z def test_silu_mul_quant( 2025-05-07T20:32:37.5009722Z self, 2025-05-07T20:32:37.5009796Z T: int, 2025-05-07T20:32:37.5009873Z D: int, 2025-05-07T20:32:37.5009969Z scale_ub: Optional[float], 2025-05-07T20:32:37.5010055Z contiguous: bool, 2025-05-07T20:32:37.5010144Z compiled: bool, 2025-05-07T20:32:37.5010218Z ) -> None: 2025-05-07T20:32:37.5010310Z torch.manual_seed(2025) 2025-05-07T20:32:37.5010381Z 2025-05-07T20:32:37.5010549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5010623Z 2025-05-07T20:32:37.5010715Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5010836Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5010924Z x = x_sign * x_clamp 2025-05-07T20:32:37.5010999Z x0 = x[:, :D] 2025-05-07T20:32:37.5011076Z x1 = x[:, D:] 2025-05-07T20:32:37.5011146Z 2025-05-07T20:32:37.5011227Z if contiguous: 2025-05-07T20:32:37.5011316Z x0 = x0.contiguous() 2025-05-07T20:32:37.5011403Z x1 = x1.contiguous() 2025-05-07T20:32:37.5011473Z 2025-05-07T20:32:37.5011562Z if scale_ub is not None: 2025-05-07T20:32:37.5011667Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5011842Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5011915Z ) 2025-05-07T20:32:37.5011990Z else: 2025-05-07T20:32:37.5012080Z scale_ub_tensor = None 2025-05-07T20:32:37.5012156Z 2025-05-07T20:32:37.5012287Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5012373Z op = silu_mul_quant 2025-05-07T20:32:37.5012457Z if compiled: 2025-05-07T20:32:37.5012552Z op = torch.compile(op) 2025-05-07T20:32:37.5012655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5012731Z 2025-05-07T20:32:37.5012818Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5012822Z 2025-05-07T20:32:37.5012916Z moe/activation_test.py:117: 2025-05-07T20:32:37.5013042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5013140Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5013240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5013611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5013702Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5014250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5014346Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5014703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5014992Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5015330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5015425Z kernel = self.compile( 2025-05-07T20:32:37.5015809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5015986Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5016109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5016154Z 2025-05-07T20:32:37.5016359Z self = 2025-05-07T20:32:37.5017147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5017657Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad768700>} 2025-05-07T20:32:37.5018496Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5018691Z context = 2025-05-07T20:32:37.5018695Z 2025-05-07T20:32:37.5018858Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5019129Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5019233Z module_map=module_map) 2025-05-07T20:32:37.5019393Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5019496Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5019569Z E ^ 2025-05-07T20:32:37.5019929Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5019934Z 2025-05-07T20:32:37.5020350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5020354Z 2025-05-07T20:32:37.5020499Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5020725Z self=, 2025-05-07T20:32:37.5020800Z T=2048, 2025-05-07T20:32:37.5020873Z D=7168, 2025-05-07T20:32:37.5020959Z scale_ub=1200.0, 2025-05-07T20:32:37.5021046Z contiguous=False, 2025-05-07T20:32:37.5021132Z compiled=False, 2025-05-07T20:32:37.5021200Z ) 2025-05-07T20:32:37.5021416Z self = 2025-05-07T20:32:37.5021597Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.5021602Z 2025-05-07T20:32:37.5021674Z @given( 2025-05-07T20:32:37.5021790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5021889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5022002Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5022116Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5022234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5022303Z ) 2025-05-07T20:32:37.5022549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5022641Z def test_silu_mul_quant( 2025-05-07T20:32:37.5022713Z self, 2025-05-07T20:32:37.5022833Z T: int, 2025-05-07T20:32:37.5022908Z D: int, 2025-05-07T20:32:37.5023003Z scale_ub: Optional[float], 2025-05-07T20:32:37.5023092Z contiguous: bool, 2025-05-07T20:32:37.5023214Z compiled: bool, 2025-05-07T20:32:37.5023288Z ) -> None: 2025-05-07T20:32:37.5023385Z torch.manual_seed(2025) 2025-05-07T20:32:37.5023456Z 2025-05-07T20:32:37.5023622Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5023697Z 2025-05-07T20:32:37.5023785Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5023910Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5023998Z x = x_sign * x_clamp 2025-05-07T20:32:37.5024073Z x0 = x[:, :D] 2025-05-07T20:32:37.5024151Z x1 = x[:, D:] 2025-05-07T20:32:37.5024221Z 2025-05-07T20:32:37.5024342Z if contiguous: 2025-05-07T20:32:37.5024434Z x0 = x0.contiguous() 2025-05-07T20:32:37.5024521Z x1 = x1.contiguous() 2025-05-07T20:32:37.5024588Z 2025-05-07T20:32:37.5024678Z if scale_ub is not None: 2025-05-07T20:32:37.5024781Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5024915Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5024989Z ) 2025-05-07T20:32:37.5025060Z else: 2025-05-07T20:32:37.5025150Z scale_ub_tensor = None 2025-05-07T20:32:37.5025225Z 2025-05-07T20:32:37.5025354Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5025444Z op = silu_mul_quant 2025-05-07T20:32:37.5025524Z if compiled: 2025-05-07T20:32:37.5025622Z op = torch.compile(op) 2025-05-07T20:32:37.5025730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5025799Z 2025-05-07T20:32:37.5025886Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5025893Z 2025-05-07T20:32:37.5025990Z moe/activation_test.py:117: 2025-05-07T20:32:37.5026116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5026213Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5026314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5026821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5026918Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5027277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5027498Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5027888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5027982Z kernel = self.compile( 2025-05-07T20:32:37.5028375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5028547Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5028667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5028674Z 2025-05-07T20:32:37.5028881Z self = 2025-05-07T20:32:37.5029666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5030180Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad769240>} 2025-05-07T20:32:37.5030973Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5031167Z context = 2025-05-07T20:32:37.5031172Z 2025-05-07T20:32:37.5031338Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5031641Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5031750Z module_map=module_map) 2025-05-07T20:32:37.5031910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5032006Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5032084Z E ^ 2025-05-07T20:32:37.5032442Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5032447Z 2025-05-07T20:32:37.5032904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5032913Z 2025-05-07T20:32:37.5033017Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5033237Z self=, 2025-05-07T20:32:37.5033315Z T=1, 2025-05-07T20:32:37.5033388Z D=7168, 2025-05-07T20:32:37.5033465Z scale_ub=None, 2025-05-07T20:32:37.5033553Z contiguous=True, 2025-05-07T20:32:37.5033633Z compiled=False, 2025-05-07T20:32:37.5033701Z ) 2025-05-07T20:32:37.5033921Z self = 2025-05-07T20:32:37.5034081Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.5034086Z 2025-05-07T20:32:37.5034163Z @given( 2025-05-07T20:32:37.5034281Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5034377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5034495Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5034611Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5034722Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5034793Z ) 2025-05-07T20:32:37.5035037Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5035130Z def test_silu_mul_quant( 2025-05-07T20:32:37.5035205Z self, 2025-05-07T20:32:37.5035277Z T: int, 2025-05-07T20:32:37.5035349Z D: int, 2025-05-07T20:32:37.5035446Z scale_ub: Optional[float], 2025-05-07T20:32:37.5035532Z contiguous: bool, 2025-05-07T20:32:37.5035616Z compiled: bool, 2025-05-07T20:32:37.5035689Z ) -> None: 2025-05-07T20:32:37.5035781Z torch.manual_seed(2025) 2025-05-07T20:32:37.5035897Z 2025-05-07T20:32:37.5036064Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5036134Z 2025-05-07T20:32:37.5036228Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5036349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5036435Z x = x_sign * x_clamp 2025-05-07T20:32:37.5036513Z x0 = x[:, :D] 2025-05-07T20:32:37.5036588Z x1 = x[:, D:] 2025-05-07T20:32:37.5036657Z 2025-05-07T20:32:37.5036740Z if contiguous: 2025-05-07T20:32:37.5036831Z x0 = x0.contiguous() 2025-05-07T20:32:37.5036919Z x1 = x1.contiguous() 2025-05-07T20:32:37.5036986Z 2025-05-07T20:32:37.5037073Z if scale_ub is not None: 2025-05-07T20:32:37.5037177Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5037309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5037382Z ) 2025-05-07T20:32:37.5037456Z else: 2025-05-07T20:32:37.5037549Z scale_ub_tensor = None 2025-05-07T20:32:37.5037617Z 2025-05-07T20:32:37.5037747Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5037836Z op = silu_mul_quant 2025-05-07T20:32:37.5037917Z if compiled: 2025-05-07T20:32:37.5038057Z op = torch.compile(op) 2025-05-07T20:32:37.5038161Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5038232Z 2025-05-07T20:32:37.5038319Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5038361Z 2025-05-07T20:32:37.5038456Z moe/activation_test.py:117: 2025-05-07T20:32:37.5038607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5038717Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5038825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5039332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5039428Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5039786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5040053Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5040398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5040489Z kernel = self.compile( 2025-05-07T20:32:37.5040875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5041048Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5041170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5041174Z 2025-05-07T20:32:37.5041378Z self = 2025-05-07T20:32:37.5045556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5046096Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad76a050>} 2025-05-07T20:32:37.5046856Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5047058Z context = 2025-05-07T20:32:37.5047064Z 2025-05-07T20:32:37.5047231Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5047561Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5047668Z module_map=module_map) 2025-05-07T20:32:37.5047836Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5047934Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5048005Z E ^ 2025-05-07T20:32:37.5048393Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5048398Z 2025-05-07T20:32:37.5048841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5048847Z 2025-05-07T20:32:37.5048952Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5049173Z self=, 2025-05-07T20:32:37.5049244Z T=16384, 2025-05-07T20:32:37.5049321Z D=7168, 2025-05-07T20:32:37.5049400Z scale_ub=1200.0, 2025-05-07T20:32:37.5049483Z contiguous=False, 2025-05-07T20:32:37.5049569Z compiled=True, 2025-05-07T20:32:37.5049637Z ) 2025-05-07T20:32:37.5049853Z self = 2025-05-07T20:32:37.5050040Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.5050114Z 2025-05-07T20:32:37.5050187Z @given( 2025-05-07T20:32:37.5050306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5050402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5050557Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5050678Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5050789Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5050858Z ) 2025-05-07T20:32:37.5051109Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5051200Z def test_silu_mul_quant( 2025-05-07T20:32:37.5051271Z self, 2025-05-07T20:32:37.5051348Z T: int, 2025-05-07T20:32:37.5051419Z D: int, 2025-05-07T20:32:37.5051517Z scale_ub: Optional[float], 2025-05-07T20:32:37.5051607Z contiguous: bool, 2025-05-07T20:32:37.5051731Z compiled: bool, 2025-05-07T20:32:37.5051808Z ) -> None: 2025-05-07T20:32:37.5051904Z torch.manual_seed(2025) 2025-05-07T20:32:37.5051971Z 2025-05-07T20:32:37.5052143Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5052213Z 2025-05-07T20:32:37.5052304Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5052431Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5052517Z x = x_sign * x_clamp 2025-05-07T20:32:37.5052592Z x0 = x[:, :D] 2025-05-07T20:32:37.5052673Z x1 = x[:, D:] 2025-05-07T20:32:37.5052742Z 2025-05-07T20:32:37.5052822Z if contiguous: 2025-05-07T20:32:37.5052915Z x0 = x0.contiguous() 2025-05-07T20:32:37.5053001Z x1 = x1.contiguous() 2025-05-07T20:32:37.5053074Z 2025-05-07T20:32:37.5053164Z if scale_ub is not None: 2025-05-07T20:32:37.5053268Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5053407Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5053479Z ) 2025-05-07T20:32:37.5053558Z else: 2025-05-07T20:32:37.5053652Z scale_ub_tensor = None 2025-05-07T20:32:37.5053724Z 2025-05-07T20:32:37.5053853Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5053946Z op = silu_mul_quant 2025-05-07T20:32:37.5054026Z if compiled: 2025-05-07T20:32:37.5054121Z op = torch.compile(op) 2025-05-07T20:32:37.5054226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5054296Z 2025-05-07T20:32:37.5054390Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5054395Z 2025-05-07T20:32:37.5054490Z moe/activation_test.py:117: 2025-05-07T20:32:37.5054661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5054766Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5054863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5055238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5055333Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5056114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5056223Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5056585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5056807Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5057153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5057249Z kernel = self.compile( 2025-05-07T20:32:37.5057635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5057818Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5058097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5058104Z 2025-05-07T20:32:37.5058332Z self = 2025-05-07T20:32:37.5059298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5059809Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad76b490>} 2025-05-07T20:32:37.5060576Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5060833Z context = 2025-05-07T20:32:37.5060838Z 2025-05-07T20:32:37.5061007Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5061274Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5061395Z module_map=module_map) 2025-05-07T20:32:37.5061558Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5061654Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5061733Z E ^ 2025-05-07T20:32:37.5062091Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5062096Z 2025-05-07T20:32:37.5062517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5062524Z 2025-05-07T20:32:37.5062629Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5062852Z self=, 2025-05-07T20:32:37.5062928Z T=1, 2025-05-07T20:32:37.5062999Z D=7168, 2025-05-07T20:32:37.5063079Z scale_ub=None, 2025-05-07T20:32:37.5063167Z contiguous=False, 2025-05-07T20:32:37.5063251Z compiled=False, 2025-05-07T20:32:37.5063318Z ) 2025-05-07T20:32:37.5063538Z self = 2025-05-07T20:32:37.5063708Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.5063713Z 2025-05-07T20:32:37.5063786Z @given( 2025-05-07T20:32:37.5063904Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5064064Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5064183Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5064299Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5064420Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5064501Z ) 2025-05-07T20:32:37.5064748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5064840Z def test_silu_mul_quant( 2025-05-07T20:32:37.5064913Z self, 2025-05-07T20:32:37.5064993Z T: int, 2025-05-07T20:32:37.5065066Z D: int, 2025-05-07T20:32:37.5065161Z scale_ub: Optional[float], 2025-05-07T20:32:37.5065251Z contiguous: bool, 2025-05-07T20:32:37.5065333Z compiled: bool, 2025-05-07T20:32:37.5065410Z ) -> None: 2025-05-07T20:32:37.5065506Z torch.manual_seed(2025) 2025-05-07T20:32:37.5065573Z 2025-05-07T20:32:37.5065745Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5065817Z 2025-05-07T20:32:37.5065905Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5066030Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5066118Z x = x_sign * x_clamp 2025-05-07T20:32:37.5066194Z x0 = x[:, :D] 2025-05-07T20:32:37.5066318Z x1 = x[:, D:] 2025-05-07T20:32:37.5066387Z 2025-05-07T20:32:37.5066466Z if contiguous: 2025-05-07T20:32:37.5066555Z x0 = x0.contiguous() 2025-05-07T20:32:37.5066639Z x1 = x1.contiguous() 2025-05-07T20:32:37.5066751Z 2025-05-07T20:32:37.5066842Z if scale_ub is not None: 2025-05-07T20:32:37.5066946Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5067082Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5067154Z ) 2025-05-07T20:32:37.5067225Z else: 2025-05-07T20:32:37.5067323Z scale_ub_tensor = None 2025-05-07T20:32:37.5067390Z 2025-05-07T20:32:37.5067521Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5067611Z op = silu_mul_quant 2025-05-07T20:32:37.5067690Z if compiled: 2025-05-07T20:32:37.5067833Z op = torch.compile(op) 2025-05-07T20:32:37.5067940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5068011Z 2025-05-07T20:32:37.5068097Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5068102Z 2025-05-07T20:32:37.5068203Z moe/activation_test.py:117: 2025-05-07T20:32:37.5068327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5068436Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5068552Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5069090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5069187Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5069550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5069835Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5070233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5070325Z kernel = self.compile( 2025-05-07T20:32:37.5070712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5070889Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5071010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5071014Z 2025-05-07T20:32:37.5071222Z self = 2025-05-07T20:32:37.5072060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5072576Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad76b7f0>} 2025-05-07T20:32:37.5073330Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5073528Z context = 2025-05-07T20:32:37.5073533Z 2025-05-07T20:32:37.5073695Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5073960Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5074069Z module_map=module_map) 2025-05-07T20:32:37.5074230Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5074327Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5074405Z E ^ 2025-05-07T20:32:37.5074803Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5074808Z 2025-05-07T20:32:37.5075227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5075232Z 2025-05-07T20:32:37.5075374Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5075596Z self=, 2025-05-07T20:32:37.5075671Z T=2048, 2025-05-07T20:32:37.5075743Z D=7168, 2025-05-07T20:32:37.5075821Z scale_ub=None, 2025-05-07T20:32:37.5075908Z contiguous=False, 2025-05-07T20:32:37.5075988Z compiled=True, 2025-05-07T20:32:37.5076058Z ) 2025-05-07T20:32:37.5076277Z self = 2025-05-07T20:32:37.5076450Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.5076497Z 2025-05-07T20:32:37.5076574Z @given( 2025-05-07T20:32:37.5076690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5076788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5076903Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5077018Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5077131Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5077205Z ) 2025-05-07T20:32:37.5077448Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5077543Z def test_silu_mul_quant( 2025-05-07T20:32:37.5077614Z self, 2025-05-07T20:32:37.5077686Z T: int, 2025-05-07T20:32:37.5077761Z D: int, 2025-05-07T20:32:37.5077855Z scale_ub: Optional[float], 2025-05-07T20:32:37.5077944Z contiguous: bool, 2025-05-07T20:32:37.5078030Z compiled: bool, 2025-05-07T20:32:37.5078104Z ) -> None: 2025-05-07T20:32:37.5078201Z torch.manual_seed(2025) 2025-05-07T20:32:37.5078275Z 2025-05-07T20:32:37.5078445Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5078515Z 2025-05-07T20:32:37.5078608Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5078729Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5078820Z x = x_sign * x_clamp 2025-05-07T20:32:37.5078897Z x0 = x[:, :D] 2025-05-07T20:32:37.5078974Z x1 = x[:, D:] 2025-05-07T20:32:37.5079045Z 2025-05-07T20:32:37.5079123Z if contiguous: 2025-05-07T20:32:37.5079210Z x0 = x0.contiguous() 2025-05-07T20:32:37.5079296Z x1 = x1.contiguous() 2025-05-07T20:32:37.5079363Z 2025-05-07T20:32:37.5079451Z if scale_ub is not None: 2025-05-07T20:32:37.5079627Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5079761Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5079831Z ) 2025-05-07T20:32:37.5079911Z else: 2025-05-07T20:32:37.5080003Z scale_ub_tensor = None 2025-05-07T20:32:37.5080072Z 2025-05-07T20:32:37.5080204Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5080291Z op = silu_mul_quant 2025-05-07T20:32:37.5080374Z if compiled: 2025-05-07T20:32:37.5080472Z op = torch.compile(op) 2025-05-07T20:32:37.5080575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5080645Z 2025-05-07T20:32:37.5080732Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5080736Z 2025-05-07T20:32:37.5080829Z moe/activation_test.py:117: 2025-05-07T20:32:37.5080957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5081054Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5081154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5081527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5081621Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5082166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5082262Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5082621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5082885Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5083227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5083321Z kernel = self.compile( 2025-05-07T20:32:37.5083708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5083883Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5084050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5084054Z 2025-05-07T20:32:37.5084262Z self = 2025-05-07T20:32:37.5085048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5085560Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad24caf0>} 2025-05-07T20:32:37.5086318Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5086514Z context = 2025-05-07T20:32:37.5086521Z 2025-05-07T20:32:37.5086687Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5086954Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5087058Z module_map=module_map) 2025-05-07T20:32:37.5087221Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5087319Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5087390Z E ^ 2025-05-07T20:32:37.5087747Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5087755Z 2025-05-07T20:32:37.5088213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5088218Z 2025-05-07T20:32:37.5088321Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5088545Z self=, 2025-05-07T20:32:37.5088619Z T=4096, 2025-05-07T20:32:37.5088691Z D=7168, 2025-05-07T20:32:37.5088775Z scale_ub=None, 2025-05-07T20:32:37.5088857Z contiguous=False, 2025-05-07T20:32:37.5088936Z compiled=True, 2025-05-07T20:32:37.5089007Z ) 2025-05-07T20:32:37.5089224Z self = 2025-05-07T20:32:37.5089400Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.5089405Z 2025-05-07T20:32:37.5089477Z @given( 2025-05-07T20:32:37.5089592Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5089691Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5089804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5089922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5090037Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5090105Z ) 2025-05-07T20:32:37.5090351Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5090489Z def test_silu_mul_quant( 2025-05-07T20:32:37.5090562Z self, 2025-05-07T20:32:37.5090637Z T: int, 2025-05-07T20:32:37.5090708Z D: int, 2025-05-07T20:32:37.5090803Z scale_ub: Optional[float], 2025-05-07T20:32:37.5090932Z contiguous: bool, 2025-05-07T20:32:37.5091015Z compiled: bool, 2025-05-07T20:32:37.5091088Z ) -> None: 2025-05-07T20:32:37.5091182Z torch.manual_seed(2025) 2025-05-07T20:32:37.5091249Z 2025-05-07T20:32:37.5091416Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5091487Z 2025-05-07T20:32:37.5091574Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5091699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5091788Z x = x_sign * x_clamp 2025-05-07T20:32:37.5091864Z x0 = x[:, :D] 2025-05-07T20:32:37.5091986Z x1 = x[:, D:] 2025-05-07T20:32:37.5092054Z 2025-05-07T20:32:37.5092131Z if contiguous: 2025-05-07T20:32:37.5092224Z x0 = x0.contiguous() 2025-05-07T20:32:37.5092309Z x1 = x1.contiguous() 2025-05-07T20:32:37.5092375Z 2025-05-07T20:32:37.5092466Z if scale_ub is not None: 2025-05-07T20:32:37.5092568Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5092704Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5092777Z ) 2025-05-07T20:32:37.5092852Z else: 2025-05-07T20:32:37.5092943Z scale_ub_tensor = None 2025-05-07T20:32:37.5093017Z 2025-05-07T20:32:37.5093146Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5093231Z op = silu_mul_quant 2025-05-07T20:32:37.5093317Z if compiled: 2025-05-07T20:32:37.5093412Z op = torch.compile(op) 2025-05-07T20:32:37.5093519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5093589Z 2025-05-07T20:32:37.5093675Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5093680Z 2025-05-07T20:32:37.5093779Z moe/activation_test.py:117: 2025-05-07T20:32:37.5093902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5093999Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5094101Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5094474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5094565Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5095065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5095158Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5095567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5095793Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5096137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5096233Z kernel = self.compile( 2025-05-07T20:32:37.5096618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5096799Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5096920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5096925Z 2025-05-07T20:32:37.5097131Z self = 2025-05-07T20:32:37.5097921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5098582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad24c280>} 2025-05-07T20:32:37.5099363Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5099593Z context = 2025-05-07T20:32:37.5099598Z 2025-05-07T20:32:37.5099766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5100031Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5100142Z module_map=module_map) 2025-05-07T20:32:37.5100304Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5100400Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5100514Z E ^ 2025-05-07T20:32:37.5100881Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5100886Z 2025-05-07T20:32:37.5101303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5101310Z 2025-05-07T20:32:37.5101416Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5101637Z self=, 2025-05-07T20:32:37.5101709Z T=16384, 2025-05-07T20:32:37.5101783Z D=5120, 2025-05-07T20:32:37.5101863Z scale_ub=1200.0, 2025-05-07T20:32:37.5101944Z contiguous=False, 2025-05-07T20:32:37.5102027Z compiled=False, 2025-05-07T20:32:37.5102093Z ) 2025-05-07T20:32:37.5102313Z self = 2025-05-07T20:32:37.5102496Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.5102504Z 2025-05-07T20:32:37.5102578Z @given( 2025-05-07T20:32:37.5102697Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5102794Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5102906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5103026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5103136Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5103205Z ) 2025-05-07T20:32:37.5103454Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5103544Z def test_silu_mul_quant( 2025-05-07T20:32:37.5103615Z self, 2025-05-07T20:32:37.5103688Z T: int, 2025-05-07T20:32:37.5103758Z D: int, 2025-05-07T20:32:37.5103901Z scale_ub: Optional[float], 2025-05-07T20:32:37.5103988Z contiguous: bool, 2025-05-07T20:32:37.5104068Z compiled: bool, 2025-05-07T20:32:37.5104147Z ) -> None: 2025-05-07T20:32:37.5104238Z torch.manual_seed(2025) 2025-05-07T20:32:37.5104304Z 2025-05-07T20:32:37.5104480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5104551Z 2025-05-07T20:32:37.5104639Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5104766Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5104852Z x = x_sign * x_clamp 2025-05-07T20:32:37.5104926Z x0 = x[:, :D] 2025-05-07T20:32:37.5105008Z x1 = x[:, D:] 2025-05-07T20:32:37.5105074Z 2025-05-07T20:32:37.5105158Z if contiguous: 2025-05-07T20:32:37.5105245Z x0 = x0.contiguous() 2025-05-07T20:32:37.5105329Z x1 = x1.contiguous() 2025-05-07T20:32:37.5105400Z 2025-05-07T20:32:37.5105491Z if scale_ub is not None: 2025-05-07T20:32:37.5105592Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5105725Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5105799Z ) 2025-05-07T20:32:37.5105876Z else: 2025-05-07T20:32:37.5106011Z scale_ub_tensor = None 2025-05-07T20:32:37.5106081Z 2025-05-07T20:32:37.5106211Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5106296Z op = silu_mul_quant 2025-05-07T20:32:37.5106417Z if compiled: 2025-05-07T20:32:37.5106518Z op = torch.compile(op) 2025-05-07T20:32:37.5106621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5106689Z 2025-05-07T20:32:37.5106781Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5106785Z 2025-05-07T20:32:37.5106880Z moe/activation_test.py:117: 2025-05-07T20:32:37.5107008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5107109Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5107206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5107713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5107876Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5108236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5108477Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5108859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5108954Z kernel = self.compile( 2025-05-07T20:32:37.5109338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5109513Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5109636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5109641Z 2025-05-07T20:32:37.5109847Z self = 2025-05-07T20:32:37.5110636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5111146Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad24ed40>} 2025-05-07T20:32:37.5111900Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5112137Z context = 2025-05-07T20:32:37.5112142Z 2025-05-07T20:32:37.5112307Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5112578Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5112684Z module_map=module_map) 2025-05-07T20:32:37.5112844Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5112943Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5113019Z E ^ 2025-05-07T20:32:37.5113376Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5113385Z 2025-05-07T20:32:37.5113800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5113805Z 2025-05-07T20:32:37.5113906Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5114134Z self=, 2025-05-07T20:32:37.5114207Z T=16384, 2025-05-07T20:32:37.5114277Z D=5120, 2025-05-07T20:32:37.5114360Z scale_ub=1200.0, 2025-05-07T20:32:37.5114441Z contiguous=True, 2025-05-07T20:32:37.5114563Z compiled=True, 2025-05-07T20:32:37.5114636Z ) 2025-05-07T20:32:37.5114854Z self = 2025-05-07T20:32:37.5115030Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.5115073Z 2025-05-07T20:32:37.5115145Z @given( 2025-05-07T20:32:37.5115262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5115362Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5115475Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5115590Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5115704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5115777Z ) 2025-05-07T20:32:37.5116022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5116168Z def test_silu_mul_quant( 2025-05-07T20:32:37.5116239Z self, 2025-05-07T20:32:37.5116312Z T: int, 2025-05-07T20:32:37.5116386Z D: int, 2025-05-07T20:32:37.5116481Z scale_ub: Optional[float], 2025-05-07T20:32:37.5116570Z contiguous: bool, 2025-05-07T20:32:37.5116652Z compiled: bool, 2025-05-07T20:32:37.5116729Z ) -> None: 2025-05-07T20:32:37.5116825Z torch.manual_seed(2025) 2025-05-07T20:32:37.5116892Z 2025-05-07T20:32:37.5117059Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5117129Z 2025-05-07T20:32:37.5117218Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5117340Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5117427Z x = x_sign * x_clamp 2025-05-07T20:32:37.5117503Z x0 = x[:, :D] 2025-05-07T20:32:37.5117581Z x1 = x[:, D:] 2025-05-07T20:32:37.5117649Z 2025-05-07T20:32:37.5117728Z if contiguous: 2025-05-07T20:32:37.5117819Z x0 = x0.contiguous() 2025-05-07T20:32:37.5117903Z x1 = x1.contiguous() 2025-05-07T20:32:37.5117973Z 2025-05-07T20:32:37.5118062Z if scale_ub is not None: 2025-05-07T20:32:37.5118164Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5118297Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5118372Z ) 2025-05-07T20:32:37.5118443Z else: 2025-05-07T20:32:37.5118533Z scale_ub_tensor = None 2025-05-07T20:32:37.5118605Z 2025-05-07T20:32:37.5118734Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5118822Z op = silu_mul_quant 2025-05-07T20:32:37.5118902Z if compiled: 2025-05-07T20:32:37.5118999Z op = torch.compile(op) 2025-05-07T20:32:37.5119149Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5119218Z 2025-05-07T20:32:37.5119304Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5119309Z 2025-05-07T20:32:37.5119409Z moe/activation_test.py:117: 2025-05-07T20:32:37.5119536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5119635Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5119735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5120103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5120200Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5120699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5120792Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5121159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5121381Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5121724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5121863Z kernel = self.compile( 2025-05-07T20:32:37.5122247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5122423Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5122584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5122589Z 2025-05-07T20:32:37.5122795Z self = 2025-05-07T20:32:37.5123581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5124089Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad24e830>} 2025-05-07T20:32:37.5124887Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5125081Z context = 2025-05-07T20:32:37.5125085Z 2025-05-07T20:32:37.5125256Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5125519Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5125624Z module_map=module_map) 2025-05-07T20:32:37.5125788Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5125886Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5125959Z E ^ 2025-05-07T20:32:37.5126317Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5126325Z 2025-05-07T20:32:37.5126742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5126747Z 2025-05-07T20:32:37.5126851Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5127075Z self=, 2025-05-07T20:32:37.5127147Z T=16384, 2025-05-07T20:32:37.5127221Z D=5120, 2025-05-07T20:32:37.5127298Z scale_ub=None, 2025-05-07T20:32:37.5127379Z contiguous=False, 2025-05-07T20:32:37.5127461Z compiled=True, 2025-05-07T20:32:37.5127530Z ) 2025-05-07T20:32:37.5127746Z self = 2025-05-07T20:32:37.5127964Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.5127969Z 2025-05-07T20:32:37.5128040Z @given( 2025-05-07T20:32:37.5128162Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5128261Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5128375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5128494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5128604Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5128676Z ) 2025-05-07T20:32:37.5128923Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5129014Z def test_silu_mul_quant( 2025-05-07T20:32:37.5129087Z self, 2025-05-07T20:32:37.5129158Z T: int, 2025-05-07T20:32:37.5129229Z D: int, 2025-05-07T20:32:37.5129326Z scale_ub: Optional[float], 2025-05-07T20:32:37.5129412Z contiguous: bool, 2025-05-07T20:32:37.5129497Z compiled: bool, 2025-05-07T20:32:37.5129573Z ) -> None: 2025-05-07T20:32:37.5129664Z torch.manual_seed(2025) 2025-05-07T20:32:37.5129735Z 2025-05-07T20:32:37.5129905Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5130016Z 2025-05-07T20:32:37.5130105Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5130232Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5130316Z x = x_sign * x_clamp 2025-05-07T20:32:37.5130432Z x0 = x[:, :D] 2025-05-07T20:32:37.5130540Z x1 = x[:, D:] 2025-05-07T20:32:37.5130643Z 2025-05-07T20:32:37.5130763Z if contiguous: 2025-05-07T20:32:37.5130883Z x0 = x0.contiguous() 2025-05-07T20:32:37.5131002Z x1 = x1.contiguous() 2025-05-07T20:32:37.5131106Z 2025-05-07T20:32:37.5131225Z if scale_ub is not None: 2025-05-07T20:32:37.5131328Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5131469Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5131539Z ) 2025-05-07T20:32:37.5131612Z else: 2025-05-07T20:32:37.5131763Z scale_ub_tensor = None 2025-05-07T20:32:37.5131831Z 2025-05-07T20:32:37.5131963Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5132051Z op = silu_mul_quant 2025-05-07T20:32:37.5132130Z if compiled: 2025-05-07T20:32:37.5132228Z op = torch.compile(op) 2025-05-07T20:32:37.5132335Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5132404Z 2025-05-07T20:32:37.5132493Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5132497Z 2025-05-07T20:32:37.5132594Z moe/activation_test.py:117: 2025-05-07T20:32:37.5132718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5132819Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5132915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5133294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5133394Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5133903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5134001Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5134361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5134585Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5134929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5135020Z kernel = self.compile( 2025-05-07T20:32:37.5135409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5135629Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5135752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5135760Z 2025-05-07T20:32:37.5135969Z self = 2025-05-07T20:32:37.5136757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5137275Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ad24f760>} 2025-05-07T20:32:37.5138098Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5138295Z context = 2025-05-07T20:32:37.5138299Z 2025-05-07T20:32:37.5138471Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5138783Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5138892Z module_map=module_map) 2025-05-07T20:32:37.5139053Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5139213Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5139289Z E ^ 2025-05-07T20:32:37.5139647Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5139651Z 2025-05-07T20:32:37.5140069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5140078Z 2025-05-07T20:32:37.5140183Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5140404Z self=, 2025-05-07T20:32:37.5140519Z T=2048, 2025-05-07T20:32:37.5140591Z D=5120, 2025-05-07T20:32:37.5140667Z scale_ub=None, 2025-05-07T20:32:37.5140758Z contiguous=False, 2025-05-07T20:32:37.5140842Z compiled=True, 2025-05-07T20:32:37.5140910Z ) 2025-05-07T20:32:37.5141130Z self = 2025-05-07T20:32:37.5141305Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.5141309Z 2025-05-07T20:32:37.5141384Z @given( 2025-05-07T20:32:37.5141502Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5141598Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5141716Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5141832Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5141944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5142017Z ) 2025-05-07T20:32:37.5142261Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5142355Z def test_silu_mul_quant( 2025-05-07T20:32:37.5142432Z self, 2025-05-07T20:32:37.5142505Z T: int, 2025-05-07T20:32:37.5142575Z D: int, 2025-05-07T20:32:37.5142673Z scale_ub: Optional[float], 2025-05-07T20:32:37.5142758Z contiguous: bool, 2025-05-07T20:32:37.5142843Z compiled: bool, 2025-05-07T20:32:37.5142917Z ) -> None: 2025-05-07T20:32:37.5143008Z torch.manual_seed(2025) 2025-05-07T20:32:37.5143078Z 2025-05-07T20:32:37.5143245Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5143320Z 2025-05-07T20:32:37.5143412Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5143533Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5143663Z x = x_sign * x_clamp 2025-05-07T20:32:37.5143748Z x0 = x[:, :D] 2025-05-07T20:32:37.5143823Z x1 = x[:, D:] 2025-05-07T20:32:37.5143894Z 2025-05-07T20:32:37.5143977Z if contiguous: 2025-05-07T20:32:37.5144064Z x0 = x0.contiguous() 2025-05-07T20:32:37.5144155Z x1 = x1.contiguous() 2025-05-07T20:32:37.5144224Z 2025-05-07T20:32:37.5144311Z if scale_ub is not None: 2025-05-07T20:32:37.5144419Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5144560Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5144630Z ) 2025-05-07T20:32:37.5144707Z else: 2025-05-07T20:32:37.5144797Z scale_ub_tensor = None 2025-05-07T20:32:37.5144864Z 2025-05-07T20:32:37.5144996Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5145082Z op = silu_mul_quant 2025-05-07T20:32:37.5145161Z if compiled: 2025-05-07T20:32:37.5145262Z op = torch.compile(op) 2025-05-07T20:32:37.5145364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5145436Z 2025-05-07T20:32:37.5145525Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5145530Z 2025-05-07T20:32:37.5145624Z moe/activation_test.py:117: 2025-05-07T20:32:37.5145797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5145899Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5145996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5146409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5146499Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5146997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5147095Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5147457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5147684Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5148069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5148158Z kernel = self.compile( 2025-05-07T20:32:37.5148550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5148728Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5148851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5148856Z 2025-05-07T20:32:37.5149066Z self = 2025-05-07T20:32:37.5149857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5150373Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acdd43a0>} 2025-05-07T20:32:37.5151132Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5151330Z context = 2025-05-07T20:32:37.5151335Z 2025-05-07T20:32:37.5151498Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5151762Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5151870Z module_map=module_map) 2025-05-07T20:32:37.5152075Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5152180Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5152252Z E ^ 2025-05-07T20:32:37.5152614Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5152618Z 2025-05-07T20:32:37.5153038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5153045Z 2025-05-07T20:32:37.5153146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5153371Z self=, 2025-05-07T20:32:37.5153445Z T=2048, 2025-05-07T20:32:37.5153516Z D=5120, 2025-05-07T20:32:37.5153598Z scale_ub=1200.0, 2025-05-07T20:32:37.5153680Z contiguous=False, 2025-05-07T20:32:37.5153759Z compiled=True, 2025-05-07T20:32:37.5153830Z ) 2025-05-07T20:32:37.5154051Z self = 2025-05-07T20:32:37.5154224Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.5154230Z 2025-05-07T20:32:37.5154305Z @given( 2025-05-07T20:32:37.5154463Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5154571Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5154685Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5154800Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5154951Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5155020Z ) 2025-05-07T20:32:37.5155265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5155359Z def test_silu_mul_quant( 2025-05-07T20:32:37.5155430Z self, 2025-05-07T20:32:37.5155501Z T: int, 2025-05-07T20:32:37.5156041Z D: int, 2025-05-07T20:32:37.5156149Z scale_ub: Optional[float], 2025-05-07T20:32:37.5156240Z contiguous: bool, 2025-05-07T20:32:37.5156327Z compiled: bool, 2025-05-07T20:32:37.5156401Z ) -> None: 2025-05-07T20:32:37.5156597Z torch.manual_seed(2025) 2025-05-07T20:32:37.5156664Z 2025-05-07T20:32:37.5156839Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5156911Z 2025-05-07T20:32:37.5156999Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5157121Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5157213Z x = x_sign * x_clamp 2025-05-07T20:32:37.5157288Z x0 = x[:, :D] 2025-05-07T20:32:37.5157364Z x1 = x[:, D:] 2025-05-07T20:32:37.5157435Z 2025-05-07T20:32:37.5157513Z if contiguous: 2025-05-07T20:32:37.5157600Z x0 = x0.contiguous() 2025-05-07T20:32:37.5157687Z x1 = x1.contiguous() 2025-05-07T20:32:37.5157756Z 2025-05-07T20:32:37.5157847Z if scale_ub is not None: 2025-05-07T20:32:37.5157951Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5158084Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5158159Z ) 2025-05-07T20:32:37.5158234Z else: 2025-05-07T20:32:37.5158325Z scale_ub_tensor = None 2025-05-07T20:32:37.5158397Z 2025-05-07T20:32:37.5158525Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5158614Z op = silu_mul_quant 2025-05-07T20:32:37.5158698Z if compiled: 2025-05-07T20:32:37.5158795Z op = torch.compile(op) 2025-05-07T20:32:37.5158898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5158969Z 2025-05-07T20:32:37.5159056Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5159061Z 2025-05-07T20:32:37.5159159Z moe/activation_test.py:117: 2025-05-07T20:32:37.5159282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5159382Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5159550Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5159924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5160016Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5160517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5160613Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5160971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5161196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5161538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5161630Z kernel = self.compile( 2025-05-07T20:32:37.5162015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5162188Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5162314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5162318Z 2025-05-07T20:32:37.5162582Z self = 2025-05-07T20:32:37.5163370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5163935Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acdd4820>} 2025-05-07T20:32:37.5164697Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5164890Z context = 2025-05-07T20:32:37.5164932Z 2025-05-07T20:32:37.5165099Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5165369Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5165474Z module_map=module_map) 2025-05-07T20:32:37.5165639Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5165736Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5165810Z E ^ 2025-05-07T20:32:37.5166171Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5166176Z 2025-05-07T20:32:37.5169832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5169840Z 2025-05-07T20:32:37.5169957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5170187Z self=, 2025-05-07T20:32:37.5170263Z T=4096, 2025-05-07T20:32:37.5170347Z D=5120, 2025-05-07T20:32:37.5170427Z scale_ub=1200.0, 2025-05-07T20:32:37.5170506Z contiguous=True, 2025-05-07T20:32:37.5170592Z compiled=True, 2025-05-07T20:32:37.5170660Z ) 2025-05-07T20:32:37.5170879Z self = 2025-05-07T20:32:37.5171056Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.5171061Z 2025-05-07T20:32:37.5171131Z @given( 2025-05-07T20:32:37.5171252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5171348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5171461Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5171647Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5171762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5171834Z ) 2025-05-07T20:32:37.5172087Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5172179Z def test_silu_mul_quant( 2025-05-07T20:32:37.5172252Z self, 2025-05-07T20:32:37.5172334Z T: int, 2025-05-07T20:32:37.5172405Z D: int, 2025-05-07T20:32:37.5172503Z scale_ub: Optional[float], 2025-05-07T20:32:37.5172591Z contiguous: bool, 2025-05-07T20:32:37.5172673Z compiled: bool, 2025-05-07T20:32:37.5172752Z ) -> None: 2025-05-07T20:32:37.5172847Z torch.manual_seed(2025) 2025-05-07T20:32:37.5172915Z 2025-05-07T20:32:37.5173086Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5173155Z 2025-05-07T20:32:37.5173244Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5173375Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5173459Z x = x_sign * x_clamp 2025-05-07T20:32:37.5173536Z x0 = x[:, :D] 2025-05-07T20:32:37.5173619Z x1 = x[:, D:] 2025-05-07T20:32:37.5173688Z 2025-05-07T20:32:37.5173769Z if contiguous: 2025-05-07T20:32:37.5173937Z x0 = x0.contiguous() 2025-05-07T20:32:37.5174025Z x1 = x1.contiguous() 2025-05-07T20:32:37.5174097Z 2025-05-07T20:32:37.5174185Z if scale_ub is not None: 2025-05-07T20:32:37.5174364Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5174502Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5174572Z ) 2025-05-07T20:32:37.5174642Z else: 2025-05-07T20:32:37.5174737Z scale_ub_tensor = None 2025-05-07T20:32:37.5174807Z 2025-05-07T20:32:37.5174940Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5175030Z op = silu_mul_quant 2025-05-07T20:32:37.5175114Z if compiled: 2025-05-07T20:32:37.5175211Z op = torch.compile(op) 2025-05-07T20:32:37.5175320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5175432Z 2025-05-07T20:32:37.5175524Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5175531Z 2025-05-07T20:32:37.5175627Z moe/activation_test.py:117: 2025-05-07T20:32:37.5175755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5175856Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5175958Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5176331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5176423Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5176922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5177021Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5177378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5177604Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5177954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5178149Z kernel = self.compile( 2025-05-07T20:32:37.5178535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5178716Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5178837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5178842Z 2025-05-07T20:32:37.5179051Z self = 2025-05-07T20:32:37.5179883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5180405Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acdd5360>} 2025-05-07T20:32:37.5181163Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5181356Z context = 2025-05-07T20:32:37.5181361Z 2025-05-07T20:32:37.5181527Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5181794Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5181905Z module_map=module_map) 2025-05-07T20:32:37.5182071Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5182171Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5182245Z E ^ 2025-05-07T20:32:37.5182642Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5182647Z 2025-05-07T20:32:37.5183066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5183111Z 2025-05-07T20:32:37.5183213Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5183436Z self=, 2025-05-07T20:32:37.5183510Z T=128, 2025-05-07T20:32:37.5183582Z D=5120, 2025-05-07T20:32:37.5183663Z scale_ub=1200.0, 2025-05-07T20:32:37.5183750Z contiguous=False, 2025-05-07T20:32:37.5183831Z compiled=True, 2025-05-07T20:32:37.5183903Z ) 2025-05-07T20:32:37.5184126Z self = 2025-05-07T20:32:37.5184295Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.5184345Z 2025-05-07T20:32:37.5184420Z @given( 2025-05-07T20:32:37.5184539Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5184635Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5184752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5184870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5184984Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5185056Z ) 2025-05-07T20:32:37.5185302Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5185391Z def test_silu_mul_quant( 2025-05-07T20:32:37.5185466Z self, 2025-05-07T20:32:37.5185539Z T: int, 2025-05-07T20:32:37.5185614Z D: int, 2025-05-07T20:32:37.5185712Z scale_ub: Optional[float], 2025-05-07T20:32:37.5185797Z contiguous: bool, 2025-05-07T20:32:37.5185882Z compiled: bool, 2025-05-07T20:32:37.5185958Z ) -> None: 2025-05-07T20:32:37.5186051Z torch.manual_seed(2025) 2025-05-07T20:32:37.5186123Z 2025-05-07T20:32:37.5186291Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5186359Z 2025-05-07T20:32:37.5186449Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5186578Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5186664Z x = x_sign * x_clamp 2025-05-07T20:32:37.5186739Z x0 = x[:, :D] 2025-05-07T20:32:37.5186819Z x1 = x[:, D:] 2025-05-07T20:32:37.5186886Z 2025-05-07T20:32:37.5186965Z if contiguous: 2025-05-07T20:32:37.5187060Z x0 = x0.contiguous() 2025-05-07T20:32:37.5187144Z x1 = x1.contiguous() 2025-05-07T20:32:37.5187212Z 2025-05-07T20:32:37.5187350Z if scale_ub is not None: 2025-05-07T20:32:37.5187454Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5187592Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5187664Z ) 2025-05-07T20:32:37.5187736Z else: 2025-05-07T20:32:37.5187832Z scale_ub_tensor = None 2025-05-07T20:32:37.5187899Z 2025-05-07T20:32:37.5188027Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5188115Z op = silu_mul_quant 2025-05-07T20:32:37.5188197Z if compiled: 2025-05-07T20:32:37.5188293Z op = torch.compile(op) 2025-05-07T20:32:37.5188399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5188467Z 2025-05-07T20:32:37.5188555Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5188562Z 2025-05-07T20:32:37.5188657Z moe/activation_test.py:117: 2025-05-07T20:32:37.5188781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5188887Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5188984Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5189354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5189580Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5190079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5190216Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5190575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5190796Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5191141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5191232Z kernel = self.compile( 2025-05-07T20:32:37.5191617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5191795Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5191960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5191965Z 2025-05-07T20:32:37.5192173Z self = 2025-05-07T20:32:37.5192956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5193467Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acdd6290>} 2025-05-07T20:32:37.5194225Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5194421Z context = 2025-05-07T20:32:37.5194426Z 2025-05-07T20:32:37.5194593Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5194856Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5194963Z module_map=module_map) 2025-05-07T20:32:37.5195124Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5195220Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5195295Z E ^ 2025-05-07T20:32:37.5195651Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5195656Z 2025-05-07T20:32:37.5196117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5196122Z 2025-05-07T20:32:37.5196229Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5196454Z self=, 2025-05-07T20:32:37.5196533Z T=16384, 2025-05-07T20:32:37.5196607Z D=7168, 2025-05-07T20:32:37.5196688Z scale_ub=1200.0, 2025-05-07T20:32:37.5196774Z contiguous=True, 2025-05-07T20:32:37.5196854Z compiled=True, 2025-05-07T20:32:37.5196926Z ) 2025-05-07T20:32:37.5197147Z self = 2025-05-07T20:32:37.5197322Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.5197327Z 2025-05-07T20:32:37.5197400Z @given( 2025-05-07T20:32:37.5197523Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5197621Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5197742Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5197859Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5197972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5198047Z ) 2025-05-07T20:32:37.5198337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5198429Z def test_silu_mul_quant( 2025-05-07T20:32:37.5198506Z self, 2025-05-07T20:32:37.5198579Z T: int, 2025-05-07T20:32:37.5198693Z D: int, 2025-05-07T20:32:37.5198792Z scale_ub: Optional[float], 2025-05-07T20:32:37.5198880Z contiguous: bool, 2025-05-07T20:32:37.5198962Z compiled: bool, 2025-05-07T20:32:37.5199042Z ) -> None: 2025-05-07T20:32:37.5199135Z torch.manual_seed(2025) 2025-05-07T20:32:37.5199207Z 2025-05-07T20:32:37.5199375Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5199444Z 2025-05-07T20:32:37.5199540Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5199663Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5199748Z x = x_sign * x_clamp 2025-05-07T20:32:37.5199873Z x0 = x[:, :D] 2025-05-07T20:32:37.5199949Z x1 = x[:, D:] 2025-05-07T20:32:37.5200023Z 2025-05-07T20:32:37.5200107Z if contiguous: 2025-05-07T20:32:37.5200196Z x0 = x0.contiguous() 2025-05-07T20:32:37.5200283Z x1 = x1.contiguous() 2025-05-07T20:32:37.5200355Z 2025-05-07T20:32:37.5200446Z if scale_ub is not None: 2025-05-07T20:32:37.5200551Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5200690Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5200764Z ) 2025-05-07T20:32:37.5200838Z else: 2025-05-07T20:32:37.5200930Z scale_ub_tensor = None 2025-05-07T20:32:37.5200998Z 2025-05-07T20:32:37.5201131Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5201221Z op = silu_mul_quant 2025-05-07T20:32:37.5201303Z if compiled: 2025-05-07T20:32:37.5201404Z op = torch.compile(op) 2025-05-07T20:32:37.5201511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5201582Z 2025-05-07T20:32:37.5201678Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5201682Z 2025-05-07T20:32:37.5201779Z moe/activation_test.py:117: 2025-05-07T20:32:37.5201908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5202008Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5202106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5202480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5202571Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5203137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5203239Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5203599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5203828Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5204172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5204262Z kernel = self.compile( 2025-05-07T20:32:37.5204652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5204828Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5204951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5204959Z 2025-05-07T20:32:37.5205165Z self = 2025-05-07T20:32:37.5205950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5206509Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acdd6d40>} 2025-05-07T20:32:37.5207266Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5207503Z context = 2025-05-07T20:32:37.5207507Z 2025-05-07T20:32:37.5207671Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5207939Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5208048Z module_map=module_map) 2025-05-07T20:32:37.5208209Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5208350Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5208426Z E ^ 2025-05-07T20:32:37.5208784Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5208788Z 2025-05-07T20:32:37.5209208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5209216Z 2025-05-07T20:32:37.5209319Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5209543Z self=, 2025-05-07T20:32:37.5209620Z T=16384, 2025-05-07T20:32:37.5209693Z D=5120, 2025-05-07T20:32:37.5209775Z scale_ub=1200.0, 2025-05-07T20:32:37.5209860Z contiguous=True, 2025-05-07T20:32:37.5209941Z compiled=False, 2025-05-07T20:32:37.5210015Z ) 2025-05-07T20:32:37.5210235Z self = 2025-05-07T20:32:37.5210415Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.5210421Z 2025-05-07T20:32:37.5210495Z @given( 2025-05-07T20:32:37.5210612Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5210709Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5210827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5210945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5211061Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5211132Z ) 2025-05-07T20:32:37.5211379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5211473Z def test_silu_mul_quant( 2025-05-07T20:32:37.5211547Z self, 2025-05-07T20:32:37.5211664Z T: int, 2025-05-07T20:32:37.5211741Z D: int, 2025-05-07T20:32:37.5211838Z scale_ub: Optional[float], 2025-05-07T20:32:37.5211928Z contiguous: bool, 2025-05-07T20:32:37.5212014Z compiled: bool, 2025-05-07T20:32:37.5212088Z ) -> None: 2025-05-07T20:32:37.5212183Z torch.manual_seed(2025) 2025-05-07T20:32:37.5212255Z 2025-05-07T20:32:37.5212425Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5212499Z 2025-05-07T20:32:37.5212592Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5212717Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5212807Z x = x_sign * x_clamp 2025-05-07T20:32:37.5212886Z x0 = x[:, :D] 2025-05-07T20:32:37.5212961Z x1 = x[:, D:] 2025-05-07T20:32:37.5213032Z 2025-05-07T20:32:37.5213113Z if contiguous: 2025-05-07T20:32:37.5213201Z x0 = x0.contiguous() 2025-05-07T20:32:37.5213294Z x1 = x1.contiguous() 2025-05-07T20:32:37.5213362Z 2025-05-07T20:32:37.5213451Z if scale_ub is not None: 2025-05-07T20:32:37.5213558Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5213695Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5213810Z ) 2025-05-07T20:32:37.5213888Z else: 2025-05-07T20:32:37.5213979Z scale_ub_tensor = None 2025-05-07T20:32:37.5214051Z 2025-05-07T20:32:37.5214181Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5214308Z op = silu_mul_quant 2025-05-07T20:32:37.5214394Z if compiled: 2025-05-07T20:32:37.5214492Z op = torch.compile(op) 2025-05-07T20:32:37.5214596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5214667Z 2025-05-07T20:32:37.5214756Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5214761Z 2025-05-07T20:32:37.5214856Z moe/activation_test.py:117: 2025-05-07T20:32:37.5214990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5215091Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5215191Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5215740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5215836Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5216198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5216424Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5216767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5216860Z kernel = self.compile( 2025-05-07T20:32:37.5217247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5217425Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5217545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5217553Z 2025-05-07T20:32:37.5217763Z self = 2025-05-07T20:32:37.5218637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5219152Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acdd7ac0>} 2025-05-07T20:32:37.5219955Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5220149Z context = 2025-05-07T20:32:37.5220157Z 2025-05-07T20:32:37.5220325Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5220595Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5220701Z module_map=module_map) 2025-05-07T20:32:37.5220867Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5220967Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5221040Z E ^ 2025-05-07T20:32:37.5221402Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5221406Z 2025-05-07T20:32:37.5221821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5221825Z 2025-05-07T20:32:37.5221933Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5222156Z self=, 2025-05-07T20:32:37.5222231Z T=1, 2025-05-07T20:32:37.5222308Z D=7168, 2025-05-07T20:32:37.5222430Z scale_ub=1200.0, 2025-05-07T20:32:37.5222517Z contiguous=False, 2025-05-07T20:32:37.5222601Z compiled=False, 2025-05-07T20:32:37.5222670Z ) 2025-05-07T20:32:37.5222887Z self = 2025-05-07T20:32:37.5223097Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.5223102Z 2025-05-07T20:32:37.5223173Z @given( 2025-05-07T20:32:37.5223296Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5223394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5223510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5223634Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5223748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5223817Z ) 2025-05-07T20:32:37.5224065Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5224196Z def test_silu_mul_quant( 2025-05-07T20:32:37.5224276Z self, 2025-05-07T20:32:37.5224350Z T: int, 2025-05-07T20:32:37.5224426Z D: int, 2025-05-07T20:32:37.5224525Z scale_ub: Optional[float], 2025-05-07T20:32:37.5224612Z contiguous: bool, 2025-05-07T20:32:37.5224697Z compiled: bool, 2025-05-07T20:32:37.5224773Z ) -> None: 2025-05-07T20:32:37.5224865Z torch.manual_seed(2025) 2025-05-07T20:32:37.5224934Z 2025-05-07T20:32:37.5225106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5225176Z 2025-05-07T20:32:37.5225264Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5225390Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5225479Z x = x_sign * x_clamp 2025-05-07T20:32:37.5225555Z x0 = x[:, :D] 2025-05-07T20:32:37.5225638Z x1 = x[:, D:] 2025-05-07T20:32:37.5225709Z 2025-05-07T20:32:37.5225791Z if contiguous: 2025-05-07T20:32:37.5225879Z x0 = x0.contiguous() 2025-05-07T20:32:37.5225966Z x1 = x1.contiguous() 2025-05-07T20:32:37.5226039Z 2025-05-07T20:32:37.5226132Z if scale_ub is not None: 2025-05-07T20:32:37.5226236Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5226376Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5226449Z ) 2025-05-07T20:32:37.5226521Z else: 2025-05-07T20:32:37.5226619Z scale_ub_tensor = None 2025-05-07T20:32:37.5226687Z 2025-05-07T20:32:37.5226819Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5226906Z op = silu_mul_quant 2025-05-07T20:32:37.5226987Z if compiled: 2025-05-07T20:32:37.5227137Z op = torch.compile(op) 2025-05-07T20:32:37.5227244Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5227313Z 2025-05-07T20:32:37.5227406Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5227411Z 2025-05-07T20:32:37.5227509Z moe/activation_test.py:117: 2025-05-07T20:32:37.5227636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5227735Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5227833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5228342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5228436Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5228796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5229025Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5229370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5229468Z kernel = self.compile( 2025-05-07T20:32:37.5229896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5230072Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5230198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5230241Z 2025-05-07T20:32:37.5230447Z self = 2025-05-07T20:32:37.5231238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5231751Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac9744c0>} 2025-05-07T20:32:37.5232571Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5232772Z context = 2025-05-07T20:32:37.5232776Z 2025-05-07T20:32:37.5232944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5233217Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5233325Z module_map=module_map) 2025-05-07T20:32:37.5233486Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5233588Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5233662Z E ^ 2025-05-07T20:32:37.5234023Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5234031Z 2025-05-07T20:32:37.5234449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5234454Z 2025-05-07T20:32:37.5234558Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5234785Z self=, 2025-05-07T20:32:37.5234862Z T=4096, 2025-05-07T20:32:37.5234935Z D=7168, 2025-05-07T20:32:37.5235021Z scale_ub=1200.0, 2025-05-07T20:32:37.5235105Z contiguous=False, 2025-05-07T20:32:37.5235190Z compiled=True, 2025-05-07T20:32:37.5235260Z ) 2025-05-07T20:32:37.5235479Z self = 2025-05-07T20:32:37.5235659Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.5235664Z 2025-05-07T20:32:37.5235781Z @given( 2025-05-07T20:32:37.5235904Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5236006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5236124Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5236243Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5236362Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5236433Z ) 2025-05-07T20:32:37.5236684Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5236780Z def test_silu_mul_quant( 2025-05-07T20:32:37.5236854Z self, 2025-05-07T20:32:37.5236931Z T: int, 2025-05-07T20:32:37.5237005Z D: int, 2025-05-07T20:32:37.5237104Z scale_ub: Optional[float], 2025-05-07T20:32:37.5237196Z contiguous: bool, 2025-05-07T20:32:37.5237282Z compiled: bool, 2025-05-07T20:32:37.5237358Z ) -> None: 2025-05-07T20:32:37.5237460Z torch.manual_seed(2025) 2025-05-07T20:32:37.5237532Z 2025-05-07T20:32:37.5237702Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5237777Z 2025-05-07T20:32:37.5237868Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5238040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5238129Z x = x_sign * x_clamp 2025-05-07T20:32:37.5238208Z x0 = x[:, :D] 2025-05-07T20:32:37.5238290Z x1 = x[:, D:] 2025-05-07T20:32:37.5238401Z 2025-05-07T20:32:37.5238482Z if contiguous: 2025-05-07T20:32:37.5238576Z x0 = x0.contiguous() 2025-05-07T20:32:37.5238664Z x1 = x1.contiguous() 2025-05-07T20:32:37.5238734Z 2025-05-07T20:32:37.5238828Z if scale_ub is not None: 2025-05-07T20:32:37.5238932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5239065Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5239140Z ) 2025-05-07T20:32:37.5239217Z else: 2025-05-07T20:32:37.5239310Z scale_ub_tensor = None 2025-05-07T20:32:37.5239385Z 2025-05-07T20:32:37.5239516Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5239648Z op = silu_mul_quant 2025-05-07T20:32:37.5239734Z if compiled: 2025-05-07T20:32:37.5239832Z op = torch.compile(op) 2025-05-07T20:32:37.5239940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5240010Z 2025-05-07T20:32:37.5240102Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5240107Z 2025-05-07T20:32:37.5240208Z moe/activation_test.py:117: 2025-05-07T20:32:37.5240333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5240433Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5240536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5240912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5241006Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5241505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5241610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5241972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5242194Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5242544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5242639Z kernel = self.compile( 2025-05-07T20:32:37.5243024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5243203Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5243371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5243377Z 2025-05-07T20:32:37.5243585Z self = 2025-05-07T20:32:37.5244379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5244888Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac9751b0>} 2025-05-07T20:32:37.5245647Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5245842Z context = 2025-05-07T20:32:37.5245847Z 2025-05-07T20:32:37.5246016Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5246283Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5246434Z module_map=module_map) 2025-05-07T20:32:37.5246600Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5246699Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5246774Z E ^ 2025-05-07T20:32:37.5247173Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5247178Z 2025-05-07T20:32:37.5247595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5247600Z 2025-05-07T20:32:37.5247705Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5247932Z self=, 2025-05-07T20:32:37.5248008Z T=128, 2025-05-07T20:32:37.5248089Z D=7168, 2025-05-07T20:32:37.5248171Z scale_ub=1200.0, 2025-05-07T20:32:37.5248296Z contiguous=False, 2025-05-07T20:32:37.5248383Z compiled=True, 2025-05-07T20:32:37.5248456Z ) 2025-05-07T20:32:37.5248680Z self = 2025-05-07T20:32:37.5248852Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.5248859Z 2025-05-07T20:32:37.5248938Z @given( 2025-05-07T20:32:37.5249061Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5249160Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5249274Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5249396Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5249509Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5249580Z ) 2025-05-07T20:32:37.5249829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5249924Z def test_silu_mul_quant( 2025-05-07T20:32:37.5250004Z self, 2025-05-07T20:32:37.5250077Z T: int, 2025-05-07T20:32:37.5250151Z D: int, 2025-05-07T20:32:37.5250253Z scale_ub: Optional[float], 2025-05-07T20:32:37.5250341Z contiguous: bool, 2025-05-07T20:32:37.5250426Z compiled: bool, 2025-05-07T20:32:37.5250506Z ) -> None: 2025-05-07T20:32:37.5250602Z torch.manual_seed(2025) 2025-05-07T20:32:37.5250672Z 2025-05-07T20:32:37.5250844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5250916Z 2025-05-07T20:32:37.5251006Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5251133Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5251221Z x = x_sign * x_clamp 2025-05-07T20:32:37.5251301Z x0 = x[:, :D] 2025-05-07T20:32:37.5251423Z x1 = x[:, D:] 2025-05-07T20:32:37.5251495Z 2025-05-07T20:32:37.5251579Z if contiguous: 2025-05-07T20:32:37.5251670Z x0 = x0.contiguous() 2025-05-07T20:32:37.5251760Z x1 = x1.contiguous() 2025-05-07T20:32:37.5251833Z 2025-05-07T20:32:37.5251924Z if scale_ub is not None: 2025-05-07T20:32:37.5252030Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5252166Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5252239Z ) 2025-05-07T20:32:37.5252317Z else: 2025-05-07T20:32:37.5252421Z scale_ub_tensor = None 2025-05-07T20:32:37.5252523Z 2025-05-07T20:32:37.5252706Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5252847Z op = silu_mul_quant 2025-05-07T20:32:37.5252935Z if compiled: 2025-05-07T20:32:37.5253039Z op = torch.compile(op) 2025-05-07T20:32:37.5253144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5253219Z 2025-05-07T20:32:37.5253312Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5253316Z 2025-05-07T20:32:37.5253413Z moe/activation_test.py:117: 2025-05-07T20:32:37.5253543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5253701Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5253802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5254178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5254317Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5254817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5254917Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5255277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5255504Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5256181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5256379Z kernel = self.compile( 2025-05-07T20:32:37.5256772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5256947Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5257071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5257076Z 2025-05-07T20:32:37.5257286Z self = 2025-05-07T20:32:37.5258117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5258632Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac9740d0>} 2025-05-07T20:32:37.5259392Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5259586Z context = 2025-05-07T20:32:37.5259593Z 2025-05-07T20:32:37.5259760Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5260024Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5260132Z module_map=module_map) 2025-05-07T20:32:37.5260297Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5260462Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5260540Z E ^ 2025-05-07T20:32:37.5260896Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5260903Z 2025-05-07T20:32:37.5261323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5261328Z 2025-05-07T20:32:37.5261431Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5261653Z self=, 2025-05-07T20:32:37.5261733Z T=2048, 2025-05-07T20:32:37.5261806Z D=7168, 2025-05-07T20:32:37.5261887Z scale_ub=None, 2025-05-07T20:32:37.5261972Z contiguous=True, 2025-05-07T20:32:37.5262054Z compiled=True, 2025-05-07T20:32:37.5262123Z ) 2025-05-07T20:32:37.5262343Z self = 2025-05-07T20:32:37.5262517Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.5262522Z 2025-05-07T20:32:37.5262601Z @given( 2025-05-07T20:32:37.5262720Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5262820Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5262999Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5263118Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5263230Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5263383Z ) 2025-05-07T20:32:37.5263629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5263720Z def test_silu_mul_quant( 2025-05-07T20:32:37.5263797Z self, 2025-05-07T20:32:37.5263870Z T: int, 2025-05-07T20:32:37.5263947Z D: int, 2025-05-07T20:32:37.5264046Z scale_ub: Optional[float], 2025-05-07T20:32:37.5264133Z contiguous: bool, 2025-05-07T20:32:37.5264221Z compiled: bool, 2025-05-07T20:32:37.5264301Z ) -> None: 2025-05-07T20:32:37.5264393Z torch.manual_seed(2025) 2025-05-07T20:32:37.5264466Z 2025-05-07T20:32:37.5264634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5264749Z 2025-05-07T20:32:37.5264849Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5264978Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5265064Z x = x_sign * x_clamp 2025-05-07T20:32:37.5265143Z x0 = x[:, :D] 2025-05-07T20:32:37.5265223Z x1 = x[:, D:] 2025-05-07T20:32:37.5265292Z 2025-05-07T20:32:37.5265377Z if contiguous: 2025-05-07T20:32:37.5265467Z x0 = x0.contiguous() 2025-05-07T20:32:37.5265557Z x1 = x1.contiguous() 2025-05-07T20:32:37.5265625Z 2025-05-07T20:32:37.5265716Z if scale_ub is not None: 2025-05-07T20:32:37.5265822Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5265958Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5266029Z ) 2025-05-07T20:32:37.5266105Z else: 2025-05-07T20:32:37.5266197Z scale_ub_tensor = None 2025-05-07T20:32:37.5266269Z 2025-05-07T20:32:37.5266404Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5266494Z op = silu_mul_quant 2025-05-07T20:32:37.5266578Z if compiled: 2025-05-07T20:32:37.5266681Z op = torch.compile(op) 2025-05-07T20:32:37.5266786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5266865Z 2025-05-07T20:32:37.5266956Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5266961Z 2025-05-07T20:32:37.5267057Z moe/activation_test.py:117: 2025-05-07T20:32:37.5267186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5267286Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5267383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5267803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5267895Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5268403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5268501Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5268859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5269095Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5269436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5269528Z kernel = self.compile( 2025-05-07T20:32:37.5269914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5270091Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5270216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5270223Z 2025-05-07T20:32:37.5270430Z self = 2025-05-07T20:32:37.5271255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5271806Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac976560>} 2025-05-07T20:32:37.5272561Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5272759Z context = 2025-05-07T20:32:37.5272763Z 2025-05-07T20:32:37.5272929Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5273238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5273345Z module_map=module_map) 2025-05-07T20:32:37.5273510Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5273612Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5273687Z E ^ 2025-05-07T20:32:37.5274043Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5274048Z 2025-05-07T20:32:37.5274467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5274472Z 2025-05-07T20:32:37.5274578Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5274805Z self=, 2025-05-07T20:32:37.5274879Z T=16384, 2025-05-07T20:32:37.5274956Z D=5120, 2025-05-07T20:32:37.5275040Z scale_ub=None, 2025-05-07T20:32:37.5275129Z contiguous=False, 2025-05-07T20:32:37.5275214Z compiled=False, 2025-05-07T20:32:37.5275290Z ) 2025-05-07T20:32:37.5275507Z self = 2025-05-07T20:32:37.5275684Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.5275695Z 2025-05-07T20:32:37.5275768Z @given( 2025-05-07T20:32:37.5275886Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5275988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5276104Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5276222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5276380Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5276454Z ) 2025-05-07T20:32:37.5276698Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5276799Z def test_silu_mul_quant( 2025-05-07T20:32:37.5276873Z self, 2025-05-07T20:32:37.5276950Z T: int, 2025-05-07T20:32:37.5277027Z D: int, 2025-05-07T20:32:37.5277125Z scale_ub: Optional[float], 2025-05-07T20:32:37.5277216Z contiguous: bool, 2025-05-07T20:32:37.5277302Z compiled: bool, 2025-05-07T20:32:37.5277380Z ) -> None: 2025-05-07T20:32:37.5277476Z torch.manual_seed(2025) 2025-05-07T20:32:37.5277545Z 2025-05-07T20:32:37.5277715Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5277789Z 2025-05-07T20:32:37.5277879Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5278002Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5279894Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5279940Z 2025-05-07T20:32:37.5280059Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.5280064Z 2025-05-07T20:32:37.5280172Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5280395Z self=, 2025-05-07T20:32:37.5280472Z T=4096, 2025-05-07T20:32:37.5280546Z D=7168, 2025-05-07T20:32:37.5280626Z scale_ub=1200.0, 2025-05-07T20:32:37.5280710Z contiguous=True, 2025-05-07T20:32:37.5280793Z compiled=True, 2025-05-07T20:32:37.5280861Z ) 2025-05-07T20:32:37.5281078Z self = 2025-05-07T20:32:37.5281295Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.5281302Z 2025-05-07T20:32:37.5281376Z @given( 2025-05-07T20:32:37.5281493Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5281590Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5281709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5281824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5281935Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5282009Z ) 2025-05-07T20:32:37.5282254Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5282345Z def test_silu_mul_quant( 2025-05-07T20:32:37.5282421Z self, 2025-05-07T20:32:37.5282496Z T: int, 2025-05-07T20:32:37.5282569Z D: int, 2025-05-07T20:32:37.5282668Z scale_ub: Optional[float], 2025-05-07T20:32:37.5282754Z contiguous: bool, 2025-05-07T20:32:37.5282840Z compiled: bool, 2025-05-07T20:32:37.5282919Z ) -> None: 2025-05-07T20:32:37.5283014Z torch.manual_seed(2025) 2025-05-07T20:32:37.5283084Z 2025-05-07T20:32:37.5283253Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5283323Z 2025-05-07T20:32:37.5283418Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5283541Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5285414Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5285429Z 2025-05-07T20:32:37.5285549Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.5285554Z 2025-05-07T20:32:37.5285656Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5285882Z self=, 2025-05-07T20:32:37.5285958Z T=16384, 2025-05-07T20:32:37.5286031Z D=7168, 2025-05-07T20:32:37.5286112Z scale_ub=None, 2025-05-07T20:32:37.5286195Z contiguous=False, 2025-05-07T20:32:37.5286279Z compiled=False, 2025-05-07T20:32:37.5286348Z ) 2025-05-07T20:32:37.5286565Z self = 2025-05-07T20:32:37.5286741Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.5286749Z 2025-05-07T20:32:37.5286823Z @given( 2025-05-07T20:32:37.5286937Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5287038Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5287192Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5287310Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5287424Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5287496Z ) 2025-05-07T20:32:37.5287784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5287876Z def test_silu_mul_quant( 2025-05-07T20:32:37.5287949Z self, 2025-05-07T20:32:37.5288025Z T: int, 2025-05-07T20:32:37.5288097Z D: int, 2025-05-07T20:32:37.5288194Z scale_ub: Optional[float], 2025-05-07T20:32:37.5288286Z contiguous: bool, 2025-05-07T20:32:37.5291521Z compiled: bool, 2025-05-07T20:32:37.5291610Z ) -> None: 2025-05-07T20:32:37.5291716Z torch.manual_seed(2025) 2025-05-07T20:32:37.5291784Z 2025-05-07T20:32:37.5291962Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5293876Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5293886Z 2025-05-07T20:32:37.5294008Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5294012Z 2025-05-07T20:32:37.5294112Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5294340Z self=, 2025-05-07T20:32:37.5294414Z T=2048, 2025-05-07T20:32:37.5294484Z D=7168, 2025-05-07T20:32:37.5294565Z scale_ub=1200.0, 2025-05-07T20:32:37.5294650Z contiguous=True, 2025-05-07T20:32:37.5294730Z compiled=True, 2025-05-07T20:32:37.5294798Z ) 2025-05-07T20:32:37.5295018Z self = 2025-05-07T20:32:37.5295188Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.5295195Z 2025-05-07T20:32:37.5295269Z @given( 2025-05-07T20:32:37.5295391Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5295486Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5295606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5295726Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5295837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5295953Z ) 2025-05-07T20:32:37.5296201Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5296302Z def test_silu_mul_quant( 2025-05-07T20:32:37.5296373Z self, 2025-05-07T20:32:37.5296445Z T: int, 2025-05-07T20:32:37.5296521Z D: int, 2025-05-07T20:32:37.5296617Z scale_ub: Optional[float], 2025-05-07T20:32:37.5296703Z contiguous: bool, 2025-05-07T20:32:37.5296788Z compiled: bool, 2025-05-07T20:32:37.5296864Z ) -> None: 2025-05-07T20:32:37.5296955Z torch.manual_seed(2025) 2025-05-07T20:32:37.5297031Z 2025-05-07T20:32:37.5297197Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5297267Z 2025-05-07T20:32:37.5297359Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5297483Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5299434Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5299483Z 2025-05-07T20:32:37.5299600Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.5299605Z 2025-05-07T20:32:37.5299709Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5299931Z self=, 2025-05-07T20:32:37.5300003Z T=2048, 2025-05-07T20:32:37.5300075Z D=7168, 2025-05-07T20:32:37.5300153Z scale_ub=None, 2025-05-07T20:32:37.5300237Z contiguous=True, 2025-05-07T20:32:37.5300323Z compiled=False, 2025-05-07T20:32:37.5300391Z ) 2025-05-07T20:32:37.5300606Z self = 2025-05-07T20:32:37.5300822Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.5300827Z 2025-05-07T20:32:37.5300901Z @given( 2025-05-07T20:32:37.5301020Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5301117Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5301228Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5301351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5301462Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5301531Z ) 2025-05-07T20:32:37.5301779Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5301870Z def test_silu_mul_quant( 2025-05-07T20:32:37.5301942Z self, 2025-05-07T20:32:37.5302016Z T: int, 2025-05-07T20:32:37.5302088Z D: int, 2025-05-07T20:32:37.5302189Z scale_ub: Optional[float], 2025-05-07T20:32:37.5302275Z contiguous: bool, 2025-05-07T20:32:37.5302362Z compiled: bool, 2025-05-07T20:32:37.5302438Z ) -> None: 2025-05-07T20:32:37.5302532Z torch.manual_seed(2025) 2025-05-07T20:32:37.5302599Z 2025-05-07T20:32:37.5302769Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5302837Z 2025-05-07T20:32:37.5302927Z > x_sign = torch.sign(x) 2025-05-07T20:32:37.5304795Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5304802Z 2025-05-07T20:32:37.5304924Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:37.5304928Z 2025-05-07T20:32:37.5305033Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5305255Z self=, 2025-05-07T20:32:37.5305332Z T=1, 2025-05-07T20:32:37.5305404Z D=7168, 2025-05-07T20:32:37.5305487Z scale_ub=1200.0, 2025-05-07T20:32:37.5305572Z contiguous=True, 2025-05-07T20:32:37.5305653Z compiled=False, 2025-05-07T20:32:37.5305722Z ) 2025-05-07T20:32:37.5305940Z self = 2025-05-07T20:32:37.5306104Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.5306109Z 2025-05-07T20:32:37.5306185Z @given( 2025-05-07T20:32:37.5306306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5306403Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5306518Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5306640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5306795Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5306873Z ) 2025-05-07T20:32:37.5307120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5307212Z def test_silu_mul_quant( 2025-05-07T20:32:37.5307327Z self, 2025-05-07T20:32:37.5307400Z T: int, 2025-05-07T20:32:37.5307472Z D: int, 2025-05-07T20:32:37.5307573Z scale_ub: Optional[float], 2025-05-07T20:32:37.5307659Z contiguous: bool, 2025-05-07T20:32:37.5307744Z compiled: bool, 2025-05-07T20:32:37.5307822Z ) -> None: 2025-05-07T20:32:37.5307914Z torch.manual_seed(2025) 2025-05-07T20:32:37.5307983Z 2025-05-07T20:32:37.5308155Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5308228Z 2025-05-07T20:32:37.5308322Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5308489Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5308575Z x = x_sign * x_clamp 2025-05-07T20:32:37.5308659Z x0 = x[:, :D] 2025-05-07T20:32:37.5308736Z x1 = x[:, D:] 2025-05-07T20:32:37.5308804Z 2025-05-07T20:32:37.5308889Z if contiguous: 2025-05-07T20:32:37.5308981Z x0 = x0.contiguous() 2025-05-07T20:32:37.5309072Z x1 = x1.contiguous() 2025-05-07T20:32:37.5309143Z 2025-05-07T20:32:37.5309237Z if scale_ub is not None: 2025-05-07T20:32:37.5309341Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5309477Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5309553Z ) 2025-05-07T20:32:37.5309627Z else: 2025-05-07T20:32:37.5309721Z scale_ub_tensor = None 2025-05-07T20:32:37.5309791Z 2025-05-07T20:32:37.5309921Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5310011Z op = silu_mul_quant 2025-05-07T20:32:37.5310095Z if compiled: 2025-05-07T20:32:37.5310191Z op = torch.compile(op) 2025-05-07T20:32:37.5310299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5310368Z 2025-05-07T20:32:37.5310456Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5310460Z 2025-05-07T20:32:37.5310560Z moe/activation_test.py:117: 2025-05-07T20:32:37.5310685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5310786Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5310882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5311391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5311533Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5311897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5312124Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5312472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5312564Z kernel = self.compile( 2025-05-07T20:32:37.5312953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5313131Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5313252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5313256Z 2025-05-07T20:32:37.5313470Z self = 2025-05-07T20:32:37.5314264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5314820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac6884c0>} 2025-05-07T20:32:37.5315576Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5315808Z context = 2025-05-07T20:32:37.5315813Z 2025-05-07T20:32:37.5315979Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5316244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5316356Z module_map=module_map) 2025-05-07T20:32:37.5316517Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5316614Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5316731Z E ^ 2025-05-07T20:32:37.5317090Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5317095Z 2025-05-07T20:32:37.5317519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5317526Z 2025-05-07T20:32:37.5317631Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5317852Z self=, 2025-05-07T20:32:37.5317928Z T=128, 2025-05-07T20:32:37.5318000Z D=5120, 2025-05-07T20:32:37.5318079Z scale_ub=None, 2025-05-07T20:32:37.5318164Z contiguous=True, 2025-05-07T20:32:37.5318245Z compiled=False, 2025-05-07T20:32:37.5318316Z ) 2025-05-07T20:32:37.5318567Z self = 2025-05-07T20:32:37.5318755Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.5318762Z 2025-05-07T20:32:37.5318841Z @given( 2025-05-07T20:32:37.5318959Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5319057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5319176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5319295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5319407Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5319482Z ) 2025-05-07T20:32:37.5319726Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5319817Z def test_silu_mul_quant( 2025-05-07T20:32:37.5319892Z self, 2025-05-07T20:32:37.5319966Z T: int, 2025-05-07T20:32:37.5320041Z D: int, 2025-05-07T20:32:37.5320270Z scale_ub: Optional[float], 2025-05-07T20:32:37.5320358Z contiguous: bool, 2025-05-07T20:32:37.5320443Z compiled: bool, 2025-05-07T20:32:37.5320522Z ) -> None: 2025-05-07T20:32:37.5320615Z torch.manual_seed(2025) 2025-05-07T20:32:37.5320689Z 2025-05-07T20:32:37.5320856Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5320927Z 2025-05-07T20:32:37.5321021Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5321148Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5321233Z x = x_sign * x_clamp 2025-05-07T20:32:37.5321312Z x0 = x[:, :D] 2025-05-07T20:32:37.5321389Z x1 = x[:, D:] 2025-05-07T20:32:37.5321457Z 2025-05-07T20:32:37.5321543Z if contiguous: 2025-05-07T20:32:37.5321632Z x0 = x0.contiguous() 2025-05-07T20:32:37.5321721Z x1 = x1.contiguous() 2025-05-07T20:32:37.5321790Z 2025-05-07T20:32:37.5321881Z if scale_ub is not None: 2025-05-07T20:32:37.5321988Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5322120Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5322194Z ) 2025-05-07T20:32:37.5322272Z else: 2025-05-07T20:32:37.5322413Z scale_ub_tensor = None 2025-05-07T20:32:37.5322484Z 2025-05-07T20:32:37.5322618Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5322706Z op = silu_mul_quant 2025-05-07T20:32:37.5322828Z if compiled: 2025-05-07T20:32:37.5322928Z op = torch.compile(op) 2025-05-07T20:32:37.5323031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5323104Z 2025-05-07T20:32:37.5323193Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5323198Z 2025-05-07T20:32:37.5323295Z moe/activation_test.py:117: 2025-05-07T20:32:37.5323422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5323524Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5323622Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5324128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5324268Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5324632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5324854Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5325202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5325297Z kernel = self.compile( 2025-05-07T20:32:37.5325681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5325856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5325981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5325985Z 2025-05-07T20:32:37.5326194Z self = 2025-05-07T20:32:37.5326983Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5327493Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac688940>} 2025-05-07T20:32:37.5328251Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5328533Z context = 2025-05-07T20:32:37.5328538Z 2025-05-07T20:32:37.5328728Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5329003Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5329109Z module_map=module_map) 2025-05-07T20:32:37.5329274Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5329371Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5329447Z E ^ 2025-05-07T20:32:37.5329806Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5329811Z 2025-05-07T20:32:37.5330229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5330233Z 2025-05-07T20:32:37.5330335Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5330561Z self=, 2025-05-07T20:32:37.5330635Z T=128, 2025-05-07T20:32:37.5330712Z D=7168, 2025-05-07T20:32:37.5330793Z scale_ub=None, 2025-05-07T20:32:37.5330875Z contiguous=True, 2025-05-07T20:32:37.5331002Z compiled=False, 2025-05-07T20:32:37.5331072Z ) 2025-05-07T20:32:37.5331290Z self = 2025-05-07T20:32:37.5331460Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.5331504Z 2025-05-07T20:32:37.5331579Z @given( 2025-05-07T20:32:37.5331695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5331796Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5331910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5332029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5332142Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5332215Z ) 2025-05-07T20:32:37.5332463Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5332596Z def test_silu_mul_quant( 2025-05-07T20:32:37.5332668Z self, 2025-05-07T20:32:37.5332748Z T: int, 2025-05-07T20:32:37.5332824Z D: int, 2025-05-07T20:32:37.5332920Z scale_ub: Optional[float], 2025-05-07T20:32:37.5333009Z contiguous: bool, 2025-05-07T20:32:37.5333091Z compiled: bool, 2025-05-07T20:32:37.5333169Z ) -> None: 2025-05-07T20:32:37.5333263Z torch.manual_seed(2025) 2025-05-07T20:32:37.5333332Z 2025-05-07T20:32:37.5333503Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5333573Z 2025-05-07T20:32:37.5333663Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5333788Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5333874Z x = x_sign * x_clamp 2025-05-07T20:32:37.5333953Z x0 = x[:, :D] 2025-05-07T20:32:37.5334032Z x1 = x[:, D:] 2025-05-07T20:32:37.5334100Z 2025-05-07T20:32:37.5334180Z if contiguous: 2025-05-07T20:32:37.5334275Z x0 = x0.contiguous() 2025-05-07T20:32:37.5334362Z x1 = x1.contiguous() 2025-05-07T20:32:37.5334432Z 2025-05-07T20:32:37.5334525Z if scale_ub is not None: 2025-05-07T20:32:37.5334627Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5334763Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5334838Z ) 2025-05-07T20:32:37.5334910Z else: 2025-05-07T20:32:37.5335007Z scale_ub_tensor = None 2025-05-07T20:32:37.5335078Z 2025-05-07T20:32:37.5335206Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5335295Z op = silu_mul_quant 2025-05-07T20:32:37.5335377Z if compiled: 2025-05-07T20:32:37.5335474Z op = torch.compile(op) 2025-05-07T20:32:37.5335626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5335696Z 2025-05-07T20:32:37.5335784Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5335789Z 2025-05-07T20:32:37.5335890Z moe/activation_test.py:117: 2025-05-07T20:32:37.5336017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5336120Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5336217Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5336718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5336821Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5337180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5337400Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5337747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5337838Z kernel = self.compile( 2025-05-07T20:32:37.5338287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5338507Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5338631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5338636Z 2025-05-07T20:32:37.5338887Z self = 2025-05-07T20:32:37.5339669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5340182Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac689240>} 2025-05-07T20:32:37.5340934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5341171Z context = 2025-05-07T20:32:37.5341178Z 2025-05-07T20:32:37.5341343Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5341610Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5341721Z module_map=module_map) 2025-05-07T20:32:37.5341882Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5341980Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5342056Z E ^ 2025-05-07T20:32:37.5342416Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5342420Z 2025-05-07T20:32:37.5342842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5342849Z 2025-05-07T20:32:37.5342954Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5343176Z self=, 2025-05-07T20:32:37.5343256Z T=2048, 2025-05-07T20:32:37.5343331Z D=7168, 2025-05-07T20:32:37.5343415Z scale_ub=1200.0, 2025-05-07T20:32:37.5343501Z contiguous=True, 2025-05-07T20:32:37.5343588Z compiled=False, 2025-05-07T20:32:37.5343658Z ) 2025-05-07T20:32:37.5343882Z self = 2025-05-07T20:32:37.5344056Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.5344060Z 2025-05-07T20:32:37.5344134Z @given( 2025-05-07T20:32:37.5344294Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5344393Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5344511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5344631Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5344744Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5344819Z ) 2025-05-07T20:32:37.5345065Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5345160Z def test_silu_mul_quant( 2025-05-07T20:32:37.5345235Z self, 2025-05-07T20:32:37.5345308Z T: int, 2025-05-07T20:32:37.5345383Z D: int, 2025-05-07T20:32:37.5345478Z scale_ub: Optional[float], 2025-05-07T20:32:37.5345566Z contiguous: bool, 2025-05-07T20:32:37.5345654Z compiled: bool, 2025-05-07T20:32:37.5345729Z ) -> None: 2025-05-07T20:32:37.5345821Z torch.manual_seed(2025) 2025-05-07T20:32:37.5345892Z 2025-05-07T20:32:37.5346064Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5347932Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5347980Z 2025-05-07T20:32:37.5348097Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5348102Z 2025-05-07T20:32:37.5348202Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5348430Z self=, 2025-05-07T20:32:37.5348504Z T=1, 2025-05-07T20:32:37.5348583Z D=5120, 2025-05-07T20:32:37.5348664Z scale_ub=1200.0, 2025-05-07T20:32:37.5348746Z contiguous=True, 2025-05-07T20:32:37.5348870Z compiled=False, 2025-05-07T20:32:37.5348938Z ) 2025-05-07T20:32:37.5349157Z self = 2025-05-07T20:32:37.5349325Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.5349330Z 2025-05-07T20:32:37.5349403Z @given( 2025-05-07T20:32:37.5349522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5349621Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5349733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5349852Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5349968Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5350037Z ) 2025-05-07T20:32:37.5350285Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5350377Z def test_silu_mul_quant( 2025-05-07T20:32:37.5350449Z self, 2025-05-07T20:32:37.5350525Z T: int, 2025-05-07T20:32:37.5350603Z D: int, 2025-05-07T20:32:37.5350701Z scale_ub: Optional[float], 2025-05-07T20:32:37.5350793Z contiguous: bool, 2025-05-07T20:32:37.5350875Z compiled: bool, 2025-05-07T20:32:37.5350952Z ) -> None: 2025-05-07T20:32:37.5351047Z torch.manual_seed(2025) 2025-05-07T20:32:37.5351121Z 2025-05-07T20:32:37.5351291Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5351361Z 2025-05-07T20:32:37.5351451Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5351578Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5351664Z x = x_sign * x_clamp 2025-05-07T20:32:37.5351744Z x0 = x[:, :D] 2025-05-07T20:32:37.5351821Z x1 = x[:, D:] 2025-05-07T20:32:37.5351889Z 2025-05-07T20:32:37.5352015Z if contiguous: 2025-05-07T20:32:37.5352106Z x0 = x0.contiguous() 2025-05-07T20:32:37.5352196Z x1 = x1.contiguous() 2025-05-07T20:32:37.5352272Z 2025-05-07T20:32:37.5352360Z if scale_ub is not None: 2025-05-07T20:32:37.5352468Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5352606Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5352677Z ) 2025-05-07T20:32:37.5352750Z else: 2025-05-07T20:32:37.5352848Z scale_ub_tensor = None 2025-05-07T20:32:37.5352921Z 2025-05-07T20:32:37.5353053Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5353141Z op = silu_mul_quant 2025-05-07T20:32:37.5353222Z if compiled: 2025-05-07T20:32:37.5353350Z op = torch.compile(op) 2025-05-07T20:32:37.5353499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5353607Z 2025-05-07T20:32:37.5353742Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5353752Z 2025-05-07T20:32:37.5353851Z moe/activation_test.py:117: 2025-05-07T20:32:37.5353978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5354083Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5354239Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5354755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5354890Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5355251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5355476Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5356039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5356135Z kernel = self.compile( 2025-05-07T20:32:37.5356530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5356707Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5356956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5356961Z 2025-05-07T20:32:37.5357170Z self = 2025-05-07T20:32:37.5357956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5358474Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7ac68a200>} 2025-05-07T20:32:37.5359235Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5359436Z context = 2025-05-07T20:32:37.5359441Z 2025-05-07T20:32:37.5359608Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5359877Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5359987Z module_map=module_map) 2025-05-07T20:32:37.5360149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5360251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5360325Z E ^ 2025-05-07T20:32:37.5360682Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5360687Z 2025-05-07T20:32:37.5361174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5361179Z 2025-05-07T20:32:37.5361284Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5361515Z self=, 2025-05-07T20:32:37.5361590Z T=2048, 2025-05-07T20:32:37.5361664Z D=5120, 2025-05-07T20:32:37.5361749Z scale_ub=None, 2025-05-07T20:32:37.5361833Z contiguous=True, 2025-05-07T20:32:37.5361917Z compiled=False, 2025-05-07T20:32:37.5361994Z ) 2025-05-07T20:32:37.5362216Z self = 2025-05-07T20:32:37.5362388Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.5362397Z 2025-05-07T20:32:37.5362470Z @given( 2025-05-07T20:32:37.5362587Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5362688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5362807Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5362925Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5363041Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5363118Z ) 2025-05-07T20:32:37.5363424Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5363521Z def test_silu_mul_quant( 2025-05-07T20:32:37.5363597Z self, 2025-05-07T20:32:37.5363673Z T: int, 2025-05-07T20:32:37.5363804Z D: int, 2025-05-07T20:32:37.5363903Z scale_ub: Optional[float], 2025-05-07T20:32:37.5363993Z contiguous: bool, 2025-05-07T20:32:37.5364077Z compiled: bool, 2025-05-07T20:32:37.5364153Z ) -> None: 2025-05-07T20:32:37.5364249Z torch.manual_seed(2025) 2025-05-07T20:32:37.5364320Z 2025-05-07T20:32:37.5364490Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5364564Z 2025-05-07T20:32:37.5364658Z > x_sign = torch.sign(x) 2025-05-07T20:32:37.5366482Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5366533Z 2025-05-07T20:32:37.5366649Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:37.5366654Z 2025-05-07T20:32:37.5366756Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5366983Z self=, 2025-05-07T20:32:37.5367058Z T=16384, 2025-05-07T20:32:37.5367135Z D=5120, 2025-05-07T20:32:37.5367215Z scale_ub=None, 2025-05-07T20:32:37.5367298Z contiguous=True, 2025-05-07T20:32:37.5367385Z compiled=False, 2025-05-07T20:32:37.5367456Z ) 2025-05-07T20:32:37.5367672Z self = 2025-05-07T20:32:37.5367854Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.5367858Z 2025-05-07T20:32:37.5367931Z @given( 2025-05-07T20:32:37.5368046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5368150Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5368264Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5368387Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5368499Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5368569Z ) 2025-05-07T20:32:37.5368819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5368955Z def test_silu_mul_quant( 2025-05-07T20:32:37.5369030Z self, 2025-05-07T20:32:37.5369106Z T: int, 2025-05-07T20:32:37.5369178Z D: int, 2025-05-07T20:32:37.5369275Z scale_ub: Optional[float], 2025-05-07T20:32:37.5369366Z contiguous: bool, 2025-05-07T20:32:37.5369451Z compiled: bool, 2025-05-07T20:32:37.5369527Z ) -> None: 2025-05-07T20:32:37.5369623Z torch.manual_seed(2025) 2025-05-07T20:32:37.5369691Z 2025-05-07T20:32:37.5369863Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5371685Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5371693Z 2025-05-07T20:32:37.5371812Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5371817Z 2025-05-07T20:32:37.5371962Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5372186Z self=, 2025-05-07T20:32:37.5372271Z T=4096, 2025-05-07T20:32:37.5372386Z D=5120, 2025-05-07T20:32:37.5372466Z scale_ub=None, 2025-05-07T20:32:37.5372555Z contiguous=True, 2025-05-07T20:32:37.5372639Z compiled=False, 2025-05-07T20:32:37.5372708Z ) 2025-05-07T20:32:37.5372931Z self = 2025-05-07T20:32:37.5373100Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.5373105Z 2025-05-07T20:32:37.5373187Z @given( 2025-05-07T20:32:37.5373306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5373405Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5373562Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5373682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5373794Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5373867Z ) 2025-05-07T20:32:37.5374112Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5374209Z def test_silu_mul_quant( 2025-05-07T20:32:37.5374286Z self, 2025-05-07T20:32:37.5374359Z T: int, 2025-05-07T20:32:37.5374434Z D: int, 2025-05-07T20:32:37.5374531Z scale_ub: Optional[float], 2025-05-07T20:32:37.5374620Z contiguous: bool, 2025-05-07T20:32:37.5374707Z compiled: bool, 2025-05-07T20:32:37.5374784Z ) -> None: 2025-05-07T20:32:37.5374876Z torch.manual_seed(2025) 2025-05-07T20:32:37.5374951Z 2025-05-07T20:32:37.5375120Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5376940Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5376951Z 2025-05-07T20:32:37.5377068Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5377072Z 2025-05-07T20:32:37.5377175Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5377400Z self=, 2025-05-07T20:32:37.5377517Z T=2048, 2025-05-07T20:32:37.5377597Z D=5120, 2025-05-07T20:32:37.5377677Z scale_ub=None, 2025-05-07T20:32:37.5377762Z contiguous=False, 2025-05-07T20:32:37.5377851Z compiled=False, 2025-05-07T20:32:37.5377921Z ) 2025-05-07T20:32:37.5378193Z self = 2025-05-07T20:32:37.5378369Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.5378374Z 2025-05-07T20:32:37.5378450Z @given( 2025-05-07T20:32:37.5378567Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5378668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5378781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5378907Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5379019Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5379096Z ) 2025-05-07T20:32:37.5379349Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5379442Z def test_silu_mul_quant( 2025-05-07T20:32:37.5379516Z self, 2025-05-07T20:32:37.5379594Z T: int, 2025-05-07T20:32:37.5379668Z D: int, 2025-05-07T20:32:37.5379808Z scale_ub: Optional[float], 2025-05-07T20:32:37.5379902Z contiguous: bool, 2025-05-07T20:32:37.5379987Z compiled: bool, 2025-05-07T20:32:37.5380065Z ) -> None: 2025-05-07T20:32:37.5380163Z torch.manual_seed(2025) 2025-05-07T20:32:37.5380273Z 2025-05-07T20:32:37.5380443Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5382254Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5382299Z 2025-05-07T20:32:37.5382422Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5382427Z 2025-05-07T20:32:37.5382529Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5382752Z self=, 2025-05-07T20:32:37.5382833Z T=4096, 2025-05-07T20:32:37.5382907Z D=7168, 2025-05-07T20:32:37.5382986Z scale_ub=None, 2025-05-07T20:32:37.5383072Z contiguous=True, 2025-05-07T20:32:37.5383153Z compiled=True, 2025-05-07T20:32:37.5383225Z ) 2025-05-07T20:32:37.5383443Z self = 2025-05-07T20:32:37.5383611Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.5383618Z 2025-05-07T20:32:37.5383695Z @given( 2025-05-07T20:32:37.5383811Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5383911Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5384027Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5384147Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5384260Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5384335Z ) 2025-05-07T20:32:37.5384579Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5384674Z def test_silu_mul_quant( 2025-05-07T20:32:37.5384751Z self, 2025-05-07T20:32:37.5384826Z T: int, 2025-05-07T20:32:37.5384902Z D: int, 2025-05-07T20:32:37.5385001Z scale_ub: Optional[float], 2025-05-07T20:32:37.5385089Z contiguous: bool, 2025-05-07T20:32:37.5385179Z compiled: bool, 2025-05-07T20:32:37.5385254Z ) -> None: 2025-05-07T20:32:37.5385393Z torch.manual_seed(2025) 2025-05-07T20:32:37.5385467Z 2025-05-07T20:32:37.5385636Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5387464Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5387473Z 2025-05-07T20:32:37.5387589Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5387594Z 2025-05-07T20:32:37.5387697Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5387925Z self=, 2025-05-07T20:32:37.5387999Z T=2048, 2025-05-07T20:32:37.5388078Z D=5120, 2025-05-07T20:32:37.5388163Z scale_ub=1200.0, 2025-05-07T20:32:37.5388247Z contiguous=False, 2025-05-07T20:32:37.5388373Z compiled=False, 2025-05-07T20:32:37.5388445Z ) 2025-05-07T20:32:37.5388663Z self = 2025-05-07T20:32:37.5388839Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.5388905Z 2025-05-07T20:32:37.5388980Z @given( 2025-05-07T20:32:37.5389098Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5389200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5389314Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5389432Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5389545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5389619Z ) 2025-05-07T20:32:37.5389867Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5390003Z def test_silu_mul_quant( 2025-05-07T20:32:37.5390078Z self, 2025-05-07T20:32:37.5390158Z T: int, 2025-05-07T20:32:37.5390233Z D: int, 2025-05-07T20:32:37.5390331Z scale_ub: Optional[float], 2025-05-07T20:32:37.5390422Z contiguous: bool, 2025-05-07T20:32:37.5390505Z compiled: bool, 2025-05-07T20:32:37.5390583Z ) -> None: 2025-05-07T20:32:37.5390680Z torch.manual_seed(2025) 2025-05-07T20:32:37.5390751Z 2025-05-07T20:32:37.5390922Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5392744Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5392752Z 2025-05-07T20:32:37.5392872Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5392877Z 2025-05-07T20:32:37.5392978Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5393205Z self=, 2025-05-07T20:32:37.5393282Z T=4096, 2025-05-07T20:32:37.5393355Z D=7168, 2025-05-07T20:32:37.5393435Z scale_ub=1200.0, 2025-05-07T20:32:37.5393522Z contiguous=True, 2025-05-07T20:32:37.5393605Z compiled=False, 2025-05-07T20:32:37.5393676Z ) 2025-05-07T20:32:37.5393893Z self = 2025-05-07T20:32:37.5394111Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.5394116Z 2025-05-07T20:32:37.5394195Z @given( 2025-05-07T20:32:37.5394315Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5394413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5394532Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5394648Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5394761Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5394837Z ) 2025-05-07T20:32:37.5395081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5395175Z def test_silu_mul_quant( 2025-05-07T20:32:37.5395252Z self, 2025-05-07T20:32:37.5395326Z T: int, 2025-05-07T20:32:37.5395403Z D: int, 2025-05-07T20:32:37.5395499Z scale_ub: Optional[float], 2025-05-07T20:32:37.5395587Z contiguous: bool, 2025-05-07T20:32:37.5395675Z compiled: bool, 2025-05-07T20:32:37.5395751Z ) -> None: 2025-05-07T20:32:37.5395844Z torch.manual_seed(2025) 2025-05-07T20:32:37.5395918Z 2025-05-07T20:32:37.5396088Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5397952Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5397993Z 2025-05-07T20:32:37.5398109Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5398114Z 2025-05-07T20:32:37.5398217Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5398444Z self=, 2025-05-07T20:32:37.5398558Z T=16384, 2025-05-07T20:32:37.5398637Z D=7168, 2025-05-07T20:32:37.5398719Z scale_ub=None, 2025-05-07T20:32:37.5398805Z contiguous=False, 2025-05-07T20:32:37.5398889Z compiled=True, 2025-05-07T20:32:37.5398959Z ) 2025-05-07T20:32:37.5399177Z self = 2025-05-07T20:32:37.5399358Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.5399362Z 2025-05-07T20:32:37.5399438Z @given( 2025-05-07T20:32:37.5399553Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5399655Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5399767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5399887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5400005Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5400077Z ) 2025-05-07T20:32:37.5400324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5400420Z def test_silu_mul_quant( 2025-05-07T20:32:37.5400496Z self, 2025-05-07T20:32:37.5400578Z T: int, 2025-05-07T20:32:37.5400649Z D: int, 2025-05-07T20:32:37.5400746Z scale_ub: Optional[float], 2025-05-07T20:32:37.5400836Z contiguous: bool, 2025-05-07T20:32:37.5400920Z compiled: bool, 2025-05-07T20:32:37.5400995Z ) -> None: 2025-05-07T20:32:37.5401090Z torch.manual_seed(2025) 2025-05-07T20:32:37.5401159Z 2025-05-07T20:32:37.5401331Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5403196Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5403205Z 2025-05-07T20:32:37.5403325Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5403333Z 2025-05-07T20:32:37.5403435Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5403657Z self=, 2025-05-07T20:32:37.5403735Z T=4096, 2025-05-07T20:32:37.5403807Z D=7168, 2025-05-07T20:32:37.5403885Z scale_ub=None, 2025-05-07T20:32:37.5403969Z contiguous=True, 2025-05-07T20:32:37.5404050Z compiled=False, 2025-05-07T20:32:37.5404118Z ) 2025-05-07T20:32:37.5404337Z self = 2025-05-07T20:32:37.5404504Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.5404511Z 2025-05-07T20:32:37.5404587Z @given( 2025-05-07T20:32:37.5404742Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5404842Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5404957Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5405072Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5405224Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5405296Z ) 2025-05-07T20:32:37.5405539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5405630Z def test_silu_mul_quant( 2025-05-07T20:32:37.5405705Z self, 2025-05-07T20:32:37.5405777Z T: int, 2025-05-07T20:32:37.5405853Z D: int, 2025-05-07T20:32:37.5405948Z scale_ub: Optional[float], 2025-05-07T20:32:37.5406038Z contiguous: bool, 2025-05-07T20:32:37.5406123Z compiled: bool, 2025-05-07T20:32:37.5406198Z ) -> None: 2025-05-07T20:32:37.5406330Z torch.manual_seed(2025) 2025-05-07T20:32:37.5406402Z 2025-05-07T20:32:37.5406572Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5408391Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5408400Z 2025-05-07T20:32:37.5408517Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5408522Z 2025-05-07T20:32:37.5408622Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5408847Z self=, 2025-05-07T20:32:37.5408926Z T=16384, 2025-05-07T20:32:37.5409003Z D=7168, 2025-05-07T20:32:37.5409081Z scale_ub=None, 2025-05-07T20:32:37.5409163Z contiguous=True, 2025-05-07T20:32:37.5409246Z compiled=False, 2025-05-07T20:32:37.5409315Z ) 2025-05-07T20:32:37.5409532Z self = 2025-05-07T20:32:37.5409706Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.5409711Z 2025-05-07T20:32:37.5409784Z @given( 2025-05-07T20:32:37.5409899Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5409998Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5410115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5410275Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5410388Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5410461Z ) 2025-05-07T20:32:37.5410707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5410801Z def test_silu_mul_quant( 2025-05-07T20:32:37.5410874Z self, 2025-05-07T20:32:37.5410948Z T: int, 2025-05-07T20:32:37.5411021Z D: int, 2025-05-07T20:32:37.5411116Z scale_ub: Optional[float], 2025-05-07T20:32:37.5411208Z contiguous: bool, 2025-05-07T20:32:37.5411289Z compiled: bool, 2025-05-07T20:32:37.5411363Z ) -> None: 2025-05-07T20:32:37.5411458Z torch.manual_seed(2025) 2025-05-07T20:32:37.5411527Z 2025-05-07T20:32:37.5411696Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5413555Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5413564Z 2025-05-07T20:32:37.5413722Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5413726Z 2025-05-07T20:32:37.5413827Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5414051Z self=, 2025-05-07T20:32:37.5414128Z T=16384, 2025-05-07T20:32:37.5414200Z D=7168, 2025-05-07T20:32:37.5414279Z scale_ub=1200.0, 2025-05-07T20:32:37.5414362Z contiguous=True, 2025-05-07T20:32:37.5414443Z compiled=False, 2025-05-07T20:32:37.5414519Z ) 2025-05-07T20:32:37.5418315Z self = 2025-05-07T20:32:37.5418502Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.5418581Z 2025-05-07T20:32:37.5418664Z @given( 2025-05-07T20:32:37.5418791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5418888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5419003Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5419123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5419237Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5419307Z ) 2025-05-07T20:32:37.5419554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5419649Z def test_silu_mul_quant( 2025-05-07T20:32:37.5419723Z self, 2025-05-07T20:32:37.5419797Z T: int, 2025-05-07T20:32:37.5419876Z D: int, 2025-05-07T20:32:37.5419971Z scale_ub: Optional[float], 2025-05-07T20:32:37.5420059Z contiguous: bool, 2025-05-07T20:32:37.5420148Z compiled: bool, 2025-05-07T20:32:37.5420225Z ) -> None: 2025-05-07T20:32:37.5420318Z torch.manual_seed(2025) 2025-05-07T20:32:37.5420392Z 2025-05-07T20:32:37.5420561Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5422441Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5422449Z 2025-05-07T20:32:37.5422567Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5422571Z 2025-05-07T20:32:37.5422676Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5422902Z self=, 2025-05-07T20:32:37.5422977Z T=128, 2025-05-07T20:32:37.5423054Z D=5120, 2025-05-07T20:32:37.5423134Z scale_ub=1200.0, 2025-05-07T20:32:37.5423222Z contiguous=False, 2025-05-07T20:32:37.5423307Z compiled=False, 2025-05-07T20:32:37.5423376Z ) 2025-05-07T20:32:37.5423592Z self = 2025-05-07T20:32:37.5423765Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.5423770Z 2025-05-07T20:32:37.5423845Z @given( 2025-05-07T20:32:37.5423968Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5424064Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5424180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5424297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5424411Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5424482Z ) 2025-05-07T20:32:37.5424775Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5424868Z def test_silu_mul_quant( 2025-05-07T20:32:37.5424941Z self, 2025-05-07T20:32:37.5425018Z T: int, 2025-05-07T20:32:37.5425130Z D: int, 2025-05-07T20:32:37.5425229Z scale_ub: Optional[float], 2025-05-07T20:32:37.5425316Z contiguous: bool, 2025-05-07T20:32:37.5425398Z compiled: bool, 2025-05-07T20:32:37.5425476Z ) -> None: 2025-05-07T20:32:37.5425568Z torch.manual_seed(2025) 2025-05-07T20:32:37.5425638Z 2025-05-07T20:32:37.5425807Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5425878Z 2025-05-07T20:32:37.5425970Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5426096Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5426226Z x = x_sign * x_clamp 2025-05-07T20:32:37.5426304Z x0 = x[:, :D] 2025-05-07T20:32:37.5426384Z x1 = x[:, D:] 2025-05-07T20:32:37.5426455Z 2025-05-07T20:32:37.5426537Z if contiguous: 2025-05-07T20:32:37.5426633Z x0 = x0.contiguous() 2025-05-07T20:32:37.5426720Z x1 = x1.contiguous() 2025-05-07T20:32:37.5426794Z 2025-05-07T20:32:37.5426890Z if scale_ub is not None: 2025-05-07T20:32:37.5426994Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5427132Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5427204Z ) 2025-05-07T20:32:37.5427279Z else: 2025-05-07T20:32:37.5427373Z scale_ub_tensor = None 2025-05-07T20:32:37.5427442Z 2025-05-07T20:32:37.5427574Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5427666Z op = silu_mul_quant 2025-05-07T20:32:37.5427748Z if compiled: 2025-05-07T20:32:37.5427847Z op = torch.compile(op) 2025-05-07T20:32:37.5427956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5428028Z 2025-05-07T20:32:37.5428120Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5428125Z 2025-05-07T20:32:37.5428220Z moe/activation_test.py:117: 2025-05-07T20:32:37.5428345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5428451Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5428549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5429055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5429153Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5429558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5429786Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5430133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5430226Z kernel = self.compile( 2025-05-07T20:32:37.5430618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5430796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5430917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5430928Z 2025-05-07T20:32:37.5431138Z self = 2025-05-07T20:32:37.5431927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5432441Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acabdea0>} 2025-05-07T20:32:37.5433237Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5433471Z context = 2025-05-07T20:32:37.5433476Z 2025-05-07T20:32:37.5433638Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5433903Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5434012Z module_map=module_map) 2025-05-07T20:32:37.5434175Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5434275Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5434349Z E ^ 2025-05-07T20:32:37.5434705Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5434752Z 2025-05-07T20:32:37.5435173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5435177Z 2025-05-07T20:32:37.5435280Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5435505Z self=, 2025-05-07T20:32:37.5435581Z T=2048, 2025-05-07T20:32:37.5435656Z D=7168, 2025-05-07T20:32:37.5435737Z scale_ub=None, 2025-05-07T20:32:37.5435820Z contiguous=False, 2025-05-07T20:32:37.5435901Z compiled=False, 2025-05-07T20:32:37.5435973Z ) 2025-05-07T20:32:37.5436193Z self = 2025-05-07T20:32:37.5436366Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.5436370Z 2025-05-07T20:32:37.5436450Z @given( 2025-05-07T20:32:37.5436571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5436669Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5436782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5436900Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5437014Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5437089Z ) 2025-05-07T20:32:37.5437333Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5437426Z def test_silu_mul_quant( 2025-05-07T20:32:37.5437504Z self, 2025-05-07T20:32:37.5437577Z T: int, 2025-05-07T20:32:37.5437650Z D: int, 2025-05-07T20:32:37.5437749Z scale_ub: Optional[float], 2025-05-07T20:32:37.5437889Z contiguous: bool, 2025-05-07T20:32:37.5437973Z compiled: bool, 2025-05-07T20:32:37.5438053Z ) -> None: 2025-05-07T20:32:37.5438145Z torch.manual_seed(2025) 2025-05-07T20:32:37.5438217Z 2025-05-07T20:32:37.5438389Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5440211Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5440224Z 2025-05-07T20:32:37.5440339Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5440346Z 2025-05-07T20:32:37.5440448Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5440672Z self=, 2025-05-07T20:32:37.5440748Z T=128, 2025-05-07T20:32:37.5440821Z D=7168, 2025-05-07T20:32:37.5440948Z scale_ub=1200.0, 2025-05-07T20:32:37.5441031Z contiguous=True, 2025-05-07T20:32:37.5441111Z compiled=True, 2025-05-07T20:32:37.5441185Z ) 2025-05-07T20:32:37.5441402Z self = 2025-05-07T20:32:37.5441610Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.5441617Z 2025-05-07T20:32:37.5441690Z @given( 2025-05-07T20:32:37.5441806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5441904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5442016Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5442133Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5442247Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5442319Z ) 2025-05-07T20:32:37.5442604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5442701Z def test_silu_mul_quant( 2025-05-07T20:32:37.5442774Z self, 2025-05-07T20:32:37.5442851Z T: int, 2025-05-07T20:32:37.5442923Z D: int, 2025-05-07T20:32:37.5443019Z scale_ub: Optional[float], 2025-05-07T20:32:37.5443110Z contiguous: bool, 2025-05-07T20:32:37.5443192Z compiled: bool, 2025-05-07T20:32:37.5443267Z ) -> None: 2025-05-07T20:32:37.5443361Z torch.manual_seed(2025) 2025-05-07T20:32:37.5443431Z 2025-05-07T20:32:37.5443598Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5443671Z 2025-05-07T20:32:37.5443759Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5443885Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5443973Z x = x_sign * x_clamp 2025-05-07T20:32:37.5444053Z x0 = x[:, :D] 2025-05-07T20:32:37.5444130Z x1 = x[:, D:] 2025-05-07T20:32:37.5444205Z 2025-05-07T20:32:37.5444285Z if contiguous: 2025-05-07T20:32:37.5444379Z x0 = x0.contiguous() 2025-05-07T20:32:37.5444465Z x1 = x1.contiguous() 2025-05-07T20:32:37.5444534Z 2025-05-07T20:32:37.5444627Z if scale_ub is not None: 2025-05-07T20:32:37.5444730Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5444866Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5444941Z ) 2025-05-07T20:32:37.5445015Z else: 2025-05-07T20:32:37.5445107Z scale_ub_tensor = None 2025-05-07T20:32:37.5445178Z 2025-05-07T20:32:37.5445307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5445395Z op = silu_mul_quant 2025-05-07T20:32:37.5445478Z if compiled: 2025-05-07T20:32:37.5445622Z op = torch.compile(op) 2025-05-07T20:32:37.5445730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5445802Z 2025-05-07T20:32:37.5445890Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5445894Z 2025-05-07T20:32:37.5445993Z moe/activation_test.py:117: 2025-05-07T20:32:37.5446118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5446219Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5446322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5446696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5446786Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5447287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5447382Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5447746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5447971Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5448395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5448505Z kernel = self.compile( 2025-05-07T20:32:37.5448914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5449131Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5449256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5449261Z 2025-05-07T20:32:37.5449468Z self = 2025-05-07T20:32:37.5450258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5450890Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa7acabf7f0>} 2025-05-07T20:32:37.5451650Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5451848Z context = 2025-05-07T20:32:37.5451852Z 2025-05-07T20:32:37.5452015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5452281Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5452389Z module_map=module_map) 2025-05-07T20:32:37.5452552Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5452649Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5452726Z E ^ 2025-05-07T20:32:37.5453088Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5453093Z 2025-05-07T20:32:37.5453509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5453516Z 2025-05-07T20:32:37.5453621Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5453841Z self=, 2025-05-07T20:32:37.5453915Z T=128, 2025-05-07T20:32:37.5453992Z D=7168, 2025-05-07T20:32:37.5454072Z scale_ub=1200.0, 2025-05-07T20:32:37.5454154Z contiguous=True, 2025-05-07T20:32:37.5454236Z compiled=False, 2025-05-07T20:32:37.5454306Z ) 2025-05-07T20:32:37.5454568Z self = 2025-05-07T20:32:37.5454742Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.5454749Z 2025-05-07T20:32:37.5454823Z @given( 2025-05-07T20:32:37.5454944Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5455043Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5455155Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5455274Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5455387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5455458Z ) 2025-05-07T20:32:37.5456040Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5456139Z def test_silu_mul_quant( 2025-05-07T20:32:37.5456212Z self, 2025-05-07T20:32:37.5456290Z T: int, 2025-05-07T20:32:37.5456362Z D: int, 2025-05-07T20:32:37.5456464Z scale_ub: Optional[float], 2025-05-07T20:32:37.5456554Z contiguous: bool, 2025-05-07T20:32:37.5456638Z compiled: bool, 2025-05-07T20:32:37.5456720Z ) -> None: 2025-05-07T20:32:37.5456817Z torch.manual_seed(2025) 2025-05-07T20:32:37.5456886Z 2025-05-07T20:32:37.5457149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5457222Z 2025-05-07T20:32:37.5457313Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5457442Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5459409Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5459487Z 2025-05-07T20:32:37.5459612Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.5459616Z 2025-05-07T20:32:37.5459719Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5459942Z self=, 2025-05-07T20:32:37.5460019Z T=128, 2025-05-07T20:32:37.5460091Z D=5120, 2025-05-07T20:32:37.5460180Z scale_ub=1200.0, 2025-05-07T20:32:37.5460262Z contiguous=True, 2025-05-07T20:32:37.5460342Z compiled=True, 2025-05-07T20:32:37.5460413Z ) 2025-05-07T20:32:37.5460630Z self = 2025-05-07T20:32:37.5460795Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.5460799Z 2025-05-07T20:32:37.5460875Z @given( 2025-05-07T20:32:37.5460993Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5461090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5461204Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5461322Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5461439Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5461509Z ) 2025-05-07T20:32:37.5461753Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5461851Z def test_silu_mul_quant( 2025-05-07T20:32:37.5461924Z self, 2025-05-07T20:32:37.5461997Z T: int, 2025-05-07T20:32:37.5462073Z D: int, 2025-05-07T20:32:37.5462168Z scale_ub: Optional[float], 2025-05-07T20:32:37.5462255Z contiguous: bool, 2025-05-07T20:32:37.5462340Z compiled: bool, 2025-05-07T20:32:37.5462415Z ) -> None: 2025-05-07T20:32:37.5462507Z torch.manual_seed(2025) 2025-05-07T20:32:37.5462580Z 2025-05-07T20:32:37.5462810Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5462888Z 2025-05-07T20:32:37.5462976Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5463101Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5464915Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5464923Z 2025-05-07T20:32:37.5465040Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.5465044Z 2025-05-07T20:32:37.5465148Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5465369Z self=, 2025-05-07T20:32:37.5465445Z T=128, 2025-05-07T20:32:37.5465521Z D=7168, 2025-05-07T20:32:37.5465601Z scale_ub=None, 2025-05-07T20:32:37.5465725Z contiguous=True, 2025-05-07T20:32:37.5465810Z compiled=True, 2025-05-07T20:32:37.5465880Z ) 2025-05-07T20:32:37.5466098Z self = 2025-05-07T20:32:37.5466302Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.5466307Z 2025-05-07T20:32:37.5466381Z @given( 2025-05-07T20:32:37.5466497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5466593Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5466704Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5466820Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5466935Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5467006Z ) 2025-05-07T20:32:37.5467257Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5467392Z def test_silu_mul_quant( 2025-05-07T20:32:37.5467468Z self, 2025-05-07T20:32:37.5467543Z T: int, 2025-05-07T20:32:37.5467616Z D: int, 2025-05-07T20:32:37.5467713Z scale_ub: Optional[float], 2025-05-07T20:32:37.5467799Z contiguous: bool, 2025-05-07T20:32:37.5467885Z compiled: bool, 2025-05-07T20:32:37.5467962Z ) -> None: 2025-05-07T20:32:37.5468053Z torch.manual_seed(2025) 2025-05-07T20:32:37.5468122Z 2025-05-07T20:32:37.5468291Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5470111Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.5470119Z 2025-05-07T20:32:37.5470237Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.5470368Z =============================== warnings summary =============================== 2025-05-07T20:32:37.5470682Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:37.5470983Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:37.5471281Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:37.5472215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:37.5472451Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:37.5472455Z 2025-05-07T20:32:37.5472672Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:37.5472838Z ================= 1 failed, 1 deselected, 3 warnings in 17.47s ================= 2025-05-07T20:32:39.0833470Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:39.1454746Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:32:39.1455335Z 2025-05-07T20:32:41.1472244Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:43.2931412Z ============================= test session starts ============================== 2025-05-07T20:32:43.2932343Z platform linux -- Python 3.10.13, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:43.2932880Z cachedir: .pytest_cache 2025-05-07T20:32:43.2933472Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:43.2934332Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:43.2934751Z plugins: hypothesis-6.131.14 2025-05-07T20:32:44.8971797Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:45.0757223Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:45.0758266Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:45.0758544Z 2025-05-07T20:32:47.5954608Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.5956113Z self=, 2025-05-07T20:32:47.5956682Z T=1, 2025-05-07T20:32:47.5956933Z D=5120, 2025-05-07T20:32:47.5957189Z scale_ub=None, 2025-05-07T20:32:47.5957441Z contiguous=True, 2025-05-07T20:32:47.5957684Z compiled=True, 2025-05-07T20:32:47.5957903Z ) 2025-05-07T20:32:47.5958230Z self = 2025-05-07T20:32:47.5958733Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.5959008Z 2025-05-07T20:32:47.5959089Z @given( 2025-05-07T20:32:47.5959337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.5959653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.5959978Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.5960322Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.5960657Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.5960952Z ) 2025-05-07T20:32:47.5961318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.5961769Z def test_silu_mul_quant( 2025-05-07T20:32:47.5962013Z self, 2025-05-07T20:32:47.5962216Z T: int, 2025-05-07T20:32:47.5962421Z D: int, 2025-05-07T20:32:47.5962642Z scale_ub: Optional[float], 2025-05-07T20:32:47.5962927Z contiguous: bool, 2025-05-07T20:32:47.5963176Z compiled: bool, 2025-05-07T20:32:47.5963402Z ) -> None: 2025-05-07T20:32:47.5963625Z torch.manual_seed(2025) 2025-05-07T20:32:47.5963876Z 2025-05-07T20:32:47.5964157Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.5964508Z 2025-05-07T20:32:47.5964840Z x_sign = torch.sign(x) 2025-05-07T20:32:47.5965140Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.5965462Z x = x_sign * x_clamp 2025-05-07T20:32:47.5965708Z x0 = x[:, :D] 2025-05-07T20:32:47.5965922Z x1 = x[:, D:] 2025-05-07T20:32:47.5966139Z 2025-05-07T20:32:47.5966334Z if contiguous: 2025-05-07T20:32:47.5966570Z x0 = x0.contiguous() 2025-05-07T20:32:47.5966840Z x1 = x1.contiguous() 2025-05-07T20:32:47.5967088Z 2025-05-07T20:32:47.5967282Z if scale_ub is not None: 2025-05-07T20:32:47.5967569Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.5967919Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.5968231Z ) 2025-05-07T20:32:47.5968426Z else: 2025-05-07T20:32:47.5968648Z scale_ub_tensor = None 2025-05-07T20:32:47.5968908Z 2025-05-07T20:32:47.5969149Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.5969473Z op = silu_mul_quant 2025-05-07T20:32:47.5969733Z if compiled: 2025-05-07T20:32:47.5969986Z op = torch.compile(op) 2025-05-07T20:32:47.5970299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.5970674Z 2025-05-07T20:32:47.5970872Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.5971172Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.5971498Z 2025-05-07T20:32:47.5971767Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.5972196Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.5972499Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.5972826Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.5973191Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.5973508Z 2025-05-07T20:32:47.5973720Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.5973922Z 2025-05-07T20:32:47.5974027Z moe/activation_test.py:126: 2025-05-07T20:32:47.5974330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5974741Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.5975079Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.5975890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.5976668Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.5977476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.5978297Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.5979004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.5979747Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.5980516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:47.5981277Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.5982075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.5982731Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.5983351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.5983874Z fn() 2025-05-07T20:32:47.5984396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.5984992Z self.fn.run( 2025-05-07T20:32:47.5985522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.5986068Z kernel = self.compile( 2025-05-07T20:32:47.5986623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.5987297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.5987698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5987931Z 2025-05-07T20:32:47.5988147Z self = 2025-05-07T20:32:47.5989253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.5990688Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6e133caf0>} 2025-05-07T20:32:47.5992127Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.5993179Z context = 2025-05-07T20:32:47.5993476Z 2025-05-07T20:32:47.5993649Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.5994226Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.5994702Z module_map=module_map) 2025-05-07T20:32:47.5995084Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.5995456Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.5995734Z E ^ 2025-05-07T20:32:47.5996219Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.5996684Z 2025-05-07T20:32:47.5997107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.5997675Z 2025-05-07T20:32:47.5997792Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.5998211Z self=, 2025-05-07T20:32:47.5998621Z T=2048, 2025-05-07T20:32:47.5998821Z D=5120, 2025-05-07T20:32:47.5999023Z scale_ub=1200.0, 2025-05-07T20:32:47.5999258Z contiguous=True, 2025-05-07T20:32:47.5999494Z compiled=False, 2025-05-07T20:32:47.5999704Z ) 2025-05-07T20:32:48.9360919Z self = 2025-05-07T20:32:48.9361739Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:48.9362181Z 2025-05-07T20:32:48.9362284Z @given( 2025-05-07T20:32:48.9362626Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.9362973Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.9363294Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.9363645Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.9363987Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.9364282Z ) 2025-05-07T20:32:48.9364648Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.9365106Z def test_silu_mul_quant( 2025-05-07T20:32:48.9365359Z self, 2025-05-07T20:32:48.9365563Z T: int, 2025-05-07T20:32:48.9365768Z D: int, 2025-05-07T20:32:48.9365991Z scale_ub: Optional[float], 2025-05-07T20:32:48.9366275Z contiguous: bool, 2025-05-07T20:32:48.9366520Z compiled: bool, 2025-05-07T20:32:48.9366751Z ) -> None: 2025-05-07T20:32:48.9366979Z torch.manual_seed(2025) 2025-05-07T20:32:48.9367233Z 2025-05-07T20:32:48.9367787Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.9368153Z 2025-05-07T20:32:48.9368354Z x_sign = torch.sign(x) 2025-05-07T20:32:48.9368655Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.9368980Z x = x_sign * x_clamp 2025-05-07T20:32:48.9369230Z x0 = x[:, :D] 2025-05-07T20:32:48.9369446Z x1 = x[:, D:] 2025-05-07T20:32:48.9369661Z 2025-05-07T20:32:48.9369854Z if contiguous: 2025-05-07T20:32:48.9370093Z x0 = x0.contiguous() 2025-05-07T20:32:48.9370361Z x1 = x1.contiguous() 2025-05-07T20:32:48.9370611Z 2025-05-07T20:32:48.9370803Z if scale_ub is not None: 2025-05-07T20:32:48.9371090Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.9371440Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.9371756Z ) 2025-05-07T20:32:48.9371963Z else: 2025-05-07T20:32:48.9372220Z scale_ub_tensor = None 2025-05-07T20:32:48.9372482Z 2025-05-07T20:32:48.9372718Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.9373044Z op = silu_mul_quant 2025-05-07T20:32:48.9373301Z if compiled: 2025-05-07T20:32:48.9373641Z op = torch.compile(op) 2025-05-07T20:32:48.9373950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.9374230Z 2025-05-07T20:32:48.9374428Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.9374602Z 2025-05-07T20:32:48.9374786Z moe/activation_test.py:117: 2025-05-07T20:32:48.9375089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.9375424Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.9375708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.9376423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.9377141Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.9377688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.9378568Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.9379274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.9386672Z kernel = self.compile( 2025-05-07T20:32:48.9387261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.9387940Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.9388351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.9388581Z 2025-05-07T20:32:48.9388802Z self = 2025-05-07T20:32:48.9389909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.9391387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6e1219990>} 2025-05-07T20:32:48.9392798Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.9393844Z context = 2025-05-07T20:32:48.9394134Z 2025-05-07T20:32:48.9394313Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.9394844Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.9395391Z module_map=module_map) 2025-05-07T20:32:48.9395774Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.9396142Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.9396404Z E ^ 2025-05-07T20:32:48.9396890Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.9397356Z 2025-05-07T20:32:48.9397780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.9398304Z 2025-05-07T20:32:48.9398418Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.9398837Z self=, 2025-05-07T20:32:48.9399245Z T=2048, 2025-05-07T20:32:48.9399440Z D=5120, 2025-05-07T20:32:48.9399633Z scale_ub=1200.0, 2025-05-07T20:32:48.9399866Z contiguous=True, 2025-05-07T20:32:48.9400094Z compiled=True, 2025-05-07T20:32:48.9400303Z ) 2025-05-07T20:32:48.9400631Z self = 2025-05-07T20:32:48.9401134Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:48.9401408Z 2025-05-07T20:32:48.9401546Z @given( 2025-05-07T20:32:48.9401779Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.9402125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.9402467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.9402843Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.9403183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.9403476Z ) 2025-05-07T20:32:48.9403829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.9404278Z def test_silu_mul_quant( 2025-05-07T20:32:48.9404527Z self, 2025-05-07T20:32:48.9404720Z T: int, 2025-05-07T20:32:48.9404927Z D: int, 2025-05-07T20:32:48.9405153Z scale_ub: Optional[float], 2025-05-07T20:32:48.9405432Z contiguous: bool, 2025-05-07T20:32:48.9405671Z compiled: bool, 2025-05-07T20:32:48.9405949Z ) -> None: 2025-05-07T20:32:48.9406170Z torch.manual_seed(2025) 2025-05-07T20:32:48.9406413Z 2025-05-07T20:32:48.9406696Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.9407044Z 2025-05-07T20:32:48.9407238Z x_sign = torch.sign(x) 2025-05-07T20:32:48.9407543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.9407861Z x = x_sign * x_clamp 2025-05-07T20:32:48.9408099Z x0 = x[:, :D] 2025-05-07T20:32:48.9408320Z x1 = x[:, D:] 2025-05-07T20:32:48.9408532Z 2025-05-07T20:32:48.9408719Z if contiguous: 2025-05-07T20:32:48.9408960Z x0 = x0.contiguous() 2025-05-07T20:32:48.9409223Z x1 = x1.contiguous() 2025-05-07T20:32:48.9409460Z 2025-05-07T20:32:48.9409660Z if scale_ub is not None: 2025-05-07T20:32:48.9409931Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.9410272Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.9410585Z ) 2025-05-07T20:32:48.9410781Z else: 2025-05-07T20:32:48.9410994Z scale_ub_tensor = None 2025-05-07T20:32:48.9411250Z 2025-05-07T20:32:48.9411491Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.9411802Z op = silu_mul_quant 2025-05-07T20:32:48.9412056Z if compiled: 2025-05-07T20:32:48.9412307Z op = torch.compile(op) 2025-05-07T20:32:48.9412602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.9412881Z 2025-05-07T20:32:48.9413079Z y_fp8, y_scale = fn() 2025-05-07T20:32:48.9413365Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:48.9413654Z 2025-05-07T20:32:48.9413952Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.9414288Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:48.9414586Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:48.9414910Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:48.9415277Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.9415584Z 2025-05-07T20:32:48.9415794Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:48.9415994Z 2025-05-07T20:32:48.9416102Z moe/activation_test.py:126: 2025-05-07T20:32:48.9416398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.9416735Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:48.9417071Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.9417867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:48.9418710Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:48.9419272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.9419970Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.9420714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:48.9421452Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:48.9422311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:48.9423069Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:48.9423803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:48.9424454Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:48.9425063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:48.9425627Z fn() 2025-05-07T20:32:48.9426142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:48.9426727Z self.fn.run( 2025-05-07T20:32:48.9427207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.9427744Z kernel = self.compile( 2025-05-07T20:32:48.9428299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.9428967Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.9429366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.9429592Z 2025-05-07T20:32:48.9429805Z self = 2025-05-07T20:32:48.9430909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.9432360Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6dbc2d3f0>} 2025-05-07T20:32:48.9433729Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.9434767Z context = 2025-05-07T20:32:48.9435066Z 2025-05-07T20:32:48.9435238Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.9435816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.9436299Z module_map=module_map) 2025-05-07T20:32:48.9436668Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.9437038Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:48.9437307Z E ^ 2025-05-07T20:32:48.9437781Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.9438247Z 2025-05-07T20:32:48.9438672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.9439196Z 2025-05-07T20:32:48.9439301Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.9439722Z self=, 2025-05-07T20:32:48.9440120Z T=16384, 2025-05-07T20:32:48.9440316Z D=7168, 2025-05-07T20:32:48.9440516Z scale_ub=1200.0, 2025-05-07T20:32:48.9440740Z contiguous=False, 2025-05-07T20:32:48.9440970Z compiled=False, 2025-05-07T20:32:48.9441176Z ) 2025-05-07T20:32:50.1460974Z self = 2025-05-07T20:32:50.1462164Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:50.1462478Z 2025-05-07T20:32:50.1462565Z @given( 2025-05-07T20:32:50.1462814Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.1463242Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.1463570Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.1463917Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.1464253Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.1464551Z ) 2025-05-07T20:32:50.1464921Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.1465389Z def test_silu_mul_quant( 2025-05-07T20:32:50.1465641Z self, 2025-05-07T20:32:50.1465848Z T: int, 2025-05-07T20:32:50.1466058Z D: int, 2025-05-07T20:32:50.1466428Z scale_ub: Optional[float], 2025-05-07T20:32:50.1466714Z contiguous: bool, 2025-05-07T20:32:50.1466974Z compiled: bool, 2025-05-07T20:32:50.1467210Z ) -> None: 2025-05-07T20:32:50.1467440Z torch.manual_seed(2025) 2025-05-07T20:32:50.1467697Z 2025-05-07T20:32:50.1467980Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.1468347Z 2025-05-07T20:32:50.1468558Z x_sign = torch.sign(x) 2025-05-07T20:32:50.1468856Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.1469181Z x = x_sign * x_clamp 2025-05-07T20:32:50.1469435Z x0 = x[:, :D] 2025-05-07T20:32:50.1469656Z x1 = x[:, D:] 2025-05-07T20:32:50.1469876Z 2025-05-07T20:32:50.1470075Z if contiguous: 2025-05-07T20:32:50.1470317Z x0 = x0.contiguous() 2025-05-07T20:32:50.1470588Z x1 = x1.contiguous() 2025-05-07T20:32:50.1470838Z 2025-05-07T20:32:50.1471042Z if scale_ub is not None: 2025-05-07T20:32:50.1471328Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.1471680Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.1472000Z ) 2025-05-07T20:32:50.1472198Z else: 2025-05-07T20:32:50.1472421Z scale_ub_tensor = None 2025-05-07T20:32:50.1472683Z 2025-05-07T20:32:50.1472926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.1473254Z op = silu_mul_quant 2025-05-07T20:32:50.1473513Z if compiled: 2025-05-07T20:32:50.1473768Z op = torch.compile(op) 2025-05-07T20:32:50.1474079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1474360Z 2025-05-07T20:32:50.1474560Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.1474734Z 2025-05-07T20:32:50.1474932Z moe/activation_test.py:117: 2025-05-07T20:32:50.1475239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1475579Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.1475872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1476590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.1477301Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.1477852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.1478550Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.1479229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.1479775Z kernel = self.compile( 2025-05-07T20:32:50.1480333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.1481029Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.1481438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1481675Z 2025-05-07T20:32:50.1481935Z self = 2025-05-07T20:32:50.1483052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.1484521Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6dbc2ce50>} 2025-05-07T20:32:50.1485892Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.1486945Z context = 2025-05-07T20:32:50.1487285Z 2025-05-07T20:32:50.1487460Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.1487999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.1488480Z module_map=module_map) 2025-05-07T20:32:50.1488861Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.1489233Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.1489497Z E ^ 2025-05-07T20:32:50.1489983Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.1490450Z 2025-05-07T20:32:50.1490885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.1491405Z 2025-05-07T20:32:50.1491519Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.1491939Z self=, 2025-05-07T20:32:50.1492354Z T=1, 2025-05-07T20:32:50.1492552Z D=7168, 2025-05-07T20:32:50.1492749Z scale_ub=None, 2025-05-07T20:32:50.1492978Z contiguous=True, 2025-05-07T20:32:50.1493218Z compiled=True, 2025-05-07T20:32:50.1493430Z ) 2025-05-07T20:32:50.1493760Z self = 2025-05-07T20:32:50.1494255Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:50.1494518Z 2025-05-07T20:32:50.1494604Z @given( 2025-05-07T20:32:50.1494840Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.1495164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.1495479Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.1495856Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.1496197Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.1496491Z ) 2025-05-07T20:32:50.1496845Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.1497298Z def test_silu_mul_quant( 2025-05-07T20:32:50.1497551Z self, 2025-05-07T20:32:50.1497753Z T: int, 2025-05-07T20:32:50.1497950Z D: int, 2025-05-07T20:32:50.1498290Z scale_ub: Optional[float], 2025-05-07T20:32:50.1498571Z contiguous: bool, 2025-05-07T20:32:50.1498812Z compiled: bool, 2025-05-07T20:32:50.1499043Z ) -> None: 2025-05-07T20:32:50.1499269Z torch.manual_seed(2025) 2025-05-07T20:32:50.1499513Z 2025-05-07T20:32:50.1499796Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.1500147Z 2025-05-07T20:32:50.1500343Z x_sign = torch.sign(x) 2025-05-07T20:32:50.1500648Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.1500966Z x = x_sign * x_clamp 2025-05-07T20:32:50.1501209Z x0 = x[:, :D] 2025-05-07T20:32:50.1501439Z x1 = x[:, D:] 2025-05-07T20:32:50.1501680Z 2025-05-07T20:32:50.1501895Z if contiguous: 2025-05-07T20:32:50.1502181Z x0 = x0.contiguous() 2025-05-07T20:32:50.1502450Z x1 = x1.contiguous() 2025-05-07T20:32:50.1502691Z 2025-05-07T20:32:50.1502894Z if scale_ub is not None: 2025-05-07T20:32:50.1503223Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.1503570Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.1503880Z ) 2025-05-07T20:32:50.1504081Z else: 2025-05-07T20:32:50.1504300Z scale_ub_tensor = None 2025-05-07T20:32:50.1504554Z 2025-05-07T20:32:50.1504794Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.1505117Z op = silu_mul_quant 2025-05-07T20:32:50.1505373Z if compiled: 2025-05-07T20:32:50.1505635Z op = torch.compile(op) 2025-05-07T20:32:50.1505941Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1506259Z 2025-05-07T20:32:50.1506459Z y_fp8, y_scale = fn() 2025-05-07T20:32:50.1506758Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:50.1507046Z 2025-05-07T20:32:50.1507294Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.1507640Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:50.1507944Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:50.1508262Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:50.1508630Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.1508945Z 2025-05-07T20:32:50.1509149Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:50.1509368Z 2025-05-07T20:32:50.1509471Z moe/activation_test.py:126: 2025-05-07T20:32:50.1509777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1510113Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:50.1510449Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.1511252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:50.1512073Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:50.1512630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.1513320Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.1514023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:50.1514760Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.1515569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:50.1516334Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.1517081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:50.1517735Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:50.1518344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:50.1518876Z fn() 2025-05-07T20:32:50.1519401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:50.1519992Z self.fn.run( 2025-05-07T20:32:50.1520467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.1521015Z kernel = self.compile( 2025-05-07T20:32:50.1521577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.1522243Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.1522692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1522922Z 2025-05-07T20:32:50.1523142Z self = 2025-05-07T20:32:50.1524285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.1525675Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6db9c09d0>} 2025-05-07T20:32:50.1527047Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.1528137Z context = 2025-05-07T20:32:50.1528431Z 2025-05-07T20:32:50.1528611Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.1529140Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.1529630Z module_map=module_map) 2025-05-07T20:32:50.1530007Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.1530379Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:50.1530653Z E ^ 2025-05-07T20:32:50.1531129Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.1531588Z 2025-05-07T20:32:50.1532020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.1532544Z 2025-05-07T20:32:50.1532659Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.1533080Z self=, 2025-05-07T20:32:50.1533492Z T=4096, 2025-05-07T20:32:50.1533690Z D=5120, 2025-05-07T20:32:50.1533886Z scale_ub=None, 2025-05-07T20:32:50.1534114Z contiguous=False, 2025-05-07T20:32:50.1534356Z compiled=False, 2025-05-07T20:32:50.1534565Z ) 2025-05-07T20:32:51.7231371Z self = 2025-05-07T20:32:51.7231951Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.7232250Z 2025-05-07T20:32:51.7232338Z @given( 2025-05-07T20:32:51.7232767Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7233673Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7234308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7234993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7235664Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7236247Z ) 2025-05-07T20:32:51.7236974Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7237885Z def test_silu_mul_quant( 2025-05-07T20:32:51.7238374Z self, 2025-05-07T20:32:51.7238783Z T: int, 2025-05-07T20:32:51.7239187Z D: int, 2025-05-07T20:32:51.7239627Z scale_ub: Optional[float], 2025-05-07T20:32:51.7240183Z contiguous: bool, 2025-05-07T20:32:51.7240671Z compiled: bool, 2025-05-07T20:32:51.7241124Z ) -> None: 2025-05-07T20:32:51.7241568Z torch.manual_seed(2025) 2025-05-07T20:32:51.7242060Z 2025-05-07T20:32:51.7242618Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7243039Z 2025-05-07T20:32:51.7243253Z x_sign = torch.sign(x) 2025-05-07T20:32:51.7243552Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.7243875Z x = x_sign * x_clamp 2025-05-07T20:32:51.7244126Z x0 = x[:, :D] 2025-05-07T20:32:51.7244429Z x1 = x[:, D:] 2025-05-07T20:32:51.7244652Z 2025-05-07T20:32:51.7244850Z if contiguous: 2025-05-07T20:32:51.7245087Z x0 = x0.contiguous() 2025-05-07T20:32:51.7245358Z x1 = x1.contiguous() 2025-05-07T20:32:51.7245692Z 2025-05-07T20:32:51.7245895Z if scale_ub is not None: 2025-05-07T20:32:51.7246175Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.7246523Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.7246840Z ) 2025-05-07T20:32:51.7247037Z else: 2025-05-07T20:32:51.7247257Z scale_ub_tensor = None 2025-05-07T20:32:51.7247517Z 2025-05-07T20:32:51.7247761Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.7248089Z op = silu_mul_quant 2025-05-07T20:32:51.7248348Z if compiled: 2025-05-07T20:32:51.7248683Z op = torch.compile(op) 2025-05-07T20:32:51.7248998Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7249284Z 2025-05-07T20:32:51.7249485Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.7249660Z 2025-05-07T20:32:51.7249767Z moe/activation_test.py:117: 2025-05-07T20:32:51.7250075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7250417Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.7250706Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7251421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.7252136Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.7252690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.7253392Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.7254084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.7254637Z kernel = self.compile( 2025-05-07T20:32:51.7255195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.7256105Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.7256516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7256745Z 2025-05-07T20:32:51.7256967Z self = 2025-05-07T20:32:51.7264440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.7265910Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6db9c1a20>} 2025-05-07T20:32:51.7267294Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.7268349Z context = 2025-05-07T20:32:51.7268645Z 2025-05-07T20:32:51.7268824Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.7269352Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.7269831Z module_map=module_map) 2025-05-07T20:32:51.7270212Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.7270574Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.7270832Z E ^ 2025-05-07T20:32:51.7271381Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.7271843Z 2025-05-07T20:32:51.7272276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.7272798Z 2025-05-07T20:32:51.7272974Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7273395Z self=, 2025-05-07T20:32:51.7273807Z T=4096, 2025-05-07T20:32:51.7274002Z D=7168, 2025-05-07T20:32:51.7274195Z scale_ub=None, 2025-05-07T20:32:51.7274424Z contiguous=False, 2025-05-07T20:32:51.7274663Z compiled=False, 2025-05-07T20:32:51.7274873Z ) 2025-05-07T20:32:51.7275209Z self = 2025-05-07T20:32:51.7275740Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.7276083Z 2025-05-07T20:32:51.7276170Z @given( 2025-05-07T20:32:51.7276402Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7276728Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7277048Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7277387Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7277721Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7278019Z ) 2025-05-07T20:32:51.7278379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7278822Z def test_silu_mul_quant( 2025-05-07T20:32:51.7279073Z self, 2025-05-07T20:32:51.7279277Z T: int, 2025-05-07T20:32:51.7279476Z D: int, 2025-05-07T20:32:51.7279705Z scale_ub: Optional[float], 2025-05-07T20:32:51.7279992Z contiguous: bool, 2025-05-07T20:32:51.7280230Z compiled: bool, 2025-05-07T20:32:51.7280465Z ) -> None: 2025-05-07T20:32:51.7280689Z torch.manual_seed(2025) 2025-05-07T20:32:51.7280935Z 2025-05-07T20:32:51.7281221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7281569Z 2025-05-07T20:32:51.7281764Z x_sign = torch.sign(x) 2025-05-07T20:32:51.7282066Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.7282384Z x = x_sign * x_clamp 2025-05-07T20:32:51.7282630Z x0 = x[:, :D] 2025-05-07T20:32:51.7282846Z x1 = x[:, D:] 2025-05-07T20:32:51.7283060Z 2025-05-07T20:32:51.7283257Z if contiguous: 2025-05-07T20:32:51.7283487Z x0 = x0.contiguous() 2025-05-07T20:32:51.7283752Z x1 = x1.contiguous() 2025-05-07T20:32:51.7283997Z 2025-05-07T20:32:51.7284189Z if scale_ub is not None: 2025-05-07T20:32:51.7284522Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.7284869Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.7285177Z ) 2025-05-07T20:32:51.7285379Z else: 2025-05-07T20:32:51.7285586Z scale_ub_tensor = None 2025-05-07T20:32:51.7285858Z 2025-05-07T20:32:51.7286097Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.7286409Z op = silu_mul_quant 2025-05-07T20:32:51.7286660Z if compiled: 2025-05-07T20:32:51.7286914Z op = torch.compile(op) 2025-05-07T20:32:51.7287208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7287483Z 2025-05-07T20:32:51.7287679Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.7287846Z 2025-05-07T20:32:51.7287956Z moe/activation_test.py:117: 2025-05-07T20:32:51.7288249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7288582Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.7288876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7289575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.7290278Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.7290870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.7291568Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.7292274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.7292868Z kernel = self.compile( 2025-05-07T20:32:51.7293423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.7294091Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.7294499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7294734Z 2025-05-07T20:32:51.7294950Z self = 2025-05-07T20:32:51.7296094Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.7297487Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6db9c2560>} 2025-05-07T20:32:51.7298952Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.7299995Z context = 2025-05-07T20:32:51.7300297Z 2025-05-07T20:32:51.7300467Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.7300996Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.7301470Z module_map=module_map) 2025-05-07T20:32:51.7301844Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.7302209Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.7302469Z E ^ 2025-05-07T20:32:51.7302983Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.7303463Z 2025-05-07T20:32:51.7303882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.7304399Z 2025-05-07T20:32:51.7304510Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7304970Z self=, 2025-05-07T20:32:51.7305375Z T=128, 2025-05-07T20:32:51.7305566Z D=7168, 2025-05-07T20:32:51.7305755Z scale_ub=None, 2025-05-07T20:32:51.7305981Z contiguous=False, 2025-05-07T20:32:51.7306208Z compiled=True, 2025-05-07T20:32:51.7306408Z ) 2025-05-07T20:32:51.7922216Z self = 2025-05-07T20:32:51.7922784Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.7923163Z 2025-05-07T20:32:51.7923286Z @given( 2025-05-07T20:32:51.7923615Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7924042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7924450Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7924852Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7925198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7925489Z ) 2025-05-07T20:32:51.7925861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7926319Z def test_silu_mul_quant( 2025-05-07T20:32:51.7926572Z self, 2025-05-07T20:32:51.7926777Z T: int, 2025-05-07T20:32:51.7926991Z D: int, 2025-05-07T20:32:51.7927344Z scale_ub: Optional[float], 2025-05-07T20:32:51.7927636Z contiguous: bool, 2025-05-07T20:32:51.7927889Z compiled: bool, 2025-05-07T20:32:51.7928130Z ) -> None: 2025-05-07T20:32:51.7928355Z torch.manual_seed(2025) 2025-05-07T20:32:51.7928678Z 2025-05-07T20:32:51.7928970Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7929326Z 2025-05-07T20:32:51.7929539Z x_sign = torch.sign(x) 2025-05-07T20:32:51.7929853Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.7930206Z x = x_sign * x_clamp 2025-05-07T20:32:51.7930456Z x0 = x[:, :D] 2025-05-07T20:32:51.7930692Z x1 = x[:, D:] 2025-05-07T20:32:51.7930922Z 2025-05-07T20:32:51.7931117Z if contiguous: 2025-05-07T20:32:51.7931368Z x0 = x0.contiguous() 2025-05-07T20:32:51.7931641Z x1 = x1.contiguous() 2025-05-07T20:32:51.7931959Z 2025-05-07T20:32:51.7932165Z if scale_ub is not None: 2025-05-07T20:32:51.7932461Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.7932868Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.7933184Z ) 2025-05-07T20:32:51.7933394Z else: 2025-05-07T20:32:51.7933618Z scale_ub_tensor = None 2025-05-07T20:32:51.7933875Z 2025-05-07T20:32:51.7934121Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.7934449Z op = silu_mul_quant 2025-05-07T20:32:51.7934705Z if compiled: 2025-05-07T20:32:51.7934968Z op = torch.compile(op) 2025-05-07T20:32:51.7935279Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7935562Z 2025-05-07T20:32:51.7935773Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.7936074Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.7936373Z 2025-05-07T20:32:51.7936632Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.7936984Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.7937289Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.7937611Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.7937981Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.7938459Z 2025-05-07T20:32:51.7938671Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.7938879Z 2025-05-07T20:32:51.7938986Z moe/activation_test.py:126: 2025-05-07T20:32:51.7939295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7939639Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.7940059Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.7940877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.7941657Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.7942216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.7942923Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.7943637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.7944379Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.7945146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:51.7945916Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.7946667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.7947323Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.7947994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.7948531Z fn() 2025-05-07T20:32:51.7949060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.7949717Z self.fn.run( 2025-05-07T20:32:51.7950200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.7950752Z kernel = self.compile( 2025-05-07T20:32:51.7951304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.7951978Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.7952397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7952709Z 2025-05-07T20:32:51.7952933Z self = 2025-05-07T20:32:51.7954038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.7955442Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6db9ee3b0>} 2025-05-07T20:32:51.7957244Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.7958300Z context = 2025-05-07T20:32:51.7958595Z 2025-05-07T20:32:51.7958780Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.7959316Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.7959803Z module_map=module_map) 2025-05-07T20:32:51.7960183Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.7960557Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.7960830Z E ^ 2025-05-07T20:32:51.7961309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.7961766Z 2025-05-07T20:32:51.7962200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.7962745Z 2025-05-07T20:32:51.7962977Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7963405Z self=, 2025-05-07T20:32:51.7963816Z T=128, 2025-05-07T20:32:51.7964021Z D=7168, 2025-05-07T20:32:51.7964219Z scale_ub=None, 2025-05-07T20:32:51.7964450Z contiguous=False, 2025-05-07T20:32:51.7964690Z compiled=False, 2025-05-07T20:32:51.7964904Z ) 2025-05-07T20:32:52.1552932Z self = 2025-05-07T20:32:52.1553461Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.1553763Z 2025-05-07T20:32:52.1553863Z @given( 2025-05-07T20:32:52.1554229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.1554622Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.1555033Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.1555375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.1555947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.1556245Z ) 2025-05-07T20:32:52.1556609Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.1557059Z def test_silu_mul_quant( 2025-05-07T20:32:52.1557310Z self, 2025-05-07T20:32:52.1557634Z T: int, 2025-05-07T20:32:52.1557835Z D: int, 2025-05-07T20:32:52.1558062Z scale_ub: Optional[float], 2025-05-07T20:32:52.1558346Z contiguous: bool, 2025-05-07T20:32:52.1558587Z compiled: bool, 2025-05-07T20:32:52.1558926Z ) -> None: 2025-05-07T20:32:52.1559145Z torch.manual_seed(2025) 2025-05-07T20:32:52.1559397Z 2025-05-07T20:32:52.1559679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.1560022Z 2025-05-07T20:32:52.1560223Z x_sign = torch.sign(x) 2025-05-07T20:32:52.1560524Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.1560834Z x = x_sign * x_clamp 2025-05-07T20:32:52.1561084Z x0 = x[:, :D] 2025-05-07T20:32:52.1561314Z x1 = x[:, D:] 2025-05-07T20:32:52.1561521Z 2025-05-07T20:32:52.1561786Z if contiguous: 2025-05-07T20:32:52.1562029Z x0 = x0.contiguous() 2025-05-07T20:32:52.1562291Z x1 = x1.contiguous() 2025-05-07T20:32:52.1562544Z 2025-05-07T20:32:52.1562782Z if scale_ub is not None: 2025-05-07T20:32:52.1563075Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.1563422Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.1563745Z ) 2025-05-07T20:32:52.1563939Z else: 2025-05-07T20:32:52.1564159Z scale_ub_tensor = None 2025-05-07T20:32:52.1564420Z 2025-05-07T20:32:52.1564665Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.1564980Z op = silu_mul_quant 2025-05-07T20:32:52.1565238Z if compiled: 2025-05-07T20:32:52.1565493Z op = torch.compile(op) 2025-05-07T20:32:52.1565797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1566081Z 2025-05-07T20:32:52.1566281Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.1566455Z 2025-05-07T20:32:52.1566557Z moe/activation_test.py:117: 2025-05-07T20:32:52.1566864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1567197Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.1567480Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1568191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.1568896Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.1569443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.1570131Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.1570877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.1571429Z kernel = self.compile( 2025-05-07T20:32:52.1571983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.1572647Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.1573051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1573280Z 2025-05-07T20:32:52.1573500Z self = 2025-05-07T20:32:52.1574601Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.1576010Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6dba2e830>} 2025-05-07T20:32:52.1577422Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.1578595Z context = 2025-05-07T20:32:52.1578888Z 2025-05-07T20:32:52.1579235Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.1579815Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.1580299Z module_map=module_map) 2025-05-07T20:32:52.1580670Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.1581034Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.1581296Z E ^ 2025-05-07T20:32:52.1581776Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.1582234Z 2025-05-07T20:32:52.1582663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.1583229Z 2025-05-07T20:32:52.1583341Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.1583770Z self=, 2025-05-07T20:32:52.1584182Z T=4096, 2025-05-07T20:32:52.1584380Z D=5120, 2025-05-07T20:32:52.1584572Z scale_ub=1200.0, 2025-05-07T20:32:52.1584805Z contiguous=True, 2025-05-07T20:32:52.1585035Z compiled=False, 2025-05-07T20:32:52.1585240Z ) 2025-05-07T20:32:52.1585578Z self = 2025-05-07T20:32:52.1586083Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:52.1586361Z 2025-05-07T20:32:52.1586441Z @given( 2025-05-07T20:32:52.1586684Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.1587010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.1587323Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.1587666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.1588005Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.1588301Z ) 2025-05-07T20:32:52.1588652Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.1589107Z def test_silu_mul_quant( 2025-05-07T20:32:52.1589359Z self, 2025-05-07T20:32:52.1589554Z T: int, 2025-05-07T20:32:52.1589760Z D: int, 2025-05-07T20:32:52.1589986Z scale_ub: Optional[float], 2025-05-07T20:32:52.1590260Z contiguous: bool, 2025-05-07T20:32:52.1590507Z compiled: bool, 2025-05-07T20:32:52.1590735Z ) -> None: 2025-05-07T20:32:52.1590955Z torch.manual_seed(2025) 2025-05-07T20:32:52.1591257Z 2025-05-07T20:32:52.1591545Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.1591889Z 2025-05-07T20:32:52.1592094Z x_sign = torch.sign(x) 2025-05-07T20:32:52.1592395Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.1592749Z x = x_sign * x_clamp 2025-05-07T20:32:52.1593003Z x0 = x[:, :D] 2025-05-07T20:32:52.1593222Z x1 = x[:, D:] 2025-05-07T20:32:52.1593434Z 2025-05-07T20:32:52.1593619Z if contiguous: 2025-05-07T20:32:52.1593859Z x0 = x0.contiguous() 2025-05-07T20:32:52.1594121Z x1 = x1.contiguous() 2025-05-07T20:32:52.1594362Z 2025-05-07T20:32:52.1594564Z if scale_ub is not None: 2025-05-07T20:32:52.1594843Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.1595178Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.1595490Z ) 2025-05-07T20:32:52.1595691Z else: 2025-05-07T20:32:52.1595905Z scale_ub_tensor = None 2025-05-07T20:32:52.1596160Z 2025-05-07T20:32:52.1596399Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.1596716Z op = silu_mul_quant 2025-05-07T20:32:52.1596967Z if compiled: 2025-05-07T20:32:52.1597268Z op = torch.compile(op) 2025-05-07T20:32:52.1597568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1597847Z 2025-05-07T20:32:52.1598048Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.1598259Z 2025-05-07T20:32:52.1598369Z moe/activation_test.py:117: 2025-05-07T20:32:52.1598660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1598992Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.1599280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1599979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.1600689Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.1601235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.1601975Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.1602647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.1603242Z kernel = self.compile( 2025-05-07T20:32:52.1603797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.1604459Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.1604858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1605094Z 2025-05-07T20:32:52.1605307Z self = 2025-05-07T20:32:52.1606407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.1607804Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6dba2db40>} 2025-05-07T20:32:52.1609171Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.1610216Z context = 2025-05-07T20:32:52.1610507Z 2025-05-07T20:32:52.1610681Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.1611254Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.1611726Z module_map=module_map) 2025-05-07T20:32:52.1612096Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.1612459Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.1612717Z E ^ 2025-05-07T20:32:52.1613192Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.1613649Z 2025-05-07T20:32:52.1614075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.1614594Z 2025-05-07T20:32:52.1614706Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.1615121Z self=, 2025-05-07T20:32:52.1615527Z T=1, 2025-05-07T20:32:52.1615719Z D=5120, 2025-05-07T20:32:52.1615910Z scale_ub=None, 2025-05-07T20:32:52.1616129Z contiguous=True, 2025-05-07T20:32:52.1616358Z compiled=True, 2025-05-07T20:32:52.1616561Z ) 2025-05-07T20:32:52.7392014Z self = 2025-05-07T20:32:52.7392662Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.7392974Z 2025-05-07T20:32:52.7393208Z @given( 2025-05-07T20:32:52.7393450Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.7400489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.7400821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.7401272Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.7401782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.7402072Z ) 2025-05-07T20:32:52.7402433Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.7402873Z def test_silu_mul_quant( 2025-05-07T20:32:52.7403159Z self, 2025-05-07T20:32:52.7403376Z T: int, 2025-05-07T20:32:52.7403580Z D: int, 2025-05-07T20:32:52.7403803Z scale_ub: Optional[float], 2025-05-07T20:32:52.7404077Z contiguous: bool, 2025-05-07T20:32:52.7404390Z compiled: bool, 2025-05-07T20:32:52.7404625Z ) -> None: 2025-05-07T20:32:52.7404854Z torch.manual_seed(2025) 2025-05-07T20:32:52.7405094Z 2025-05-07T20:32:52.7405381Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.7405733Z 2025-05-07T20:32:52.7405926Z x_sign = torch.sign(x) 2025-05-07T20:32:52.7406229Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.7406544Z x = x_sign * x_clamp 2025-05-07T20:32:52.7406782Z x0 = x[:, :D] 2025-05-07T20:32:52.7407008Z x1 = x[:, D:] 2025-05-07T20:32:52.7407220Z 2025-05-07T20:32:52.7407412Z if contiguous: 2025-05-07T20:32:52.7407643Z x0 = x0.contiguous() 2025-05-07T20:32:52.7407906Z x1 = x1.contiguous() 2025-05-07T20:32:52.7408152Z 2025-05-07T20:32:52.7408348Z if scale_ub is not None: 2025-05-07T20:32:52.7408634Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.7408979Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.7409287Z ) 2025-05-07T20:32:52.7409492Z else: 2025-05-07T20:32:52.7409708Z scale_ub_tensor = None 2025-05-07T20:32:52.7409956Z 2025-05-07T20:32:52.7410199Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.7410522Z op = silu_mul_quant 2025-05-07T20:32:52.7410775Z if compiled: 2025-05-07T20:32:52.7411030Z op = torch.compile(op) 2025-05-07T20:32:52.7411333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.7411606Z 2025-05-07T20:32:52.7411806Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.7412097Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.7412391Z 2025-05-07T20:32:52.7412705Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.7413046Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.7413350Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.7413670Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.7414041Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.7414352Z 2025-05-07T20:32:52.7414553Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.7414757Z 2025-05-07T20:32:52.7414859Z moe/activation_test.py:126: 2025-05-07T20:32:52.7415159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.7415494Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.7415822Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.7416628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.7417397Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.7417940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.7418746Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.7419498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.7420226Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.7421027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:52.7421781Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.7422509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.7423157Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.7423768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.7424330Z fn() 2025-05-07T20:32:52.7424841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.7425426Z self.fn.run( 2025-05-07T20:32:52.7425903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.7426442Z kernel = self.compile( 2025-05-07T20:32:52.7426994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.7427662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.7428062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.7428287Z 2025-05-07T20:32:52.7428499Z self = 2025-05-07T20:32:52.7429606Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.7431019Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6dba2f250>} 2025-05-07T20:32:52.7432389Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.7433432Z context = 2025-05-07T20:32:52.7433722Z 2025-05-07T20:32:52.7433891Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.7434469Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.7434946Z module_map=module_map) 2025-05-07T20:32:52.7435313Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.7435679Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.7435948Z E ^ 2025-05-07T20:32:52.7436420Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.7436879Z 2025-05-07T20:32:52.7437301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.7437826Z 2025-05-07T20:32:52.7437933Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.7438354Z self=, 2025-05-07T20:32:52.7438763Z T=2048, 2025-05-07T20:32:52.7438949Z D=5120, 2025-05-07T20:32:52.7439152Z scale_ub=None, 2025-05-07T20:32:52.7439371Z contiguous=True, 2025-05-07T20:32:52.7439597Z compiled=True, 2025-05-07T20:32:52.7439812Z ) 2025-05-07T20:32:53.2806458Z self = 2025-05-07T20:32:53.2807148Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:53.2807434Z 2025-05-07T20:32:53.2807520Z @given( 2025-05-07T20:32:53.2807766Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.2808147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.2808473Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.2808823Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.2809163Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.2809461Z ) 2025-05-07T20:32:53.2809832Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.2810296Z def test_silu_mul_quant( 2025-05-07T20:32:53.2810547Z self, 2025-05-07T20:32:53.2810758Z T: int, 2025-05-07T20:32:53.2810971Z D: int, 2025-05-07T20:32:53.2811265Z scale_ub: Optional[float], 2025-05-07T20:32:53.2811554Z contiguous: bool, 2025-05-07T20:32:53.2811811Z compiled: bool, 2025-05-07T20:32:53.2812044Z ) -> None: 2025-05-07T20:32:53.2812279Z torch.manual_seed(2025) 2025-05-07T20:32:53.2812536Z 2025-05-07T20:32:53.2812818Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.2813179Z 2025-05-07T20:32:53.2813386Z x_sign = torch.sign(x) 2025-05-07T20:32:53.2813687Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.2814013Z x = x_sign * x_clamp 2025-05-07T20:32:53.2814265Z x0 = x[:, :D] 2025-05-07T20:32:53.2814489Z x1 = x[:, D:] 2025-05-07T20:32:53.2814708Z 2025-05-07T20:32:53.2814908Z if contiguous: 2025-05-07T20:32:53.2815152Z x0 = x0.contiguous() 2025-05-07T20:32:53.2815426Z x1 = x1.contiguous() 2025-05-07T20:32:53.2815680Z 2025-05-07T20:32:53.2815881Z if scale_ub is not None: 2025-05-07T20:32:53.2816175Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.2816530Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.2816853Z ) 2025-05-07T20:32:53.2817054Z else: 2025-05-07T20:32:53.2817277Z scale_ub_tensor = None 2025-05-07T20:32:53.2817544Z 2025-05-07T20:32:53.2817788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.2818214Z op = silu_mul_quant 2025-05-07T20:32:53.2818477Z if compiled: 2025-05-07T20:32:53.2818734Z op = torch.compile(op) 2025-05-07T20:32:53.2819052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.2819342Z 2025-05-07T20:32:53.2819544Z y_fp8, y_scale = fn() 2025-05-07T20:32:53.2819917Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:53.2820222Z 2025-05-07T20:32:53.2820471Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.2820826Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:53.2821137Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:53.2821468Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:53.2821839Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.2822167Z 2025-05-07T20:32:53.2822391Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:53.2822594Z 2025-05-07T20:32:53.2822701Z moe/activation_test.py:126: 2025-05-07T20:32:53.2823015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.2823363Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:53.2823703Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.2824520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:53.2825298Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:53.2825866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.2826607Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.2827315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:53.2828098Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.2828865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:53.2829625Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.2830374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:53.2831029Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:53.2831737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:53.2832263Z fn() 2025-05-07T20:32:53.2832783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:53.2833378Z self.fn.run( 2025-05-07T20:32:53.2833858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.2834402Z kernel = self.compile( 2025-05-07T20:32:53.2834958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.2835630Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.2836032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.2836267Z 2025-05-07T20:32:53.2836482Z self = 2025-05-07T20:32:53.2837599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.2839005Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6db5d3760>} 2025-05-07T20:32:53.2840377Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.2841429Z context = 2025-05-07T20:32:53.2841837Z 2025-05-07T20:32:53.2842012Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.2842548Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.2843075Z module_map=module_map) 2025-05-07T20:32:53.2843458Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.2843827Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:53.2844101Z E ^ 2025-05-07T20:32:53.2844586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.2845050Z 2025-05-07T20:32:53.2845473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.2845992Z 2025-05-07T20:32:53.2846108Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.2846531Z self=, 2025-05-07T20:32:53.2846946Z T=128, 2025-05-07T20:32:53.2847144Z D=5120, 2025-05-07T20:32:53.2847348Z scale_ub=None, 2025-05-07T20:32:53.2847574Z contiguous=True, 2025-05-07T20:32:53.2847808Z compiled=True, 2025-05-07T20:32:53.2848022Z ) 2025-05-07T20:32:54.1735090Z self = 2025-05-07T20:32:54.1735894Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.1736277Z 2025-05-07T20:32:54.1736489Z @given( 2025-05-07T20:32:54.1736730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1737054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1737373Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1737711Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1738139Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1738436Z ) 2025-05-07T20:32:54.1738802Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1739252Z def test_silu_mul_quant( 2025-05-07T20:32:54.1739503Z self, 2025-05-07T20:32:54.1739790Z T: int, 2025-05-07T20:32:54.1739990Z D: int, 2025-05-07T20:32:54.1740222Z scale_ub: Optional[float], 2025-05-07T20:32:54.1740502Z contiguous: bool, 2025-05-07T20:32:54.1740745Z compiled: bool, 2025-05-07T20:32:54.1740978Z ) -> None: 2025-05-07T20:32:54.1741202Z torch.manual_seed(2025) 2025-05-07T20:32:54.1741449Z 2025-05-07T20:32:54.1741731Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1742087Z 2025-05-07T20:32:54.1742283Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1742585Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1742906Z x = x_sign * x_clamp 2025-05-07T20:32:54.1743148Z x0 = x[:, :D] 2025-05-07T20:32:54.1743372Z x1 = x[:, D:] 2025-05-07T20:32:54.1743592Z 2025-05-07T20:32:54.1743781Z if contiguous: 2025-05-07T20:32:54.1744018Z x0 = x0.contiguous() 2025-05-07T20:32:54.1744288Z x1 = x1.contiguous() 2025-05-07T20:32:54.1744535Z 2025-05-07T20:32:54.1744730Z if scale_ub is not None: 2025-05-07T20:32:54.1745015Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.1745362Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.1745669Z ) 2025-05-07T20:32:54.1745876Z else: 2025-05-07T20:32:54.1746095Z scale_ub_tensor = None 2025-05-07T20:32:54.1746348Z 2025-05-07T20:32:54.1746591Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1746916Z op = silu_mul_quant 2025-05-07T20:32:54.1747169Z if compiled: 2025-05-07T20:32:54.1747429Z op = torch.compile(op) 2025-05-07T20:32:54.1747742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1748017Z 2025-05-07T20:32:54.1748298Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.1748600Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.1748895Z 2025-05-07T20:32:54.1749144Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1749493Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.1749798Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.1750117Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.1750488Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.1750806Z 2025-05-07T20:32:54.1751012Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.1751221Z 2025-05-07T20:32:54.1751327Z moe/activation_test.py:126: 2025-05-07T20:32:54.1751631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1751967Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.1752308Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.1753118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.1753894Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.1754497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.1755199Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.1756186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.1756928Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.1757696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:54.1758463Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.1759204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.1759933Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.1760544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.1761074Z fn() 2025-05-07T20:32:54.1761593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.1762187Z self.fn.run( 2025-05-07T20:32:54.1762672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.1763216Z kernel = self.compile( 2025-05-07T20:32:54.1763768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.1764440Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.1764847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1765078Z 2025-05-07T20:32:54.1765296Z self = 2025-05-07T20:32:54.1766401Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.1767817Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da770280>} 2025-05-07T20:32:54.1769253Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.1770302Z context = 2025-05-07T20:32:54.1770595Z 2025-05-07T20:32:54.1770773Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.1771302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.1771788Z module_map=module_map) 2025-05-07T20:32:54.1772166Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.1772538Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.1772815Z E ^ 2025-05-07T20:32:54.1773344Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.1773799Z 2025-05-07T20:32:54.1774226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.1774743Z 2025-05-07T20:32:54.1774856Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1775281Z self=, 2025-05-07T20:32:54.1775692Z T=4096, 2025-05-07T20:32:54.1775880Z D=5120, 2025-05-07T20:32:54.1776082Z scale_ub=None, 2025-05-07T20:32:54.1776384Z contiguous=True, 2025-05-07T20:32:54.1776614Z compiled=True, 2025-05-07T20:32:54.1776819Z ) 2025-05-07T20:32:54.9189043Z self = 2025-05-07T20:32:54.9189979Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.9190367Z 2025-05-07T20:32:54.9190479Z @given( 2025-05-07T20:32:54.9190741Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9191057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9191370Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9191711Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9192056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9192341Z ) 2025-05-07T20:32:54.9192699Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9193233Z def test_silu_mul_quant( 2025-05-07T20:32:54.9193476Z self, 2025-05-07T20:32:54.9193682Z T: int, 2025-05-07T20:32:54.9193886Z D: int, 2025-05-07T20:32:54.9194103Z scale_ub: Optional[float], 2025-05-07T20:32:54.9194381Z contiguous: bool, 2025-05-07T20:32:54.9194630Z compiled: bool, 2025-05-07T20:32:54.9194853Z ) -> None: 2025-05-07T20:32:54.9195076Z torch.manual_seed(2025) 2025-05-07T20:32:54.9195325Z 2025-05-07T20:32:54.9195607Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9195955Z 2025-05-07T20:32:54.9196153Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9196446Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9196766Z x = x_sign * x_clamp 2025-05-07T20:32:54.9197014Z x0 = x[:, :D] 2025-05-07T20:32:54.9197238Z x1 = x[:, D:] 2025-05-07T20:32:54.9197446Z 2025-05-07T20:32:54.9197652Z if contiguous: 2025-05-07T20:32:54.9197890Z x0 = x0.contiguous() 2025-05-07T20:32:54.9198150Z x1 = x1.contiguous() 2025-05-07T20:32:54.9198399Z 2025-05-07T20:32:54.9198600Z if scale_ub is not None: 2025-05-07T20:32:54.9198876Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9199220Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9199538Z ) 2025-05-07T20:32:54.9199731Z else: 2025-05-07T20:32:54.9199951Z scale_ub_tensor = None 2025-05-07T20:32:54.9200211Z 2025-05-07T20:32:54.9200445Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9200769Z op = silu_mul_quant 2025-05-07T20:32:54.9201026Z if compiled: 2025-05-07T20:32:54.9201343Z op = torch.compile(op) 2025-05-07T20:32:54.9201653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9201931Z 2025-05-07T20:32:54.9202123Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.9202417Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.9202718Z 2025-05-07T20:32:54.9202966Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9203304Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.9203602Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.9203924Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.9204287Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.9204601Z 2025-05-07T20:32:54.9204809Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.9205007Z 2025-05-07T20:32:54.9205110Z moe/activation_test.py:126: 2025-05-07T20:32:54.9205410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9205753Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.9206091Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.9206961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.9207734Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.9208293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9209023Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9209723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.9210461Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.9211234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:54.9211992Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.9212807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.9213469Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.9214119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.9214641Z fn() 2025-05-07T20:32:54.9215156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.9215760Z self.fn.run( 2025-05-07T20:32:54.9222328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9222882Z kernel = self.compile( 2025-05-07T20:32:54.9223492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9224153Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9224556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9224789Z 2025-05-07T20:32:54.9225006Z self = 2025-05-07T20:32:54.9226112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9227511Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da7712d0>} 2025-05-07T20:32:54.9228956Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9230003Z context = 2025-05-07T20:32:54.9230299Z 2025-05-07T20:32:54.9230476Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9231000Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9231480Z module_map=module_map) 2025-05-07T20:32:54.9231855Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9232221Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.9232484Z E ^ 2025-05-07T20:32:54.9232964Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9233422Z 2025-05-07T20:32:54.9233853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9234375Z 2025-05-07T20:32:54.9234482Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9234907Z self=, 2025-05-07T20:32:54.9235320Z T=16384, 2025-05-07T20:32:54.9235575Z D=5120, 2025-05-07T20:32:54.9235774Z scale_ub=None, 2025-05-07T20:32:54.9236000Z contiguous=True, 2025-05-07T20:32:54.9236231Z compiled=True, 2025-05-07T20:32:54.9236437Z ) 2025-05-07T20:32:54.9624028Z W0507 20:32:54.960000 88308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:54.9626802Z W0507 20:32:54.960000 88308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:54.9629307Z W0507 20:32:54.960000 88308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:54.9631151Z W0507 20:32:54.960000 88308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:54.9633431Z W0507 20:32:54.960000 88308 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:55.0657967Z self = 2025-05-07T20:32:55.0658830Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.0659226Z 2025-05-07T20:32:55.0659325Z @given( 2025-05-07T20:32:55.0659567Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.0659896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.0660215Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.0660569Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.0660909Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.0661213Z ) 2025-05-07T20:32:55.0661580Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.0662027Z def test_silu_mul_quant( 2025-05-07T20:32:55.0662279Z self, 2025-05-07T20:32:55.0662486Z T: int, 2025-05-07T20:32:55.0662685Z D: int, 2025-05-07T20:32:55.0662920Z scale_ub: Optional[float], 2025-05-07T20:32:55.0663207Z contiguous: bool, 2025-05-07T20:32:55.0663449Z compiled: bool, 2025-05-07T20:32:55.0663684Z ) -> None: 2025-05-07T20:32:55.0663911Z torch.manual_seed(2025) 2025-05-07T20:32:55.0664160Z 2025-05-07T20:32:55.0664445Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.0664799Z 2025-05-07T20:32:55.0664995Z x_sign = torch.sign(x) 2025-05-07T20:32:55.0665669Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.0666000Z x = x_sign * x_clamp 2025-05-07T20:32:55.0666243Z x0 = x[:, :D] 2025-05-07T20:32:55.0666474Z x1 = x[:, D:] 2025-05-07T20:32:55.0666689Z 2025-05-07T20:32:55.0666887Z if contiguous: 2025-05-07T20:32:55.0667130Z x0 = x0.contiguous() 2025-05-07T20:32:55.0667397Z x1 = x1.contiguous() 2025-05-07T20:32:55.0667639Z 2025-05-07T20:32:55.0667839Z if scale_ub is not None: 2025-05-07T20:32:55.0668124Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.0668465Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.0668782Z ) 2025-05-07T20:32:55.0668985Z else: 2025-05-07T20:32:55.0669200Z scale_ub_tensor = None 2025-05-07T20:32:55.0669465Z 2025-05-07T20:32:55.0669707Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.0670030Z op = silu_mul_quant 2025-05-07T20:32:55.0670288Z if compiled: 2025-05-07T20:32:55.0670546Z op = torch.compile(op) 2025-05-07T20:32:55.0670848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.0671133Z 2025-05-07T20:32:55.0671333Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.0671702Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.0671996Z 2025-05-07T20:32:55.0672242Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.0672649Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.0672941Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.0673265Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.0673679Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.0673989Z 2025-05-07T20:32:55.0674196Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.0674400Z 2025-05-07T20:32:55.0674510Z moe/activation_test.py:126: 2025-05-07T20:32:55.0674823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.0675155Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.0675565Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.0676367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.0677127Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.0677685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.0678380Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.0679076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.0679810Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.0680575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.0681341Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.0682084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.0682735Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.0683352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.0683926Z fn() 2025-05-07T20:32:55.0684440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.0685031Z self.fn.run( 2025-05-07T20:32:55.0685515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.0686101Z kernel = self.compile( 2025-05-07T20:32:55.0686654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.0687336Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.0687736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.0687966Z 2025-05-07T20:32:55.0688180Z self = 2025-05-07T20:32:55.0689291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.0690693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daaa1510>} 2025-05-07T20:32:55.0692064Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.0693163Z context = 2025-05-07T20:32:55.0693456Z 2025-05-07T20:32:55.0693626Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.0694156Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.0694688Z module_map=module_map) 2025-05-07T20:32:55.0695059Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.0695419Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.0695696Z E ^ 2025-05-07T20:32:55.0696171Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.0696635Z 2025-05-07T20:32:55.0697058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.0697632Z 2025-05-07T20:32:55.0697741Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.0698234Z self=, 2025-05-07T20:32:55.0698644Z T=1, 2025-05-07T20:32:55.0698834Z D=5120, 2025-05-07T20:32:55.0699036Z scale_ub=1200.0, 2025-05-07T20:32:55.0699282Z contiguous=True, 2025-05-07T20:32:55.0699513Z compiled=True, 2025-05-07T20:32:55.0699733Z ) 2025-05-07T20:32:55.2144052Z self = 2025-05-07T20:32:55.2144825Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.2145198Z 2025-05-07T20:32:55.2145310Z @given( 2025-05-07T20:32:55.2145565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.2145892Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.2146210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.2146547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.2146890Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.2147185Z ) 2025-05-07T20:32:55.2147539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.2147988Z def test_silu_mul_quant( 2025-05-07T20:32:55.2148234Z self, 2025-05-07T20:32:55.2148433Z T: int, 2025-05-07T20:32:55.2148632Z D: int, 2025-05-07T20:32:55.2148861Z scale_ub: Optional[float], 2025-05-07T20:32:55.2149133Z contiguous: bool, 2025-05-07T20:32:55.2149379Z compiled: bool, 2025-05-07T20:32:55.2149605Z ) -> None: 2025-05-07T20:32:55.2149819Z torch.manual_seed(2025) 2025-05-07T20:32:55.2150066Z 2025-05-07T20:32:55.2150467Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.2150815Z 2025-05-07T20:32:55.2151014Z x_sign = torch.sign(x) 2025-05-07T20:32:55.2151315Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.2151633Z x = x_sign * x_clamp 2025-05-07T20:32:55.2151870Z x0 = x[:, :D] 2025-05-07T20:32:55.2152093Z x1 = x[:, D:] 2025-05-07T20:32:55.2152304Z 2025-05-07T20:32:55.2152490Z if contiguous: 2025-05-07T20:32:55.2152727Z x0 = x0.contiguous() 2025-05-07T20:32:55.2152991Z x1 = x1.contiguous() 2025-05-07T20:32:55.2153234Z 2025-05-07T20:32:55.2153433Z if scale_ub is not None: 2025-05-07T20:32:55.2153742Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.2154110Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.2154425Z ) 2025-05-07T20:32:55.2154621Z else: 2025-05-07T20:32:55.2154832Z scale_ub_tensor = None 2025-05-07T20:32:55.2155089Z 2025-05-07T20:32:55.2155330Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.2155905Z op = silu_mul_quant 2025-05-07T20:32:55.2156163Z if compiled: 2025-05-07T20:32:55.2156419Z op = torch.compile(op) 2025-05-07T20:32:55.2156817Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.2157092Z 2025-05-07T20:32:55.2157285Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.2157452Z 2025-05-07T20:32:55.2157556Z moe/activation_test.py:117: 2025-05-07T20:32:55.2157848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.2158269Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.2158558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.2159124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.2159692Z return fn(*args, **kwargs) 2025-05-07T20:32:55.2160366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.2161073Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.2161687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.2162389Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.2163066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.2163613Z kernel = self.compile( 2025-05-07T20:32:55.2164162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.2164833Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.2165235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.2165462Z 2025-05-07T20:32:55.2165683Z self = 2025-05-07T20:32:55.2166786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.2168195Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da692b90>} 2025-05-07T20:32:55.2169573Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.2170620Z context = 2025-05-07T20:32:55.2170912Z 2025-05-07T20:32:55.2171086Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.2171681Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.2172161Z module_map=module_map) 2025-05-07T20:32:55.2172534Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.2172892Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.2173156Z E ^ 2025-05-07T20:32:55.2173678Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.2174139Z 2025-05-07T20:32:55.2174567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.2175089Z 2025-05-07T20:32:55.2175195Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.2175613Z self=, 2025-05-07T20:32:55.2176020Z T=1, 2025-05-07T20:32:55.2176203Z D=5120, 2025-05-07T20:32:55.2176406Z scale_ub=None, 2025-05-07T20:32:55.2176626Z contiguous=False, 2025-05-07T20:32:55.2176849Z compiled=True, 2025-05-07T20:32:55.2177054Z ) 2025-05-07T20:32:55.2853037Z self = 2025-05-07T20:32:55.2854347Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.2854713Z 2025-05-07T20:32:55.2854802Z @given( 2025-05-07T20:32:55.2855038Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.2855427Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.2856030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.2856370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.2856708Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.2857000Z ) 2025-05-07T20:32:55.2857361Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.2857809Z def test_silu_mul_quant( 2025-05-07T20:32:55.2858160Z self, 2025-05-07T20:32:55.2858364Z T: int, 2025-05-07T20:32:55.2858561Z D: int, 2025-05-07T20:32:55.2858786Z scale_ub: Optional[float], 2025-05-07T20:32:55.2859170Z contiguous: bool, 2025-05-07T20:32:55.2859412Z compiled: bool, 2025-05-07T20:32:55.2859653Z ) -> None: 2025-05-07T20:32:55.2859879Z torch.manual_seed(2025) 2025-05-07T20:32:55.2860120Z 2025-05-07T20:32:55.2860401Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.2860753Z 2025-05-07T20:32:55.2860949Z x_sign = torch.sign(x) 2025-05-07T20:32:55.2861251Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.2861572Z x = x_sign * x_clamp 2025-05-07T20:32:55.2861812Z x0 = x[:, :D] 2025-05-07T20:32:55.2862036Z x1 = x[:, D:] 2025-05-07T20:32:55.2862247Z 2025-05-07T20:32:55.2862440Z if contiguous: 2025-05-07T20:32:55.2862678Z x0 = x0.contiguous() 2025-05-07T20:32:55.2862944Z x1 = x1.contiguous() 2025-05-07T20:32:55.2863189Z 2025-05-07T20:32:55.2863380Z if scale_ub is not None: 2025-05-07T20:32:55.2863663Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.2864011Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.2864320Z ) 2025-05-07T20:32:55.2864518Z else: 2025-05-07T20:32:55.2864735Z scale_ub_tensor = None 2025-05-07T20:32:55.2864987Z 2025-05-07T20:32:55.2865232Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.2865556Z op = silu_mul_quant 2025-05-07T20:32:55.2865808Z if compiled: 2025-05-07T20:32:55.2866062Z op = torch.compile(op) 2025-05-07T20:32:55.2866369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.2866644Z 2025-05-07T20:32:55.2866846Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.2867212Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.2867511Z 2025-05-07T20:32:55.2867755Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.2868099Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.2868402Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.2868722Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.2869094Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.2869412Z 2025-05-07T20:32:55.2869615Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.2869826Z 2025-05-07T20:32:55.2869930Z moe/activation_test.py:126: 2025-05-07T20:32:55.2870231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.2870571Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.2870904Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.2871709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.2872482Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.2873040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.2873854Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.2874559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.2875357Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.2876123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.2876892Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.2877642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.2878301Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.2878960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.2879495Z fn() 2025-05-07T20:32:55.2880019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.2880608Z self.fn.run( 2025-05-07T20:32:55.2881094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.2881638Z kernel = self.compile( 2025-05-07T20:32:55.2882195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.2882859Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.2883265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.2883494Z 2025-05-07T20:32:55.2883715Z self = 2025-05-07T20:32:55.2884883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.2886286Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daf79b40>} 2025-05-07T20:32:55.2887666Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.2888718Z context = 2025-05-07T20:32:55.2889008Z 2025-05-07T20:32:55.2889228Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.2889760Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.2890251Z module_map=module_map) 2025-05-07T20:32:55.2890628Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.2890994Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.2891263Z E ^ 2025-05-07T20:32:55.2891740Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.2892204Z 2025-05-07T20:32:55.2892637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.2893160Z 2025-05-07T20:32:55.2893267Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.2893692Z self=, 2025-05-07T20:32:55.2894106Z T=1, 2025-05-07T20:32:55.2894298Z D=5120, 2025-05-07T20:32:55.2894491Z scale_ub=None, 2025-05-07T20:32:55.2894713Z contiguous=True, 2025-05-07T20:32:55.2894945Z compiled=False, 2025-05-07T20:32:55.2895153Z ) 2025-05-07T20:32:55.6102863Z self = 2025-05-07T20:32:55.6103623Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:55.6104038Z 2025-05-07T20:32:55.6104151Z @given( 2025-05-07T20:32:55.6104502Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.6104820Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.6105140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.6105486Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.6105818Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.6106111Z ) 2025-05-07T20:32:55.6106471Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.6106923Z def test_silu_mul_quant( 2025-05-07T20:32:55.6107169Z self, 2025-05-07T20:32:55.6107443Z T: int, 2025-05-07T20:32:55.6107646Z D: int, 2025-05-07T20:32:55.6107869Z scale_ub: Optional[float], 2025-05-07T20:32:55.6108155Z contiguous: bool, 2025-05-07T20:32:55.6108402Z compiled: bool, 2025-05-07T20:32:55.6108631Z ) -> None: 2025-05-07T20:32:55.6108856Z torch.manual_seed(2025) 2025-05-07T20:32:55.6109108Z 2025-05-07T20:32:55.6109386Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.6109738Z 2025-05-07T20:32:55.6109940Z x_sign = torch.sign(x) 2025-05-07T20:32:55.6110233Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.6110549Z x = x_sign * x_clamp 2025-05-07T20:32:55.6110800Z x0 = x[:, :D] 2025-05-07T20:32:55.6111018Z x1 = x[:, D:] 2025-05-07T20:32:55.6111233Z 2025-05-07T20:32:55.6111428Z if contiguous: 2025-05-07T20:32:55.6111670Z x0 = x0.contiguous() 2025-05-07T20:32:55.6111929Z x1 = x1.contiguous() 2025-05-07T20:32:55.6112180Z 2025-05-07T20:32:55.6112382Z if scale_ub is not None: 2025-05-07T20:32:55.6112658Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.6113007Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.6113320Z ) 2025-05-07T20:32:55.6113515Z else: 2025-05-07T20:32:55.6113744Z scale_ub_tensor = None 2025-05-07T20:32:55.6114001Z 2025-05-07T20:32:55.6114236Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.6114557Z op = silu_mul_quant 2025-05-07T20:32:55.6114815Z if compiled: 2025-05-07T20:32:55.6115066Z op = torch.compile(op) 2025-05-07T20:32:55.6115372Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.6115650Z 2025-05-07T20:32:55.6115918Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.6116093Z 2025-05-07T20:32:55.6116195Z moe/activation_test.py:117: 2025-05-07T20:32:55.6116493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.6116830Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.6117119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.6117827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.6118531Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.6119072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.6126001Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.6126730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.6127277Z kernel = self.compile( 2025-05-07T20:32:55.6127832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.6128501Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.6128976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.6129204Z 2025-05-07T20:32:55.6129415Z self = 2025-05-07T20:32:55.6130550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.6131946Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daf79ea0>} 2025-05-07T20:32:55.6133360Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.6134439Z context = 2025-05-07T20:32:55.6134731Z 2025-05-07T20:32:55.6134898Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.6135421Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.6135895Z module_map=module_map) 2025-05-07T20:32:55.6136261Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.6136623Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.6136878Z E ^ 2025-05-07T20:32:55.6137350Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.6137807Z 2025-05-07T20:32:55.6138341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.6138864Z 2025-05-07T20:32:55.6138970Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.6139388Z self=, 2025-05-07T20:32:55.6139786Z T=128, 2025-05-07T20:32:55.6139980Z D=5120, 2025-05-07T20:32:55.6140175Z scale_ub=None, 2025-05-07T20:32:55.6140397Z contiguous=False, 2025-05-07T20:32:55.6140625Z compiled=True, 2025-05-07T20:32:55.6140828Z ) 2025-05-07T20:32:55.6141151Z self = 2025-05-07T20:32:55.6141644Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.6141917Z 2025-05-07T20:32:55.6141994Z @given( 2025-05-07T20:32:55.6142225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.6142584Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.6142893Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.6143222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.6143551Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.6143837Z ) 2025-05-07T20:32:55.6144197Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.6144642Z def test_silu_mul_quant( 2025-05-07T20:32:55.6144879Z self, 2025-05-07T20:32:55.6145073Z T: int, 2025-05-07T20:32:55.6145269Z D: int, 2025-05-07T20:32:55.6145483Z scale_ub: Optional[float], 2025-05-07T20:32:55.6145757Z contiguous: bool, 2025-05-07T20:32:55.6145997Z compiled: bool, 2025-05-07T20:32:55.6146219Z ) -> None: 2025-05-07T20:32:55.6146437Z torch.manual_seed(2025) 2025-05-07T20:32:55.6146678Z 2025-05-07T20:32:55.6146950Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.6147294Z 2025-05-07T20:32:55.6147490Z x_sign = torch.sign(x) 2025-05-07T20:32:55.6147777Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.6148089Z x = x_sign * x_clamp 2025-05-07T20:32:55.6148326Z x0 = x[:, :D] 2025-05-07T20:32:55.6148587Z x1 = x[:, D:] 2025-05-07T20:32:55.6148796Z 2025-05-07T20:32:55.6148986Z if contiguous: 2025-05-07T20:32:55.6149217Z x0 = x0.contiguous() 2025-05-07T20:32:55.6149468Z x1 = x1.contiguous() 2025-05-07T20:32:55.6149752Z 2025-05-07T20:32:55.6149937Z if scale_ub is not None: 2025-05-07T20:32:55.6150211Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.6150548Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.6150849Z ) 2025-05-07T20:32:55.6151040Z else: 2025-05-07T20:32:55.6151248Z scale_ub_tensor = None 2025-05-07T20:32:55.6151491Z 2025-05-07T20:32:55.6151725Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.6152039Z op = silu_mul_quant 2025-05-07T20:32:55.6152282Z if compiled: 2025-05-07T20:32:55.6152575Z op = torch.compile(op) 2025-05-07T20:32:55.6152873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.6153144Z 2025-05-07T20:32:55.6153334Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.6153497Z 2025-05-07T20:32:55.6153598Z moe/activation_test.py:117: 2025-05-07T20:32:55.6153891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.6154216Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.6154500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.6155061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.6155862Z return fn(*args, **kwargs) 2025-05-07T20:32:55.6156530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.6157225Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.6157763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.6158451Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.6159117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.6159655Z kernel = self.compile( 2025-05-07T20:32:55.6160196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.6160857Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.6161251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.6161475Z 2025-05-07T20:32:55.6161770Z self = 2025-05-07T20:32:55.6162860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.6164254Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daf78dc0>} 2025-05-07T20:32:55.6165616Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.6166655Z context = 2025-05-07T20:32:55.6166941Z 2025-05-07T20:32:55.6167111Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.6167631Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.6168101Z module_map=module_map) 2025-05-07T20:32:55.6168467Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.6168879Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.6169140Z E ^ 2025-05-07T20:32:55.6169603Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.6170056Z 2025-05-07T20:32:55.6170547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.6171060Z 2025-05-07T20:32:55.6171163Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.6171577Z self=, 2025-05-07T20:32:55.6171978Z T=128, 2025-05-07T20:32:55.6172162Z D=7168, 2025-05-07T20:32:55.6172353Z scale_ub=1200.0, 2025-05-07T20:32:55.6172579Z contiguous=False, 2025-05-07T20:32:55.6172797Z compiled=False, 2025-05-07T20:32:55.6172996Z ) 2025-05-07T20:32:55.7425119Z self = 2025-05-07T20:32:55.7426006Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.7426411Z 2025-05-07T20:32:55.7426518Z @given( 2025-05-07T20:32:55.7426844Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.7427284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.7427645Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.7427985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.7428321Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.7428608Z ) 2025-05-07T20:32:55.7428964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.7429414Z def test_silu_mul_quant( 2025-05-07T20:32:55.7429662Z self, 2025-05-07T20:32:55.7429857Z T: int, 2025-05-07T20:32:55.7430057Z D: int, 2025-05-07T20:32:55.7430282Z scale_ub: Optional[float], 2025-05-07T20:32:55.7430591Z contiguous: bool, 2025-05-07T20:32:55.7430836Z compiled: bool, 2025-05-07T20:32:55.7431064Z ) -> None: 2025-05-07T20:32:55.7431290Z torch.manual_seed(2025) 2025-05-07T20:32:55.7431538Z 2025-05-07T20:32:55.7431812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.7432163Z 2025-05-07T20:32:55.7432365Z x_sign = torch.sign(x) 2025-05-07T20:32:55.7432658Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.7432974Z x = x_sign * x_clamp 2025-05-07T20:32:55.7433222Z x0 = x[:, :D] 2025-05-07T20:32:55.7433462Z x1 = x[:, D:] 2025-05-07T20:32:55.7433707Z 2025-05-07T20:32:55.7433900Z if contiguous: 2025-05-07T20:32:55.7434137Z x0 = x0.contiguous() 2025-05-07T20:32:55.7434481Z x1 = x1.contiguous() 2025-05-07T20:32:55.7434732Z 2025-05-07T20:32:55.7434926Z if scale_ub is not None: 2025-05-07T20:32:55.7435207Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.7435551Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.7435863Z ) 2025-05-07T20:32:55.7436061Z else: 2025-05-07T20:32:55.7436275Z scale_ub_tensor = None 2025-05-07T20:32:55.7436530Z 2025-05-07T20:32:55.7436763Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.7437090Z op = silu_mul_quant 2025-05-07T20:32:55.7437350Z if compiled: 2025-05-07T20:32:55.7437597Z op = torch.compile(op) 2025-05-07T20:32:55.7437900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.7438177Z 2025-05-07T20:32:55.7438371Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.7438547Z 2025-05-07T20:32:55.7438649Z moe/activation_test.py:117: 2025-05-07T20:32:55.7438953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.7439280Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.7439567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.7440340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.7441041Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.7441585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.7442336Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.7443010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.7443550Z kernel = self.compile( 2025-05-07T20:32:55.7444151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.7444821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.7445221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.7445521Z 2025-05-07T20:32:55.7445732Z self = 2025-05-07T20:32:55.7446834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.7448234Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daba3f40>} 2025-05-07T20:32:55.7449601Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.7450647Z context = 2025-05-07T20:32:55.7450940Z 2025-05-07T20:32:55.7451107Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.7451642Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.7452117Z module_map=module_map) 2025-05-07T20:32:55.7452488Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.7452844Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.7453108Z E ^ 2025-05-07T20:32:55.7453581Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.7454037Z 2025-05-07T20:32:55.7454459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.7455030Z 2025-05-07T20:32:55.7455138Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.7455738Z self=, 2025-05-07T20:32:55.7456155Z T=128, 2025-05-07T20:32:55.7456346Z D=5120, 2025-05-07T20:32:55.7456550Z scale_ub=None, 2025-05-07T20:32:55.7456771Z contiguous=False, 2025-05-07T20:32:55.7457002Z compiled=False, 2025-05-07T20:32:55.7457218Z ) 2025-05-07T20:32:55.7457542Z self = 2025-05-07T20:32:55.7458133Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.7458406Z 2025-05-07T20:32:55.7458484Z @given( 2025-05-07T20:32:55.7458716Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.7459038Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.7459346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.7459684Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.7460021Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.7460306Z ) 2025-05-07T20:32:55.7460663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.7461183Z def test_silu_mul_quant( 2025-05-07T20:32:55.7461436Z self, 2025-05-07T20:32:55.7461635Z T: int, 2025-05-07T20:32:55.7461835Z D: int, 2025-05-07T20:32:55.7462057Z scale_ub: Optional[float], 2025-05-07T20:32:55.7462392Z contiguous: bool, 2025-05-07T20:32:55.7462637Z compiled: bool, 2025-05-07T20:32:55.7462862Z ) -> None: 2025-05-07T20:32:55.7463077Z torch.manual_seed(2025) 2025-05-07T20:32:55.7463319Z 2025-05-07T20:32:55.7463599Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.7463939Z 2025-05-07T20:32:55.7464138Z x_sign = torch.sign(x) 2025-05-07T20:32:55.7464435Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.7464744Z x = x_sign * x_clamp 2025-05-07T20:32:55.7464986Z x0 = x[:, :D] 2025-05-07T20:32:55.7465205Z x1 = x[:, D:] 2025-05-07T20:32:55.7465482Z 2025-05-07T20:32:55.7465676Z if contiguous: 2025-05-07T20:32:55.7465916Z x0 = x0.contiguous() 2025-05-07T20:32:55.7466177Z x1 = x1.contiguous() 2025-05-07T20:32:55.7466417Z 2025-05-07T20:32:55.7466611Z if scale_ub is not None: 2025-05-07T20:32:55.7466888Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.7467229Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.7467541Z ) 2025-05-07T20:32:55.7467739Z else: 2025-05-07T20:32:55.7467948Z scale_ub_tensor = None 2025-05-07T20:32:55.7468200Z 2025-05-07T20:32:55.7468439Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.7468750Z op = silu_mul_quant 2025-05-07T20:32:55.7469005Z if compiled: 2025-05-07T20:32:55.7469255Z op = torch.compile(op) 2025-05-07T20:32:55.7469554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.7469832Z 2025-05-07T20:32:55.7470032Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.7470200Z 2025-05-07T20:32:55.7470302Z moe/activation_test.py:117: 2025-05-07T20:32:55.7470602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.7470937Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.7471231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.7471927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.7472626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.7473175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.7473927Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.7474602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.7475143Z kernel = self.compile( 2025-05-07T20:32:55.7475692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.7476351Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.7476747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.7476977Z 2025-05-07T20:32:55.7477188Z self = 2025-05-07T20:32:55.7478274Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.7479662Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daba12d0>} 2025-05-07T20:32:55.7481072Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.7482112Z context = 2025-05-07T20:32:55.7482440Z 2025-05-07T20:32:55.7482613Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.7483136Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.7483609Z module_map=module_map) 2025-05-07T20:32:55.7483977Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.7484332Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.7484592Z E ^ 2025-05-07T20:32:55.7485057Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.7485557Z 2025-05-07T20:32:55.7485986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.7486502Z 2025-05-07T20:32:55.7486610Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.7487023Z self=, 2025-05-07T20:32:55.7487428Z T=128, 2025-05-07T20:32:55.7487619Z D=5120, 2025-05-07T20:32:55.7487809Z scale_ub=1200.0, 2025-05-07T20:32:55.7488035Z contiguous=True, 2025-05-07T20:32:55.7488260Z compiled=False, 2025-05-07T20:32:55.7488464Z ) 2025-05-07T20:32:55.9417323Z self = 2025-05-07T20:32:55.9418237Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.9418626Z 2025-05-07T20:32:55.9418736Z @given( 2025-05-07T20:32:55.9419057Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.9419474Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.9419791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.9420135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.9420476Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.9420763Z ) 2025-05-07T20:32:55.9421128Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.9421587Z def test_silu_mul_quant( 2025-05-07T20:32:55.9421834Z self, 2025-05-07T20:32:55.9422032Z T: int, 2025-05-07T20:32:55.9422237Z D: int, 2025-05-07T20:32:55.9422464Z scale_ub: Optional[float], 2025-05-07T20:32:55.9422739Z contiguous: bool, 2025-05-07T20:32:55.9422985Z compiled: bool, 2025-05-07T20:32:55.9423216Z ) -> None: 2025-05-07T20:32:55.9423552Z torch.manual_seed(2025) 2025-05-07T20:32:55.9423817Z 2025-05-07T20:32:55.9424138Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.9424489Z 2025-05-07T20:32:55.9424690Z x_sign = torch.sign(x) 2025-05-07T20:32:55.9424996Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.9425312Z x = x_sign * x_clamp 2025-05-07T20:32:55.9425556Z x0 = x[:, :D] 2025-05-07T20:32:55.9425780Z x1 = x[:, D:] 2025-05-07T20:32:55.9425992Z 2025-05-07T20:32:55.9426186Z if contiguous: 2025-05-07T20:32:55.9426427Z x0 = x0.contiguous() 2025-05-07T20:32:55.9426685Z x1 = x1.contiguous() 2025-05-07T20:32:55.9426932Z 2025-05-07T20:32:55.9427137Z if scale_ub is not None: 2025-05-07T20:32:55.9427417Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.9427755Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.9428076Z ) 2025-05-07T20:32:55.9428273Z else: 2025-05-07T20:32:55.9428486Z scale_ub_tensor = None 2025-05-07T20:32:55.9428743Z 2025-05-07T20:32:55.9428986Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.9429299Z op = silu_mul_quant 2025-05-07T20:32:55.9429624Z if compiled: 2025-05-07T20:32:55.9429877Z op = torch.compile(op) 2025-05-07T20:32:55.9430179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9430516Z 2025-05-07T20:32:55.9430717Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.9430886Z 2025-05-07T20:32:55.9430989Z moe/activation_test.py:117: 2025-05-07T20:32:55.9431288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9431624Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.9431908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9432610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.9433314Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.9433860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.9434626Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.9435299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.9435845Z kernel = self.compile( 2025-05-07T20:32:55.9436402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.9437066Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.9437471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9437700Z 2025-05-07T20:32:55.9437919Z self = 2025-05-07T20:32:55.9439021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.9440422Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daba3d90>} 2025-05-07T20:32:55.9441796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.9442841Z context = 2025-05-07T20:32:55.9443132Z 2025-05-07T20:32:55.9443307Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.9443936Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.9444420Z module_map=module_map) 2025-05-07T20:32:55.9444796Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.9445163Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.9445422Z E ^ 2025-05-07T20:32:55.9445896Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.9446352Z 2025-05-07T20:32:55.9446778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.9447296Z 2025-05-07T20:32:55.9447406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.9447821Z self=, 2025-05-07T20:32:55.9448226Z T=1, 2025-05-07T20:32:55.9448413Z D=7168, 2025-05-07T20:32:55.9448640Z scale_ub=1200.0, 2025-05-07T20:32:55.9448866Z contiguous=True, 2025-05-07T20:32:55.9449096Z compiled=True, 2025-05-07T20:32:55.9449306Z ) 2025-05-07T20:32:55.9449629Z self = 2025-05-07T20:32:55.9450165Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:55.9450429Z 2025-05-07T20:32:55.9450511Z @given( 2025-05-07T20:32:55.9450745Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.9451065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.9451446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.9451779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.9452111Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.9452397Z ) 2025-05-07T20:32:55.9452751Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.9453191Z def test_silu_mul_quant( 2025-05-07T20:32:55.9453442Z self, 2025-05-07T20:32:55.9453645Z T: int, 2025-05-07T20:32:55.9453864Z D: int, 2025-05-07T20:32:55.9454111Z scale_ub: Optional[float], 2025-05-07T20:32:55.9454437Z contiguous: bool, 2025-05-07T20:32:55.9460822Z compiled: bool, 2025-05-07T20:32:55.9461063Z ) -> None: 2025-05-07T20:32:55.9461286Z torch.manual_seed(2025) 2025-05-07T20:32:55.9461528Z 2025-05-07T20:32:55.9461800Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.9462140Z 2025-05-07T20:32:55.9462331Z x_sign = torch.sign(x) 2025-05-07T20:32:55.9462627Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.9462931Z x = x_sign * x_clamp 2025-05-07T20:32:55.9463169Z x0 = x[:, :D] 2025-05-07T20:32:55.9463388Z x1 = x[:, D:] 2025-05-07T20:32:55.9463584Z 2025-05-07T20:32:55.9463769Z if contiguous: 2025-05-07T20:32:55.9464009Z x0 = x0.contiguous() 2025-05-07T20:32:55.9464266Z x1 = x1.contiguous() 2025-05-07T20:32:55.9464505Z 2025-05-07T20:32:55.9464695Z if scale_ub is not None: 2025-05-07T20:32:55.9464972Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.9465315Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.9465624Z ) 2025-05-07T20:32:55.9465813Z else: 2025-05-07T20:32:55.9466030Z scale_ub_tensor = None 2025-05-07T20:32:55.9466283Z 2025-05-07T20:32:55.9466514Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.9466834Z op = silu_mul_quant 2025-05-07T20:32:55.9467086Z if compiled: 2025-05-07T20:32:55.9467333Z op = torch.compile(op) 2025-05-07T20:32:55.9467636Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9467914Z 2025-05-07T20:32:55.9468113Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.9468280Z 2025-05-07T20:32:55.9468381Z moe/activation_test.py:117: 2025-05-07T20:32:55.9468788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9469122Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.9469408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9469980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:55.9470545Z return fn(*args, **kwargs) 2025-05-07T20:32:55.9471212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.9471907Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.9472449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.9473135Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.9473855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.9474388Z kernel = self.compile( 2025-05-07T20:32:55.9474937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.9475670Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.9476066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9476296Z 2025-05-07T20:32:55.9476506Z self = 2025-05-07T20:32:55.9477667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.9479063Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daba31c0>} 2025-05-07T20:32:55.9480421Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.9481525Z context = 2025-05-07T20:32:55.9481817Z 2025-05-07T20:32:55.9481986Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.9482515Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.9482986Z module_map=module_map) 2025-05-07T20:32:55.9483351Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.9483708Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.9483995Z E ^ 2025-05-07T20:32:55.9484488Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.9484941Z 2025-05-07T20:32:55.9485365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.9485885Z 2025-05-07T20:32:55.9485990Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.9486406Z self=, 2025-05-07T20:32:55.9486808Z T=1, 2025-05-07T20:32:55.9486991Z D=7168, 2025-05-07T20:32:55.9487181Z scale_ub=1200.0, 2025-05-07T20:32:55.9487410Z contiguous=False, 2025-05-07T20:32:55.9487634Z compiled=True, 2025-05-07T20:32:55.9487831Z ) 2025-05-07T20:32:56.0864990Z self = 2025-05-07T20:32:56.0865746Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:56.0866122Z 2025-05-07T20:32:56.0866237Z @given( 2025-05-07T20:32:56.0866663Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.0866986Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.0867295Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.0867631Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.0867969Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.0868260Z ) 2025-05-07T20:32:56.0868615Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.0869057Z def test_silu_mul_quant( 2025-05-07T20:32:56.0869308Z self, 2025-05-07T20:32:56.0869506Z T: int, 2025-05-07T20:32:56.0869699Z D: int, 2025-05-07T20:32:56.0869922Z scale_ub: Optional[float], 2025-05-07T20:32:56.0870194Z contiguous: bool, 2025-05-07T20:32:56.0870431Z compiled: bool, 2025-05-07T20:32:56.0870656Z ) -> None: 2025-05-07T20:32:56.0870874Z torch.manual_seed(2025) 2025-05-07T20:32:56.0871113Z 2025-05-07T20:32:56.0871395Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.0871739Z 2025-05-07T20:32:56.0871929Z x_sign = torch.sign(x) 2025-05-07T20:32:56.0872226Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.0872536Z x = x_sign * x_clamp 2025-05-07T20:32:56.0872856Z x0 = x[:, :D] 2025-05-07T20:32:56.0873071Z x1 = x[:, D:] 2025-05-07T20:32:56.0873279Z 2025-05-07T20:32:56.0873468Z if contiguous: 2025-05-07T20:32:56.0873700Z x0 = x0.contiguous() 2025-05-07T20:32:56.0874021Z x1 = x1.contiguous() 2025-05-07T20:32:56.0874263Z 2025-05-07T20:32:56.0874453Z if scale_ub is not None: 2025-05-07T20:32:56.0874728Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.0875069Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.0875377Z ) 2025-05-07T20:32:56.0875574Z else: 2025-05-07T20:32:56.0875788Z scale_ub_tensor = None 2025-05-07T20:32:56.0876047Z 2025-05-07T20:32:56.0876282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.0876595Z op = silu_mul_quant 2025-05-07T20:32:56.0876912Z if compiled: 2025-05-07T20:32:56.0877164Z op = torch.compile(op) 2025-05-07T20:32:56.0877474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.0877749Z 2025-05-07T20:32:56.0877952Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.0878120Z 2025-05-07T20:32:56.0878220Z moe/activation_test.py:117: 2025-05-07T20:32:56.0878518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.0878843Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.0879126Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.0879695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:56.0880255Z return fn(*args, **kwargs) 2025-05-07T20:32:56.0880922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.0881618Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.0882162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.0882846Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.0883517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.0884057Z kernel = self.compile( 2025-05-07T20:32:56.0884602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.0885263Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.0885658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.0885883Z 2025-05-07T20:32:56.0886146Z self = 2025-05-07T20:32:56.0887241Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.0888640Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daba05e0>} 2025-05-07T20:32:56.0890004Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.0891045Z context = 2025-05-07T20:32:56.0891334Z 2025-05-07T20:32:56.0891512Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.0892042Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.0892519Z module_map=module_map) 2025-05-07T20:32:56.0892935Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.0893287Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.0893550Z E ^ 2025-05-07T20:32:56.0894019Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.0894515Z 2025-05-07T20:32:56.0894939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.0895459Z 2025-05-07T20:32:56.0895565Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.0895980Z self=, 2025-05-07T20:32:56.0896380Z T=1, 2025-05-07T20:32:56.0896563Z D=7168, 2025-05-07T20:32:56.0896757Z scale_ub=None, 2025-05-07T20:32:56.0896978Z contiguous=False, 2025-05-07T20:32:56.0897203Z compiled=True, 2025-05-07T20:32:56.0897447Z ) 2025-05-07T20:32:56.3457296Z self = 2025-05-07T20:32:56.3457940Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:56.3458437Z 2025-05-07T20:32:56.3458589Z @given( 2025-05-07T20:32:56.3458921Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.3459372Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.3459801Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.3460200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.3460538Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.3460823Z ) 2025-05-07T20:32:56.3461186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.3461636Z def test_silu_mul_quant( 2025-05-07T20:32:56.3461885Z self, 2025-05-07T20:32:56.3462078Z T: int, 2025-05-07T20:32:56.3462276Z D: int, 2025-05-07T20:32:56.3462506Z scale_ub: Optional[float], 2025-05-07T20:32:56.3462782Z contiguous: bool, 2025-05-07T20:32:56.3463027Z compiled: bool, 2025-05-07T20:32:56.3463256Z ) -> None: 2025-05-07T20:32:56.3463473Z torch.manual_seed(2025) 2025-05-07T20:32:56.3463721Z 2025-05-07T20:32:56.3464003Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.3464350Z 2025-05-07T20:32:56.3464545Z x_sign = torch.sign(x) 2025-05-07T20:32:56.3464847Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.3465153Z x = x_sign * x_clamp 2025-05-07T20:32:56.3465394Z x0 = x[:, :D] 2025-05-07T20:32:56.3465614Z x1 = x[:, D:] 2025-05-07T20:32:56.3465820Z 2025-05-07T20:32:56.3466009Z if contiguous: 2025-05-07T20:32:56.3466371Z x0 = x0.contiguous() 2025-05-07T20:32:56.3466635Z x1 = x1.contiguous() 2025-05-07T20:32:56.3466884Z 2025-05-07T20:32:56.3467093Z if scale_ub is not None: 2025-05-07T20:32:56.3467375Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.3467716Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.3468029Z ) 2025-05-07T20:32:56.3468226Z else: 2025-05-07T20:32:56.3468435Z scale_ub_tensor = None 2025-05-07T20:32:56.3468697Z 2025-05-07T20:32:56.3468936Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.3469250Z op = silu_mul_quant 2025-05-07T20:32:56.3469503Z if compiled: 2025-05-07T20:32:56.3469757Z op = torch.compile(op) 2025-05-07T20:32:56.3470061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.3470340Z 2025-05-07T20:32:56.3470536Z y_fp8, y_scale = fn() 2025-05-07T20:32:56.3470836Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:56.3471129Z 2025-05-07T20:32:56.3471372Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.3471716Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:56.3472081Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:56.3472408Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:56.3472768Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:56.3473076Z 2025-05-07T20:32:56.3473374Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:56.3473580Z 2025-05-07T20:32:56.3473683Z moe/activation_test.py:126: 2025-05-07T20:32:56.3473982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.3474314Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:56.3474650Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:56.3475455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:56.3476217Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:56.3476844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.3477537Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.3478237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:56.3478969Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:56.3479728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:56.3480487Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:56.3481232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:56.3481879Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:56.3482497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:56.3483032Z fn() 2025-05-07T20:32:56.3483593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:56.3484183Z self.fn.run( 2025-05-07T20:32:56.3484660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.3485210Z kernel = self.compile( 2025-05-07T20:32:56.3485756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.3486420Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.3486868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.3487095Z 2025-05-07T20:32:56.3487311Z self = 2025-05-07T20:32:56.3488415Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.3489818Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da3aaef0>} 2025-05-07T20:32:56.3491189Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.3492235Z context = 2025-05-07T20:32:56.3492526Z 2025-05-07T20:32:56.3492701Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.3493234Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.3493814Z module_map=module_map) 2025-05-07T20:32:56.3494190Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.3494550Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:56.3494828Z E ^ 2025-05-07T20:32:56.3495342Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.3495799Z 2025-05-07T20:32:56.3496223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.3496741Z 2025-05-07T20:32:56.3496847Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.3497273Z self=, 2025-05-07T20:32:56.3497678Z T=1, 2025-05-07T20:32:56.3497859Z D=5120, 2025-05-07T20:32:56.3498144Z scale_ub=1200.0, 2025-05-07T20:32:56.3498426Z contiguous=False, 2025-05-07T20:32:56.3498653Z compiled=True, 2025-05-07T20:32:56.3498859Z ) 2025-05-07T20:32:56.5181253Z self = 2025-05-07T20:32:56.5182636Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:56.5183323Z 2025-05-07T20:32:56.5183546Z @given( 2025-05-07T20:32:56.5184014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.5184440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.5184753Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.5185090Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.5185419Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.5185710Z ) 2025-05-07T20:32:56.5186068Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.5186508Z def test_silu_mul_quant( 2025-05-07T20:32:56.5186758Z self, 2025-05-07T20:32:56.5186959Z T: int, 2025-05-07T20:32:56.5187154Z D: int, 2025-05-07T20:32:56.5187380Z scale_ub: Optional[float], 2025-05-07T20:32:56.5187656Z contiguous: bool, 2025-05-07T20:32:56.5187898Z compiled: bool, 2025-05-07T20:32:56.5188125Z ) -> None: 2025-05-07T20:32:56.5188347Z torch.manual_seed(2025) 2025-05-07T20:32:56.5188589Z 2025-05-07T20:32:56.5188870Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.5189218Z 2025-05-07T20:32:56.5189415Z x_sign = torch.sign(x) 2025-05-07T20:32:56.5189707Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.5190022Z x = x_sign * x_clamp 2025-05-07T20:32:56.5190270Z x0 = x[:, :D] 2025-05-07T20:32:56.5190603Z x1 = x[:, D:] 2025-05-07T20:32:56.5190817Z 2025-05-07T20:32:56.5191005Z if contiguous: 2025-05-07T20:32:56.5191238Z x0 = x0.contiguous() 2025-05-07T20:32:56.5191509Z x1 = x1.contiguous() 2025-05-07T20:32:56.5191753Z 2025-05-07T20:32:56.5191950Z if scale_ub is not None: 2025-05-07T20:32:56.5192232Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.5192577Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.5192883Z ) 2025-05-07T20:32:56.5193080Z else: 2025-05-07T20:32:56.5193297Z scale_ub_tensor = None 2025-05-07T20:32:56.5193549Z 2025-05-07T20:32:56.5193788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.5194107Z op = silu_mul_quant 2025-05-07T20:32:56.5194359Z if compiled: 2025-05-07T20:32:56.5194610Z op = torch.compile(op) 2025-05-07T20:32:56.5194913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.5195193Z 2025-05-07T20:32:56.5195386Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.5195557Z 2025-05-07T20:32:56.5195657Z moe/activation_test.py:117: 2025-05-07T20:32:56.5195957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.5196356Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.5196645Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.5197210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:56.5197836Z return fn(*args, **kwargs) 2025-05-07T20:32:56.5198499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.5199199Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.5199739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.5200424Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.5201096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.5201697Z kernel = self.compile( 2025-05-07T20:32:56.5202254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.5202914Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.5203322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.5203546Z 2025-05-07T20:32:56.5203787Z self = 2025-05-07T20:32:56.5204906Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.5206298Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da3abeb0>} 2025-05-07T20:32:56.5207662Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.5208701Z context = 2025-05-07T20:32:56.5208989Z 2025-05-07T20:32:56.5209164Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.5209690Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.5210166Z module_map=module_map) 2025-05-07T20:32:56.5210548Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.5210958Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.5211219Z E ^ 2025-05-07T20:32:56.5211688Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.5212146Z 2025-05-07T20:32:56.5212575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.5213091Z 2025-05-07T20:32:56.5213202Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.5213618Z self=, 2025-05-07T20:32:56.5214039Z T=1, 2025-05-07T20:32:56.5214263Z D=5120, 2025-05-07T20:32:56.5214465Z scale_ub=1200.0, 2025-05-07T20:32:56.5214696Z contiguous=False, 2025-05-07T20:32:56.5214928Z compiled=False, 2025-05-07T20:32:56.5215131Z ) 2025-05-07T20:32:56.5215454Z self = 2025-05-07T20:32:56.5215955Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:56.5216227Z 2025-05-07T20:32:56.5216305Z @given( 2025-05-07T20:32:56.5216543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.5216869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.5217233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.5217565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.5217901Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.5218274Z ) 2025-05-07T20:32:56.5218673Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.5219119Z def test_silu_mul_quant( 2025-05-07T20:32:56.5219361Z self, 2025-05-07T20:32:56.5219552Z T: int, 2025-05-07T20:32:56.5219749Z D: int, 2025-05-07T20:32:56.5219971Z scale_ub: Optional[float], 2025-05-07T20:32:56.5220241Z contiguous: bool, 2025-05-07T20:32:56.5220484Z compiled: bool, 2025-05-07T20:32:56.5220712Z ) -> None: 2025-05-07T20:32:56.5220925Z torch.manual_seed(2025) 2025-05-07T20:32:56.5221170Z 2025-05-07T20:32:56.5221447Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.5221839Z 2025-05-07T20:32:56.5222040Z x_sign = torch.sign(x) 2025-05-07T20:32:56.5222340Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.5222650Z x = x_sign * x_clamp 2025-05-07T20:32:56.5222889Z x0 = x[:, :D] 2025-05-07T20:32:56.5223115Z x1 = x[:, D:] 2025-05-07T20:32:56.5223325Z 2025-05-07T20:32:56.5223508Z if contiguous: 2025-05-07T20:32:56.5223745Z x0 = x0.contiguous() 2025-05-07T20:32:56.5224005Z x1 = x1.contiguous() 2025-05-07T20:32:56.5224242Z 2025-05-07T20:32:56.5224433Z if scale_ub is not None: 2025-05-07T20:32:56.5224710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.5225055Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.5225372Z ) 2025-05-07T20:32:56.5225567Z else: 2025-05-07T20:32:56.5225780Z scale_ub_tensor = None 2025-05-07T20:32:56.5226040Z 2025-05-07T20:32:56.5232355Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.5232697Z op = silu_mul_quant 2025-05-07T20:32:56.5232945Z if compiled: 2025-05-07T20:32:56.5233197Z op = torch.compile(op) 2025-05-07T20:32:56.5233497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.5233794Z 2025-05-07T20:32:56.5234005Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.5234171Z 2025-05-07T20:32:56.5234275Z moe/activation_test.py:117: 2025-05-07T20:32:56.5234568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.5234905Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.5235190Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.5235958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.5236654Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.5237203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.5237893Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.5238553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.5239087Z kernel = self.compile( 2025-05-07T20:32:56.5239639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.5240297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.5240686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.5240916Z 2025-05-07T20:32:56.5241131Z self = 2025-05-07T20:32:56.5242269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.5243671Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daba1c60>} 2025-05-07T20:32:56.5245121Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.5246157Z context = 2025-05-07T20:32:56.5246455Z 2025-05-07T20:32:56.5246624Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.5247157Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.5247628Z module_map=module_map) 2025-05-07T20:32:56.5248042Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.5248406Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.5248665Z E ^ 2025-05-07T20:32:56.5249130Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.5249594Z 2025-05-07T20:32:56.5250017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.5250537Z 2025-05-07T20:32:56.5250651Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.5251072Z self=, 2025-05-07T20:32:56.5251473Z T=16384, 2025-05-07T20:32:56.5251669Z D=5120, 2025-05-07T20:32:56.5251869Z scale_ub=1200.0, 2025-05-07T20:32:56.5252092Z contiguous=False, 2025-05-07T20:32:56.5252319Z compiled=True, 2025-05-07T20:32:56.5252523Z ) 2025-05-07T20:32:56.6246934Z self = 2025-05-07T20:32:56.6248485Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:56.6249261Z 2025-05-07T20:32:56.6249475Z @given( 2025-05-07T20:32:56.6250121Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.6250833Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.6251445Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.6252108Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.6252769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.6253337Z ) 2025-05-07T20:32:56.6253874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.6254465Z def test_silu_mul_quant( 2025-05-07T20:32:56.6254714Z self, 2025-05-07T20:32:56.6254905Z T: int, 2025-05-07T20:32:56.6255103Z D: int, 2025-05-07T20:32:56.6255325Z scale_ub: Optional[float], 2025-05-07T20:32:56.6255778Z contiguous: bool, 2025-05-07T20:32:56.6256029Z compiled: bool, 2025-05-07T20:32:56.6256251Z ) -> None: 2025-05-07T20:32:56.6256469Z torch.manual_seed(2025) 2025-05-07T20:32:56.6256712Z 2025-05-07T20:32:56.6256983Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.6257326Z 2025-05-07T20:32:56.6257523Z x_sign = torch.sign(x) 2025-05-07T20:32:56.6257814Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.6258217Z x = x_sign * x_clamp 2025-05-07T20:32:56.6258463Z x0 = x[:, :D] 2025-05-07T20:32:56.6258676Z x1 = x[:, D:] 2025-05-07T20:32:56.6258888Z 2025-05-07T20:32:56.6259075Z if contiguous: 2025-05-07T20:32:56.6259311Z x0 = x0.contiguous() 2025-05-07T20:32:56.6259569Z x1 = x1.contiguous() 2025-05-07T20:32:56.6259808Z 2025-05-07T20:32:56.6260007Z if scale_ub is not None: 2025-05-07T20:32:56.6260282Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.6260700Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.6261012Z ) 2025-05-07T20:32:56.6261201Z else: 2025-05-07T20:32:56.6261412Z scale_ub_tensor = None 2025-05-07T20:32:56.6261722Z 2025-05-07T20:32:56.6261957Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.6262271Z op = silu_mul_quant 2025-05-07T20:32:56.6262525Z if compiled: 2025-05-07T20:32:56.6262769Z op = torch.compile(op) 2025-05-07T20:32:56.6263067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.6263343Z 2025-05-07T20:32:56.6263535Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.6263710Z 2025-05-07T20:32:56.6263814Z moe/activation_test.py:117: 2025-05-07T20:32:56.6264115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.6264515Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.6264799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.6265370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:56.6265933Z return fn(*args, **kwargs) 2025-05-07T20:32:56.6266590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.6267288Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.6267833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.6268516Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.6269181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.6269721Z kernel = self.compile( 2025-05-07T20:32:56.6270269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.6270934Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.6271328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.6271564Z 2025-05-07T20:32:56.6271773Z self = 2025-05-07T20:32:56.6272873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.6274327Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5058148b0>} 2025-05-07T20:32:56.6275684Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.6276733Z context = 2025-05-07T20:32:56.6277028Z 2025-05-07T20:32:56.6277196Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.6277726Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.6278197Z module_map=module_map) 2025-05-07T20:32:56.6278561Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.6278915Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.6279170Z E ^ 2025-05-07T20:32:56.6279641Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.6280098Z 2025-05-07T20:32:56.6280517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.6281035Z 2025-05-07T20:32:56.6281196Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.6281610Z self=, 2025-05-07T20:32:56.6282011Z T=2048, 2025-05-07T20:32:56.6282245Z D=7168, 2025-05-07T20:32:56.6282432Z scale_ub=1200.0, 2025-05-07T20:32:56.6282661Z contiguous=False, 2025-05-07T20:32:56.6282889Z compiled=True, 2025-05-07T20:32:56.6283089Z ) 2025-05-07T20:32:56.6283413Z self = 2025-05-07T20:32:56.6283937Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:56.6284236Z 2025-05-07T20:32:56.6284321Z @given( 2025-05-07T20:32:56.6284554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.6284872Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.6285178Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.6285557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.6285893Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.6286184Z ) 2025-05-07T20:32:56.6286534Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.6286984Z def test_silu_mul_quant( 2025-05-07T20:32:56.6287231Z self, 2025-05-07T20:32:56.6287427Z T: int, 2025-05-07T20:32:56.6287618Z D: int, 2025-05-07T20:32:56.6287838Z scale_ub: Optional[float], 2025-05-07T20:32:56.6288108Z contiguous: bool, 2025-05-07T20:32:56.6288345Z compiled: bool, 2025-05-07T20:32:56.6288566Z ) -> None: 2025-05-07T20:32:56.6288785Z torch.manual_seed(2025) 2025-05-07T20:32:56.6289018Z 2025-05-07T20:32:56.6289300Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.6289642Z 2025-05-07T20:32:56.6289834Z x_sign = torch.sign(x) 2025-05-07T20:32:56.6290127Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.6290436Z x = x_sign * x_clamp 2025-05-07T20:32:56.6290670Z x0 = x[:, :D] 2025-05-07T20:32:56.6290884Z x1 = x[:, D:] 2025-05-07T20:32:56.6291093Z 2025-05-07T20:32:56.6291277Z if contiguous: 2025-05-07T20:32:56.6291510Z x0 = x0.contiguous() 2025-05-07T20:32:56.6291765Z x1 = x1.contiguous() 2025-05-07T20:32:56.6292001Z 2025-05-07T20:32:56.6292198Z if scale_ub is not None: 2025-05-07T20:32:56.6292473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.6292812Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.6293118Z ) 2025-05-07T20:32:56.6293309Z else: 2025-05-07T20:32:56.6293571Z scale_ub_tensor = None 2025-05-07T20:32:56.6293825Z 2025-05-07T20:32:56.6294061Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.6294425Z op = silu_mul_quant 2025-05-07T20:32:56.6294669Z if compiled: 2025-05-07T20:32:56.6294920Z op = torch.compile(op) 2025-05-07T20:32:56.6295220Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.6295490Z 2025-05-07T20:32:56.6295681Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.6295849Z 2025-05-07T20:32:56.6295954Z moe/activation_test.py:117: 2025-05-07T20:32:56.6296241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.6296569Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.6296850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.6297411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:56.6297974Z return fn(*args, **kwargs) 2025-05-07T20:32:56.6298703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.6299396Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.6299978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.6300661Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.6301391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.6301926Z kernel = self.compile( 2025-05-07T20:32:56.6302464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.6303122Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.6303519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.6303744Z 2025-05-07T20:32:56.6303955Z self = 2025-05-07T20:32:56.6305089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.6306478Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505815090>} 2025-05-07T20:32:56.6307835Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.6308873Z context = 2025-05-07T20:32:56.6309162Z 2025-05-07T20:32:56.6309349Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.6309874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.6310345Z module_map=module_map) 2025-05-07T20:32:56.6310714Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.6311069Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.6311320Z E ^ 2025-05-07T20:32:56.6311790Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.6312246Z 2025-05-07T20:32:56.6312670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.6313188Z 2025-05-07T20:32:56.7603014Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.7604111Z self=, 2025-05-07T20:32:56.7604868Z T=1, 2025-05-07T20:32:56.7605133Z D=5120, 2025-05-07T20:32:56.7605409Z scale_ub=None, 2025-05-07T20:32:56.7605675Z contiguous=False, 2025-05-07T20:32:56.7605908Z compiled=False, 2025-05-07T20:32:56.7606112Z ) 2025-05-07T20:32:56.7606431Z self = 2025-05-07T20:32:56.7606928Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:56.7607194Z 2025-05-07T20:32:56.7607278Z @given( 2025-05-07T20:32:56.7607515Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.7607829Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.7608140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.7608473Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.7608803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.7609092Z ) 2025-05-07T20:32:56.7609448Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.7609890Z def test_silu_mul_quant( 2025-05-07T20:32:56.7610132Z self, 2025-05-07T20:32:56.7610326Z T: int, 2025-05-07T20:32:56.7610517Z D: int, 2025-05-07T20:32:56.7610740Z scale_ub: Optional[float], 2025-05-07T20:32:56.7611094Z contiguous: bool, 2025-05-07T20:32:56.7611334Z compiled: bool, 2025-05-07T20:32:56.7611560Z ) -> None: 2025-05-07T20:32:56.7611779Z torch.manual_seed(2025) 2025-05-07T20:32:56.7612082Z 2025-05-07T20:32:56.7612354Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.7612698Z 2025-05-07T20:32:56.7612892Z x_sign = torch.sign(x) 2025-05-07T20:32:56.7613179Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.7613487Z x = x_sign * x_clamp 2025-05-07T20:32:56.7613726Z x0 = x[:, :D] 2025-05-07T20:32:56.7613941Z x1 = x[:, D:] 2025-05-07T20:32:56.7614166Z 2025-05-07T20:32:56.7614387Z if contiguous: 2025-05-07T20:32:56.7614620Z x0 = x0.contiguous() 2025-05-07T20:32:56.7614882Z x1 = x1.contiguous() 2025-05-07T20:32:56.7615193Z 2025-05-07T20:32:56.7615389Z if scale_ub is not None: 2025-05-07T20:32:56.7615668Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.7616012Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.7616321Z ) 2025-05-07T20:32:56.7616515Z else: 2025-05-07T20:32:56.7616728Z scale_ub_tensor = None 2025-05-07T20:32:56.7616975Z 2025-05-07T20:32:56.7617209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.7617526Z op = silu_mul_quant 2025-05-07T20:32:56.7617774Z if compiled: 2025-05-07T20:32:56.7618019Z op = torch.compile(op) 2025-05-07T20:32:56.7618416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.7618698Z 2025-05-07T20:32:56.7618890Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.7619062Z 2025-05-07T20:32:56.7619162Z moe/activation_test.py:117: 2025-05-07T20:32:56.7619462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.7619800Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.7620088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.7620785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.7621487Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.7622022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.7622710Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.7623377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.7623956Z kernel = self.compile( 2025-05-07T20:32:56.7624513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.7625179Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.7625579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.7625804Z 2025-05-07T20:32:56.7626015Z self = 2025-05-07T20:32:56.7627106Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.7628527Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5058157e0>} 2025-05-07T20:32:56.7629885Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.7630924Z context = 2025-05-07T20:32:56.7631258Z 2025-05-07T20:32:56.7631432Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.7631955Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.7632471Z module_map=module_map) 2025-05-07T20:32:56.7632842Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.7633196Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.7633457Z E ^ 2025-05-07T20:32:56.7633977Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.7634430Z 2025-05-07T20:32:56.7634867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.7635384Z 2025-05-07T20:32:56.7635537Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.7635958Z self=, 2025-05-07T20:32:56.7636361Z T=4096, 2025-05-07T20:32:56.7636550Z D=7168, 2025-05-07T20:32:56.7636744Z scale_ub=1200.0, 2025-05-07T20:32:56.7636975Z contiguous=False, 2025-05-07T20:32:56.7637204Z compiled=False, 2025-05-07T20:32:56.7637414Z ) 2025-05-07T20:32:56.7637738Z self = 2025-05-07T20:32:56.7638242Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:56.7638520Z 2025-05-07T20:32:56.7638599Z @given( 2025-05-07T20:32:56.7638833Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.7639147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.7639459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.7639789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.7640122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.7640404Z ) 2025-05-07T20:32:56.7640760Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.7641206Z def test_silu_mul_quant( 2025-05-07T20:32:56.7641447Z self, 2025-05-07T20:32:56.7641638Z T: int, 2025-05-07T20:32:56.7641836Z D: int, 2025-05-07T20:32:56.7642055Z scale_ub: Optional[float], 2025-05-07T20:32:56.7642325Z contiguous: bool, 2025-05-07T20:32:56.7642567Z compiled: bool, 2025-05-07T20:32:56.7642791Z ) -> None: 2025-05-07T20:32:56.7643003Z torch.manual_seed(2025) 2025-05-07T20:32:56.7643244Z 2025-05-07T20:32:56.7643518Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.7643858Z 2025-05-07T20:32:56.7644122Z x_sign = torch.sign(x) 2025-05-07T20:32:56.7644440Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.7644745Z x = x_sign * x_clamp 2025-05-07T20:32:56.7644985Z x0 = x[:, :D] 2025-05-07T20:32:56.7645204Z x1 = x[:, D:] 2025-05-07T20:32:56.7645412Z 2025-05-07T20:32:56.7645599Z if contiguous: 2025-05-07T20:32:56.7645830Z x0 = x0.contiguous() 2025-05-07T20:32:56.7646082Z x1 = x1.contiguous() 2025-05-07T20:32:56.7646327Z 2025-05-07T20:32:56.7646521Z if scale_ub is not None: 2025-05-07T20:32:56.7646793Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.7647125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.7647432Z ) 2025-05-07T20:32:56.7647629Z else: 2025-05-07T20:32:56.7647836Z scale_ub_tensor = None 2025-05-07T20:32:56.7648087Z 2025-05-07T20:32:56.7648325Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.7648633Z op = silu_mul_quant 2025-05-07T20:32:56.7648883Z if compiled: 2025-05-07T20:32:56.7649133Z op = torch.compile(op) 2025-05-07T20:32:56.7649433Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.7649760Z 2025-05-07T20:32:56.7649957Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.7650121Z 2025-05-07T20:32:56.7650218Z moe/activation_test.py:117: 2025-05-07T20:32:56.7650512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.7650883Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.7651166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.7651857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.7652559Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.7653106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.7653790Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.7654505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.7655043Z kernel = self.compile( 2025-05-07T20:32:56.7655768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.7656438Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.7656843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.7657068Z 2025-05-07T20:32:56.7657281Z self = 2025-05-07T20:32:56.7658418Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.7659799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505816200>} 2025-05-07T20:32:56.7661162Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.7662197Z context = 2025-05-07T20:32:56.7662486Z 2025-05-07T20:32:56.7662659Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.7663179Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.7663650Z module_map=module_map) 2025-05-07T20:32:56.7664139Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.7664497Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.7664750Z E ^ 2025-05-07T20:32:56.7665222Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.7665678Z 2025-05-07T20:32:56.7666103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.7666619Z 2025-05-07T20:32:56.7666733Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.7667143Z self=, 2025-05-07T20:32:56.7667542Z T=16384, 2025-05-07T20:32:56.7667738Z D=7168, 2025-05-07T20:32:56.7667926Z scale_ub=None, 2025-05-07T20:32:56.7668140Z contiguous=True, 2025-05-07T20:32:56.7668362Z compiled=True, 2025-05-07T20:32:56.7668562Z ) 2025-05-07T20:32:56.9613822Z self = 2025-05-07T20:32:56.9614423Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:56.9614797Z 2025-05-07T20:32:56.9614938Z @given( 2025-05-07T20:32:56.9615262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.9615836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.9616216Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.9616564Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.9616977Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.9617271Z ) 2025-05-07T20:32:56.9617638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.9618158Z def test_silu_mul_quant( 2025-05-07T20:32:56.9618409Z self, 2025-05-07T20:32:56.9618612Z T: int, 2025-05-07T20:32:56.9618810Z D: int, 2025-05-07T20:32:56.9619041Z scale_ub: Optional[float], 2025-05-07T20:32:56.9619329Z contiguous: bool, 2025-05-07T20:32:56.9619573Z compiled: bool, 2025-05-07T20:32:56.9619809Z ) -> None: 2025-05-07T20:32:56.9620036Z torch.manual_seed(2025) 2025-05-07T20:32:56.9620359Z 2025-05-07T20:32:56.9620651Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.9621037Z 2025-05-07T20:32:56.9621243Z x_sign = torch.sign(x) 2025-05-07T20:32:56.9621546Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.9621863Z x = x_sign * x_clamp 2025-05-07T20:32:56.9622115Z x0 = x[:, :D] 2025-05-07T20:32:56.9622339Z x1 = x[:, D:] 2025-05-07T20:32:56.9622551Z 2025-05-07T20:32:56.9622745Z if contiguous: 2025-05-07T20:32:56.9622989Z x0 = x0.contiguous() 2025-05-07T20:32:56.9623257Z x1 = x1.contiguous() 2025-05-07T20:32:56.9623504Z 2025-05-07T20:32:56.9623708Z if scale_ub is not None: 2025-05-07T20:32:56.9623996Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.9624394Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.9624720Z ) 2025-05-07T20:32:56.9624926Z else: 2025-05-07T20:32:56.9625141Z scale_ub_tensor = None 2025-05-07T20:32:56.9625405Z 2025-05-07T20:32:56.9625656Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.9625976Z op = silu_mul_quant 2025-05-07T20:32:56.9626238Z if compiled: 2025-05-07T20:32:56.9626494Z op = torch.compile(op) 2025-05-07T20:32:56.9626798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.9627091Z 2025-05-07T20:32:56.9627292Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.9627463Z 2025-05-07T20:32:56.9627573Z moe/activation_test.py:117: 2025-05-07T20:32:56.9627873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.9628213Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.9628573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.9629154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:56.9629736Z return fn(*args, **kwargs) 2025-05-07T20:32:56.9630425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.9631131Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.9631684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.9632388Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.9633071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.9633618Z kernel = self.compile( 2025-05-07T20:32:56.9634179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.9634857Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.9635257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.9635495Z 2025-05-07T20:32:56.9635757Z self = 2025-05-07T20:32:56.9636869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.9638314Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505817760>} 2025-05-07T20:32:56.9639684Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.9640730Z context = 2025-05-07T20:32:56.9641067Z 2025-05-07T20:32:56.9641237Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.9641774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.9642253Z module_map=module_map) 2025-05-07T20:32:56.9642627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.9642991Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.9643256Z E ^ 2025-05-07T20:32:56.9643725Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.9644187Z 2025-05-07T20:32:56.9644612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.9645141Z 2025-05-07T20:32:56.9645246Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.9645671Z self=, 2025-05-07T20:32:56.9646075Z T=4096, 2025-05-07T20:32:56.9646268Z D=5120, 2025-05-07T20:32:56.9646468Z scale_ub=None, 2025-05-07T20:32:56.9646684Z contiguous=False, 2025-05-07T20:32:56.9646914Z compiled=True, 2025-05-07T20:32:56.9647120Z ) 2025-05-07T20:32:56.9647440Z self = 2025-05-07T20:32:56.9647946Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:56.9648220Z 2025-05-07T20:32:56.9648303Z @given( 2025-05-07T20:32:56.9648532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.9648850Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.9649160Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.9649544Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.9649875Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.9650164Z ) 2025-05-07T20:32:56.9650526Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.9650973Z def test_silu_mul_quant( 2025-05-07T20:32:56.9651219Z self, 2025-05-07T20:32:56.9651417Z T: int, 2025-05-07T20:32:56.9651615Z D: int, 2025-05-07T20:32:56.9651841Z scale_ub: Optional[float], 2025-05-07T20:32:56.9652122Z contiguous: bool, 2025-05-07T20:32:56.9652360Z compiled: bool, 2025-05-07T20:32:56.9652585Z ) -> None: 2025-05-07T20:32:56.9652809Z torch.manual_seed(2025) 2025-05-07T20:32:56.9653052Z 2025-05-07T20:32:56.9653330Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.9653677Z 2025-05-07T20:32:56.9653867Z x_sign = torch.sign(x) 2025-05-07T20:32:56.9654166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.9654478Z x = x_sign * x_clamp 2025-05-07T20:32:56.9654722Z x0 = x[:, :D] 2025-05-07T20:32:56.9654935Z x1 = x[:, D:] 2025-05-07T20:32:56.9655150Z 2025-05-07T20:32:56.9655338Z if contiguous: 2025-05-07T20:32:56.9656159Z x0 = x0.contiguous() 2025-05-07T20:32:56.9656492Z x1 = x1.contiguous() 2025-05-07T20:32:56.9656729Z 2025-05-07T20:32:56.9656920Z if scale_ub is not None: 2025-05-07T20:32:56.9657201Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.9657616Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.9657920Z ) 2025-05-07T20:32:56.9658190Z else: 2025-05-07T20:32:56.9658401Z scale_ub_tensor = None 2025-05-07T20:32:56.9658650Z 2025-05-07T20:32:56.9658887Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.9659200Z op = silu_mul_quant 2025-05-07T20:32:56.9659449Z if compiled: 2025-05-07T20:32:56.9659704Z op = torch.compile(op) 2025-05-07T20:32:56.9660003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.9660353Z 2025-05-07T20:32:56.9660542Z > y_fp8, y_scale = fn() 2025-05-07T20:32:56.9660715Z 2025-05-07T20:32:56.9660817Z moe/activation_test.py:117: 2025-05-07T20:32:56.9661113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.9661439Z moe/activation_test.py:115: in fn 2025-05-07T20:32:56.9661725Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.9662296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:56.9662857Z return fn(*args, **kwargs) 2025-05-07T20:32:56.9663529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:56.9664232Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:56.9664776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.9665463Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.9666142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.9666679Z kernel = self.compile( 2025-05-07T20:32:56.9667227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.9667894Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.9668291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.9668517Z 2025-05-07T20:32:56.9668732Z self = 2025-05-07T20:32:56.9669898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.9671314Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a4280>} 2025-05-07T20:32:56.9672681Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.9673737Z context = 2025-05-07T20:32:56.9674069Z 2025-05-07T20:32:56.9674241Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.9674764Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.9675242Z module_map=module_map) 2025-05-07T20:32:56.9675610Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.9675965Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.9676221Z E ^ 2025-05-07T20:32:56.9676737Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.9677196Z 2025-05-07T20:32:56.9677622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.9678180Z 2025-05-07T20:32:57.2924903Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.2925824Z self=, 2025-05-07T20:32:57.2926643Z T=4096, 2025-05-07T20:32:57.2927018Z D=5120, 2025-05-07T20:32:57.2927410Z scale_ub=1200.0, 2025-05-07T20:32:57.2927857Z contiguous=False, 2025-05-07T20:32:57.2928306Z compiled=False, 2025-05-07T20:32:57.2928716Z ) 2025-05-07T20:32:57.2929370Z self = 2025-05-07T20:32:57.2930376Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:57.2931115Z 2025-05-07T20:32:57.2931280Z @given( 2025-05-07T20:32:57.2931740Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.2932373Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.2932995Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.2933570Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.2933908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.2934211Z ) 2025-05-07T20:32:57.2934566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.2935016Z def test_silu_mul_quant( 2025-05-07T20:32:57.2935266Z self, 2025-05-07T20:32:57.2935463Z T: int, 2025-05-07T20:32:57.2935661Z D: int, 2025-05-07T20:32:57.2935889Z scale_ub: Optional[float], 2025-05-07T20:32:57.2936162Z contiguous: bool, 2025-05-07T20:32:57.2936407Z compiled: bool, 2025-05-07T20:32:57.2936642Z ) -> None: 2025-05-07T20:32:57.2936861Z torch.manual_seed(2025) 2025-05-07T20:32:57.2937109Z 2025-05-07T20:32:57.2937391Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.2937735Z 2025-05-07T20:32:57.2937939Z x_sign = torch.sign(x) 2025-05-07T20:32:57.2938334Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.2938652Z x = x_sign * x_clamp 2025-05-07T20:32:57.2938892Z x0 = x[:, :D] 2025-05-07T20:32:57.2939113Z x1 = x[:, D:] 2025-05-07T20:32:57.2939326Z 2025-05-07T20:32:57.2939513Z if contiguous: 2025-05-07T20:32:57.2939747Z x0 = x0.contiguous() 2025-05-07T20:32:57.2940010Z x1 = x1.contiguous() 2025-05-07T20:32:57.2940250Z 2025-05-07T20:32:57.2940448Z if scale_ub is not None: 2025-05-07T20:32:57.2940827Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.2941165Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.2941484Z ) 2025-05-07T20:32:57.2941682Z else: 2025-05-07T20:32:57.2941895Z scale_ub_tensor = None 2025-05-07T20:32:57.2942156Z 2025-05-07T20:32:57.2942400Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.2942715Z op = silu_mul_quant 2025-05-07T20:32:57.2942966Z if compiled: 2025-05-07T20:32:57.2943221Z op = torch.compile(op) 2025-05-07T20:32:57.2943527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.2943804Z 2025-05-07T20:32:57.2944029Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.2944221Z 2025-05-07T20:32:57.2944329Z moe/activation_test.py:117: 2025-05-07T20:32:57.2944626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.2944959Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.2945246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.2945946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.2946715Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.2947262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.2947954Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.2948684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.2949224Z kernel = self.compile( 2025-05-07T20:32:57.2949779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.2950442Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.2950842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.2951074Z 2025-05-07T20:32:57.2951290Z self = 2025-05-07T20:32:57.2952435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.2953836Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a5000>} 2025-05-07T20:32:57.2955196Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.2956484Z context = 2025-05-07T20:32:57.2956782Z 2025-05-07T20:32:57.2956951Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.2957486Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.2957962Z module_map=module_map) 2025-05-07T20:32:57.2958336Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.2958698Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.2958967Z E ^ 2025-05-07T20:32:57.2959437Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.2959897Z 2025-05-07T20:32:57.2960319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.2960837Z 2025-05-07T20:32:57.2960950Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.2961465Z self=, 2025-05-07T20:32:57.2961872Z T=4096, 2025-05-07T20:32:57.2962067Z D=5120, 2025-05-07T20:32:57.2962263Z scale_ub=1200.0, 2025-05-07T20:32:57.2962504Z contiguous=False, 2025-05-07T20:32:57.2962734Z compiled=True, 2025-05-07T20:32:57.2962948Z ) 2025-05-07T20:32:57.2963307Z self = 2025-05-07T20:32:57.2963807Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:57.2964089Z 2025-05-07T20:32:57.2964170Z @given( 2025-05-07T20:32:57.2964407Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.2964726Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.2965042Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.2965378Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.2965709Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.2966007Z ) 2025-05-07T20:32:57.2966365Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.2966807Z def test_silu_mul_quant( 2025-05-07T20:32:57.2967062Z self, 2025-05-07T20:32:57.2967262Z T: int, 2025-05-07T20:32:57.2967526Z D: int, 2025-05-07T20:32:57.2967761Z scale_ub: Optional[float], 2025-05-07T20:32:57.2968037Z contiguous: bool, 2025-05-07T20:32:57.2968284Z compiled: bool, 2025-05-07T20:32:57.2968511Z ) -> None: 2025-05-07T20:32:57.2968794Z torch.manual_seed(2025) 2025-05-07T20:32:57.2969042Z 2025-05-07T20:32:57.2969321Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.2969674Z 2025-05-07T20:32:57.2969881Z x_sign = torch.sign(x) 2025-05-07T20:32:57.2970175Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.2970492Z x = x_sign * x_clamp 2025-05-07T20:32:57.2970737Z x0 = x[:, :D] 2025-05-07T20:32:57.2970957Z x1 = x[:, D:] 2025-05-07T20:32:57.2971176Z 2025-05-07T20:32:57.2971370Z if contiguous: 2025-05-07T20:32:57.2971604Z x0 = x0.contiguous() 2025-05-07T20:32:57.2971937Z x1 = x1.contiguous() 2025-05-07T20:32:57.2972184Z 2025-05-07T20:32:57.2972387Z if scale_ub is not None: 2025-05-07T20:32:57.2972666Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.2973014Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.2973333Z ) 2025-05-07T20:32:57.2973527Z else: 2025-05-07T20:32:57.2973746Z scale_ub_tensor = None 2025-05-07T20:32:57.2974001Z 2025-05-07T20:32:57.2974235Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.2974552Z op = silu_mul_quant 2025-05-07T20:32:57.2974817Z if compiled: 2025-05-07T20:32:57.2975065Z op = torch.compile(op) 2025-05-07T20:32:57.2975373Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.2975651Z 2025-05-07T20:32:57.2975852Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.2976025Z 2025-05-07T20:32:57.2976132Z moe/activation_test.py:117: 2025-05-07T20:32:57.2976430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.2976756Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.2977043Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.2977609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.2978220Z return fn(*args, **kwargs) 2025-05-07T20:32:57.2978883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.2979582Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.2980126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.2980916Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.2981591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.2982135Z kernel = self.compile( 2025-05-07T20:32:57.2982689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.2983349Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.2983751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.2983978Z 2025-05-07T20:32:57.2984193Z self = 2025-05-07T20:32:57.2985293Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.2986681Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a4700>} 2025-05-07T20:32:57.2988095Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.2989144Z context = 2025-05-07T20:32:57.2989476Z 2025-05-07T20:32:57.2989652Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.2990180Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.2990663Z module_map=module_map) 2025-05-07T20:32:57.2991036Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.2991401Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.2991662Z E ^ 2025-05-07T20:32:57.2992137Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.2992638Z 2025-05-07T20:32:57.2993069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.2993587Z 2025-05-07T20:32:57.4260696Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.4261201Z self=, 2025-05-07T20:32:57.4261650Z T=2048, 2025-05-07T20:32:57.4261869Z D=7168, 2025-05-07T20:32:57.4262076Z scale_ub=1200.0, 2025-05-07T20:32:57.4262307Z contiguous=False, 2025-05-07T20:32:57.4262540Z compiled=False, 2025-05-07T20:32:57.4262752Z ) 2025-05-07T20:32:57.4263075Z self = 2025-05-07T20:32:57.4263584Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:57.4263862Z 2025-05-07T20:32:57.4263949Z @given( 2025-05-07T20:32:57.4264191Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.4264508Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.4264825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.4265161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.4265490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.4265781Z ) 2025-05-07T20:32:57.4266136Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.4266578Z def test_silu_mul_quant( 2025-05-07T20:32:57.4266823Z self, 2025-05-07T20:32:57.4267022Z T: int, 2025-05-07T20:32:57.4267218Z D: int, 2025-05-07T20:32:57.4267442Z scale_ub: Optional[float], 2025-05-07T20:32:57.4267724Z contiguous: bool, 2025-05-07T20:32:57.4268077Z compiled: bool, 2025-05-07T20:32:57.4268313Z ) -> None: 2025-05-07T20:32:57.4268536Z torch.manual_seed(2025) 2025-05-07T20:32:57.4268779Z 2025-05-07T20:32:57.4269055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.4269403Z 2025-05-07T20:32:57.4269605Z x_sign = torch.sign(x) 2025-05-07T20:32:57.4269899Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.4270212Z x = x_sign * x_clamp 2025-05-07T20:32:57.4270454Z x0 = x[:, :D] 2025-05-07T20:32:57.4270671Z x1 = x[:, D:] 2025-05-07T20:32:57.4270880Z 2025-05-07T20:32:57.4271071Z if contiguous: 2025-05-07T20:32:57.4271304Z x0 = x0.contiguous() 2025-05-07T20:32:57.4271568Z x1 = x1.contiguous() 2025-05-07T20:32:57.4271813Z 2025-05-07T20:32:57.4272004Z if scale_ub is not None: 2025-05-07T20:32:57.4272283Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.4272631Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.4272939Z ) 2025-05-07T20:32:57.4273140Z else: 2025-05-07T20:32:57.4273359Z scale_ub_tensor = None 2025-05-07T20:32:57.4280081Z 2025-05-07T20:32:57.4280458Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.4280793Z op = silu_mul_quant 2025-05-07T20:32:57.4281045Z if compiled: 2025-05-07T20:32:57.4281289Z op = torch.compile(op) 2025-05-07T20:32:57.4281587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.4281925Z 2025-05-07T20:32:57.4282113Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.4282284Z 2025-05-07T20:32:57.4282383Z moe/activation_test.py:117: 2025-05-07T20:32:57.4282674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.4283007Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.4283284Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.4283982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.4284749Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.4285293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.4285983Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.4286648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.4287191Z kernel = self.compile( 2025-05-07T20:32:57.4287732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.4288398Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.4288792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.4289041Z 2025-05-07T20:32:57.4289259Z self = 2025-05-07T20:32:57.4290359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.4291751Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a5240>} 2025-05-07T20:32:57.4293116Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.4294207Z context = 2025-05-07T20:32:57.4294496Z 2025-05-07T20:32:57.4294715Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.4295240Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.4295720Z module_map=module_map) 2025-05-07T20:32:57.4296106Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.4296464Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.4296719Z E ^ 2025-05-07T20:32:57.4297190Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.4297649Z 2025-05-07T20:32:57.4298154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.4298672Z 2025-05-07T20:32:57.4298789Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.4299206Z self=, 2025-05-07T20:32:57.4299609Z T=1, 2025-05-07T20:32:57.4299802Z D=7168, 2025-05-07T20:32:57.4299993Z scale_ub=None, 2025-05-07T20:32:57.4300213Z contiguous=True, 2025-05-07T20:32:57.4300442Z compiled=False, 2025-05-07T20:32:57.4300646Z ) 2025-05-07T20:32:57.4300971Z self = 2025-05-07T20:32:57.4301513Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:57.4301776Z 2025-05-07T20:32:57.4301853Z @given( 2025-05-07T20:32:57.4302087Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.4302445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.4302755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.4303085Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.4303418Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.4303706Z ) 2025-05-07T20:32:57.4304057Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.4304556Z def test_silu_mul_quant( 2025-05-07T20:32:57.4304796Z self, 2025-05-07T20:32:57.4304988Z T: int, 2025-05-07T20:32:57.4305232Z D: int, 2025-05-07T20:32:57.4305456Z scale_ub: Optional[float], 2025-05-07T20:32:57.4305726Z contiguous: bool, 2025-05-07T20:32:57.4305969Z compiled: bool, 2025-05-07T20:32:57.4306191Z ) -> None: 2025-05-07T20:32:57.4306411Z torch.manual_seed(2025) 2025-05-07T20:32:57.4306652Z 2025-05-07T20:32:57.4306923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.4307268Z 2025-05-07T20:32:57.4307461Z x_sign = torch.sign(x) 2025-05-07T20:32:57.4307755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.4308059Z x = x_sign * x_clamp 2025-05-07T20:32:57.4308301Z x0 = x[:, :D] 2025-05-07T20:32:57.4308515Z x1 = x[:, D:] 2025-05-07T20:32:57.4308717Z 2025-05-07T20:32:57.4308903Z if contiguous: 2025-05-07T20:32:57.4309137Z x0 = x0.contiguous() 2025-05-07T20:32:57.4309391Z x1 = x1.contiguous() 2025-05-07T20:32:57.4309634Z 2025-05-07T20:32:57.4309831Z if scale_ub is not None: 2025-05-07T20:32:57.4310102Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.4310444Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.4310751Z ) 2025-05-07T20:32:57.4310943Z else: 2025-05-07T20:32:57.4311154Z scale_ub_tensor = None 2025-05-07T20:32:57.4311413Z 2025-05-07T20:32:57.4311644Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.4311962Z op = silu_mul_quant 2025-05-07T20:32:57.4312211Z if compiled: 2025-05-07T20:32:57.4312461Z op = torch.compile(op) 2025-05-07T20:32:57.4312756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.4313031Z 2025-05-07T20:32:57.4313224Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.4313438Z 2025-05-07T20:32:57.4313541Z moe/activation_test.py:117: 2025-05-07T20:32:57.4313836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.4314164Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.4314441Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.4315140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.4315839Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.4316381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.4317063Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.4317737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.4318275Z kernel = self.compile( 2025-05-07T20:32:57.4318819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.4319480Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.4319924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.4320151Z 2025-05-07T20:32:57.4320363Z self = 2025-05-07T20:32:57.4321447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.4322875Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a6050>} 2025-05-07T20:32:57.4324296Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.4325376Z context = 2025-05-07T20:32:57.4325665Z 2025-05-07T20:32:57.4325837Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.4326358Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.4326834Z module_map=module_map) 2025-05-07T20:32:57.4327202Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.4327556Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.4327815Z E ^ 2025-05-07T20:32:57.4328280Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.4328732Z 2025-05-07T20:32:57.4329160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.4329676Z 2025-05-07T20:32:57.4329780Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.4330199Z self=, 2025-05-07T20:32:57.4330604Z T=16384, 2025-05-07T20:32:57.4330793Z D=7168, 2025-05-07T20:32:57.4330989Z scale_ub=1200.0, 2025-05-07T20:32:57.4331215Z contiguous=False, 2025-05-07T20:32:57.4331436Z compiled=True, 2025-05-07T20:32:57.6949127Z ) 2025-05-07T20:32:57.6950309Z self = 2025-05-07T20:32:57.6951707Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:57.6952299Z 2025-05-07T20:32:57.6952463Z @given( 2025-05-07T20:32:57.6952944Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.6953496Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.6953959Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.6954307Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.6954646Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.6954947Z ) 2025-05-07T20:32:57.6955321Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.6956014Z def test_silu_mul_quant( 2025-05-07T20:32:57.6956268Z self, 2025-05-07T20:32:57.6956475Z T: int, 2025-05-07T20:32:57.6956675Z D: int, 2025-05-07T20:32:57.6956909Z scale_ub: Optional[float], 2025-05-07T20:32:57.6957191Z contiguous: bool, 2025-05-07T20:32:57.6957439Z compiled: bool, 2025-05-07T20:32:57.6957668Z ) -> None: 2025-05-07T20:32:57.6957893Z torch.manual_seed(2025) 2025-05-07T20:32:57.6958142Z 2025-05-07T20:32:57.6958423Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.6958777Z 2025-05-07T20:32:57.6958985Z x_sign = torch.sign(x) 2025-05-07T20:32:57.6959281Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.6959601Z x = x_sign * x_clamp 2025-05-07T20:32:57.6959854Z x0 = x[:, :D] 2025-05-07T20:32:57.6960074Z x1 = x[:, D:] 2025-05-07T20:32:57.6960290Z 2025-05-07T20:32:57.6960561Z if contiguous: 2025-05-07T20:32:57.6960804Z x0 = x0.contiguous() 2025-05-07T20:32:57.6961072Z x1 = x1.contiguous() 2025-05-07T20:32:57.6961321Z 2025-05-07T20:32:57.6961580Z if scale_ub is not None: 2025-05-07T20:32:57.6961864Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.6962211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.6962521Z ) 2025-05-07T20:32:57.6962724Z else: 2025-05-07T20:32:57.6962943Z scale_ub_tensor = None 2025-05-07T20:32:57.6963204Z 2025-05-07T20:32:57.6963443Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.6963768Z op = silu_mul_quant 2025-05-07T20:32:57.6964025Z if compiled: 2025-05-07T20:32:57.6964277Z op = torch.compile(op) 2025-05-07T20:32:57.6964654Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.6964933Z 2025-05-07T20:32:57.6965134Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.6965310Z 2025-05-07T20:32:57.6965416Z moe/activation_test.py:117: 2025-05-07T20:32:57.6965713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.6966052Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.6966349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.6966926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.6967504Z return fn(*args, **kwargs) 2025-05-07T20:32:57.6968173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.6968883Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.6969434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.6970294Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.6970961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.6971501Z kernel = self.compile( 2025-05-07T20:32:57.6972054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.6972713Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.6973109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.6973334Z 2025-05-07T20:32:57.6973549Z self = 2025-05-07T20:32:57.6974721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.6976135Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a7490>} 2025-05-07T20:32:57.6977508Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.6978616Z context = 2025-05-07T20:32:57.6978911Z 2025-05-07T20:32:57.6979081Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.6979612Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.6980082Z module_map=module_map) 2025-05-07T20:32:57.6980448Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.6980807Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.6981066Z E ^ 2025-05-07T20:32:57.6981602Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.6982066Z 2025-05-07T20:32:57.6982487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.6983050Z 2025-05-07T20:32:57.6983160Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.6983573Z self=, 2025-05-07T20:32:57.6983976Z T=1, 2025-05-07T20:32:57.6984164Z D=7168, 2025-05-07T20:32:57.6984362Z scale_ub=None, 2025-05-07T20:32:57.6984575Z contiguous=False, 2025-05-07T20:32:57.6984810Z compiled=False, 2025-05-07T20:32:57.6985018Z ) 2025-05-07T20:32:57.6985336Z self = 2025-05-07T20:32:57.6985868Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:57.6986132Z 2025-05-07T20:32:57.6986216Z @given( 2025-05-07T20:32:57.6986442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.6986761Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.6987069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.6987398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.6987729Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.6988016Z ) 2025-05-07T20:32:57.6988373Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.6988811Z def test_silu_mul_quant( 2025-05-07T20:32:57.6989057Z self, 2025-05-07T20:32:57.6989252Z T: int, 2025-05-07T20:32:57.6989446Z D: int, 2025-05-07T20:32:57.6989664Z scale_ub: Optional[float], 2025-05-07T20:32:57.6989936Z contiguous: bool, 2025-05-07T20:32:57.6990175Z compiled: bool, 2025-05-07T20:32:57.6990399Z ) -> None: 2025-05-07T20:32:57.6990621Z torch.manual_seed(2025) 2025-05-07T20:32:57.6990860Z 2025-05-07T20:32:57.6991137Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.6991475Z 2025-05-07T20:32:57.6991663Z x_sign = torch.sign(x) 2025-05-07T20:32:57.6991960Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.6992266Z x = x_sign * x_clamp 2025-05-07T20:32:57.6992501Z x0 = x[:, :D] 2025-05-07T20:32:57.6992716Z x1 = x[:, D:] 2025-05-07T20:32:57.6992922Z 2025-05-07T20:32:57.6993111Z if contiguous: 2025-05-07T20:32:57.6993339Z x0 = x0.contiguous() 2025-05-07T20:32:57.6993598Z x1 = x1.contiguous() 2025-05-07T20:32:57.6993884Z 2025-05-07T20:32:57.6994075Z if scale_ub is not None: 2025-05-07T20:32:57.6994353Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.6994695Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.6994999Z ) 2025-05-07T20:32:57.6995198Z else: 2025-05-07T20:32:57.6995411Z scale_ub_tensor = None 2025-05-07T20:32:57.6995660Z 2025-05-07T20:32:57.6995896Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.6996211Z op = silu_mul_quant 2025-05-07T20:32:57.6996459Z if compiled: 2025-05-07T20:32:57.6996711Z op = torch.compile(op) 2025-05-07T20:32:57.6997009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.6997276Z 2025-05-07T20:32:57.6997473Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.6997643Z 2025-05-07T20:32:57.6997744Z moe/activation_test.py:117: 2025-05-07T20:32:57.6998047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.6998368Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.6998650Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.6999395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.7000091Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.7000632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.7001366Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.7002032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.7002563Z kernel = self.compile( 2025-05-07T20:32:57.7003110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.7003776Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.7004167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.7004441Z 2025-05-07T20:32:57.7004655Z self = 2025-05-07T20:32:57.7005748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.7007143Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a77f0>} 2025-05-07T20:32:57.7008507Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.7009544Z context = 2025-05-07T20:32:57.7009839Z 2025-05-07T20:32:57.7010010Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.7010544Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.7011017Z module_map=module_map) 2025-05-07T20:32:57.7011378Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.7011737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.7011998Z E ^ 2025-05-07T20:32:57.7012463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.7012922Z 2025-05-07T20:32:57.7013341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.7013864Z 2025-05-07T20:32:57.7014018Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.7014444Z self=, 2025-05-07T20:32:57.7014848Z T=2048, 2025-05-07T20:32:57.7015041Z D=7168, 2025-05-07T20:32:57.7015235Z scale_ub=None, 2025-05-07T20:32:57.7015452Z contiguous=False, 2025-05-07T20:32:57.7015681Z compiled=True, 2025-05-07T20:32:57.7015885Z ) 2025-05-07T20:32:57.8006291Z self = 2025-05-07T20:32:57.8007831Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:57.8008614Z 2025-05-07T20:32:57.8008826Z @given( 2025-05-07T20:32:57.8009453Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.8010087Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.8010717Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.8011392Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.8012065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.8012639Z ) 2025-05-07T20:32:57.8013347Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.8014197Z def test_silu_mul_quant( 2025-05-07T20:32:57.8014480Z self, 2025-05-07T20:32:57.8014800Z T: int, 2025-05-07T20:32:57.8015008Z D: int, 2025-05-07T20:32:57.8015231Z scale_ub: Optional[float], 2025-05-07T20:32:57.8015515Z contiguous: bool, 2025-05-07T20:32:57.8015850Z compiled: bool, 2025-05-07T20:32:57.8016077Z ) -> None: 2025-05-07T20:32:57.8016302Z torch.manual_seed(2025) 2025-05-07T20:32:57.8016551Z 2025-05-07T20:32:57.8016834Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.8017185Z 2025-05-07T20:32:57.8017390Z x_sign = torch.sign(x) 2025-05-07T20:32:57.8017685Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.8018007Z x = x_sign * x_clamp 2025-05-07T20:32:57.8018324Z x0 = x[:, :D] 2025-05-07T20:32:57.8018547Z x1 = x[:, D:] 2025-05-07T20:32:57.8018766Z 2025-05-07T20:32:57.8019034Z if contiguous: 2025-05-07T20:32:57.8019269Z x0 = x0.contiguous() 2025-05-07T20:32:57.8019540Z x1 = x1.contiguous() 2025-05-07T20:32:57.8019790Z 2025-05-07T20:32:57.8019986Z if scale_ub is not None: 2025-05-07T20:32:57.8020270Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.8020620Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.8020934Z ) 2025-05-07T20:32:57.8021131Z else: 2025-05-07T20:32:57.8021350Z scale_ub_tensor = None 2025-05-07T20:32:57.8021607Z 2025-05-07T20:32:57.8021843Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.8022169Z op = silu_mul_quant 2025-05-07T20:32:57.8022425Z if compiled: 2025-05-07T20:32:57.8022679Z op = torch.compile(op) 2025-05-07T20:32:57.8022985Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.8023267Z 2025-05-07T20:32:57.8023464Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.8023640Z 2025-05-07T20:32:57.8023743Z moe/activation_test.py:117: 2025-05-07T20:32:57.8024044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.8024415Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.8024716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.8025291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.8025859Z return fn(*args, **kwargs) 2025-05-07T20:32:57.8026525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.8027226Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.8027896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.8028591Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.8029266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.8029806Z kernel = self.compile( 2025-05-07T20:32:57.8030359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.8031026Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.8031433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.8031667Z 2025-05-07T20:32:57.8031877Z self = 2025-05-07T20:32:57.8032982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.8034427Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505540af0>} 2025-05-07T20:32:57.8035788Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.8036874Z context = 2025-05-07T20:32:57.8037165Z 2025-05-07T20:32:57.8037343Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.8037880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.8038358Z module_map=module_map) 2025-05-07T20:32:57.8038734Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.8039095Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.8039353Z E ^ 2025-05-07T20:32:57.8039869Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.8040326Z 2025-05-07T20:32:57.8040753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.8041273Z 2025-05-07T20:32:57.8041389Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.8041809Z self=, 2025-05-07T20:32:57.8042213Z T=4096, 2025-05-07T20:32:57.8042406Z D=7168, 2025-05-07T20:32:57.8042606Z scale_ub=None, 2025-05-07T20:32:57.8048718Z contiguous=False, 2025-05-07T20:32:57.8048964Z compiled=True, 2025-05-07T20:32:57.8049164Z ) 2025-05-07T20:32:57.8049495Z self = 2025-05-07T20:32:57.8049997Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:57.8050270Z 2025-05-07T20:32:57.8050348Z @given( 2025-05-07T20:32:57.8050577Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.8050893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.8051196Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.8051524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.8051856Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.8052133Z ) 2025-05-07T20:32:57.8052482Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.8052931Z def test_silu_mul_quant( 2025-05-07T20:32:57.8053175Z self, 2025-05-07T20:32:57.8053360Z T: int, 2025-05-07T20:32:57.8053553Z D: int, 2025-05-07T20:32:57.8053772Z scale_ub: Optional[float], 2025-05-07T20:32:57.8054113Z contiguous: bool, 2025-05-07T20:32:57.8054380Z compiled: bool, 2025-05-07T20:32:57.8054630Z ) -> None: 2025-05-07T20:32:57.8054838Z torch.manual_seed(2025) 2025-05-07T20:32:57.8055082Z 2025-05-07T20:32:57.8055362Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.8055952Z 2025-05-07T20:32:57.8056147Z x_sign = torch.sign(x) 2025-05-07T20:32:57.8056439Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.8056753Z x = x_sign * x_clamp 2025-05-07T20:32:57.8056987Z x0 = x[:, :D] 2025-05-07T20:32:57.8057204Z x1 = x[:, D:] 2025-05-07T20:32:57.8057410Z 2025-05-07T20:32:57.8057589Z if contiguous: 2025-05-07T20:32:57.8057821Z x0 = x0.contiguous() 2025-05-07T20:32:57.8058181Z x1 = x1.contiguous() 2025-05-07T20:32:57.8058419Z 2025-05-07T20:32:57.8058612Z if scale_ub is not None: 2025-05-07T20:32:57.8058889Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.8059227Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.8059532Z ) 2025-05-07T20:32:57.8059728Z else: 2025-05-07T20:32:57.8059932Z scale_ub_tensor = None 2025-05-07T20:32:57.8060181Z 2025-05-07T20:32:57.8060503Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.8060816Z op = silu_mul_quant 2025-05-07T20:32:57.8061069Z if compiled: 2025-05-07T20:32:57.8061317Z op = torch.compile(op) 2025-05-07T20:32:57.8061672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.8061942Z 2025-05-07T20:32:57.8062135Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.8062300Z 2025-05-07T20:32:57.8062405Z moe/activation_test.py:117: 2025-05-07T20:32:57.8062691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.8063019Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.8063307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.8063868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:57.8064501Z return fn(*args, **kwargs) 2025-05-07T20:32:57.8065167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.8065861Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.8066397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.8067082Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.8067744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.8068278Z kernel = self.compile( 2025-05-07T20:32:57.8068829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.8069491Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.8069886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.8070111Z 2025-05-07T20:32:57.8070322Z self = 2025-05-07T20:32:57.8071412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.8072805Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505540280>} 2025-05-07T20:32:57.8074279Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.8075320Z context = 2025-05-07T20:32:57.8075618Z 2025-05-07T20:32:57.8075785Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.8076317Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.8076786Z module_map=module_map) 2025-05-07T20:32:57.8077153Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.8077523Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.8077783Z E ^ 2025-05-07T20:32:57.8078249Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.8078708Z 2025-05-07T20:32:57.8079130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.8079642Z 2025-05-07T20:32:58.1471471Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.1472117Z self=, 2025-05-07T20:32:58.1472763Z T=16384, 2025-05-07T20:32:58.1473036Z D=5120, 2025-05-07T20:32:58.1473452Z scale_ub=1200.0, 2025-05-07T20:32:58.1473745Z contiguous=False, 2025-05-07T20:32:58.1473975Z compiled=False, 2025-05-07T20:32:58.1474178Z ) 2025-05-07T20:32:58.1474542Z self = 2025-05-07T20:32:58.1475136Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:58.1475423Z 2025-05-07T20:32:58.1475505Z @given( 2025-05-07T20:32:58.1475738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.1476057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.1476364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.1476703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.1477035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.1477321Z ) 2025-05-07T20:32:58.1477740Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.1478190Z def test_silu_mul_quant( 2025-05-07T20:32:58.1478433Z self, 2025-05-07T20:32:58.1478623Z T: int, 2025-05-07T20:32:58.1478821Z D: int, 2025-05-07T20:32:58.1479039Z scale_ub: Optional[float], 2025-05-07T20:32:58.1479314Z contiguous: bool, 2025-05-07T20:32:58.1479555Z compiled: bool, 2025-05-07T20:32:58.1479788Z ) -> None: 2025-05-07T20:32:58.1480003Z torch.manual_seed(2025) 2025-05-07T20:32:58.1480249Z 2025-05-07T20:32:58.1480525Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.1480867Z 2025-05-07T20:32:58.1481063Z x_sign = torch.sign(x) 2025-05-07T20:32:58.1481364Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.1481681Z x = x_sign * x_clamp 2025-05-07T20:32:58.1481914Z x0 = x[:, :D] 2025-05-07T20:32:58.1482135Z x1 = x[:, D:] 2025-05-07T20:32:58.1482343Z 2025-05-07T20:32:58.1482529Z if contiguous: 2025-05-07T20:32:58.1482764Z x0 = x0.contiguous() 2025-05-07T20:32:58.1483028Z x1 = x1.contiguous() 2025-05-07T20:32:58.1483267Z 2025-05-07T20:32:58.1483461Z if scale_ub is not None: 2025-05-07T20:32:58.1483741Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.1484076Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.1484390Z ) 2025-05-07T20:32:58.1484588Z else: 2025-05-07T20:32:58.1484791Z scale_ub_tensor = None 2025-05-07T20:32:58.1485046Z 2025-05-07T20:32:58.1485284Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.1485593Z op = silu_mul_quant 2025-05-07T20:32:58.1485915Z if compiled: 2025-05-07T20:32:58.1486170Z op = torch.compile(op) 2025-05-07T20:32:58.1486464Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1486742Z 2025-05-07T20:32:58.1486940Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.1487105Z 2025-05-07T20:32:58.1487214Z moe/activation_test.py:117: 2025-05-07T20:32:58.1487508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1487841Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.1488132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1488832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.1489527Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.1490071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.1490761Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.1491424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.1491966Z kernel = self.compile( 2025-05-07T20:32:58.1492562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.1493224Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.1493621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1493894Z 2025-05-07T20:32:58.1494124Z self = 2025-05-07T20:32:58.1495253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.1496658Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505542d40>} 2025-05-07T20:32:58.1498132Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.1499174Z context = 2025-05-07T20:32:58.1499474Z 2025-05-07T20:32:58.1499641Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.1500167Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.1500641Z module_map=module_map) 2025-05-07T20:32:58.1501010Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.1501370Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.1501630Z E ^ 2025-05-07T20:32:58.1502100Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.1502560Z 2025-05-07T20:32:58.1502983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.1503500Z 2025-05-07T20:32:58.1503608Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.1504025Z self=, 2025-05-07T20:32:58.1504460Z T=16384, 2025-05-07T20:32:58.1504682Z D=5120, 2025-05-07T20:32:58.1504876Z scale_ub=1200.0, 2025-05-07T20:32:58.1505106Z contiguous=True, 2025-05-07T20:32:58.1505335Z compiled=True, 2025-05-07T20:32:58.1505537Z ) 2025-05-07T20:32:58.1505864Z self = 2025-05-07T20:32:58.1506409Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.1506689Z 2025-05-07T20:32:58.1506772Z @given( 2025-05-07T20:32:58.1507008Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.1507333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.1507647Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.1507977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.1508310Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.1508603Z ) 2025-05-07T20:32:58.1508967Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.1509410Z def test_silu_mul_quant( 2025-05-07T20:32:58.1509654Z self, 2025-05-07T20:32:58.1509848Z T: int, 2025-05-07T20:32:58.1510048Z D: int, 2025-05-07T20:32:58.1510267Z scale_ub: Optional[float], 2025-05-07T20:32:58.1510544Z contiguous: bool, 2025-05-07T20:32:58.1510781Z compiled: bool, 2025-05-07T20:32:58.1511007Z ) -> None: 2025-05-07T20:32:58.1511228Z torch.manual_seed(2025) 2025-05-07T20:32:58.1511471Z 2025-05-07T20:32:58.1511753Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.1512104Z 2025-05-07T20:32:58.1512304Z x_sign = torch.sign(x) 2025-05-07T20:32:58.1512644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.1512959Z x = x_sign * x_clamp 2025-05-07T20:32:58.1513204Z x0 = x[:, :D] 2025-05-07T20:32:58.1513417Z x1 = x[:, D:] 2025-05-07T20:32:58.1513671Z 2025-05-07T20:32:58.1513860Z if contiguous: 2025-05-07T20:32:58.1514091Z x0 = x0.contiguous() 2025-05-07T20:32:58.1514352Z x1 = x1.contiguous() 2025-05-07T20:32:58.1514599Z 2025-05-07T20:32:58.1514793Z if scale_ub is not None: 2025-05-07T20:32:58.1515074Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.1515413Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.1515724Z ) 2025-05-07T20:32:58.1515923Z else: 2025-05-07T20:32:58.1516137Z scale_ub_tensor = None 2025-05-07T20:32:58.1516434Z 2025-05-07T20:32:58.1516673Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.1516994Z op = silu_mul_quant 2025-05-07T20:32:58.1517246Z if compiled: 2025-05-07T20:32:58.1517491Z op = torch.compile(op) 2025-05-07T20:32:58.1517791Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1518073Z 2025-05-07T20:32:58.1518267Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.1518436Z 2025-05-07T20:32:58.1518539Z moe/activation_test.py:117: 2025-05-07T20:32:58.1518837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1519162Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.1519452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.1520022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.1520588Z return fn(*args, **kwargs) 2025-05-07T20:32:58.1521256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.1521963Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.1522503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.1523185Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.1523856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.1524448Z kernel = self.compile( 2025-05-07T20:32:58.1524999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.1525707Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.1526104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.1526328Z 2025-05-07T20:32:58.1526542Z self = 2025-05-07T20:32:58.1527636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.1529021Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505542830>} 2025-05-07T20:32:58.1530382Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.1531422Z context = 2025-05-07T20:32:58.1531710Z 2025-05-07T20:32:58.1531883Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.1532449Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.1532926Z module_map=module_map) 2025-05-07T20:32:58.1533295Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.1533654Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.1533983Z E ^ 2025-05-07T20:32:58.1534480Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.1534966Z 2025-05-07T20:32:58.1535391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.1535906Z 2025-05-07T20:32:58.3409400Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.3410656Z self=, 2025-05-07T20:32:58.3411834Z T=16384, 2025-05-07T20:32:58.3412390Z D=5120, 2025-05-07T20:32:58.3413037Z scale_ub=None, 2025-05-07T20:32:58.3413475Z contiguous=False, 2025-05-07T20:32:58.3413839Z compiled=True, 2025-05-07T20:32:58.3414049Z ) 2025-05-07T20:32:58.3414377Z self = 2025-05-07T20:32:58.3414887Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.3415175Z 2025-05-07T20:32:58.3415264Z @given( 2025-05-07T20:32:58.3415506Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.3415824Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.3416140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.3416483Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.3416814Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.3417108Z ) 2025-05-07T20:32:58.3417468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.3417919Z def test_silu_mul_quant( 2025-05-07T20:32:58.3418241Z self, 2025-05-07T20:32:58.3418443Z T: int, 2025-05-07T20:32:58.3418644Z D: int, 2025-05-07T20:32:58.3418873Z scale_ub: Optional[float], 2025-05-07T20:32:58.3419150Z contiguous: bool, 2025-05-07T20:32:58.3419393Z compiled: bool, 2025-05-07T20:32:58.3419633Z ) -> None: 2025-05-07T20:32:58.3419858Z torch.manual_seed(2025) 2025-05-07T20:32:58.3420101Z 2025-05-07T20:32:58.3420384Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.3420737Z 2025-05-07T20:32:58.3420941Z x_sign = torch.sign(x) 2025-05-07T20:32:58.3421235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.3421550Z x = x_sign * x_clamp 2025-05-07T20:32:58.3421866Z x0 = x[:, :D] 2025-05-07T20:32:58.3422086Z x1 = x[:, D:] 2025-05-07T20:32:58.3422299Z 2025-05-07T20:32:58.3422492Z if contiguous: 2025-05-07T20:32:58.3422730Z x0 = x0.contiguous() 2025-05-07T20:32:58.3422996Z x1 = x1.contiguous() 2025-05-07T20:32:58.3423245Z 2025-05-07T20:32:58.3423441Z if scale_ub is not None: 2025-05-07T20:32:58.3423723Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.3424067Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.3424380Z ) 2025-05-07T20:32:58.3424576Z else: 2025-05-07T20:32:58.3424793Z scale_ub_tensor = None 2025-05-07T20:32:58.3425045Z 2025-05-07T20:32:58.3425290Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.3425609Z op = silu_mul_quant 2025-05-07T20:32:58.3425865Z if compiled: 2025-05-07T20:32:58.3426114Z op = torch.compile(op) 2025-05-07T20:32:58.3426426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.3426708Z 2025-05-07T20:32:58.3426908Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.3427082Z 2025-05-07T20:32:58.3427187Z moe/activation_test.py:117: 2025-05-07T20:32:58.3427558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.3427891Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.3428183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.3428759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.3429392Z return fn(*args, **kwargs) 2025-05-07T20:32:58.3430057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.3430760Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.3431310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.3432002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.3432683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.3433271Z kernel = self.compile( 2025-05-07T20:32:58.3433827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.3434490Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.3434897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.3435123Z 2025-05-07T20:32:58.3435340Z self = 2025-05-07T20:32:58.3436448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.3437837Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505543760>} 2025-05-07T20:32:58.3439205Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.3440253Z context = 2025-05-07T20:32:58.3440545Z 2025-05-07T20:32:58.3440721Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.3441249Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.3441730Z module_map=module_map) 2025-05-07T20:32:58.3442105Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.3442516Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.3442777Z E ^ 2025-05-07T20:32:58.3443255Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.3443719Z 2025-05-07T20:32:58.3444151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.3444669Z 2025-05-07T20:32:58.3444784Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.3445207Z self=, 2025-05-07T20:32:58.3445615Z T=2048, 2025-05-07T20:32:58.3445814Z D=5120, 2025-05-07T20:32:58.3446008Z scale_ub=None, 2025-05-07T20:32:58.3446230Z contiguous=False, 2025-05-07T20:32:58.3446462Z compiled=True, 2025-05-07T20:32:58.3446667Z ) 2025-05-07T20:32:58.4473111Z self = 2025-05-07T20:32:58.4473900Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.4474294Z 2025-05-07T20:32:58.4474419Z @given( 2025-05-07T20:32:58.4474743Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.4475188Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.4475664Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.4476008Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.4476339Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.4476689Z ) 2025-05-07T20:32:58.4477049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.4477495Z def test_silu_mul_quant( 2025-05-07T20:32:58.4477742Z self, 2025-05-07T20:32:58.4477942Z T: int, 2025-05-07T20:32:58.4478138Z D: int, 2025-05-07T20:32:58.4478361Z scale_ub: Optional[float], 2025-05-07T20:32:58.4478640Z contiguous: bool, 2025-05-07T20:32:58.4478883Z compiled: bool, 2025-05-07T20:32:58.4479118Z ) -> None: 2025-05-07T20:32:58.4479348Z torch.manual_seed(2025) 2025-05-07T20:32:58.4479587Z 2025-05-07T20:32:58.4479938Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.4480293Z 2025-05-07T20:32:58.4480498Z x_sign = torch.sign(x) 2025-05-07T20:32:58.4480795Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.4481113Z x = x_sign * x_clamp 2025-05-07T20:32:58.4481359Z x0 = x[:, :D] 2025-05-07T20:32:58.4481577Z x1 = x[:, D:] 2025-05-07T20:32:58.4481790Z 2025-05-07T20:32:58.4481979Z if contiguous: 2025-05-07T20:32:58.4482216Z x0 = x0.contiguous() 2025-05-07T20:32:58.4482485Z x1 = x1.contiguous() 2025-05-07T20:32:58.4482733Z 2025-05-07T20:32:58.4482923Z if scale_ub is not None: 2025-05-07T20:32:58.4483204Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.4489786Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.4490132Z ) 2025-05-07T20:32:58.4490335Z else: 2025-05-07T20:32:58.4490550Z scale_ub_tensor = None 2025-05-07T20:32:58.4490801Z 2025-05-07T20:32:58.4491042Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.4491359Z op = silu_mul_quant 2025-05-07T20:32:58.4491605Z if compiled: 2025-05-07T20:32:58.4491856Z op = torch.compile(op) 2025-05-07T20:32:58.4492160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.4492424Z 2025-05-07T20:32:58.4492622Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.4492787Z 2025-05-07T20:32:58.4492892Z moe/activation_test.py:117: 2025-05-07T20:32:58.4493191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.4493520Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.4493808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.4494479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.4495050Z return fn(*args, **kwargs) 2025-05-07T20:32:58.4495721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.4496417Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.4496957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.4497641Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.4498394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.4498931Z kernel = self.compile( 2025-05-07T20:32:58.4499475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.4500142Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.4500541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.4500766Z 2025-05-07T20:32:58.4501028Z self = 2025-05-07T20:32:58.4502119Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.4503551Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5054cc3a0>} 2025-05-07T20:32:58.4504960Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.4505994Z context = 2025-05-07T20:32:58.4506281Z 2025-05-07T20:32:58.4506501Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.4507023Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.4507500Z module_map=module_map) 2025-05-07T20:32:58.4507869Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.4508224Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.4508489Z E ^ 2025-05-07T20:32:58.4508965Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.4509418Z 2025-05-07T20:32:58.4509844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.4510361Z 2025-05-07T20:32:58.4510466Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.4510883Z self=, 2025-05-07T20:32:58.4511291Z T=2048, 2025-05-07T20:32:58.4511474Z D=5120, 2025-05-07T20:32:58.4511673Z scale_ub=1200.0, 2025-05-07T20:32:58.4511924Z contiguous=False, 2025-05-07T20:32:58.4512150Z compiled=True, 2025-05-07T20:32:58.4512350Z ) 2025-05-07T20:32:58.4512674Z self = 2025-05-07T20:32:58.4513176Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.4513449Z 2025-05-07T20:32:58.4513532Z @given( 2025-05-07T20:32:58.4513759Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.4514073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.4514407Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.4514761Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.4515141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.4515433Z ) 2025-05-07T20:32:58.4515785Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.4516228Z def test_silu_mul_quant( 2025-05-07T20:32:58.4516467Z self, 2025-05-07T20:32:58.4516661Z T: int, 2025-05-07T20:32:58.4516856Z D: int, 2025-05-07T20:32:58.4517069Z scale_ub: Optional[float], 2025-05-07T20:32:58.4517341Z contiguous: bool, 2025-05-07T20:32:58.4517580Z compiled: bool, 2025-05-07T20:32:58.4517802Z ) -> None: 2025-05-07T20:32:58.4518012Z torch.manual_seed(2025) 2025-05-07T20:32:58.4518254Z 2025-05-07T20:32:58.4518526Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.4518862Z 2025-05-07T20:32:58.4519056Z x_sign = torch.sign(x) 2025-05-07T20:32:58.4519349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.4519659Z x = x_sign * x_clamp 2025-05-07T20:32:58.4519896Z x0 = x[:, :D] 2025-05-07T20:32:58.4520109Z x1 = x[:, D:] 2025-05-07T20:32:58.4520310Z 2025-05-07T20:32:58.4520500Z if contiguous: 2025-05-07T20:32:58.4520731Z x0 = x0.contiguous() 2025-05-07T20:32:58.4521031Z x1 = x1.contiguous() 2025-05-07T20:32:58.4521268Z 2025-05-07T20:32:58.4521457Z if scale_ub is not None: 2025-05-07T20:32:58.4521724Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.4522101Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.4522404Z ) 2025-05-07T20:32:58.4522590Z else: 2025-05-07T20:32:58.4522804Z scale_ub_tensor = None 2025-05-07T20:32:58.4523053Z 2025-05-07T20:32:58.4523284Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.4523592Z op = silu_mul_quant 2025-05-07T20:32:58.4523840Z if compiled: 2025-05-07T20:32:58.4524089Z op = torch.compile(op) 2025-05-07T20:32:58.4524383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.4524653Z 2025-05-07T20:32:58.4524891Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.4525054Z 2025-05-07T20:32:58.4525153Z moe/activation_test.py:117: 2025-05-07T20:32:58.4525446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.4525770Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.4526046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.4526610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.4527170Z return fn(*args, **kwargs) 2025-05-07T20:32:58.4527836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.4528522Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.4529064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.4529755Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.4530432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.4530963Z kernel = self.compile( 2025-05-07T20:32:58.4531513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.4532173Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.4532564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.4532795Z 2025-05-07T20:32:58.4533003Z self = 2025-05-07T20:32:58.4534147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.4535539Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5054cc820>} 2025-05-07T20:32:58.4536894Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.4537926Z context = 2025-05-07T20:32:58.4538269Z 2025-05-07T20:32:58.4538435Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.4538961Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.4539432Z module_map=module_map) 2025-05-07T20:32:58.4539796Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.4540152Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.4540407Z E ^ 2025-05-07T20:32:58.4540919Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.4541377Z 2025-05-07T20:32:58.4541795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.4542355Z 2025-05-07T20:32:58.8134193Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.8134645Z self=, 2025-05-07T20:32:58.8135201Z T=4096, 2025-05-07T20:32:58.8135475Z D=5120, 2025-05-07T20:32:58.8135750Z scale_ub=1200.0, 2025-05-07T20:32:58.8136065Z contiguous=True, 2025-05-07T20:32:58.8136367Z compiled=True, 2025-05-07T20:32:58.8136581Z ) 2025-05-07T20:32:58.8136912Z self = 2025-05-07T20:32:58.8137420Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.8137810Z 2025-05-07T20:32:58.8137896Z @given( 2025-05-07T20:32:58.8138242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.8138561Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.8138876Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.8139214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.8139548Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.8139840Z ) 2025-05-07T20:32:58.8140199Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.8140648Z def test_silu_mul_quant( 2025-05-07T20:32:58.8140895Z self, 2025-05-07T20:32:58.8141097Z T: int, 2025-05-07T20:32:58.8141294Z D: int, 2025-05-07T20:32:58.8141517Z scale_ub: Optional[float], 2025-05-07T20:32:58.8141799Z contiguous: bool, 2025-05-07T20:32:58.8142047Z compiled: bool, 2025-05-07T20:32:58.8142282Z ) -> None: 2025-05-07T20:32:58.8142509Z torch.manual_seed(2025) 2025-05-07T20:32:58.8142762Z 2025-05-07T20:32:58.8143040Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.8143387Z 2025-05-07T20:32:58.8143588Z x_sign = torch.sign(x) 2025-05-07T20:32:58.8143887Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.8144206Z x = x_sign * x_clamp 2025-05-07T20:32:58.8144455Z x0 = x[:, :D] 2025-05-07T20:32:58.8144672Z x1 = x[:, D:] 2025-05-07T20:32:58.8144887Z 2025-05-07T20:32:58.8145082Z if contiguous: 2025-05-07T20:32:58.8145319Z x0 = x0.contiguous() 2025-05-07T20:32:58.8145586Z x1 = x1.contiguous() 2025-05-07T20:32:58.8145831Z 2025-05-07T20:32:58.8146025Z if scale_ub is not None: 2025-05-07T20:32:58.8146382Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.8146732Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.8147042Z ) 2025-05-07T20:32:58.8147245Z else: 2025-05-07T20:32:58.8147463Z scale_ub_tensor = None 2025-05-07T20:32:58.8147716Z 2025-05-07T20:32:58.8147964Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.8148285Z op = silu_mul_quant 2025-05-07T20:32:58.8148540Z if compiled: 2025-05-07T20:32:58.8148792Z op = torch.compile(op) 2025-05-07T20:32:58.8149095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.8149373Z 2025-05-07T20:32:58.8149569Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.8149741Z 2025-05-07T20:32:58.8149843Z moe/activation_test.py:117: 2025-05-07T20:32:58.8150141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8150471Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.8150767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.8151342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.8151921Z return fn(*args, **kwargs) 2025-05-07T20:32:58.8152658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.8153367Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.8153918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.8154700Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.8155371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.8156170Z kernel = self.compile( 2025-05-07T20:32:58.8156731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.8157403Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.8157888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8158124Z 2025-05-07T20:32:58.8158340Z self = 2025-05-07T20:32:58.8159444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.8160852Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5054cd360>} 2025-05-07T20:32:58.8162219Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.8163267Z context = 2025-05-07T20:32:58.8163568Z 2025-05-07T20:32:58.8163740Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.8164299Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.8164798Z module_map=module_map) 2025-05-07T20:32:58.8165175Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.8165538Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.8165798Z E ^ 2025-05-07T20:32:58.8166272Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.8166735Z 2025-05-07T20:32:58.8167239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.8167762Z 2025-05-07T20:32:58.8167875Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.8168300Z self=, 2025-05-07T20:32:58.8168708Z T=128, 2025-05-07T20:32:58.8168903Z D=5120, 2025-05-07T20:32:58.8169103Z scale_ub=1200.0, 2025-05-07T20:32:58.8169332Z contiguous=False, 2025-05-07T20:32:58.8169567Z compiled=True, 2025-05-07T20:32:58.8169778Z ) 2025-05-07T20:32:58.9315689Z self = 2025-05-07T20:32:58.9316460Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:58.9316873Z 2025-05-07T20:32:58.9316983Z @given( 2025-05-07T20:32:58.9317316Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9317758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9318084Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9318430Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9318766Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9319060Z ) 2025-05-07T20:32:58.9319422Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9319980Z def test_silu_mul_quant( 2025-05-07T20:32:58.9320230Z self, 2025-05-07T20:32:58.9320435Z T: int, 2025-05-07T20:32:58.9320636Z D: int, 2025-05-07T20:32:58.9320858Z scale_ub: Optional[float], 2025-05-07T20:32:58.9321205Z contiguous: bool, 2025-05-07T20:32:58.9321452Z compiled: bool, 2025-05-07T20:32:58.9321678Z ) -> None: 2025-05-07T20:32:58.9321907Z torch.manual_seed(2025) 2025-05-07T20:32:58.9322160Z 2025-05-07T20:32:58.9322436Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9322787Z 2025-05-07T20:32:58.9322986Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9323281Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9323598Z x = x_sign * x_clamp 2025-05-07T20:32:58.9323842Z x0 = x[:, :D] 2025-05-07T20:32:58.9324131Z x1 = x[:, D:] 2025-05-07T20:32:58.9324355Z 2025-05-07T20:32:58.9324581Z if contiguous: 2025-05-07T20:32:58.9324820Z x0 = x0.contiguous() 2025-05-07T20:32:58.9325082Z x1 = x1.contiguous() 2025-05-07T20:32:58.9325331Z 2025-05-07T20:32:58.9325534Z if scale_ub is not None: 2025-05-07T20:32:58.9325817Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9326164Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9326474Z ) 2025-05-07T20:32:58.9326668Z else: 2025-05-07T20:32:58.9326884Z scale_ub_tensor = None 2025-05-07T20:32:58.9327144Z 2025-05-07T20:32:58.9327382Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9327701Z op = silu_mul_quant 2025-05-07T20:32:58.9327957Z if compiled: 2025-05-07T20:32:58.9328205Z op = torch.compile(op) 2025-05-07T20:32:58.9328512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9328797Z 2025-05-07T20:32:58.9328998Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9329171Z 2025-05-07T20:32:58.9329277Z moe/activation_test.py:117: 2025-05-07T20:32:58.9329578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9329915Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9330208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9330786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9331359Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9332023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9332794Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9333343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9334042Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9334766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9335308Z kernel = self.compile( 2025-05-07T20:32:58.9335863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9336539Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9336939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9337175Z 2025-05-07T20:32:58.9337387Z self = 2025-05-07T20:32:58.9338588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9340039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5054ce290>} 2025-05-07T20:32:58.9341397Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9342481Z context = 2025-05-07T20:32:58.9342779Z 2025-05-07T20:32:58.9342950Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9343490Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9343971Z module_map=module_map) 2025-05-07T20:32:58.9344343Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9344710Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9345015Z E ^ 2025-05-07T20:32:58.9345493Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9345957Z 2025-05-07T20:32:58.9346383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9346906Z 2025-05-07T20:32:58.9347023Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9347442Z self=, 2025-05-07T20:32:58.9347853Z T=16384, 2025-05-07T20:32:58.9348056Z D=7168, 2025-05-07T20:32:58.9348255Z scale_ub=1200.0, 2025-05-07T20:32:58.9348485Z contiguous=True, 2025-05-07T20:32:58.9348716Z compiled=True, 2025-05-07T20:32:58.9348930Z ) 2025-05-07T20:32:58.9349255Z self = 2025-05-07T20:32:58.9349765Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.9350049Z 2025-05-07T20:32:58.9350138Z @given( 2025-05-07T20:32:58.9350376Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9350696Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9351013Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9351349Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9351689Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9351980Z ) 2025-05-07T20:32:58.9352365Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9352815Z def test_silu_mul_quant( 2025-05-07T20:32:58.9353058Z self, 2025-05-07T20:32:58.9353257Z T: int, 2025-05-07T20:32:58.9353457Z D: int, 2025-05-07T20:32:58.9353726Z scale_ub: Optional[float], 2025-05-07T20:32:58.9354013Z contiguous: bool, 2025-05-07T20:32:58.9354262Z compiled: bool, 2025-05-07T20:32:58.9354490Z ) -> None: 2025-05-07T20:32:58.9354712Z torch.manual_seed(2025) 2025-05-07T20:32:58.9354963Z 2025-05-07T20:32:58.9355242Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9355847Z 2025-05-07T20:32:58.9356050Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9356351Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9356667Z x = x_sign * x_clamp 2025-05-07T20:32:58.9356915Z x0 = x[:, :D] 2025-05-07T20:32:58.9357132Z x1 = x[:, D:] 2025-05-07T20:32:58.9357345Z 2025-05-07T20:32:58.9357536Z if contiguous: 2025-05-07T20:32:58.9357769Z x0 = x0.contiguous() 2025-05-07T20:32:58.9358038Z x1 = x1.contiguous() 2025-05-07T20:32:58.9358283Z 2025-05-07T20:32:58.9358481Z if scale_ub is not None: 2025-05-07T20:32:58.9358761Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9359104Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9359423Z ) 2025-05-07T20:32:58.9359618Z else: 2025-05-07T20:32:58.9359909Z scale_ub_tensor = None 2025-05-07T20:32:58.9360172Z 2025-05-07T20:32:58.9360409Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9360733Z op = silu_mul_quant 2025-05-07T20:32:58.9361056Z if compiled: 2025-05-07T20:32:58.9361308Z op = torch.compile(op) 2025-05-07T20:32:58.9361619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9361898Z 2025-05-07T20:32:58.9362093Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9362268Z 2025-05-07T20:32:58.9362369Z moe/activation_test.py:117: 2025-05-07T20:32:58.9362670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9363003Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9363288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9363860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9364504Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9365172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9365877Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9366424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9367117Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9367787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9368326Z kernel = self.compile( 2025-05-07T20:32:58.9368887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9369550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9369954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9370187Z 2025-05-07T20:32:58.9370397Z self = 2025-05-07T20:32:58.9371492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9372892Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5054ced40>} 2025-05-07T20:32:58.9374364Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9375410Z context = 2025-05-07T20:32:58.9375702Z 2025-05-07T20:32:58.9375881Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9376413Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9376888Z module_map=module_map) 2025-05-07T20:32:58.9377263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9377633Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9377897Z E ^ 2025-05-07T20:32:58.9378456Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9378914Z 2025-05-07T20:32:58.9379344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9379862Z 2025-05-07T20:32:59.0743183Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.0744518Z self=, 2025-05-07T20:32:59.0745404Z T=16384, 2025-05-07T20:32:59.0745640Z D=5120, 2025-05-07T20:32:59.0745831Z scale_ub=1200.0, 2025-05-07T20:32:59.0746050Z contiguous=True, 2025-05-07T20:32:59.0746272Z compiled=False, 2025-05-07T20:32:59.0746552Z ) 2025-05-07T20:32:59.0746872Z self = 2025-05-07T20:32:59.0747376Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:59.0747666Z 2025-05-07T20:32:59.0747742Z @given( 2025-05-07T20:32:59.0747971Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.0748279Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.0748597Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.0748935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.0749257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.0749618Z ) 2025-05-07T20:32:59.0749978Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.0750415Z def test_silu_mul_quant( 2025-05-07T20:32:59.0750652Z self, 2025-05-07T20:32:59.0750850Z T: int, 2025-05-07T20:32:59.0751048Z D: int, 2025-05-07T20:32:59.0751268Z scale_ub: Optional[float], 2025-05-07T20:32:59.0751537Z contiguous: bool, 2025-05-07T20:32:59.0751782Z compiled: bool, 2025-05-07T20:32:59.0752001Z ) -> None: 2025-05-07T20:32:59.0752219Z torch.manual_seed(2025) 2025-05-07T20:32:59.0752464Z 2025-05-07T20:32:59.0752734Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.0753078Z 2025-05-07T20:32:59.0753276Z x_sign = torch.sign(x) 2025-05-07T20:32:59.0753565Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.0753875Z x = x_sign * x_clamp 2025-05-07T20:32:59.0754120Z x0 = x[:, :D] 2025-05-07T20:32:59.0754331Z x1 = x[:, D:] 2025-05-07T20:32:59.0754539Z 2025-05-07T20:32:59.0754729Z if contiguous: 2025-05-07T20:32:59.0754955Z x0 = x0.contiguous() 2025-05-07T20:32:59.0755210Z x1 = x1.contiguous() 2025-05-07T20:32:59.0755448Z 2025-05-07T20:32:59.0755921Z if scale_ub is not None: 2025-05-07T20:32:59.0756197Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.0756534Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.0756838Z ) 2025-05-07T20:32:59.0757030Z else: 2025-05-07T20:32:59.0757243Z scale_ub_tensor = None 2025-05-07T20:32:59.0757495Z 2025-05-07T20:32:59.0757726Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.0758117Z op = silu_mul_quant 2025-05-07T20:32:59.0758370Z if compiled: 2025-05-07T20:32:59.0758616Z op = torch.compile(op) 2025-05-07T20:32:59.0758918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.0759194Z 2025-05-07T20:32:59.0759384Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.0759555Z 2025-05-07T20:32:59.0759655Z moe/activation_test.py:117: 2025-05-07T20:32:59.0759950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.0760274Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.0760556Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.0761255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.0761957Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.0762500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.0763189Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.0763859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.0764482Z kernel = self.compile( 2025-05-07T20:32:59.0765048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.0765710Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.0766166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.0766392Z 2025-05-07T20:32:59.0766600Z self = 2025-05-07T20:32:59.0767690Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.0769085Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5054cfac0>} 2025-05-07T20:32:59.0770518Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.0771554Z context = 2025-05-07T20:32:59.0771841Z 2025-05-07T20:32:59.0772012Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.0772533Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.0773008Z module_map=module_map) 2025-05-07T20:32:59.0773381Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.0773731Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.0773992Z E ^ 2025-05-07T20:32:59.0774460Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.0774967Z 2025-05-07T20:32:59.0775394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.0775906Z 2025-05-07T20:32:59.0776009Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.0776427Z self=, 2025-05-07T20:32:59.0776823Z T=1, 2025-05-07T20:32:59.0777001Z D=7168, 2025-05-07T20:32:59.0777196Z scale_ub=1200.0, 2025-05-07T20:32:59.0777424Z contiguous=False, 2025-05-07T20:32:59.0777648Z compiled=False, 2025-05-07T20:32:59.0777847Z ) 2025-05-07T20:32:59.0778250Z self = 2025-05-07T20:32:59.0778814Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:59.0779081Z 2025-05-07T20:32:59.0779162Z @given( 2025-05-07T20:32:59.0779390Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.0779704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.0780008Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.0780332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.0780660Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.0780951Z ) 2025-05-07T20:32:59.0781294Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.0781735Z def test_silu_mul_quant( 2025-05-07T20:32:59.0781975Z self, 2025-05-07T20:32:59.0782164Z T: int, 2025-05-07T20:32:59.0782353Z D: int, 2025-05-07T20:32:59.0782571Z scale_ub: Optional[float], 2025-05-07T20:32:59.0782844Z contiguous: bool, 2025-05-07T20:32:59.0783083Z compiled: bool, 2025-05-07T20:32:59.0783305Z ) -> None: 2025-05-07T20:32:59.0783518Z torch.manual_seed(2025) 2025-05-07T20:32:59.0783754Z 2025-05-07T20:32:59.0784031Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.0784421Z 2025-05-07T20:32:59.0784612Z x_sign = torch.sign(x) 2025-05-07T20:32:59.0784904Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.0785209Z x = x_sign * x_clamp 2025-05-07T20:32:59.0785481Z x0 = x[:, :D] 2025-05-07T20:32:59.0785696Z x1 = x[:, D:] 2025-05-07T20:32:59.0785903Z 2025-05-07T20:32:59.0786082Z if contiguous: 2025-05-07T20:32:59.0786313Z x0 = x0.contiguous() 2025-05-07T20:32:59.0786567Z x1 = x1.contiguous() 2025-05-07T20:32:59.0786799Z 2025-05-07T20:32:59.0786988Z if scale_ub is not None: 2025-05-07T20:32:59.0787259Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.0787597Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.0787899Z ) 2025-05-07T20:32:59.0788093Z else: 2025-05-07T20:32:59.0788397Z scale_ub_tensor = None 2025-05-07T20:32:59.0788648Z 2025-05-07T20:32:59.0788883Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.0789197Z op = silu_mul_quant 2025-05-07T20:32:59.0789438Z if compiled: 2025-05-07T20:32:59.0789680Z op = torch.compile(op) 2025-05-07T20:32:59.0789981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.0790249Z 2025-05-07T20:32:59.0790443Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.0790610Z 2025-05-07T20:32:59.0790713Z moe/activation_test.py:117: 2025-05-07T20:32:59.0791003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.0791331Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.0791607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.0792301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.0792993Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.0793535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.0794222Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.0794941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.0795467Z kernel = self.compile( 2025-05-07T20:32:59.0796013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.0796676Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.0797113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.0797340Z 2025-05-07T20:32:59.0797546Z self = 2025-05-07T20:32:59.0798634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.0800022Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5050484c0>} 2025-05-07T20:32:59.0801379Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.0802408Z context = 2025-05-07T20:32:59.0802701Z 2025-05-07T20:32:59.0802872Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.0803396Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.0803870Z module_map=module_map) 2025-05-07T20:32:59.0804278Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.0804641Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.0804943Z E ^ 2025-05-07T20:32:59.0805406Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.0805903Z 2025-05-07T20:32:59.0806323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.0806840Z 2025-05-07T20:32:59.2714314Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.2715123Z self=, 2025-05-07T20:32:59.2715732Z T=4096, 2025-05-07T20:32:59.2715992Z D=7168, 2025-05-07T20:32:59.2716270Z scale_ub=1200.0, 2025-05-07T20:32:59.2716507Z contiguous=False, 2025-05-07T20:32:59.2716882Z compiled=True, 2025-05-07T20:32:59.2717089Z ) 2025-05-07T20:32:59.2717418Z self = 2025-05-07T20:32:59.2717918Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:59.2718199Z 2025-05-07T20:32:59.2718275Z @given( 2025-05-07T20:32:59.2718513Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.2718828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.2719134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.2719471Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.2719805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.2720084Z ) 2025-05-07T20:32:59.2720447Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.2720884Z def test_silu_mul_quant( 2025-05-07T20:32:59.2721122Z self, 2025-05-07T20:32:59.2721317Z T: int, 2025-05-07T20:32:59.2721515Z D: int, 2025-05-07T20:32:59.2721730Z scale_ub: Optional[float], 2025-05-07T20:32:59.2722012Z contiguous: bool, 2025-05-07T20:32:59.2722253Z compiled: bool, 2025-05-07T20:32:59.2722481Z ) -> None: 2025-05-07T20:32:59.2722699Z torch.manual_seed(2025) 2025-05-07T20:32:59.2722943Z 2025-05-07T20:32:59.2723220Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.2723559Z 2025-05-07T20:32:59.2723754Z x_sign = torch.sign(x) 2025-05-07T20:32:59.2724046Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.2724354Z x = x_sign * x_clamp 2025-05-07T20:32:59.2724587Z x0 = x[:, :D] 2025-05-07T20:32:59.2724806Z x1 = x[:, D:] 2025-05-07T20:32:59.2725016Z 2025-05-07T20:32:59.2725273Z if contiguous: 2025-05-07T20:32:59.2725514Z x0 = x0.contiguous() 2025-05-07T20:32:59.2725773Z x1 = x1.contiguous() 2025-05-07T20:32:59.2726012Z 2025-05-07T20:32:59.2726213Z if scale_ub is not None: 2025-05-07T20:32:59.2726493Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.2726826Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.2727136Z ) 2025-05-07T20:32:59.2727329Z else: 2025-05-07T20:32:59.2727538Z scale_ub_tensor = None 2025-05-07T20:32:59.2727788Z 2025-05-07T20:32:59.2728024Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.2728333Z op = silu_mul_quant 2025-05-07T20:32:59.2728584Z if compiled: 2025-05-07T20:32:59.2728831Z op = torch.compile(op) 2025-05-07T20:32:59.2729129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.2729399Z 2025-05-07T20:32:59.2729600Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.2729765Z 2025-05-07T20:32:59.2729873Z moe/activation_test.py:117: 2025-05-07T20:32:59.2730165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.2730504Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.2730857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.2731416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.2731976Z return fn(*args, **kwargs) 2025-05-07T20:32:59.2732705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.2733401Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.2733937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.2734678Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.2735352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.2735925Z kernel = self.compile( 2025-05-07T20:32:59.2736472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.2737138Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.2737531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.2737757Z 2025-05-07T20:32:59.2737965Z self = 2025-05-07T20:32:59.2739155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.2740548Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5050491b0>} 2025-05-07T20:32:59.2741912Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.2742946Z context = 2025-05-07T20:32:59.2743236Z 2025-05-07T20:32:59.2743411Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.2743937Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.2744412Z module_map=module_map) 2025-05-07T20:32:59.2744828Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.2745190Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.2745450Z E ^ 2025-05-07T20:32:59.2745968Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.2746422Z 2025-05-07T20:32:59.2746848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.2747370Z 2025-05-07T20:32:59.2747475Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.2747892Z self=, 2025-05-07T20:32:59.2748292Z T=128, 2025-05-07T20:32:59.2748479Z D=7168, 2025-05-07T20:32:59.2748673Z scale_ub=1200.0, 2025-05-07T20:32:59.2748899Z contiguous=False, 2025-05-07T20:32:59.2749120Z compiled=True, 2025-05-07T20:32:59.2749328Z ) 2025-05-07T20:32:59.3778887Z self = 2025-05-07T20:32:59.3779687Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:59.3780077Z 2025-05-07T20:32:59.3780198Z @given( 2025-05-07T20:32:59.3780453Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.3780768Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.3781087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.3781542Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.3781877Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.3782167Z ) 2025-05-07T20:32:59.3782524Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.3783035Z def test_silu_mul_quant( 2025-05-07T20:32:59.3783276Z self, 2025-05-07T20:32:59.3783479Z T: int, 2025-05-07T20:32:59.3783679Z D: int, 2025-05-07T20:32:59.3783898Z scale_ub: Optional[float], 2025-05-07T20:32:59.3784177Z contiguous: bool, 2025-05-07T20:32:59.3784423Z compiled: bool, 2025-05-07T20:32:59.3784649Z ) -> None: 2025-05-07T20:32:59.3784872Z torch.manual_seed(2025) 2025-05-07T20:32:59.3785119Z 2025-05-07T20:32:59.3785395Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.3785822Z 2025-05-07T20:32:59.3786023Z x_sign = torch.sign(x) 2025-05-07T20:32:59.3786320Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.3786637Z x = x_sign * x_clamp 2025-05-07T20:32:59.3786881Z x0 = x[:, :D] 2025-05-07T20:32:59.3787097Z x1 = x[:, D:] 2025-05-07T20:32:59.3787311Z 2025-05-07T20:32:59.3787503Z if contiguous: 2025-05-07T20:32:59.3787734Z x0 = x0.contiguous() 2025-05-07T20:32:59.3787999Z x1 = x1.contiguous() 2025-05-07T20:32:59.3788245Z 2025-05-07T20:32:59.3788439Z if scale_ub is not None: 2025-05-07T20:32:59.3788713Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.3789063Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.3789375Z ) 2025-05-07T20:32:59.3789568Z else: 2025-05-07T20:32:59.3789783Z scale_ub_tensor = None 2025-05-07T20:32:59.3790038Z 2025-05-07T20:32:59.3790272Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.3790596Z op = silu_mul_quant 2025-05-07T20:32:59.3790856Z if compiled: 2025-05-07T20:32:59.3791103Z op = torch.compile(op) 2025-05-07T20:32:59.3791406Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3791684Z 2025-05-07T20:32:59.3791883Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.3792055Z 2025-05-07T20:32:59.3792155Z moe/activation_test.py:117: 2025-05-07T20:32:59.3792453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3792786Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.3793069Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3793712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.3794286Z return fn(*args, **kwargs) 2025-05-07T20:32:59.3795011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.3795716Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.3796266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.3796956Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.3797628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.3798168Z kernel = self.compile( 2025-05-07T20:32:59.3798720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.3799385Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.3799795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3800029Z 2025-05-07T20:32:59.3800239Z self = 2025-05-07T20:32:59.3801390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.3802795Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5050480d0>} 2025-05-07T20:32:59.3804197Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.3805295Z context = 2025-05-07T20:32:59.3805590Z 2025-05-07T20:32:59.3805760Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.3806292Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.3806810Z module_map=module_map) 2025-05-07T20:32:59.3807186Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.3807549Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.3807807Z E ^ 2025-05-07T20:32:59.3808287Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.3808754Z 2025-05-07T20:32:59.3809177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.3809697Z 2025-05-07T20:32:59.3809809Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.3810230Z self=, 2025-05-07T20:32:59.3810645Z T=2048, 2025-05-07T20:32:59.3810846Z D=7168, 2025-05-07T20:32:59.3811041Z scale_ub=None, 2025-05-07T20:32:59.3811269Z contiguous=True, 2025-05-07T20:32:59.3811503Z compiled=True, 2025-05-07T20:32:59.3811714Z ) 2025-05-07T20:32:59.3812047Z self = 2025-05-07T20:32:59.3812547Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:59.3812822Z 2025-05-07T20:32:59.3812905Z @given( 2025-05-07T20:32:59.3813136Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.3813454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.3813767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.3814096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.3814432Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.3814724Z ) 2025-05-07T20:32:59.3815122Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.3815575Z def test_silu_mul_quant( 2025-05-07T20:32:59.3815822Z self, 2025-05-07T20:32:59.3816018Z T: int, 2025-05-07T20:32:59.3816213Z D: int, 2025-05-07T20:32:59.3816438Z scale_ub: Optional[float], 2025-05-07T20:32:59.3816718Z contiguous: bool, 2025-05-07T20:32:59.3816955Z compiled: bool, 2025-05-07T20:32:59.3817182Z ) -> None: 2025-05-07T20:32:59.3817406Z torch.manual_seed(2025) 2025-05-07T20:32:59.3817644Z 2025-05-07T20:32:59.3817924Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.3818397Z 2025-05-07T20:32:59.3818590Z x_sign = torch.sign(x) 2025-05-07T20:32:59.3818884Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.3819197Z x = x_sign * x_clamp 2025-05-07T20:32:59.3819436Z x0 = x[:, :D] 2025-05-07T20:32:59.3819660Z x1 = x[:, D:] 2025-05-07T20:32:59.3819875Z 2025-05-07T20:32:59.3820062Z if contiguous: 2025-05-07T20:32:59.3820295Z x0 = x0.contiguous() 2025-05-07T20:32:59.3820560Z x1 = x1.contiguous() 2025-05-07T20:32:59.3826687Z 2025-05-07T20:32:59.3826981Z if scale_ub is not None: 2025-05-07T20:32:59.3827275Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.3827620Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.3827931Z ) 2025-05-07T20:32:59.3828167Z else: 2025-05-07T20:32:59.3828382Z scale_ub_tensor = None 2025-05-07T20:32:59.3828638Z 2025-05-07T20:32:59.3828882Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.3829204Z op = silu_mul_quant 2025-05-07T20:32:59.3829462Z if compiled: 2025-05-07T20:32:59.3829714Z op = torch.compile(op) 2025-05-07T20:32:59.3830013Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3830300Z 2025-05-07T20:32:59.3830503Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.3830676Z 2025-05-07T20:32:59.3830780Z moe/activation_test.py:117: 2025-05-07T20:32:59.3831126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3831470Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.3831755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3832329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.3832901Z return fn(*args, **kwargs) 2025-05-07T20:32:59.3833568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.3834274Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.3834863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.3835570Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.3836240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.3836786Z kernel = self.compile( 2025-05-07T20:32:59.3837343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.3838017Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.3838419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3838654Z 2025-05-07T20:32:59.3838867Z self = 2025-05-07T20:32:59.3840013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.3841420Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd50504a560>} 2025-05-07T20:32:59.3842790Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.3843835Z context = 2025-05-07T20:32:59.3844137Z 2025-05-07T20:32:59.3844309Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.3844841Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.3845322Z module_map=module_map) 2025-05-07T20:32:59.3845701Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.3846072Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.3846331Z E ^ 2025-05-07T20:32:59.3846809Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.3847276Z 2025-05-07T20:32:59.3847749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.3848269Z 2025-05-07T20:32:59.4698555Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.4699179Z self=, 2025-05-07T20:32:59.4699966Z T=16384, 2025-05-07T20:32:59.4700240Z D=5120, 2025-05-07T20:32:59.4700513Z scale_ub=None, 2025-05-07T20:32:59.4700824Z contiguous=False, 2025-05-07T20:32:59.4701057Z compiled=False, 2025-05-07T20:32:59.4701275Z ) 2025-05-07T20:32:59.4701606Z self = 2025-05-07T20:32:59.4702121Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:59.4702401Z 2025-05-07T20:32:59.4702480Z @given( 2025-05-07T20:32:59.4702716Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.4703119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.4703437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.4703767Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.4704103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.4704389Z ) 2025-05-07T20:32:59.4704793Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.4705244Z def test_silu_mul_quant( 2025-05-07T20:32:59.4705483Z self, 2025-05-07T20:32:59.4705676Z T: int, 2025-05-07T20:32:59.4705875Z D: int, 2025-05-07T20:32:59.4706097Z scale_ub: Optional[float], 2025-05-07T20:32:59.4706372Z contiguous: bool, 2025-05-07T20:32:59.4706609Z compiled: bool, 2025-05-07T20:32:59.4706840Z ) -> None: 2025-05-07T20:32:59.4707059Z torch.manual_seed(2025) 2025-05-07T20:32:59.4707301Z 2025-05-07T20:32:59.4707579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.4707928Z 2025-05-07T20:32:59.4708127Z x_sign = torch.sign(x) 2025-05-07T20:32:59.4708430Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.4710514Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:59.4712514Z 2025-05-07T20:32:59.4712640Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:59.4712856Z 2025-05-07T20:32:59.4712968Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.4713397Z self=, 2025-05-07T20:32:59.4713800Z T=4096, 2025-05-07T20:32:59.4713989Z D=7168, 2025-05-07T20:32:59.4714182Z scale_ub=1200.0, 2025-05-07T20:32:59.4714402Z contiguous=True, 2025-05-07T20:32:59.4714626Z compiled=True, 2025-05-07T20:32:59.4714833Z ) 2025-05-07T20:32:59.4715150Z self = 2025-05-07T20:32:59.4715650Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:59.4715921Z 2025-05-07T20:32:59.4716003Z @given( 2025-05-07T20:32:59.4716229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.4716546Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.4716867Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.4717198Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.4717526Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.4717812Z ) 2025-05-07T20:32:59.4718237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.4718682Z def test_silu_mul_quant( 2025-05-07T20:32:59.4718924Z self, 2025-05-07T20:32:59.4719119Z T: int, 2025-05-07T20:32:59.4719358Z D: int, 2025-05-07T20:32:59.4719578Z scale_ub: Optional[float], 2025-05-07T20:32:59.4719851Z contiguous: bool, 2025-05-07T20:32:59.4720086Z compiled: bool, 2025-05-07T20:32:59.4720313Z ) -> None: 2025-05-07T20:32:59.4720533Z torch.manual_seed(2025) 2025-05-07T20:32:59.4720769Z 2025-05-07T20:32:59.4721042Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.4721384Z 2025-05-07T20:32:59.4721579Z x_sign = torch.sign(x) 2025-05-07T20:32:59.4721879Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.4723957Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:59.4725974Z 2025-05-07T20:32:59.4726092Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:59.4726309Z 2025-05-07T20:32:59.4726418Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.4726831Z self=, 2025-05-07T20:32:59.4727242Z T=16384, 2025-05-07T20:32:59.4727434Z D=7168, 2025-05-07T20:32:59.4727622Z scale_ub=None, 2025-05-07T20:32:59.4727845Z contiguous=False, 2025-05-07T20:32:59.4728075Z compiled=False, 2025-05-07T20:32:59.4728275Z ) 2025-05-07T20:32:59.4728597Z self = 2025-05-07T20:32:59.4729098Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:59.4729378Z 2025-05-07T20:32:59.4729463Z @given( 2025-05-07T20:32:59.4729691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.4730006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.4730313Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.4730638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.4730971Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.4731263Z ) 2025-05-07T20:32:59.4731655Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.4732101Z def test_silu_mul_quant( 2025-05-07T20:32:59.4732346Z self, 2025-05-07T20:32:59.4732549Z T: int, 2025-05-07T20:32:59.4732743Z D: int, 2025-05-07T20:32:59.4732966Z scale_ub: Optional[float], 2025-05-07T20:32:59.4733239Z contiguous: bool, 2025-05-07T20:32:59.4733475Z compiled: bool, 2025-05-07T20:32:59.4733695Z ) -> None: 2025-05-07T20:32:59.4733913Z torch.manual_seed(2025) 2025-05-07T20:32:59.4734153Z 2025-05-07T20:32:59.4734426Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.4736584Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:59.4738608Z 2025-05-07T20:32:59.4738779Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:59.4738995Z 2025-05-07T20:32:59.4739104Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.4739519Z self=, 2025-05-07T20:32:59.4739962Z T=2048, 2025-05-07T20:32:59.4740153Z D=7168, 2025-05-07T20:32:59.4740339Z scale_ub=1200.0, 2025-05-07T20:32:59.4740564Z contiguous=True, 2025-05-07T20:32:59.4740787Z compiled=True, 2025-05-07T20:32:59.4740987Z ) 2025-05-07T20:32:59.4741306Z self = 2025-05-07T20:32:59.4741803Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:59.4742076Z 2025-05-07T20:32:59.4742152Z @given( 2025-05-07T20:32:59.4742383Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.4742769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.4743078Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.4743406Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.4743738Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.4744021Z ) 2025-05-07T20:32:59.4744372Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.4744817Z def test_silu_mul_quant( 2025-05-07T20:32:59.4745061Z self, 2025-05-07T20:32:59.4745253Z T: int, 2025-05-07T20:32:59.4745454Z D: int, 2025-05-07T20:32:59.4745674Z scale_ub: Optional[float], 2025-05-07T20:32:59.4745942Z contiguous: bool, 2025-05-07T20:32:59.4746188Z compiled: bool, 2025-05-07T20:32:59.4746411Z ) -> None: 2025-05-07T20:32:59.4746627Z torch.manual_seed(2025) 2025-05-07T20:32:59.4746874Z 2025-05-07T20:32:59.4747148Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.4747495Z 2025-05-07T20:32:59.4747686Z x_sign = torch.sign(x) 2025-05-07T20:32:59.4747985Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.4750035Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:59.4751931Z 2025-05-07T20:32:59.4752104Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:59.4752319Z 2025-05-07T20:32:59.4752423Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.4752842Z self=, 2025-05-07T20:32:59.4753247Z T=2048, 2025-05-07T20:32:59.4753438Z D=7168, 2025-05-07T20:32:59.4753627Z scale_ub=None, 2025-05-07T20:32:59.4753844Z contiguous=True, 2025-05-07T20:32:59.4754097Z compiled=False, 2025-05-07T20:32:59.4754326Z ) 2025-05-07T20:32:59.7734628Z self = 2025-05-07T20:32:59.7735431Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:59.7735784Z 2025-05-07T20:32:59.7735873Z @given( 2025-05-07T20:32:59.7736111Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.7736431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.7736749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.7737086Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.7737419Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.7737712Z ) 2025-05-07T20:32:59.7738292Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.7738744Z def test_silu_mul_quant( 2025-05-07T20:32:59.7738993Z self, 2025-05-07T20:32:59.7739193Z T: int, 2025-05-07T20:32:59.7739391Z D: int, 2025-05-07T20:32:59.7739678Z scale_ub: Optional[float], 2025-05-07T20:32:59.7739960Z contiguous: bool, 2025-05-07T20:32:59.7740202Z compiled: bool, 2025-05-07T20:32:59.7740432Z ) -> None: 2025-05-07T20:32:59.7740663Z torch.manual_seed(2025) 2025-05-07T20:32:59.7740909Z 2025-05-07T20:32:59.7741191Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.7741535Z 2025-05-07T20:32:59.7741729Z > x_sign = torch.sign(x) 2025-05-07T20:32:59.7743725Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:59.7745710Z 2025-05-07T20:32:59.7745834Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:59.7746053Z 2025-05-07T20:32:59.7746157Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.7746577Z self=, 2025-05-07T20:32:59.7746975Z T=1, 2025-05-07T20:32:59.7747166Z D=7168, 2025-05-07T20:32:59.7747363Z scale_ub=1200.0, 2025-05-07T20:32:59.7747587Z contiguous=True, 2025-05-07T20:32:59.7747814Z compiled=False, 2025-05-07T20:32:59.7748025Z ) 2025-05-07T20:32:59.7748350Z self = 2025-05-07T20:32:59.7748844Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:59.7749111Z 2025-05-07T20:32:59.7749188Z @given( 2025-05-07T20:32:59.7749421Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.7749738Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.7750050Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.7750386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.7750713Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.7751002Z ) 2025-05-07T20:32:59.7751358Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.7751804Z def test_silu_mul_quant( 2025-05-07T20:32:59.7752111Z self, 2025-05-07T20:32:59.7752313Z T: int, 2025-05-07T20:32:59.7752514Z D: int, 2025-05-07T20:32:59.7752732Z scale_ub: Optional[float], 2025-05-07T20:32:59.7753014Z contiguous: bool, 2025-05-07T20:32:59.7753265Z compiled: bool, 2025-05-07T20:32:59.7753492Z ) -> None: 2025-05-07T20:32:59.7753713Z torch.manual_seed(2025) 2025-05-07T20:32:59.7753965Z 2025-05-07T20:32:59.7754241Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.7754594Z 2025-05-07T20:32:59.7754795Z x_sign = torch.sign(x) 2025-05-07T20:32:59.7755088Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.7755403Z x = x_sign * x_clamp 2025-05-07T20:32:59.7755957Z x0 = x[:, :D] 2025-05-07T20:32:59.7756174Z x1 = x[:, D:] 2025-05-07T20:32:59.7756385Z 2025-05-07T20:32:59.7756578Z if contiguous: 2025-05-07T20:32:59.7756815Z x0 = x0.contiguous() 2025-05-07T20:32:59.7757080Z x1 = x1.contiguous() 2025-05-07T20:32:59.7757327Z 2025-05-07T20:32:59.7757522Z if scale_ub is not None: 2025-05-07T20:32:59.7757804Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.7758221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.7758533Z ) 2025-05-07T20:32:59.7758721Z else: 2025-05-07T20:32:59.7758937Z scale_ub_tensor = None 2025-05-07T20:32:59.7759191Z 2025-05-07T20:32:59.7759486Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.7759801Z op = silu_mul_quant 2025-05-07T20:32:59.7760051Z if compiled: 2025-05-07T20:32:59.7760298Z op = torch.compile(op) 2025-05-07T20:32:59.7760600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.7760877Z 2025-05-07T20:32:59.7761074Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.7761244Z 2025-05-07T20:32:59.7761347Z moe/activation_test.py:117: 2025-05-07T20:32:59.7761645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.7761984Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.7762335Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.7763046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.7763751Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.7764295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.7765041Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.7765715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.7766251Z kernel = self.compile( 2025-05-07T20:32:59.7766800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.7767466Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.7767869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.7768096Z 2025-05-07T20:32:59.7768327Z self = 2025-05-07T20:32:59.7769419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.7770814Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd504d204c0>} 2025-05-07T20:32:59.7772245Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.7773288Z context = 2025-05-07T20:32:59.7773584Z 2025-05-07T20:32:59.7773760Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.7774280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.7774756Z module_map=module_map) 2025-05-07T20:32:59.7775126Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.7775478Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.7775740Z E ^ 2025-05-07T20:32:59.7776214Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.7776668Z 2025-05-07T20:32:59.7777096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.7777611Z 2025-05-07T20:32:59.7777717Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.7778210Z self=, 2025-05-07T20:32:59.7778619Z T=128, 2025-05-07T20:32:59.7778805Z D=5120, 2025-05-07T20:32:59.7779044Z scale_ub=None, 2025-05-07T20:32:59.7779261Z contiguous=True, 2025-05-07T20:32:59.7779484Z compiled=False, 2025-05-07T20:32:59.7779689Z ) 2025-05-07T20:32:59.8550535Z self = 2025-05-07T20:32:59.8551320Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:59.8551643Z 2025-05-07T20:32:59.8551727Z @given( 2025-05-07T20:32:59.8551968Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.8552289Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.8552610Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.8552957Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.8553294Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.8553586Z ) 2025-05-07T20:32:59.8554038Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.8554488Z def test_silu_mul_quant( 2025-05-07T20:32:59.8554738Z self, 2025-05-07T20:32:59.8554939Z T: int, 2025-05-07T20:32:59.8555141Z D: int, 2025-05-07T20:32:59.8555375Z scale_ub: Optional[float], 2025-05-07T20:32:59.8555825Z contiguous: bool, 2025-05-07T20:32:59.8556071Z compiled: bool, 2025-05-07T20:32:59.8556309Z ) -> None: 2025-05-07T20:32:59.8556535Z torch.manual_seed(2025) 2025-05-07T20:32:59.8556787Z 2025-05-07T20:32:59.8557065Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.8557420Z 2025-05-07T20:32:59.8557624Z x_sign = torch.sign(x) 2025-05-07T20:32:59.8557923Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.8558240Z x = x_sign * x_clamp 2025-05-07T20:32:59.8558497Z x0 = x[:, :D] 2025-05-07T20:32:59.8558719Z x1 = x[:, D:] 2025-05-07T20:32:59.8558935Z 2025-05-07T20:32:59.8559131Z if contiguous: 2025-05-07T20:32:59.8559370Z x0 = x0.contiguous() 2025-05-07T20:32:59.8559639Z x1 = x1.contiguous() 2025-05-07T20:32:59.8559889Z 2025-05-07T20:32:59.8560086Z if scale_ub is not None: 2025-05-07T20:32:59.8560372Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.8560719Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.8561028Z ) 2025-05-07T20:32:59.8561231Z else: 2025-05-07T20:32:59.8561449Z scale_ub_tensor = None 2025-05-07T20:32:59.8561703Z 2025-05-07T20:32:59.8561944Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.8562264Z op = silu_mul_quant 2025-05-07T20:32:59.8562597Z if compiled: 2025-05-07T20:32:59.8562852Z op = torch.compile(op) 2025-05-07T20:32:59.8563160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.8563443Z 2025-05-07T20:32:59.8563640Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.8563815Z 2025-05-07T20:32:59.8563920Z moe/activation_test.py:117: 2025-05-07T20:32:59.8564221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.8564551Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.8564846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.8568092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.8568792Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.8569343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.8570044Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.8570723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.8571271Z kernel = self.compile( 2025-05-07T20:32:59.8571934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.8572607Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.8573005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.8573252Z 2025-05-07T20:32:59.8573468Z self = 2025-05-07T20:32:59.8574560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.8575960Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd504d20940>} 2025-05-07T20:32:59.8577404Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.8578568Z context = 2025-05-07T20:32:59.8578858Z 2025-05-07T20:32:59.8579033Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.8585722Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.8586317Z module_map=module_map) 2025-05-07T20:32:59.8586740Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.8587138Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.8587430Z E ^ 2025-05-07T20:32:59.8587979Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.8588536Z 2025-05-07T20:32:59.8589008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.8589541Z 2025-05-07T20:32:59.8589646Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.8590076Z self=, 2025-05-07T20:32:59.8590486Z T=128, 2025-05-07T20:32:59.8590676Z D=7168, 2025-05-07T20:32:59.8590876Z scale_ub=None, 2025-05-07T20:32:59.8591092Z contiguous=True, 2025-05-07T20:32:59.8591315Z compiled=False, 2025-05-07T20:32:59.8591527Z ) 2025-05-07T20:32:59.8591855Z self = 2025-05-07T20:32:59.8592423Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:59.8592701Z 2025-05-07T20:32:59.8592780Z @given( 2025-05-07T20:32:59.8593015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.8593337Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.8593648Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.8593983Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.8594319Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.8594605Z ) 2025-05-07T20:32:59.8594961Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.8595501Z def test_silu_mul_quant( 2025-05-07T20:32:59.8595741Z self, 2025-05-07T20:32:59.8595937Z T: int, 2025-05-07T20:32:59.8596137Z D: int, 2025-05-07T20:32:59.8596355Z scale_ub: Optional[float], 2025-05-07T20:32:59.8596633Z contiguous: bool, 2025-05-07T20:32:59.8596878Z compiled: bool, 2025-05-07T20:32:59.8597098Z ) -> None: 2025-05-07T20:32:59.8597324Z torch.manual_seed(2025) 2025-05-07T20:32:59.8597568Z 2025-05-07T20:32:59.8597855Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.8598198Z 2025-05-07T20:32:59.8598396Z x_sign = torch.sign(x) 2025-05-07T20:32:59.8598740Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.8599050Z x = x_sign * x_clamp 2025-05-07T20:32:59.8599294Z x0 = x[:, :D] 2025-05-07T20:32:59.8599513Z x1 = x[:, D:] 2025-05-07T20:32:59.8599717Z 2025-05-07T20:32:59.8599911Z if contiguous: 2025-05-07T20:32:59.8600152Z x0 = x0.contiguous() 2025-05-07T20:32:59.8600412Z x1 = x1.contiguous() 2025-05-07T20:32:59.8600648Z 2025-05-07T20:32:59.8600849Z if scale_ub is not None: 2025-05-07T20:32:59.8601120Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.8601461Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.8601777Z ) 2025-05-07T20:32:59.8601965Z else: 2025-05-07T20:32:59.8602178Z scale_ub_tensor = None 2025-05-07T20:32:59.8602433Z 2025-05-07T20:32:59.8602709Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.8603033Z op = silu_mul_quant 2025-05-07T20:32:59.8603289Z if compiled: 2025-05-07T20:32:59.8603543Z op = torch.compile(op) 2025-05-07T20:32:59.8603843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.8604111Z 2025-05-07T20:32:59.8604309Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.8604478Z 2025-05-07T20:32:59.8604582Z moe/activation_test.py:117: 2025-05-07T20:32:59.8604877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.8605209Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.8605494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.8606189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.8606895Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.8607444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.8608138Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.8608808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.8609346Z kernel = self.compile( 2025-05-07T20:32:59.8609898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.8610572Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.8610968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.8611200Z 2025-05-07T20:32:59.8611458Z self = 2025-05-07T20:32:59.8612565Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.8613971Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd504d21240>} 2025-05-07T20:32:59.8615386Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.8616498Z context = 2025-05-07T20:32:59.8616795Z 2025-05-07T20:32:59.8616964Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.8617499Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.8617968Z module_map=module_map) 2025-05-07T20:32:59.8618427Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.8618834Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.8619095Z E ^ 2025-05-07T20:32:59.8619569Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.8620034Z 2025-05-07T20:32:59.8620459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.8620985Z 2025-05-07T20:32:59.8621095Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.8621513Z self=, 2025-05-07T20:32:59.8621923Z T=2048, 2025-05-07T20:32:59.8622115Z D=7168, 2025-05-07T20:32:59.8622308Z scale_ub=1200.0, 2025-05-07T20:32:59.8622540Z contiguous=True, 2025-05-07T20:32:59.8622766Z compiled=False, 2025-05-07T20:32:59.8622972Z ) 2025-05-07T20:32:59.9566019Z self = 2025-05-07T20:32:59.9566673Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:59.9567025Z 2025-05-07T20:32:59.9567107Z @given( 2025-05-07T20:32:59.9567349Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.9567671Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.9567979Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.9568323Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.9568662Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.9568959Z ) 2025-05-07T20:32:59.9569321Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.9569773Z def test_silu_mul_quant( 2025-05-07T20:32:59.9570018Z self, 2025-05-07T20:32:59.9570220Z T: int, 2025-05-07T20:32:59.9570422Z D: int, 2025-05-07T20:32:59.9570645Z scale_ub: Optional[float], 2025-05-07T20:32:59.9570931Z contiguous: bool, 2025-05-07T20:32:59.9571180Z compiled: bool, 2025-05-07T20:32:59.9571411Z ) -> None: 2025-05-07T20:32:59.9571630Z torch.manual_seed(2025) 2025-05-07T20:32:59.9571880Z 2025-05-07T20:32:59.9572165Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.9574424Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:59.9576363Z 2025-05-07T20:32:59.9576488Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:59.9576711Z 2025-05-07T20:32:59.9576819Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.9577250Z self=, 2025-05-07T20:32:59.9577659Z T=1, 2025-05-07T20:32:59.9577845Z D=5120, 2025-05-07T20:32:59.9578135Z scale_ub=1200.0, 2025-05-07T20:32:59.9578369Z contiguous=True, 2025-05-07T20:32:59.9578681Z compiled=False, 2025-05-07T20:32:59.9578889Z ) 2025-05-07T20:32:59.9579218Z self = 2025-05-07T20:32:59.9579711Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:59.9579987Z 2025-05-07T20:32:59.9580066Z @given( 2025-05-07T20:32:59.9580305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.9580620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.9580934Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.9581278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.9581685Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.9581976Z ) 2025-05-07T20:32:59.9582336Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.9582788Z def test_silu_mul_quant( 2025-05-07T20:32:59.9583031Z self, 2025-05-07T20:32:59.9583233Z T: int, 2025-05-07T20:32:59.9583435Z D: int, 2025-05-07T20:32:59.9583655Z scale_ub: Optional[float], 2025-05-07T20:32:59.9583932Z contiguous: bool, 2025-05-07T20:32:59.9584177Z compiled: bool, 2025-05-07T20:32:59.9584399Z ) -> None: 2025-05-07T20:32:59.9584623Z torch.manual_seed(2025) 2025-05-07T20:32:59.9584868Z 2025-05-07T20:32:59.9585150Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.9585497Z 2025-05-07T20:32:59.9585696Z x_sign = torch.sign(x) 2025-05-07T20:32:59.9586064Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.9586384Z x = x_sign * x_clamp 2025-05-07T20:32:59.9586629Z x0 = x[:, :D] 2025-05-07T20:32:59.9586856Z x1 = x[:, D:] 2025-05-07T20:32:59.9587064Z 2025-05-07T20:32:59.9587260Z if contiguous: 2025-05-07T20:32:59.9587501Z x0 = x0.contiguous() 2025-05-07T20:32:59.9587758Z x1 = x1.contiguous() 2025-05-07T20:32:59.9588006Z 2025-05-07T20:32:59.9588202Z if scale_ub is not None: 2025-05-07T20:32:59.9588481Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.9588823Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.9589136Z ) 2025-05-07T20:32:59.9589331Z else: 2025-05-07T20:32:59.9589555Z scale_ub_tensor = None 2025-05-07T20:32:59.9589815Z 2025-05-07T20:32:59.9590055Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.9590378Z op = silu_mul_quant 2025-05-07T20:32:59.9590637Z if compiled: 2025-05-07T20:32:59.9590890Z op = torch.compile(op) 2025-05-07T20:32:59.9591196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.9591481Z 2025-05-07T20:32:59.9591675Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.9591851Z 2025-05-07T20:32:59.9591957Z moe/activation_test.py:117: 2025-05-07T20:32:59.9592258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.9592596Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.9592883Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.9593589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.9594342Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.9594893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.9595636Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.9596318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.9596864Z kernel = self.compile( 2025-05-07T20:32:59.9597417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.9598135Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.9598534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.9598762Z 2025-05-07T20:32:59.9598979Z self = 2025-05-07T20:32:59.9600085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.9601613Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd504d22200>} 2025-05-07T20:32:59.9602988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.9604040Z context = 2025-05-07T20:32:59.9604330Z 2025-05-07T20:32:59.9604504Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.9605032Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.9605512Z module_map=module_map) 2025-05-07T20:32:59.9605886Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.9606288Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.9606543Z E ^ 2025-05-07T20:32:59.9607021Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.9607482Z 2025-05-07T20:32:59.9607910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.9608432Z 2025-05-07T20:32:59.9608548Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.9608964Z self=, 2025-05-07T20:32:59.9609372Z T=2048, 2025-05-07T20:32:59.9609564Z D=5120, 2025-05-07T20:32:59.9609753Z scale_ub=None, 2025-05-07T20:32:59.9609970Z contiguous=True, 2025-05-07T20:32:59.9610196Z compiled=False, 2025-05-07T20:32:59.9610398Z ) 2025-05-07T20:32:59.9610725Z self = 2025-05-07T20:32:59.9611222Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:59.9611496Z 2025-05-07T20:32:59.9611571Z @given( 2025-05-07T20:32:59.9611808Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.9612127Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.9612436Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.9612766Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.9613101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.9613390Z ) 2025-05-07T20:32:59.9613740Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.9614187Z def test_silu_mul_quant( 2025-05-07T20:32:59.9614432Z self, 2025-05-07T20:32:59.9614626Z T: int, 2025-05-07T20:32:59.9614825Z D: int, 2025-05-07T20:32:59.9615089Z scale_ub: Optional[float], 2025-05-07T20:32:59.9615363Z contiguous: bool, 2025-05-07T20:32:59.9615607Z compiled: bool, 2025-05-07T20:32:59.9615836Z ) -> None: 2025-05-07T20:32:59.9616052Z torch.manual_seed(2025) 2025-05-07T20:32:59.9616297Z 2025-05-07T20:32:59.9616572Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.9616910Z 2025-05-07T20:32:59.9617107Z > x_sign = torch.sign(x) 2025-05-07T20:32:59.9619226Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:59.9621181Z 2025-05-07T20:32:59.9621303Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:59.9621520Z 2025-05-07T20:32:59.9621632Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.9622090Z self=, 2025-05-07T20:32:59.9622496Z T=16384, 2025-05-07T20:32:59.9622700Z D=5120, 2025-05-07T20:32:59.9622892Z scale_ub=None, 2025-05-07T20:32:59.9623110Z contiguous=True, 2025-05-07T20:32:59.9623341Z compiled=False, 2025-05-07T20:32:59.9623552Z ) 2025-05-07T20:33:00.0595752Z self = 2025-05-07T20:33:00.0596392Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.0596810Z 2025-05-07T20:33:00.0596928Z @given( 2025-05-07T20:33:00.0597201Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.0597591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.0597976Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.0598439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.0598779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.0599071Z ) 2025-05-07T20:33:00.0599434Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.0599893Z def test_silu_mul_quant( 2025-05-07T20:33:00.0600140Z self, 2025-05-07T20:33:00.0600343Z T: int, 2025-05-07T20:33:00.0600550Z D: int, 2025-05-07T20:33:00.0600767Z scale_ub: Optional[float], 2025-05-07T20:33:00.0601049Z contiguous: bool, 2025-05-07T20:33:00.0601295Z compiled: bool, 2025-05-07T20:33:00.0601523Z ) -> None: 2025-05-07T20:33:00.0601747Z torch.manual_seed(2025) 2025-05-07T20:33:00.0601996Z 2025-05-07T20:33:00.0602273Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.0604392Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.0606320Z 2025-05-07T20:33:00.0606440Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.0606666Z 2025-05-07T20:33:00.0606776Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.0607201Z self=, 2025-05-07T20:33:00.0607603Z T=4096, 2025-05-07T20:33:00.0607868Z D=5120, 2025-05-07T20:33:00.0608066Z scale_ub=None, 2025-05-07T20:33:00.0608282Z contiguous=True, 2025-05-07T20:33:00.0608514Z compiled=False, 2025-05-07T20:33:00.0608728Z ) 2025-05-07T20:33:00.0609051Z self = 2025-05-07T20:33:00.0609557Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.0609829Z 2025-05-07T20:33:00.0609914Z @given( 2025-05-07T20:33:00.0610143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.0610467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.0610857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.0611194Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.0611528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.0611820Z ) 2025-05-07T20:33:00.0612179Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.0612626Z def test_silu_mul_quant( 2025-05-07T20:33:00.0612874Z self, 2025-05-07T20:33:00.0613073Z T: int, 2025-05-07T20:33:00.0613273Z D: int, 2025-05-07T20:33:00.0613502Z scale_ub: Optional[float], 2025-05-07T20:33:00.0613787Z contiguous: bool, 2025-05-07T20:33:00.0614094Z compiled: bool, 2025-05-07T20:33:00.0614325Z ) -> None: 2025-05-07T20:33:00.0614551Z torch.manual_seed(2025) 2025-05-07T20:33:00.0614792Z 2025-05-07T20:33:00.0615072Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.0617189Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.0619223Z 2025-05-07T20:33:00.0619345Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.0619562Z 2025-05-07T20:33:00.0619677Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.0620097Z self=, 2025-05-07T20:33:00.0620511Z T=2048, 2025-05-07T20:33:00.0620706Z D=5120, 2025-05-07T20:33:00.0620896Z scale_ub=None, 2025-05-07T20:33:00.0621120Z contiguous=False, 2025-05-07T20:33:00.0621354Z compiled=False, 2025-05-07T20:33:00.0621557Z ) 2025-05-07T20:33:00.0621881Z self = 2025-05-07T20:33:00.0622389Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.0622666Z 2025-05-07T20:33:00.0622748Z @given( 2025-05-07T20:33:00.0622980Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.0623298Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.0623618Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.0623946Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.0624281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.0624573Z ) 2025-05-07T20:33:00.0624923Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.0625372Z def test_silu_mul_quant( 2025-05-07T20:33:00.0625620Z self, 2025-05-07T20:33:00.0625826Z T: int, 2025-05-07T20:33:00.0626020Z D: int, 2025-05-07T20:33:00.0626244Z scale_ub: Optional[float], 2025-05-07T20:33:00.0626518Z contiguous: bool, 2025-05-07T20:33:00.0626757Z compiled: bool, 2025-05-07T20:33:00.0626985Z ) -> None: 2025-05-07T20:33:00.0627204Z torch.manual_seed(2025) 2025-05-07T20:33:00.0627494Z 2025-05-07T20:33:00.0627773Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.0629873Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.0631814Z 2025-05-07T20:33:00.0631939Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.0632154Z 2025-05-07T20:33:00.0632264Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.0632684Z self=, 2025-05-07T20:33:00.0633094Z T=4096, 2025-05-07T20:33:00.0633284Z D=7168, 2025-05-07T20:33:00.0633471Z scale_ub=None, 2025-05-07T20:33:00.0633691Z contiguous=True, 2025-05-07T20:33:00.0633919Z compiled=True, 2025-05-07T20:33:00.0634118Z ) 2025-05-07T20:33:00.0634486Z self = 2025-05-07T20:33:00.0634985Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.0635254Z 2025-05-07T20:33:00.0635332Z @given( 2025-05-07T20:33:00.0635568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.0635889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.0636217Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.0636547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.0636880Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.0637171Z ) 2025-05-07T20:33:00.0637527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.0637977Z def test_silu_mul_quant( 2025-05-07T20:33:00.0638223Z self, 2025-05-07T20:33:00.0638470Z T: int, 2025-05-07T20:33:00.0638665Z D: int, 2025-05-07T20:33:00.0638891Z scale_ub: Optional[float], 2025-05-07T20:33:00.0639163Z contiguous: bool, 2025-05-07T20:33:00.0639400Z compiled: bool, 2025-05-07T20:33:00.0639628Z ) -> None: 2025-05-07T20:33:00.0639850Z torch.manual_seed(2025) 2025-05-07T20:33:00.0640094Z 2025-05-07T20:33:00.0640369Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.0642475Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.0644373Z 2025-05-07T20:33:00.0644498Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.0644721Z 2025-05-07T20:33:00.0644830Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.0645244Z self=, 2025-05-07T20:33:00.0645652Z T=2048, 2025-05-07T20:33:00.0645843Z D=5120, 2025-05-07T20:33:00.0646038Z scale_ub=1200.0, 2025-05-07T20:33:00.0646264Z contiguous=False, 2025-05-07T20:33:00.0646494Z compiled=False, 2025-05-07T20:33:00.0646696Z ) 2025-05-07T20:33:00.0647019Z self = 2025-05-07T20:33:00.0647521Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.0647847Z 2025-05-07T20:33:00.0647926Z @given( 2025-05-07T20:33:00.0648162Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.0648480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.0648787Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.0649122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.0649462Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.0649754Z ) 2025-05-07T20:33:00.0650102Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.0650623Z def test_silu_mul_quant( 2025-05-07T20:33:00.0656754Z self, 2025-05-07T20:33:00.0656986Z T: int, 2025-05-07T20:33:00.0657181Z D: int, 2025-05-07T20:33:00.0657392Z scale_ub: Optional[float], 2025-05-07T20:33:00.0657663Z contiguous: bool, 2025-05-07T20:33:00.0657904Z compiled: bool, 2025-05-07T20:33:00.0658201Z ) -> None: 2025-05-07T20:33:00.0658425Z torch.manual_seed(2025) 2025-05-07T20:33:00.0658668Z 2025-05-07T20:33:00.0658936Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.0661156Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.0663070Z 2025-05-07T20:33:00.0663191Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.0663410Z 2025-05-07T20:33:00.0663514Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.0663934Z self=, 2025-05-07T20:33:00.0664329Z T=4096, 2025-05-07T20:33:00.0664514Z D=7168, 2025-05-07T20:33:00.0664772Z scale_ub=1200.0, 2025-05-07T20:33:00.0664989Z contiguous=True, 2025-05-07T20:33:00.0665211Z compiled=False, 2025-05-07T20:33:00.0665413Z ) 2025-05-07T20:33:00.1931348Z self = 2025-05-07T20:33:00.1931873Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.1934391Z 2025-05-07T20:33:00.1934790Z @given( 2025-05-07T20:33:00.1935226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.1935652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.1935987Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.1936351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.1936703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.1937009Z ) 2025-05-07T20:33:00.1937381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.1937862Z def test_silu_mul_quant( 2025-05-07T20:33:00.1938225Z self, 2025-05-07T20:33:00.1938440Z T: int, 2025-05-07T20:33:00.1938661Z D: int, 2025-05-07T20:33:00.1938889Z scale_ub: Optional[float], 2025-05-07T20:33:00.1939181Z contiguous: bool, 2025-05-07T20:33:00.1939437Z compiled: bool, 2025-05-07T20:33:00.1939672Z ) -> None: 2025-05-07T20:33:00.1939910Z torch.manual_seed(2025) 2025-05-07T20:33:00.1940171Z 2025-05-07T20:33:00.1940455Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.1942903Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.1944949Z 2025-05-07T20:33:00.1945076Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.1945309Z 2025-05-07T20:33:00.1945419Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.1945856Z self=, 2025-05-07T20:33:00.1946379Z T=16384, 2025-05-07T20:33:00.1946589Z D=7168, 2025-05-07T20:33:00.1946797Z scale_ub=None, 2025-05-07T20:33:00.1947018Z contiguous=False, 2025-05-07T20:33:00.1947261Z compiled=True, 2025-05-07T20:33:00.1947485Z ) 2025-05-07T20:33:00.1947812Z self = 2025-05-07T20:33:00.1948339Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.1948635Z 2025-05-07T20:33:00.1948729Z @given( 2025-05-07T20:33:00.1948966Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.1949420Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.1949745Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.1950081Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.1950423Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.1950721Z ) 2025-05-07T20:33:00.1951078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.1951542Z def test_silu_mul_quant( 2025-05-07T20:33:00.1951796Z self, 2025-05-07T20:33:00.1951998Z T: int, 2025-05-07T20:33:00.1952203Z D: int, 2025-05-07T20:33:00.1952435Z scale_ub: Optional[float], 2025-05-07T20:33:00.1952725Z contiguous: bool, 2025-05-07T20:33:00.1952971Z compiled: bool, 2025-05-07T20:33:00.1953207Z ) -> None: 2025-05-07T20:33:00.1953434Z torch.manual_seed(2025) 2025-05-07T20:33:00.1953773Z 2025-05-07T20:33:00.1954092Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.1956525Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.1958479Z 2025-05-07T20:33:00.1958603Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.1958824Z 2025-05-07T20:33:00.1958948Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.1959372Z self=, 2025-05-07T20:33:00.1959794Z T=4096, 2025-05-07T20:33:00.1959994Z D=7168, 2025-05-07T20:33:00.1960195Z scale_ub=None, 2025-05-07T20:33:00.1960418Z contiguous=True, 2025-05-07T20:33:00.1960657Z compiled=False, 2025-05-07T20:33:00.1960876Z ) 2025-05-07T20:33:00.1961203Z self = 2025-05-07T20:33:00.1961716Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.1961996Z 2025-05-07T20:33:00.1962086Z @given( 2025-05-07T20:33:00.1962320Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.1962648Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.1962969Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.1963306Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.1963726Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.1964033Z ) 2025-05-07T20:33:00.1964410Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.1964870Z def test_silu_mul_quant( 2025-05-07T20:33:00.1965127Z self, 2025-05-07T20:33:00.1965334Z T: int, 2025-05-07T20:33:00.1965534Z D: int, 2025-05-07T20:33:00.1965761Z scale_ub: Optional[float], 2025-05-07T20:33:00.1966043Z contiguous: bool, 2025-05-07T20:33:00.1966289Z compiled: bool, 2025-05-07T20:33:00.1966523Z ) -> None: 2025-05-07T20:33:00.1966820Z torch.manual_seed(2025) 2025-05-07T20:33:00.1967067Z 2025-05-07T20:33:00.1967354Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.1969542Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.1971483Z 2025-05-07T20:33:00.1971606Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.1971824Z 2025-05-07T20:33:00.1971939Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.1972366Z self=, 2025-05-07T20:33:00.1972787Z T=16384, 2025-05-07T20:33:00.1972993Z D=7168, 2025-05-07T20:33:00.1973185Z scale_ub=None, 2025-05-07T20:33:00.1973412Z contiguous=True, 2025-05-07T20:33:00.1973651Z compiled=False, 2025-05-07T20:33:00.1973858Z ) 2025-05-07T20:33:00.1974195Z self = 2025-05-07T20:33:00.1974710Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.1975070Z 2025-05-07T20:33:00.1975161Z @given( 2025-05-07T20:33:00.1975397Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.1975722Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.1976041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.1976377Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.1976719Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.1977020Z ) 2025-05-07T20:33:00.1977373Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.1977829Z def test_silu_mul_quant( 2025-05-07T20:33:00.1978154Z self, 2025-05-07T20:33:00.1978354Z T: int, 2025-05-07T20:33:00.1978558Z D: int, 2025-05-07T20:33:00.1978788Z scale_ub: Optional[float], 2025-05-07T20:33:00.1979069Z contiguous: bool, 2025-05-07T20:33:00.1979321Z compiled: bool, 2025-05-07T20:33:00.1979558Z ) -> None: 2025-05-07T20:33:00.1979787Z torch.manual_seed(2025) 2025-05-07T20:33:00.1980032Z 2025-05-07T20:33:00.1980318Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.1982444Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.1984419Z 2025-05-07T20:33:00.1984606Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.1984826Z 2025-05-07T20:33:00.1984932Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.1985363Z self=, 2025-05-07T20:33:00.1985784Z T=16384, 2025-05-07T20:33:00.1985988Z D=7168, 2025-05-07T20:33:00.1986180Z scale_ub=1200.0, 2025-05-07T20:33:00.1986416Z contiguous=True, 2025-05-07T20:33:00.1986650Z compiled=False, 2025-05-07T20:33:00.1986856Z ) 2025-05-07T20:33:00.1987190Z self = 2025-05-07T20:33:00.1987754Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.1988037Z 2025-05-07T20:33:00.1988117Z @given( 2025-05-07T20:33:00.1988361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.1988689Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.1989000Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.1989346Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.1989690Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.1989988Z ) 2025-05-07T20:33:00.1990343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.1990846Z def test_silu_mul_quant( 2025-05-07T20:33:00.1991099Z self, 2025-05-07T20:33:00.1991297Z T: int, 2025-05-07T20:33:00.1991503Z D: int, 2025-05-07T20:33:00.1991733Z scale_ub: Optional[float], 2025-05-07T20:33:00.1992010Z contiguous: bool, 2025-05-07T20:33:00.1992269Z compiled: bool, 2025-05-07T20:33:00.1992506Z ) -> None: 2025-05-07T20:33:00.1992725Z torch.manual_seed(2025) 2025-05-07T20:33:00.1992979Z 2025-05-07T20:33:00.1993266Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.1995446Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.1997423Z 2025-05-07T20:33:00.1997554Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.1997774Z 2025-05-07T20:33:00.1997881Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.1998314Z self=, 2025-05-07T20:33:00.1998734Z T=128, 2025-05-07T20:33:00.1998924Z D=5120, 2025-05-07T20:33:00.1999131Z scale_ub=1200.0, 2025-05-07T20:33:00.1999374Z contiguous=False, 2025-05-07T20:33:00.1999601Z compiled=False, 2025-05-07T20:33:00.1999818Z ) 2025-05-07T20:33:00.3413539Z self = 2025-05-07T20:33:00.3414153Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.3414486Z 2025-05-07T20:33:00.3414590Z @given( 2025-05-07T20:33:00.3414834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.3415171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.3415500Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.3415844Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.3416202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.3416511Z ) 2025-05-07T20:33:00.3416877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.3417340Z def test_silu_mul_quant( 2025-05-07T20:33:00.3417598Z self, 2025-05-07T20:33:00.3417801Z T: int, 2025-05-07T20:33:00.3418371Z D: int, 2025-05-07T20:33:00.3418616Z scale_ub: Optional[float], 2025-05-07T20:33:00.3418908Z contiguous: bool, 2025-05-07T20:33:00.3419161Z compiled: bool, 2025-05-07T20:33:00.3419408Z ) -> None: 2025-05-07T20:33:00.3419643Z torch.manual_seed(2025) 2025-05-07T20:33:00.3419900Z 2025-05-07T20:33:00.3420195Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.3420556Z 2025-05-07T20:33:00.3420762Z x_sign = torch.sign(x) 2025-05-07T20:33:00.3421073Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.3421501Z x = x_sign * x_clamp 2025-05-07T20:33:00.3421754Z x0 = x[:, :D] 2025-05-07T20:33:00.3421988Z x1 = x[:, D:] 2025-05-07T20:33:00.3422215Z 2025-05-07T20:33:00.3422414Z if contiguous: 2025-05-07T20:33:00.3422665Z x0 = x0.contiguous() 2025-05-07T20:33:00.3422941Z x1 = x1.contiguous() 2025-05-07T20:33:00.3423194Z 2025-05-07T20:33:00.3423412Z if scale_ub is not None: 2025-05-07T20:33:00.3423709Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.3424061Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.3424397Z ) 2025-05-07T20:33:00.3424690Z else: 2025-05-07T20:33:00.3424923Z scale_ub_tensor = None 2025-05-07T20:33:00.3425189Z 2025-05-07T20:33:00.3425441Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.3425778Z op = silu_mul_quant 2025-05-07T20:33:00.3426038Z if compiled: 2025-05-07T20:33:00.3426307Z op = torch.compile(op) 2025-05-07T20:33:00.3426626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.3426909Z 2025-05-07T20:33:00.3427119Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.3427292Z 2025-05-07T20:33:00.3427411Z moe/activation_test.py:117: 2025-05-07T20:33:00.3427719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.3428074Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.3428380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.3429199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.3429919Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.3430492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.3431212Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.3431900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.3432459Z kernel = self.compile( 2025-05-07T20:33:00.3433028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.3433715Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.3434122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.3434369Z 2025-05-07T20:33:00.3434584Z self = 2025-05-07T20:33:00.3435708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.3437149Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd504c45ea0>} 2025-05-07T20:33:00.3438534Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.3439652Z context = 2025-05-07T20:33:00.3439959Z 2025-05-07T20:33:00.3440134Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.3440687Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.3441173Z module_map=module_map) 2025-05-07T20:33:00.3441558Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.3441935Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.3442214Z E ^ 2025-05-07T20:33:00.3442780Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.3443252Z 2025-05-07T20:33:00.3443682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.3444224Z 2025-05-07T20:33:00.3444358Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.3444812Z self=, 2025-05-07T20:33:00.3445220Z T=2048, 2025-05-07T20:33:00.3445429Z D=7168, 2025-05-07T20:33:00.3445636Z scale_ub=None, 2025-05-07T20:33:00.3445860Z contiguous=False, 2025-05-07T20:33:00.3446151Z compiled=False, 2025-05-07T20:33:00.3446378Z ) 2025-05-07T20:33:00.3446709Z self = 2025-05-07T20:33:00.3447230Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.3447509Z 2025-05-07T20:33:00.3447605Z @given( 2025-05-07T20:33:00.3448061Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.3448394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.3448722Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.3449070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.3449412Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.3449715Z ) 2025-05-07T20:33:00.3450082Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.3450589Z def test_silu_mul_quant( 2025-05-07T20:33:00.3450848Z self, 2025-05-07T20:33:00.3451060Z T: int, 2025-05-07T20:33:00.3451264Z D: int, 2025-05-07T20:33:00.3451497Z scale_ub: Optional[float], 2025-05-07T20:33:00.3451783Z contiguous: bool, 2025-05-07T20:33:00.3452030Z compiled: bool, 2025-05-07T20:33:00.3452271Z ) -> None: 2025-05-07T20:33:00.3452505Z torch.manual_seed(2025) 2025-05-07T20:33:00.3452757Z 2025-05-07T20:33:00.3453046Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.3455251Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.3457417Z 2025-05-07T20:33:00.3457545Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.3457768Z 2025-05-07T20:33:00.3457886Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.3458379Z self=, 2025-05-07T20:33:00.3458800Z T=128, 2025-05-07T20:33:00.3459003Z D=7168, 2025-05-07T20:33:00.3459206Z scale_ub=1200.0, 2025-05-07T20:33:00.3459445Z contiguous=True, 2025-05-07T20:33:00.3459687Z compiled=True, 2025-05-07T20:33:00.3459899Z ) 2025-05-07T20:33:00.3879292Z self = 2025-05-07T20:33:00.3880079Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.3880483Z 2025-05-07T20:33:00.3880609Z @given( 2025-05-07T20:33:00.3880875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.3881200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.3881518Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.3881864Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.3882199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.3882499Z ) 2025-05-07T20:33:00.3882980Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.3883430Z def test_silu_mul_quant( 2025-05-07T20:33:00.3883685Z self, 2025-05-07T20:33:00.3883893Z T: int, 2025-05-07T20:33:00.3884095Z D: int, 2025-05-07T20:33:00.3884331Z scale_ub: Optional[float], 2025-05-07T20:33:00.3884622Z contiguous: bool, 2025-05-07T20:33:00.3884874Z compiled: bool, 2025-05-07T20:33:00.3885105Z ) -> None: 2025-05-07T20:33:00.3885337Z torch.manual_seed(2025) 2025-05-07T20:33:00.3885595Z 2025-05-07T20:33:00.3885957Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.3886318Z 2025-05-07T20:33:00.3886527Z x_sign = torch.sign(x) 2025-05-07T20:33:00.3886823Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.3887146Z x = x_sign * x_clamp 2025-05-07T20:33:00.3887399Z x0 = x[:, :D] 2025-05-07T20:33:00.3887625Z x1 = x[:, D:] 2025-05-07T20:33:00.3887845Z 2025-05-07T20:33:00.3888043Z if contiguous: 2025-05-07T20:33:00.3888281Z x0 = x0.contiguous() 2025-05-07T20:33:00.3888551Z x1 = x1.contiguous() 2025-05-07T20:33:00.3888806Z 2025-05-07T20:33:00.3889006Z if scale_ub is not None: 2025-05-07T20:33:00.3889291Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.3889644Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.3889968Z ) 2025-05-07T20:33:00.3890165Z else: 2025-05-07T20:33:00.3890464Z scale_ub_tensor = None 2025-05-07T20:33:00.3890728Z 2025-05-07T20:33:00.3890968Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.3891295Z op = silu_mul_quant 2025-05-07T20:33:00.3891554Z if compiled: 2025-05-07T20:33:00.3891805Z op = torch.compile(op) 2025-05-07T20:33:00.3892118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.3892409Z 2025-05-07T20:33:00.3892606Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.3892785Z 2025-05-07T20:33:00.3892889Z moe/activation_test.py:117: 2025-05-07T20:33:00.3893197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.3893529Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.3893828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.3894415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.3895001Z return fn(*args, **kwargs) 2025-05-07T20:33:00.3895679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.3896392Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.3896947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.3897655Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.3898466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.3899024Z kernel = self.compile( 2025-05-07T20:33:00.3899644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.3900316Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.3900726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.3900969Z 2025-05-07T20:33:00.3901186Z self = 2025-05-07T20:33:00.3902302Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.3903771Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd504c477f0>} 2025-05-07T20:33:00.3905213Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.3906278Z context = 2025-05-07T20:33:00.3906576Z 2025-05-07T20:33:00.3906761Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.3907346Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.3907830Z module_map=module_map) 2025-05-07T20:33:00.3908212Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.3908583Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.3908852Z E ^ 2025-05-07T20:33:00.3909355Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.3909826Z 2025-05-07T20:33:00.3910252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.3910778Z 2025-05-07T20:33:00.3910894Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.3920000Z self=, 2025-05-07T20:33:00.3920519Z T=128, 2025-05-07T20:33:00.3920721Z D=7168, 2025-05-07T20:33:00.3920930Z scale_ub=1200.0, 2025-05-07T20:33:00.3921157Z contiguous=True, 2025-05-07T20:33:00.3921394Z compiled=False, 2025-05-07T20:33:00.3921617Z ) 2025-05-07T20:33:00.3921943Z self = 2025-05-07T20:33:00.3922457Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.3922738Z 2025-05-07T20:33:00.3922829Z @given( 2025-05-07T20:33:00.3923064Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.3923390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.3923716Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.3924065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.3924424Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.3924744Z ) 2025-05-07T20:33:00.3925111Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.3925561Z def test_silu_mul_quant( 2025-05-07T20:33:00.3925815Z self, 2025-05-07T20:33:00.3926020Z T: int, 2025-05-07T20:33:00.3926216Z D: int, 2025-05-07T20:33:00.3926439Z scale_ub: Optional[float], 2025-05-07T20:33:00.3926721Z contiguous: bool, 2025-05-07T20:33:00.3926963Z compiled: bool, 2025-05-07T20:33:00.3927194Z ) -> None: 2025-05-07T20:33:00.3927415Z torch.manual_seed(2025) 2025-05-07T20:33:00.3927654Z 2025-05-07T20:33:00.3927934Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.3928280Z 2025-05-07T20:33:00.3928471Z x_sign = torch.sign(x) 2025-05-07T20:33:00.3928832Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.3930894Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.3932849Z 2025-05-07T20:33:00.3932973Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:00.3933186Z 2025-05-07T20:33:00.3933297Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.3933711Z self=, 2025-05-07T20:33:00.3934116Z T=128, 2025-05-07T20:33:00.3934308Z D=5120, 2025-05-07T20:33:00.3934525Z scale_ub=1200.0, 2025-05-07T20:33:00.3934778Z contiguous=True, 2025-05-07T20:33:00.3935003Z compiled=True, 2025-05-07T20:33:00.3935203Z ) 2025-05-07T20:33:00.3935536Z self = 2025-05-07T20:33:00.3936082Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.3936352Z 2025-05-07T20:33:00.3936437Z @given( 2025-05-07T20:33:00.3936699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.3937019Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.3937335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.3937673Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.3938002Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.3938382Z ) 2025-05-07T20:33:00.3938733Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.3939179Z def test_silu_mul_quant( 2025-05-07T20:33:00.3939417Z self, 2025-05-07T20:33:00.3939612Z T: int, 2025-05-07T20:33:00.3939807Z D: int, 2025-05-07T20:33:00.3940075Z scale_ub: Optional[float], 2025-05-07T20:33:00.3940346Z contiguous: bool, 2025-05-07T20:33:00.3940589Z compiled: bool, 2025-05-07T20:33:00.3940808Z ) -> None: 2025-05-07T20:33:00.3941024Z torch.manual_seed(2025) 2025-05-07T20:33:00.3941266Z 2025-05-07T20:33:00.3941534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.3941874Z 2025-05-07T20:33:00.3942070Z x_sign = torch.sign(x) 2025-05-07T20:33:00.3942362Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.3944437Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.3946344Z 2025-05-07T20:33:00.3946461Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:00.3946683Z 2025-05-07T20:33:00.3946785Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.3947205Z self=, 2025-05-07T20:33:00.3947607Z T=128, 2025-05-07T20:33:00.3947796Z D=7168, 2025-05-07T20:33:00.3947991Z scale_ub=None, 2025-05-07T20:33:00.3948206Z contiguous=True, 2025-05-07T20:33:00.3948664Z compiled=True, 2025-05-07T20:33:00.3948873Z ) 2025-05-07T20:33:00.5949770Z self = 2025-05-07T20:33:00.5950567Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.5950840Z 2025-05-07T20:33:00.5950929Z @given( 2025-05-07T20:33:00.5951161Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.5951491Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.5951813Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.5952147Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.5952487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.5952781Z ) 2025-05-07T20:33:00.5953143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.5953720Z def test_silu_mul_quant( 2025-05-07T20:33:00.5953973Z self, 2025-05-07T20:33:00.5954176Z T: int, 2025-05-07T20:33:00.5954371Z D: int, 2025-05-07T20:33:00.5954597Z scale_ub: Optional[float], 2025-05-07T20:33:00.5954876Z contiguous: bool, 2025-05-07T20:33:00.5955153Z compiled: bool, 2025-05-07T20:33:00.5955477Z ) -> None: 2025-05-07T20:33:00.5956005Z torch.manual_seed(2025) 2025-05-07T20:33:00.5956250Z 2025-05-07T20:33:00.5956548Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.5958826Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.5960784Z 2025-05-07T20:33:00.5960907Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.5961126Z 2025-05-07T20:33:00.5971185Z FAILED 2025-05-07T20:33:00.5971345Z 2025-05-07T20:33:00.5971478Z =================================== FAILURES =================================== 2025-05-07T20:33:00.5972137Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:00.5972798Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:00.5973658Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:33:00.5974219Z | yield 2025-05-07T20:33:00.5974671Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 591, in run 2025-05-07T20:33:00.5975318Z | self._callTestMethod(testMethod) 2025-05-07T20:33:00.5975902Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/unittest/case.py", line 549, in _callTestMethod 2025-05-07T20:33:00.5976449Z | method() 2025-05-07T20:33:00.5977112Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:00.5977905Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.5979006Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:00.5979918Z | raise the_error_hypothesis_found 2025-05-07T20:33:00.5980622Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:00.5981316Z +-+---------------- 1 ---------------- 2025-05-07T20:33:00.5981727Z | Traceback (most recent call last): 2025-05-07T20:33:00.5982731Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:00.5983937Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.5987088Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.5989887Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:00.5990611Z | self=, 2025-05-07T20:33:00.5991231Z | T=2048, 2025-05-07T20:33:00.5991588Z | D=5120, # or any other generated value 2025-05-07T20:33:00.5992066Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:00.5992584Z | contiguous=True, # or any other generated value 2025-05-07T20:33:00.5993118Z | compiled=False, # or any other generated value 2025-05-07T20:33:00.5993566Z | ) 2025-05-07T20:33:00.5993813Z | 2025-05-07T20:33:00.5994682Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:00.5995547Z +---------------- 2 ---------------- 2025-05-07T20:33:00.5995956Z | Traceback (most recent call last): 2025-05-07T20:33:00.5996970Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:00.5998075Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6000989Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.6003866Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:00.6004497Z | self=, 2025-05-07T20:33:00.6005130Z | T=128, 2025-05-07T20:33:00.6005477Z | D=7168, 2025-05-07T20:33:00.6005772Z | scale_ub=None, 2025-05-07T20:33:00.6006107Z | contiguous=True, 2025-05-07T20:33:00.6006458Z | compiled=True, 2025-05-07T20:33:00.6006767Z | ) 2025-05-07T20:33:00.6007018Z | 2025-05-07T20:33:00.6007760Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:00.6008610Z +---------------- 3 ---------------- 2025-05-07T20:33:00.6009020Z | Traceback (most recent call last): 2025-05-07T20:33:00.6010016Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:00.6011126Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6014017Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.6016761Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:00.6017303Z | self=, 2025-05-07T20:33:00.6017722Z | T=128, 2025-05-07T20:33:00.6017932Z | D=5120, 2025-05-07T20:33:00.6018212Z | scale_ub=1200.0, 2025-05-07T20:33:00.6018462Z | contiguous=True, 2025-05-07T20:33:00.6018714Z | compiled=True, 2025-05-07T20:33:00.6018937Z | ) 2025-05-07T20:33:00.6019126Z | 2025-05-07T20:33:00.6019663Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:00.6020353Z +---------------- 4 ---------------- 2025-05-07T20:33:00.6020650Z | Traceback (most recent call last): 2025-05-07T20:33:00.6021382Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:00.6022106Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.6022766Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:00.6023525Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6024380Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:00.6025245Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.6025862Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:00.6026612Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6027375Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:00.6028163Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6029285Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:33:00.6030467Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6032520Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:00.6033566Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.6034543Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:00.6035420Z | fn() 2025-05-07T20:33:00.6036253Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:00.6037156Z | self.fn.run( 2025-05-07T20:33:00.6037924Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:00.6038779Z | kernel = self.compile( 2025-05-07T20:33:00.6039654Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:00.6040667Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6041690Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:00.6042847Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6043601Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6044222Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.6044598Z | ^ 2025-05-07T20:33:00.6045271Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6046108Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:00.6046694Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:00.6047457Z | self=, 2025-05-07T20:33:00.6048089Z | T=1, # or any other generated value 2025-05-07T20:33:00.6048600Z | D=5120, # or any other generated value 2025-05-07T20:33:00.6049091Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:00.6049621Z | contiguous=True, # or any other generated value 2025-05-07T20:33:00.6050147Z | compiled=True, # or any other generated value 2025-05-07T20:33:00.6050596Z | ) 2025-05-07T20:33:00.6050861Z | 2025-05-07T20:33:00.6051627Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:00.6052498Z +------------------------------------ 2025-05-07T20:33:00.6053070Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:00.6053611Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6054210Z self=, 2025-05-07T20:33:00.6054805Z T=1, 2025-05-07T20:33:00.6055109Z D=5120, 2025-05-07T20:33:00.6055391Z scale_ub=None, 2025-05-07T20:33:00.6056006Z contiguous=True, 2025-05-07T20:33:00.6056338Z compiled=True, 2025-05-07T20:33:00.6056633Z ) 2025-05-07T20:33:00.6057107Z self = 2025-05-07T20:33:00.6057820Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.6058322Z 2025-05-07T20:33:00.6058452Z @given( 2025-05-07T20:33:00.6058797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6059245Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6059874Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6060350Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6060827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6061248Z ) 2025-05-07T20:33:00.6061751Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6062390Z def test_silu_mul_quant( 2025-05-07T20:33:00.6062745Z self, 2025-05-07T20:33:00.6063034Z T: int, 2025-05-07T20:33:00.6063314Z D: int, 2025-05-07T20:33:00.6063635Z scale_ub: Optional[float], 2025-05-07T20:33:00.6064041Z contiguous: bool, 2025-05-07T20:33:00.6064387Z compiled: bool, 2025-05-07T20:33:00.6064716Z ) -> None: 2025-05-07T20:33:00.6065040Z torch.manual_seed(2025) 2025-05-07T20:33:00.6065401Z 2025-05-07T20:33:00.6065796Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6066287Z 2025-05-07T20:33:00.6066561Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6066982Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6067417Z x = x_sign * x_clamp 2025-05-07T20:33:00.6067739Z x0 = x[:, :D] 2025-05-07T20:33:00.6068058Z x1 = x[:, D:] 2025-05-07T20:33:00.6068361Z 2025-05-07T20:33:00.6068633Z if contiguous: 2025-05-07T20:33:00.6068968Z x0 = x0.contiguous() 2025-05-07T20:33:00.6069357Z x1 = x1.contiguous() 2025-05-07T20:33:00.6069703Z 2025-05-07T20:33:00.6069990Z if scale_ub is not None: 2025-05-07T20:33:00.6070394Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6070883Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6071335Z ) 2025-05-07T20:33:00.6071709Z else: 2025-05-07T20:33:00.6072018Z scale_ub_tensor = None 2025-05-07T20:33:00.6072367Z 2025-05-07T20:33:00.6072691Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6073141Z op = silu_mul_quant 2025-05-07T20:33:00.6073507Z if compiled: 2025-05-07T20:33:00.6073873Z op = torch.compile(op) 2025-05-07T20:33:00.6074305Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6074695Z 2025-05-07T20:33:00.6074980Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.6075399Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.6075878Z 2025-05-07T20:33:00.6076209Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6076671Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.6077085Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.6077531Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.6078041Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6078482Z 2025-05-07T20:33:00.6078760Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.6079043Z 2025-05-07T20:33:00.6079182Z moe/activation_test.py:126: 2025-05-07T20:33:00.6079668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6080121Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.6080579Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6081661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.6082739Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.6083483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6084450Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6085504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.6086580Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6087664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.6088724Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6089743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.6090624Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.6091469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.6092189Z fn() 2025-05-07T20:33:00.6092889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.6093670Z self.fn.run( 2025-05-07T20:33:00.6094306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6095081Z kernel = self.compile( 2025-05-07T20:33:00.6095841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6096738Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6097283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6097611Z 2025-05-07T20:33:00.6097919Z self = 2025-05-07T20:33:00.6099541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6101427Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6e133caf0>} 2025-05-07T20:33:00.6125415Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6126860Z context = 2025-05-07T20:33:00.6127387Z 2025-05-07T20:33:00.6127615Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6128331Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6128962Z module_map=module_map) 2025-05-07T20:33:00.6129452Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6129952Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.6130336Z E ^ 2025-05-07T20:33:00.6130980Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6131617Z 2025-05-07T20:33:00.6132263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6132996Z 2025-05-07T20:33:00.6133154Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6133745Z self=, 2025-05-07T20:33:00.6134310Z T=2048, 2025-05-07T20:33:00.6134586Z D=5120, 2025-05-07T20:33:00.6134871Z scale_ub=1200.0, 2025-05-07T20:33:00.6135186Z contiguous=True, 2025-05-07T20:33:00.6135507Z compiled=False, 2025-05-07T20:33:00.6135807Z ) 2025-05-07T20:33:00.6136260Z self = 2025-05-07T20:33:00.6136954Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.6137348Z 2025-05-07T20:33:00.6137469Z @given( 2025-05-07T20:33:00.6137862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6138433Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6138892Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6139381Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6139862Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6140286Z ) 2025-05-07T20:33:00.6140808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6141417Z def test_silu_mul_quant( 2025-05-07T20:33:00.6141763Z self, 2025-05-07T20:33:00.6142037Z T: int, 2025-05-07T20:33:00.6142307Z D: int, 2025-05-07T20:33:00.6142608Z scale_ub: Optional[float], 2025-05-07T20:33:00.6142994Z contiguous: bool, 2025-05-07T20:33:00.6143338Z compiled: bool, 2025-05-07T20:33:00.6143655Z ) -> None: 2025-05-07T20:33:00.6143965Z torch.manual_seed(2025) 2025-05-07T20:33:00.6144315Z 2025-05-07T20:33:00.6144714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6145261Z 2025-05-07T20:33:00.6145530Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6145943Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6146390Z x = x_sign * x_clamp 2025-05-07T20:33:00.6146740Z x0 = x[:, :D] 2025-05-07T20:33:00.6147036Z x1 = x[:, D:] 2025-05-07T20:33:00.6147337Z 2025-05-07T20:33:00.6147599Z if contiguous: 2025-05-07T20:33:00.6147914Z x0 = x0.contiguous() 2025-05-07T20:33:00.6148276Z x1 = x1.contiguous() 2025-05-07T20:33:00.6148605Z 2025-05-07T20:33:00.6148880Z if scale_ub is not None: 2025-05-07T20:33:00.6149281Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6149839Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6150288Z ) 2025-05-07T20:33:00.6150578Z else: 2025-05-07T20:33:00.6150895Z scale_ub_tensor = None 2025-05-07T20:33:00.6151262Z 2025-05-07T20:33:00.6151615Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6152054Z op = silu_mul_quant 2025-05-07T20:33:00.6152404Z if compiled: 2025-05-07T20:33:00.6152746Z op = torch.compile(op) 2025-05-07T20:33:00.6153171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6153561Z 2025-05-07T20:33:00.6153899Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6154144Z 2025-05-07T20:33:00.6154284Z moe/activation_test.py:117: 2025-05-07T20:33:00.6154694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6155191Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6155873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6156904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6157929Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6158843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6159804Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6160763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6161506Z kernel = self.compile( 2025-05-07T20:33:00.6162219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6163177Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6163760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6164097Z 2025-05-07T20:33:00.6164402Z self = 2025-05-07T20:33:00.6165900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6167867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6e1219990>} 2025-05-07T20:33:00.6169794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6171246Z context = 2025-05-07T20:33:00.6171653Z 2025-05-07T20:33:00.6171891Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6172640Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6173313Z module_map=module_map) 2025-05-07T20:33:00.6173839Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6174338Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6174715Z E ^ 2025-05-07T20:33:00.6175438Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6175986Z 2025-05-07T20:33:00.6176422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6176938Z 2025-05-07T20:33:00.6177044Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6177464Z self=, 2025-05-07T20:33:00.6177873Z T=2048, 2025-05-07T20:33:00.6178258Z D=5120, 2025-05-07T20:33:00.6178458Z scale_ub=1200.0, 2025-05-07T20:33:00.6178689Z contiguous=True, 2025-05-07T20:33:00.6178910Z compiled=True, 2025-05-07T20:33:00.6179119Z ) 2025-05-07T20:33:00.6179455Z self = 2025-05-07T20:33:00.6179951Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.6180227Z 2025-05-07T20:33:00.6180305Z @given( 2025-05-07T20:33:00.6180537Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6180855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6181229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6181562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6181900Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6182184Z ) 2025-05-07T20:33:00.6182543Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6182987Z def test_silu_mul_quant( 2025-05-07T20:33:00.6183227Z self, 2025-05-07T20:33:00.6183428Z T: int, 2025-05-07T20:33:00.6183634Z D: int, 2025-05-07T20:33:00.6183855Z scale_ub: Optional[float], 2025-05-07T20:33:00.6184175Z contiguous: bool, 2025-05-07T20:33:00.6184422Z compiled: bool, 2025-05-07T20:33:00.6184650Z ) -> None: 2025-05-07T20:33:00.6184868Z torch.manual_seed(2025) 2025-05-07T20:33:00.6185137Z 2025-05-07T20:33:00.6185449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6185791Z 2025-05-07T20:33:00.6185989Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6186287Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6186596Z x = x_sign * x_clamp 2025-05-07T20:33:00.6186841Z x0 = x[:, :D] 2025-05-07T20:33:00.6187065Z x1 = x[:, D:] 2025-05-07T20:33:00.6187270Z 2025-05-07T20:33:00.6187463Z if contiguous: 2025-05-07T20:33:00.6187699Z x0 = x0.contiguous() 2025-05-07T20:33:00.6187955Z x1 = x1.contiguous() 2025-05-07T20:33:00.6188247Z 2025-05-07T20:33:00.6188443Z if scale_ub is not None: 2025-05-07T20:33:00.6188719Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6189066Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6189373Z ) 2025-05-07T20:33:00.6189562Z else: 2025-05-07T20:33:00.6189775Z scale_ub_tensor = None 2025-05-07T20:33:00.6190031Z 2025-05-07T20:33:00.6190270Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6190583Z op = silu_mul_quant 2025-05-07T20:33:00.6190834Z if compiled: 2025-05-07T20:33:00.6191088Z op = torch.compile(op) 2025-05-07T20:33:00.6191383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6191656Z 2025-05-07T20:33:00.6191855Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.6192141Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.6192431Z 2025-05-07T20:33:00.6192675Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6193008Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.6193307Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.6193629Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.6193994Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6194303Z 2025-05-07T20:33:00.6194517Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.6194718Z 2025-05-07T20:33:00.6194826Z moe/activation_test.py:126: 2025-05-07T20:33:00.6195120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6195454Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.6195788Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6196628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.6197399Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.6197959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6198651Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6199344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.6200122Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6200888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.6201645Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6202377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.6203025Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.6203677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.6204199Z fn() 2025-05-07T20:33:00.6204717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.6205307Z self.fn.run( 2025-05-07T20:33:00.6205791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6206324Z kernel = self.compile( 2025-05-07T20:33:00.6206873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6207540Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6207934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6208165Z 2025-05-07T20:33:00.6208451Z self = 2025-05-07T20:33:00.6209549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6210946Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6dbc2d3f0>} 2025-05-07T20:33:00.6212308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6213344Z context = 2025-05-07T20:33:00.6213644Z 2025-05-07T20:33:00.6213813Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6214342Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6214814Z module_map=module_map) 2025-05-07T20:33:00.6215177Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6215537Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.6215808Z E ^ 2025-05-07T20:33:00.6216286Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6216745Z 2025-05-07T20:33:00.6217165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6217687Z 2025-05-07T20:33:00.6217793Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6218373Z self=, 2025-05-07T20:33:00.6218783Z T=16384, 2025-05-07T20:33:00.6218975Z D=7168, 2025-05-07T20:33:00.6219180Z scale_ub=1200.0, 2025-05-07T20:33:00.6219412Z contiguous=False, 2025-05-07T20:33:00.6219639Z compiled=False, 2025-05-07T20:33:00.6219846Z ) 2025-05-07T20:33:00.6220174Z self = 2025-05-07T20:33:00.6220674Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.6220963Z 2025-05-07T20:33:00.6221042Z @given( 2025-05-07T20:33:00.6221322Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6221632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6221947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6222279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6222610Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6222900Z ) 2025-05-07T20:33:00.6223257Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6223704Z def test_silu_mul_quant( 2025-05-07T20:33:00.6223943Z self, 2025-05-07T20:33:00.6224137Z T: int, 2025-05-07T20:33:00.6224381Z D: int, 2025-05-07T20:33:00.6224596Z scale_ub: Optional[float], 2025-05-07T20:33:00.6224871Z contiguous: bool, 2025-05-07T20:33:00.6225110Z compiled: bool, 2025-05-07T20:33:00.6225329Z ) -> None: 2025-05-07T20:33:00.6225548Z torch.manual_seed(2025) 2025-05-07T20:33:00.6225795Z 2025-05-07T20:33:00.6226066Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6226412Z 2025-05-07T20:33:00.6226609Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6226906Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6227209Z x = x_sign * x_clamp 2025-05-07T20:33:00.6227454Z x0 = x[:, :D] 2025-05-07T20:33:00.6227674Z x1 = x[:, D:] 2025-05-07T20:33:00.6227877Z 2025-05-07T20:33:00.6228064Z if contiguous: 2025-05-07T20:33:00.6228351Z x0 = x0.contiguous() 2025-05-07T20:33:00.6228607Z x1 = x1.contiguous() 2025-05-07T20:33:00.6228849Z 2025-05-07T20:33:00.6229046Z if scale_ub is not None: 2025-05-07T20:33:00.6229316Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6229658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6229970Z ) 2025-05-07T20:33:00.6230158Z else: 2025-05-07T20:33:00.6230373Z scale_ub_tensor = None 2025-05-07T20:33:00.6230632Z 2025-05-07T20:33:00.6230860Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6231174Z op = silu_mul_quant 2025-05-07T20:33:00.6231427Z if compiled: 2025-05-07T20:33:00.6231669Z op = torch.compile(op) 2025-05-07T20:33:00.6231972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6232253Z 2025-05-07T20:33:00.6232449Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6232614Z 2025-05-07T20:33:00.6232717Z moe/activation_test.py:117: 2025-05-07T20:33:00.6233014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6233348Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6233630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6234341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6235088Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6235635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6236320Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6237042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6237581Z kernel = self.compile( 2025-05-07T20:33:00.6238127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6238799Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6239196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6239422Z 2025-05-07T20:33:00.6239638Z self = 2025-05-07T20:33:00.6240768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6242159Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6dbc2ce50>} 2025-05-07T20:33:00.6243523Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6244606Z context = 2025-05-07T20:33:00.6244899Z 2025-05-07T20:33:00.6245092Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6245645Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6246119Z module_map=module_map) 2025-05-07T20:33:00.6246490Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6246843Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6247104Z E ^ 2025-05-07T20:33:00.6247578Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6248036Z 2025-05-07T20:33:00.6248461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6249024Z 2025-05-07T20:33:00.6249131Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6249554Z self=, 2025-05-07T20:33:00.6249958Z T=1, 2025-05-07T20:33:00.6250137Z D=7168, 2025-05-07T20:33:00.6250334Z scale_ub=None, 2025-05-07T20:33:00.6250554Z contiguous=True, 2025-05-07T20:33:00.6250782Z compiled=True, 2025-05-07T20:33:00.6250993Z ) 2025-05-07T20:33:00.6251317Z self = 2025-05-07T20:33:00.6251808Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.6252071Z 2025-05-07T20:33:00.6252149Z @given( 2025-05-07T20:33:00.6252390Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6252705Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6253009Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6253348Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6253686Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6253969Z ) 2025-05-07T20:33:00.6254325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6254795Z def test_silu_mul_quant( 2025-05-07T20:33:00.6255069Z self, 2025-05-07T20:33:00.6255263Z T: int, 2025-05-07T20:33:00.6255469Z D: int, 2025-05-07T20:33:00.6255972Z scale_ub: Optional[float], 2025-05-07T20:33:00.6256256Z contiguous: bool, 2025-05-07T20:33:00.6256504Z compiled: bool, 2025-05-07T20:33:00.6256728Z ) -> None: 2025-05-07T20:33:00.6256946Z torch.manual_seed(2025) 2025-05-07T20:33:00.6257195Z 2025-05-07T20:33:00.6257564Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6257905Z 2025-05-07T20:33:00.6258200Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6258499Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6258801Z x = x_sign * x_clamp 2025-05-07T20:33:00.6259041Z x0 = x[:, :D] 2025-05-07T20:33:00.6259258Z x1 = x[:, D:] 2025-05-07T20:33:00.6259459Z 2025-05-07T20:33:00.6259652Z if contiguous: 2025-05-07T20:33:00.6259881Z x0 = x0.contiguous() 2025-05-07T20:33:00.6260137Z x1 = x1.contiguous() 2025-05-07T20:33:00.6260446Z 2025-05-07T20:33:00.6260639Z if scale_ub is not None: 2025-05-07T20:33:00.6260908Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6261246Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6261553Z ) 2025-05-07T20:33:00.6261750Z else: 2025-05-07T20:33:00.6261956Z scale_ub_tensor = None 2025-05-07T20:33:00.6262212Z 2025-05-07T20:33:00.6262448Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6262755Z op = silu_mul_quant 2025-05-07T20:33:00.6263008Z if compiled: 2025-05-07T20:33:00.6263258Z op = torch.compile(op) 2025-05-07T20:33:00.6263617Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6263892Z 2025-05-07T20:33:00.6264091Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.6264373Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.6264663Z 2025-05-07T20:33:00.6264929Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6265291Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.6265589Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.6265905Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.6266271Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6266581Z 2025-05-07T20:33:00.6266785Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.6266982Z 2025-05-07T20:33:00.6267088Z moe/activation_test.py:126: 2025-05-07T20:33:00.6267455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6267792Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.6268127Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6268924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.6269683Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.6270238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6270928Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6271623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.6272352Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6273118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.6273877Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6274608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.6275313Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.6275921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.6276450Z fn() 2025-05-07T20:33:00.6276959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.6277595Z self.fn.run( 2025-05-07T20:33:00.6278074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6278610Z kernel = self.compile( 2025-05-07T20:33:00.6279161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6279830Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6280227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6280453Z 2025-05-07T20:33:00.6280708Z self = 2025-05-07T20:33:00.6281805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6283199Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6db9c09d0>} 2025-05-07T20:33:00.6284627Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6285664Z context = 2025-05-07T20:33:00.6285962Z 2025-05-07T20:33:00.6286133Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6286665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6287137Z module_map=module_map) 2025-05-07T20:33:00.6287501Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6287864Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.6288135Z E ^ 2025-05-07T20:33:00.6288597Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6289101Z 2025-05-07T20:33:00.6289526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6290048Z 2025-05-07T20:33:00.6290152Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6290572Z self=, 2025-05-07T20:33:00.6290969Z T=4096, 2025-05-07T20:33:00.6291164Z D=5120, 2025-05-07T20:33:00.6291364Z scale_ub=None, 2025-05-07T20:33:00.6291578Z contiguous=False, 2025-05-07T20:33:00.6291809Z compiled=False, 2025-05-07T20:33:00.6292016Z ) 2025-05-07T20:33:00.6300207Z self = 2025-05-07T20:33:00.6300762Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.6301045Z 2025-05-07T20:33:00.6301121Z @given( 2025-05-07T20:33:00.6301355Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6301680Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6301994Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6302331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6302666Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6302957Z ) 2025-05-07T20:33:00.6303309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6303763Z def test_silu_mul_quant( 2025-05-07T20:33:00.6304007Z self, 2025-05-07T20:33:00.6304195Z T: int, 2025-05-07T20:33:00.6304395Z D: int, 2025-05-07T20:33:00.6304619Z scale_ub: Optional[float], 2025-05-07T20:33:00.6304887Z contiguous: bool, 2025-05-07T20:33:00.6305164Z compiled: bool, 2025-05-07T20:33:00.6305403Z ) -> None: 2025-05-07T20:33:00.6305696Z torch.manual_seed(2025) 2025-05-07T20:33:00.6305944Z 2025-05-07T20:33:00.6306223Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6306568Z 2025-05-07T20:33:00.6306769Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6307070Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6307384Z x = x_sign * x_clamp 2025-05-07T20:33:00.6307623Z x0 = x[:, :D] 2025-05-07T20:33:00.6307842Z x1 = x[:, D:] 2025-05-07T20:33:00.6308054Z 2025-05-07T20:33:00.6308238Z if contiguous: 2025-05-07T20:33:00.6308527Z x0 = x0.contiguous() 2025-05-07T20:33:00.6308789Z x1 = x1.contiguous() 2025-05-07T20:33:00.6309029Z 2025-05-07T20:33:00.6309232Z if scale_ub is not None: 2025-05-07T20:33:00.6309513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6309847Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6310163Z ) 2025-05-07T20:33:00.6310359Z else: 2025-05-07T20:33:00.6310569Z scale_ub_tensor = None 2025-05-07T20:33:00.6310824Z 2025-05-07T20:33:00.6311064Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6311376Z op = silu_mul_quant 2025-05-07T20:33:00.6311675Z if compiled: 2025-05-07T20:33:00.6311926Z op = torch.compile(op) 2025-05-07T20:33:00.6312229Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6312501Z 2025-05-07T20:33:00.6312696Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6312863Z 2025-05-07T20:33:00.6312972Z moe/activation_test.py:117: 2025-05-07T20:33:00.6313264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6313598Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6313890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6314594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6315303Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6315852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6316597Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6317267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6317806Z kernel = self.compile( 2025-05-07T20:33:00.6318357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6319027Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6319425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6319651Z 2025-05-07T20:33:00.6319869Z self = 2025-05-07T20:33:00.6320671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6321188Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6db9c1a20>} 2025-05-07T20:33:00.6321956Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6322154Z context = 2025-05-07T20:33:00.6322158Z 2025-05-07T20:33:00.6322329Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6322644Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6322753Z module_map=module_map) 2025-05-07T20:33:00.6322924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6323026Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6323106Z E ^ 2025-05-07T20:33:00.6323469Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6323474Z 2025-05-07T20:33:00.6323896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6323941Z 2025-05-07T20:33:00.6324053Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6324279Z self=, 2025-05-07T20:33:00.6324361Z T=4096, 2025-05-07T20:33:00.6324437Z D=7168, 2025-05-07T20:33:00.6324520Z scale_ub=None, 2025-05-07T20:33:00.6324616Z contiguous=False, 2025-05-07T20:33:00.6324700Z compiled=False, 2025-05-07T20:33:00.6324774Z ) 2025-05-07T20:33:00.6325037Z self = 2025-05-07T20:33:00.6325268Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.6325273Z 2025-05-07T20:33:00.6325352Z @given( 2025-05-07T20:33:00.6325477Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6325576Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6325693Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6325819Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6325934Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6326015Z ) 2025-05-07T20:33:00.6326264Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6326358Z def test_silu_mul_quant( 2025-05-07T20:33:00.6326442Z self, 2025-05-07T20:33:00.6326517Z T: int, 2025-05-07T20:33:00.6326592Z D: int, 2025-05-07T20:33:00.6326697Z scale_ub: Optional[float], 2025-05-07T20:33:00.6326829Z contiguous: bool, 2025-05-07T20:33:00.6326913Z compiled: bool, 2025-05-07T20:33:00.6326997Z ) -> None: 2025-05-07T20:33:00.6327093Z torch.manual_seed(2025) 2025-05-07T20:33:00.6327165Z 2025-05-07T20:33:00.6327340Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6327412Z 2025-05-07T20:33:00.6327511Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6327639Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6327726Z x = x_sign * x_clamp 2025-05-07T20:33:00.6327807Z x0 = x[:, :D] 2025-05-07T20:33:00.6327888Z x1 = x[:, D:] 2025-05-07T20:33:00.6327960Z 2025-05-07T20:33:00.6328046Z if contiguous: 2025-05-07T20:33:00.6328135Z x0 = x0.contiguous() 2025-05-07T20:33:00.6328224Z x1 = x1.contiguous() 2025-05-07T20:33:00.6328301Z 2025-05-07T20:33:00.6328390Z if scale_ub is not None: 2025-05-07T20:33:00.6328497Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6328638Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6328711Z ) 2025-05-07T20:33:00.6328789Z else: 2025-05-07T20:33:00.6328881Z scale_ub_tensor = None 2025-05-07T20:33:00.6328952Z 2025-05-07T20:33:00.6329085Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6329173Z op = silu_mul_quant 2025-05-07T20:33:00.6329260Z if compiled: 2025-05-07T20:33:00.6329362Z op = torch.compile(op) 2025-05-07T20:33:00.6329468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6329540Z 2025-05-07T20:33:00.6329632Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6329637Z 2025-05-07T20:33:00.6329733Z moe/activation_test.py:117: 2025-05-07T20:33:00.6329908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6330009Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6330111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6330627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6330724Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6331087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6331357Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6331704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6331798Z kernel = self.compile( 2025-05-07T20:33:00.6332191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6332369Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6332500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6332505Z 2025-05-07T20:33:00.6332793Z self = 2025-05-07T20:33:00.6333594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6334109Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6db9c2560>} 2025-05-07T20:33:00.6334873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6335087Z context = 2025-05-07T20:33:00.6335141Z 2025-05-07T20:33:00.6335332Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6335608Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6335715Z module_map=module_map) 2025-05-07T20:33:00.6335879Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6335990Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6336064Z E ^ 2025-05-07T20:33:00.6336427Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6336437Z 2025-05-07T20:33:00.6336860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6336865Z 2025-05-07T20:33:00.6336968Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6337195Z self=, 2025-05-07T20:33:00.6337273Z T=128, 2025-05-07T20:33:00.6337348Z D=7168, 2025-05-07T20:33:00.6337436Z scale_ub=None, 2025-05-07T20:33:00.6337521Z contiguous=False, 2025-05-07T20:33:00.6337603Z compiled=True, 2025-05-07T20:33:00.6337678Z ) 2025-05-07T20:33:00.6337898Z self = 2025-05-07T20:33:00.6338162Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.6338170Z 2025-05-07T20:33:00.6338246Z @given( 2025-05-07T20:33:00.6338364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6338466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6338581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6338742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6338862Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6338933Z ) 2025-05-07T20:33:00.6339185Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6339285Z def test_silu_mul_quant( 2025-05-07T20:33:00.6339359Z self, 2025-05-07T20:33:00.6339438Z T: int, 2025-05-07T20:33:00.6339512Z D: int, 2025-05-07T20:33:00.6339608Z scale_ub: Optional[float], 2025-05-07T20:33:00.6339699Z contiguous: bool, 2025-05-07T20:33:00.6339783Z compiled: bool, 2025-05-07T20:33:00.6339941Z ) -> None: 2025-05-07T20:33:00.6340041Z torch.manual_seed(2025) 2025-05-07T20:33:00.6340116Z 2025-05-07T20:33:00.6340285Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6340361Z 2025-05-07T20:33:00.6340453Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6340581Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6340671Z x = x_sign * x_clamp 2025-05-07T20:33:00.6340749Z x0 = x[:, :D] 2025-05-07T20:33:00.6340831Z x1 = x[:, D:] 2025-05-07T20:33:00.6340908Z 2025-05-07T20:33:00.6340990Z if contiguous: 2025-05-07T20:33:00.6341124Z x0 = x0.contiguous() 2025-05-07T20:33:00.6341212Z x1 = x1.contiguous() 2025-05-07T20:33:00.6341280Z 2025-05-07T20:33:00.6341375Z if scale_ub is not None: 2025-05-07T20:33:00.6341481Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6341620Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6341700Z ) 2025-05-07T20:33:00.6341775Z else: 2025-05-07T20:33:00.6341866Z scale_ub_tensor = None 2025-05-07T20:33:00.6341940Z 2025-05-07T20:33:00.6342069Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6342155Z op = silu_mul_quant 2025-05-07T20:33:00.6342246Z if compiled: 2025-05-07T20:33:00.6342345Z op = torch.compile(op) 2025-05-07T20:33:00.6342453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6342565Z 2025-05-07T20:33:00.6342654Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.6342782Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.6342854Z 2025-05-07T20:33:00.6342992Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6343098Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.6343197Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.6343323Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.6343467Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6343538Z 2025-05-07T20:33:00.6343642Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.6343647Z 2025-05-07T20:33:00.6343744Z moe/activation_test.py:126: 2025-05-07T20:33:00.6343872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6343982Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.6344117Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6344694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.6344807Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.6345174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6345404Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6345778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.6346037Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6346488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.6346746Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6347136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.6347308Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.6347657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.6347779Z fn() 2025-05-07T20:33:00.6348189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.6348268Z self.fn.run( 2025-05-07T20:33:00.6348613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6348709Z kernel = self.compile( 2025-05-07T20:33:00.6349099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6349278Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6349442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6349448Z 2025-05-07T20:33:00.6349661Z self = 2025-05-07T20:33:00.6350458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6350976Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6db9ee3b0>} 2025-05-07T20:33:00.6351742Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6351977Z context = 2025-05-07T20:33:00.6351989Z 2025-05-07T20:33:00.6352157Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6352425Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6352535Z module_map=module_map) 2025-05-07T20:33:00.6352699Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6352802Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.6352881Z E ^ 2025-05-07T20:33:00.6353242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6353246Z 2025-05-07T20:33:00.6353674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6353679Z 2025-05-07T20:33:00.6353784Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6354011Z self=, 2025-05-07T20:33:00.6354090Z T=128, 2025-05-07T20:33:00.6354165Z D=7168, 2025-05-07T20:33:00.6354244Z scale_ub=None, 2025-05-07T20:33:00.6354334Z contiguous=False, 2025-05-07T20:33:00.6354417Z compiled=False, 2025-05-07T20:33:00.6354486Z ) 2025-05-07T20:33:00.6354712Z self = 2025-05-07T20:33:00.6354910Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.6354915Z 2025-05-07T20:33:00.6355002Z @given( 2025-05-07T20:33:00.6355138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6355236Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6355398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6355516Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6355876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6355999Z ) 2025-05-07T20:33:00.6356255Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6356347Z def test_silu_mul_quant( 2025-05-07T20:33:00.6356429Z self, 2025-05-07T20:33:00.6356502Z T: int, 2025-05-07T20:33:00.6356581Z D: int, 2025-05-07T20:33:00.6356679Z scale_ub: Optional[float], 2025-05-07T20:33:00.6356859Z contiguous: bool, 2025-05-07T20:33:00.6356947Z compiled: bool, 2025-05-07T20:33:00.6357023Z ) -> None: 2025-05-07T20:33:00.6357117Z torch.manual_seed(2025) 2025-05-07T20:33:00.6357197Z 2025-05-07T20:33:00.6357369Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6357439Z 2025-05-07T20:33:00.6357537Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6357666Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6357755Z x = x_sign * x_clamp 2025-05-07T20:33:00.6357838Z x0 = x[:, :D] 2025-05-07T20:33:00.6357914Z x1 = x[:, D:] 2025-05-07T20:33:00.6358066Z 2025-05-07T20:33:00.6358151Z if contiguous: 2025-05-07T20:33:00.6358241Z x0 = x0.contiguous() 2025-05-07T20:33:00.6358331Z x1 = x1.contiguous() 2025-05-07T20:33:00.6358401Z 2025-05-07T20:33:00.6358489Z if scale_ub is not None: 2025-05-07T20:33:00.6358603Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6358736Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6358808Z ) 2025-05-07T20:33:00.6358888Z else: 2025-05-07T20:33:00.6358980Z scale_ub_tensor = None 2025-05-07T20:33:00.6359049Z 2025-05-07T20:33:00.6359184Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6359272Z op = silu_mul_quant 2025-05-07T20:33:00.6359354Z if compiled: 2025-05-07T20:33:00.6359455Z op = torch.compile(op) 2025-05-07T20:33:00.6359628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6359703Z 2025-05-07T20:33:00.6359792Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6359797Z 2025-05-07T20:33:00.6359893Z moe/activation_test.py:117: 2025-05-07T20:33:00.6360023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6360122Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6360226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6360741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6360838Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6361209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6361431Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6361790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6361885Z kernel = self.compile( 2025-05-07T20:33:00.6362276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6362456Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6362584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6362589Z 2025-05-07T20:33:00.6362797Z self = 2025-05-07T20:33:00.6363659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6364174Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6dba2e830>} 2025-05-07T20:33:00.6364946Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6365159Z context = 2025-05-07T20:33:00.6365209Z 2025-05-07T20:33:00.6365406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6365675Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6365783Z module_map=module_map) 2025-05-07T20:33:00.6365953Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6366054Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6366131Z E ^ 2025-05-07T20:33:00.6366496Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6366542Z 2025-05-07T20:33:00.6366969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6366974Z 2025-05-07T20:33:00.6367085Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6367311Z self=, 2025-05-07T20:33:00.6367389Z T=4096, 2025-05-07T20:33:00.6367473Z D=5120, 2025-05-07T20:33:00.6367557Z scale_ub=1200.0, 2025-05-07T20:33:00.6367642Z contiguous=True, 2025-05-07T20:33:00.6367727Z compiled=False, 2025-05-07T20:33:00.6367801Z ) 2025-05-07T20:33:00.6368034Z self = 2025-05-07T20:33:00.6368211Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.6368216Z 2025-05-07T20:33:00.6368337Z @given( 2025-05-07T20:33:00.6368461Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6368561Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6368680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6368803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6368921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6368995Z ) 2025-05-07T20:33:00.6369248Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6369343Z def test_silu_mul_quant( 2025-05-07T20:33:00.6369422Z self, 2025-05-07T20:33:00.6369497Z T: int, 2025-05-07T20:33:00.6369571Z D: int, 2025-05-07T20:33:00.6369675Z scale_ub: Optional[float], 2025-05-07T20:33:00.6369764Z contiguous: bool, 2025-05-07T20:33:00.6369850Z compiled: bool, 2025-05-07T20:33:00.6369933Z ) -> None: 2025-05-07T20:33:00.6370028Z torch.manual_seed(2025) 2025-05-07T20:33:00.6370105Z 2025-05-07T20:33:00.6370283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6370355Z 2025-05-07T20:33:00.6370447Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6370577Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6370665Z x = x_sign * x_clamp 2025-05-07T20:33:00.6370749Z x0 = x[:, :D] 2025-05-07T20:33:00.6370831Z x1 = x[:, D:] 2025-05-07T20:33:00.6370903Z 2025-05-07T20:33:00.6370991Z if contiguous: 2025-05-07T20:33:00.6371080Z x0 = x0.contiguous() 2025-05-07T20:33:00.6371170Z x1 = x1.contiguous() 2025-05-07T20:33:00.6371248Z 2025-05-07T20:33:00.6371338Z if scale_ub is not None: 2025-05-07T20:33:00.6371510Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6371653Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6371729Z ) 2025-05-07T20:33:00.6371807Z else: 2025-05-07T20:33:00.6371905Z scale_ub_tensor = None 2025-05-07T20:33:00.6371974Z 2025-05-07T20:33:00.6372109Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6372198Z op = silu_mul_quant 2025-05-07T20:33:00.6372280Z if compiled: 2025-05-07T20:33:00.6372384Z op = torch.compile(op) 2025-05-07T20:33:00.6372492Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6372604Z 2025-05-07T20:33:00.6372700Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6372704Z 2025-05-07T20:33:00.6372802Z moe/activation_test.py:117: 2025-05-07T20:33:00.6372930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6373036Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6373139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6373654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6373755Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6374159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6374388Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6374736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6374838Z kernel = self.compile( 2025-05-07T20:33:00.6375230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6375405Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6375534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6375538Z 2025-05-07T20:33:00.6375745Z self = 2025-05-07T20:33:00.6376579Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6377101Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6dba2db40>} 2025-05-07T20:33:00.6377867Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6378161Z context = 2025-05-07T20:33:00.6378167Z 2025-05-07T20:33:00.6378337Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6378612Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6378724Z module_map=module_map) 2025-05-07T20:33:00.6378890Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6378995Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6379071Z E ^ 2025-05-07T20:33:00.6379430Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6379437Z 2025-05-07T20:33:00.6379865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6379870Z 2025-05-07T20:33:00.6379974Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6380206Z self=, 2025-05-07T20:33:00.6380327Z T=1, 2025-05-07T20:33:00.6380404Z D=5120, 2025-05-07T20:33:00.6380493Z scale_ub=None, 2025-05-07T20:33:00.6380579Z contiguous=True, 2025-05-07T20:33:00.6380664Z compiled=True, 2025-05-07T20:33:00.6380741Z ) 2025-05-07T20:33:00.6380964Z self = 2025-05-07T20:33:00.6381126Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.6381130Z 2025-05-07T20:33:00.6381214Z @given( 2025-05-07T20:33:00.6381334Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6381481Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6381597Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6381718Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6381836Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6381909Z ) 2025-05-07T20:33:00.6382160Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6382258Z def test_silu_mul_quant( 2025-05-07T20:33:00.6382334Z self, 2025-05-07T20:33:00.6382411Z T: int, 2025-05-07T20:33:00.6382493Z D: int, 2025-05-07T20:33:00.6382591Z scale_ub: Optional[float], 2025-05-07T20:33:00.6382722Z contiguous: bool, 2025-05-07T20:33:00.6382809Z compiled: bool, 2025-05-07T20:33:00.6382892Z ) -> None: 2025-05-07T20:33:00.6382994Z torch.manual_seed(2025) 2025-05-07T20:33:00.6383065Z 2025-05-07T20:33:00.6383235Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6383317Z 2025-05-07T20:33:00.6383413Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6383539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6383630Z x = x_sign * x_clamp 2025-05-07T20:33:00.6383710Z x0 = x[:, :D] 2025-05-07T20:33:00.6383789Z x1 = x[:, D:] 2025-05-07T20:33:00.6383862Z 2025-05-07T20:33:00.6383948Z if contiguous: 2025-05-07T20:33:00.6384040Z x0 = x0.contiguous() 2025-05-07T20:33:00.6384130Z x1 = x1.contiguous() 2025-05-07T20:33:00.6384245Z 2025-05-07T20:33:00.6384340Z if scale_ub is not None: 2025-05-07T20:33:00.6384449Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6384587Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6384666Z ) 2025-05-07T20:33:00.6384743Z else: 2025-05-07T20:33:00.6384859Z scale_ub_tensor = None 2025-05-07T20:33:00.6384941Z 2025-05-07T20:33:00.6385093Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6385183Z op = silu_mul_quant 2025-05-07T20:33:00.6385269Z if compiled: 2025-05-07T20:33:00.6385368Z op = torch.compile(op) 2025-05-07T20:33:00.6385472Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6385548Z 2025-05-07T20:33:00.6385641Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.6385768Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.6385838Z 2025-05-07T20:33:00.6385978Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6386088Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.6386188Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.6386310Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.6386454Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6386526Z 2025-05-07T20:33:00.6386628Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.6386633Z 2025-05-07T20:33:00.6386737Z moe/activation_test.py:126: 2025-05-07T20:33:00.6386869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6386984Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.6387121Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6387738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.6387851Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.6388218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6388440Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6388822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.6389118Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6389528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.6389788Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6390172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.6390349Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.6390737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.6390822Z fn() 2025-05-07T20:33:00.6391232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.6391313Z self.fn.run( 2025-05-07T20:33:00.6391665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6391758Z kernel = self.compile( 2025-05-07T20:33:00.6392143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6392329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6392454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6392499Z 2025-05-07T20:33:00.6392712Z self = 2025-05-07T20:33:00.6393507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6394022Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6dba2f250>} 2025-05-07T20:33:00.6394789Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6394988Z context = 2025-05-07T20:33:00.6394993Z 2025-05-07T20:33:00.6395193Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6395487Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6395599Z module_map=module_map) 2025-05-07T20:33:00.6395763Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6395867Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.6395947Z E ^ 2025-05-07T20:33:00.6396307Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6396312Z 2025-05-07T20:33:00.6396734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6396744Z 2025-05-07T20:33:00.6396847Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6397113Z self=, 2025-05-07T20:33:00.6397194Z T=2048, 2025-05-07T20:33:00.6397272Z D=5120, 2025-05-07T20:33:00.6397355Z scale_ub=None, 2025-05-07T20:33:00.6397445Z contiguous=True, 2025-05-07T20:33:00.6397534Z compiled=True, 2025-05-07T20:33:00.6397606Z ) 2025-05-07T20:33:00.6397833Z self = 2025-05-07T20:33:00.6398002Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.6398007Z 2025-05-07T20:33:00.6398125Z @given( 2025-05-07T20:33:00.6398252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6398351Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6398471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6398589Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6398705Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6398783Z ) 2025-05-07T20:33:00.6399031Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6399127Z def test_silu_mul_quant( 2025-05-07T20:33:00.6399205Z self, 2025-05-07T20:33:00.6399321Z T: int, 2025-05-07T20:33:00.6399399Z D: int, 2025-05-07T20:33:00.6399501Z scale_ub: Optional[float], 2025-05-07T20:33:00.6399589Z contiguous: bool, 2025-05-07T20:33:00.6399678Z compiled: bool, 2025-05-07T20:33:00.6399756Z ) -> None: 2025-05-07T20:33:00.6399852Z torch.manual_seed(2025) 2025-05-07T20:33:00.6399929Z 2025-05-07T20:33:00.6400101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6400175Z 2025-05-07T20:33:00.6400271Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6400400Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6400489Z x = x_sign * x_clamp 2025-05-07T20:33:00.6400577Z x0 = x[:, :D] 2025-05-07T20:33:00.6400661Z x1 = x[:, D:] 2025-05-07T20:33:00.6400732Z 2025-05-07T20:33:00.6400817Z if contiguous: 2025-05-07T20:33:00.6400975Z x0 = x0.contiguous() 2025-05-07T20:33:00.6401068Z x1 = x1.contiguous() 2025-05-07T20:33:00.6401145Z 2025-05-07T20:33:00.6401235Z if scale_ub is not None: 2025-05-07T20:33:00.6401346Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6401479Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6401553Z ) 2025-05-07T20:33:00.6401636Z else: 2025-05-07T20:33:00.6401729Z scale_ub_tensor = None 2025-05-07T20:33:00.6401801Z 2025-05-07T20:33:00.6401936Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6402023Z op = silu_mul_quant 2025-05-07T20:33:00.6402107Z if compiled: 2025-05-07T20:33:00.6402211Z op = torch.compile(op) 2025-05-07T20:33:00.6402317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6402389Z 2025-05-07T20:33:00.6402488Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.6402611Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.6402692Z 2025-05-07T20:33:00.6402831Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6402936Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.6403037Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.6403164Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.6403305Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6403383Z 2025-05-07T20:33:00.6403484Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.6403488Z 2025-05-07T20:33:00.6403587Z moe/activation_test.py:126: 2025-05-07T20:33:00.6403716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6403866Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.6404005Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6404577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.6404683Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.6405052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6405276Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6405698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.6405955Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6406358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.6406620Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6406999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.6407209Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.6407564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.6407641Z fn() 2025-05-07T20:33:00.6408054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.6408141Z self.fn.run( 2025-05-07T20:33:00.6408485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6408583Z kernel = self.compile( 2025-05-07T20:33:00.6408974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6409151Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6409323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6409327Z 2025-05-07T20:33:00.6409540Z self = 2025-05-07T20:33:00.6410338Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6410854Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6db5d3760>} 2025-05-07T20:33:00.6411661Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6411933Z context = 2025-05-07T20:33:00.6411944Z 2025-05-07T20:33:00.6412173Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6412455Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6412569Z module_map=module_map) 2025-05-07T20:33:00.6412743Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6412852Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.6412932Z E ^ 2025-05-07T20:33:00.6413299Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6413303Z 2025-05-07T20:33:00.6413782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6413788Z 2025-05-07T20:33:00.6413903Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6414131Z self=, 2025-05-07T20:33:00.6414215Z T=128, 2025-05-07T20:33:00.6414300Z D=5120, 2025-05-07T20:33:00.6414387Z scale_ub=None, 2025-05-07T20:33:00.6414475Z contiguous=True, 2025-05-07T20:33:00.6414568Z compiled=True, 2025-05-07T20:33:00.6414650Z ) 2025-05-07T20:33:00.6414915Z self = 2025-05-07T20:33:00.6415094Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.6415142Z 2025-05-07T20:33:00.6415222Z @given( 2025-05-07T20:33:00.6415344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6415450Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6415569Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6415697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6415814Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6415890Z ) 2025-05-07T20:33:00.6416151Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6416288Z def test_silu_mul_quant( 2025-05-07T20:33:00.6416368Z self, 2025-05-07T20:33:00.6416451Z T: int, 2025-05-07T20:33:00.6416528Z D: int, 2025-05-07T20:33:00.6416628Z scale_ub: Optional[float], 2025-05-07T20:33:00.6416723Z contiguous: bool, 2025-05-07T20:33:00.6416812Z compiled: bool, 2025-05-07T20:33:00.6416907Z ) -> None: 2025-05-07T20:33:00.6417004Z torch.manual_seed(2025) 2025-05-07T20:33:00.6417078Z 2025-05-07T20:33:00.6417254Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6417330Z 2025-05-07T20:33:00.6417425Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6417560Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6417651Z x = x_sign * x_clamp 2025-05-07T20:33:00.6417731Z x0 = x[:, :D] 2025-05-07T20:33:00.6417868Z x1 = x[:, D:] 2025-05-07T20:33:00.6417942Z 2025-05-07T20:33:00.6418026Z if contiguous: 2025-05-07T20:33:00.6418239Z x0 = x0.contiguous() 2025-05-07T20:33:00.6418331Z x1 = x1.contiguous() 2025-05-07T20:33:00.6418405Z 2025-05-07T20:33:00.6418503Z if scale_ub is not None: 2025-05-07T20:33:00.6418612Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6418755Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6418834Z ) 2025-05-07T20:33:00.6418911Z else: 2025-05-07T20:33:00.6419011Z scale_ub_tensor = None 2025-05-07T20:33:00.6419084Z 2025-05-07T20:33:00.6419218Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6419315Z op = silu_mul_quant 2025-05-07T20:33:00.6419406Z if compiled: 2025-05-07T20:33:00.6419548Z op = torch.compile(op) 2025-05-07T20:33:00.6419681Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6419780Z 2025-05-07T20:33:00.6419909Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.6420047Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.6420120Z 2025-05-07T20:33:00.6420261Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6420367Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.6420468Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.6420591Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.6420738Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6420812Z 2025-05-07T20:33:00.6420913Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.6420922Z 2025-05-07T20:33:00.6421022Z moe/activation_test.py:126: 2025-05-07T20:33:00.6421201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6421316Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.6421454Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6422031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.6422138Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.6422504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6422771Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6423148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.6423408Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6423821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.6433482Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6434042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.6434257Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.6434619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.6434704Z fn() 2025-05-07T20:33:00.6435116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.6435204Z self.fn.run( 2025-05-07T20:33:00.6435551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6435649Z kernel = self.compile( 2025-05-07T20:33:00.6436047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6436272Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6436403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6436416Z 2025-05-07T20:33:00.6436631Z self = 2025-05-07T20:33:00.6437485Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6438011Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da770280>} 2025-05-07T20:33:00.6438805Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6439008Z context = 2025-05-07T20:33:00.6439015Z 2025-05-07T20:33:00.6439184Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6439456Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6439572Z module_map=module_map) 2025-05-07T20:33:00.6439741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6439850Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.6439935Z E ^ 2025-05-07T20:33:00.6440301Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6440306Z 2025-05-07T20:33:00.6440779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6440784Z 2025-05-07T20:33:00.6440898Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6441129Z self=, 2025-05-07T20:33:00.6441217Z T=4096, 2025-05-07T20:33:00.6441294Z D=5120, 2025-05-07T20:33:00.6441381Z scale_ub=None, 2025-05-07T20:33:00.6441467Z contiguous=True, 2025-05-07T20:33:00.6441550Z compiled=True, 2025-05-07T20:33:00.6441630Z ) 2025-05-07T20:33:00.6441897Z self = 2025-05-07T20:33:00.6442069Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.6442074Z 2025-05-07T20:33:00.6442160Z @given( 2025-05-07T20:33:00.6442281Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6442383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6442509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6442630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6442756Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6442831Z ) 2025-05-07T20:33:00.6443127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6443231Z def test_silu_mul_quant( 2025-05-07T20:33:00.6443308Z self, 2025-05-07T20:33:00.6443385Z T: int, 2025-05-07T20:33:00.6443465Z D: int, 2025-05-07T20:33:00.6443566Z scale_ub: Optional[float], 2025-05-07T20:33:00.6443659Z contiguous: bool, 2025-05-07T20:33:00.6443751Z compiled: bool, 2025-05-07T20:33:00.6443831Z ) -> None: 2025-05-07T20:33:00.6443928Z torch.manual_seed(2025) 2025-05-07T20:33:00.6444007Z 2025-05-07T20:33:00.6444181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6444263Z 2025-05-07T20:33:00.6444357Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6444483Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6444624Z x = x_sign * x_clamp 2025-05-07T20:33:00.6444706Z x0 = x[:, :D] 2025-05-07T20:33:00.6444801Z x1 = x[:, D:] 2025-05-07T20:33:00.6444891Z 2025-05-07T20:33:00.6444991Z if contiguous: 2025-05-07T20:33:00.6445099Z x0 = x0.contiguous() 2025-05-07T20:33:00.6445194Z x1 = x1.contiguous() 2025-05-07T20:33:00.6445267Z 2025-05-07T20:33:00.6445360Z if scale_ub is not None: 2025-05-07T20:33:00.6445476Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6445617Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6445694Z ) 2025-05-07T20:33:00.6445775Z else: 2025-05-07T20:33:00.6445872Z scale_ub_tensor = None 2025-05-07T20:33:00.6445952Z 2025-05-07T20:33:00.6446087Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6446179Z op = silu_mul_quant 2025-05-07T20:33:00.6446267Z if compiled: 2025-05-07T20:33:00.6446370Z op = torch.compile(op) 2025-05-07T20:33:00.6446477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6446558Z 2025-05-07T20:33:00.6446649Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.6446772Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.6446852Z 2025-05-07T20:33:00.6446992Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6447095Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.6447208Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.6447331Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.6447482Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6447556Z 2025-05-07T20:33:00.6447658Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.6447708Z 2025-05-07T20:33:00.6447816Z moe/activation_test.py:126: 2025-05-07T20:33:00.6447947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6448059Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.6448208Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6448786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.6448895Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.6449306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6449581Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6450018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.6450361Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6450896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.6455895Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6456387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.6477307Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.6477689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.6477765Z fn() 2025-05-07T20:33:00.6478176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.6478257Z self.fn.run( 2025-05-07T20:33:00.6478602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6478696Z kernel = self.compile( 2025-05-07T20:33:00.6479083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6479381Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6479509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6479514Z 2025-05-07T20:33:00.6479723Z self = 2025-05-07T20:33:00.6480523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6481039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da7712d0>} 2025-05-07T20:33:00.6481807Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6482005Z context = 2025-05-07T20:33:00.6482010Z 2025-05-07T20:33:00.6482176Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6482445Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6482552Z module_map=module_map) 2025-05-07T20:33:00.6482718Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6482819Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.6482892Z E ^ 2025-05-07T20:33:00.6483322Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6483327Z 2025-05-07T20:33:00.6483749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6483756Z 2025-05-07T20:33:00.6483861Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6484089Z self=, 2025-05-07T20:33:00.6484163Z T=16384, 2025-05-07T20:33:00.6484241Z D=5120, 2025-05-07T20:33:00.6484320Z scale_ub=None, 2025-05-07T20:33:00.6484401Z contiguous=True, 2025-05-07T20:33:00.6484548Z compiled=True, 2025-05-07T20:33:00.6484623Z ) 2025-05-07T20:33:00.6484843Z self = 2025-05-07T20:33:00.6485045Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.6485051Z 2025-05-07T20:33:00.6485126Z @given( 2025-05-07T20:33:00.6485269Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6485372Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6485486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6485608Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6485762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6485836Z ) 2025-05-07T20:33:00.6486094Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6486186Z def test_silu_mul_quant( 2025-05-07T20:33:00.6486258Z self, 2025-05-07T20:33:00.6486336Z T: int, 2025-05-07T20:33:00.6486408Z D: int, 2025-05-07T20:33:00.6486503Z scale_ub: Optional[float], 2025-05-07T20:33:00.6486596Z contiguous: bool, 2025-05-07T20:33:00.6486679Z compiled: bool, 2025-05-07T20:33:00.6486754Z ) -> None: 2025-05-07T20:33:00.6486851Z torch.manual_seed(2025) 2025-05-07T20:33:00.6486919Z 2025-05-07T20:33:00.6487094Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6487165Z 2025-05-07T20:33:00.6487255Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6487425Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6487514Z x = x_sign * x_clamp 2025-05-07T20:33:00.6487590Z x0 = x[:, :D] 2025-05-07T20:33:00.6487676Z x1 = x[:, D:] 2025-05-07T20:33:00.6487744Z 2025-05-07T20:33:00.6487827Z if contiguous: 2025-05-07T20:33:00.6487922Z x0 = x0.contiguous() 2025-05-07T20:33:00.6488010Z x1 = x1.contiguous() 2025-05-07T20:33:00.6488083Z 2025-05-07T20:33:00.6488174Z if scale_ub is not None: 2025-05-07T20:33:00.6488277Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6488415Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6488490Z ) 2025-05-07T20:33:00.6488563Z else: 2025-05-07T20:33:00.6488664Z scale_ub_tensor = None 2025-05-07T20:33:00.6488732Z 2025-05-07T20:33:00.6488864Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6488955Z op = silu_mul_quant 2025-05-07T20:33:00.6489037Z if compiled: 2025-05-07T20:33:00.6489137Z op = torch.compile(op) 2025-05-07T20:33:00.6489243Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6489313Z 2025-05-07T20:33:00.6489400Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.6489522Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.6489590Z 2025-05-07T20:33:00.6489732Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6489831Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.6489928Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.6490051Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.6490190Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6490331Z 2025-05-07T20:33:00.6490434Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.6490439Z 2025-05-07T20:33:00.6490536Z moe/activation_test.py:126: 2025-05-07T20:33:00.6490664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6490773Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.6490910Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6491490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.6491634Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.6491999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6492228Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6492606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.6492866Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6493313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.6493569Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6493955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.6494128Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.6494478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.6494564Z fn() 2025-05-07T20:33:00.6494975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.6495065Z self.fn.run( 2025-05-07T20:33:00.6495411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6495547Z kernel = self.compile( 2025-05-07T20:33:00.6495944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6496125Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6496252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6496260Z 2025-05-07T20:33:00.6496476Z self = 2025-05-07T20:33:00.6497385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6497916Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daaa1510>} 2025-05-07T20:33:00.6498799Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6498997Z context = 2025-05-07T20:33:00.6499002Z 2025-05-07T20:33:00.6499174Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6499447Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6499564Z module_map=module_map) 2025-05-07T20:33:00.6499727Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6499831Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.6499968Z E ^ 2025-05-07T20:33:00.6500333Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6500340Z 2025-05-07T20:33:00.6500772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6500777Z 2025-05-07T20:33:00.6500883Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6501111Z self=, 2025-05-07T20:33:00.6501193Z T=1, 2025-05-07T20:33:00.6501311Z D=5120, 2025-05-07T20:33:00.6501395Z scale_ub=1200.0, 2025-05-07T20:33:00.6501485Z contiguous=True, 2025-05-07T20:33:00.6501573Z compiled=True, 2025-05-07T20:33:00.6501647Z ) 2025-05-07T20:33:00.6501876Z self = 2025-05-07T20:33:00.6502047Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.6502052Z 2025-05-07T20:33:00.6502134Z @given( 2025-05-07T20:33:00.6502254Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6502357Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6502518Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6502639Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6502757Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6502839Z ) 2025-05-07T20:33:00.6503091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6503193Z def test_silu_mul_quant( 2025-05-07T20:33:00.6503278Z self, 2025-05-07T20:33:00.6503354Z T: int, 2025-05-07T20:33:00.6503436Z D: int, 2025-05-07T20:33:00.6503535Z scale_ub: Optional[float], 2025-05-07T20:33:00.6503625Z contiguous: bool, 2025-05-07T20:33:00.6503718Z compiled: bool, 2025-05-07T20:33:00.6503800Z ) -> None: 2025-05-07T20:33:00.6503899Z torch.manual_seed(2025) 2025-05-07T20:33:00.6503979Z 2025-05-07T20:33:00.6504150Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6504267Z 2025-05-07T20:33:00.6504366Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6504495Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6504585Z x = x_sign * x_clamp 2025-05-07T20:33:00.6504675Z x0 = x[:, :D] 2025-05-07T20:33:00.6504767Z x1 = x[:, D:] 2025-05-07T20:33:00.6504853Z 2025-05-07T20:33:00.6504955Z if contiguous: 2025-05-07T20:33:00.6505064Z x0 = x0.contiguous() 2025-05-07T20:33:00.6505160Z x1 = x1.contiguous() 2025-05-07T20:33:00.6505232Z 2025-05-07T20:33:00.6505325Z if scale_ub is not None: 2025-05-07T20:33:00.6505437Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6505571Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6505651Z ) 2025-05-07T20:33:00.6505736Z else: 2025-05-07T20:33:00.6505830Z scale_ub_tensor = None 2025-05-07T20:33:00.6505908Z 2025-05-07T20:33:00.6506046Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6506137Z op = silu_mul_quant 2025-05-07T20:33:00.6506222Z if compiled: 2025-05-07T20:33:00.6506329Z op = torch.compile(op) 2025-05-07T20:33:00.6506435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6506512Z 2025-05-07T20:33:00.6506602Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6506610Z 2025-05-07T20:33:00.6506708Z moe/activation_test.py:117: 2025-05-07T20:33:00.6506842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6506943Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6507045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6507471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6507566Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6508080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6508183Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6508550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6508781Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6509189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6509285Z kernel = self.compile( 2025-05-07T20:33:00.6509682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6509865Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6509995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6510000Z 2025-05-07T20:33:00.6510214Z self = 2025-05-07T20:33:00.6511050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6511570Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da692b90>} 2025-05-07T20:33:00.6512342Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6512545Z context = 2025-05-07T20:33:00.6512549Z 2025-05-07T20:33:00.6512718Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6513030Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6513143Z module_map=module_map) 2025-05-07T20:33:00.6513307Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6513415Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6513492Z E ^ 2025-05-07T20:33:00.6513853Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6513861Z 2025-05-07T20:33:00.6514289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6514294Z 2025-05-07T20:33:00.6514399Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6514638Z self=, 2025-05-07T20:33:00.6514715Z T=1, 2025-05-07T20:33:00.6514792Z D=5120, 2025-05-07T20:33:00.6514883Z scale_ub=None, 2025-05-07T20:33:00.6514972Z contiguous=False, 2025-05-07T20:33:00.6515077Z compiled=True, 2025-05-07T20:33:00.6515162Z ) 2025-05-07T20:33:00.6515405Z self = 2025-05-07T20:33:00.6515570Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.6515575Z 2025-05-07T20:33:00.6515658Z @given( 2025-05-07T20:33:00.6515779Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6515885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6516001Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6516119Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6516239Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6516359Z ) 2025-05-07T20:33:00.6516608Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6516711Z def test_silu_mul_quant( 2025-05-07T20:33:00.6516790Z self, 2025-05-07T20:33:00.6516867Z T: int, 2025-05-07T20:33:00.6516950Z D: int, 2025-05-07T20:33:00.6517050Z scale_ub: Optional[float], 2025-05-07T20:33:00.6517139Z contiguous: bool, 2025-05-07T20:33:00.6517230Z compiled: bool, 2025-05-07T20:33:00.6517310Z ) -> None: 2025-05-07T20:33:00.6517411Z torch.manual_seed(2025) 2025-05-07T20:33:00.6517526Z 2025-05-07T20:33:00.6517697Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6517778Z 2025-05-07T20:33:00.6517871Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6517998Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6518091Z x = x_sign * x_clamp 2025-05-07T20:33:00.6518176Z x0 = x[:, :D] 2025-05-07T20:33:00.6518259Z x1 = x[:, D:] 2025-05-07T20:33:00.6518338Z 2025-05-07T20:33:00.6518425Z if contiguous: 2025-05-07T20:33:00.6518519Z x0 = x0.contiguous() 2025-05-07T20:33:00.6518612Z x1 = x1.contiguous() 2025-05-07T20:33:00.6518685Z 2025-05-07T20:33:00.6518827Z if scale_ub is not None: 2025-05-07T20:33:00.6518935Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6519071Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6519158Z ) 2025-05-07T20:33:00.6519233Z else: 2025-05-07T20:33:00.6519330Z scale_ub_tensor = None 2025-05-07T20:33:00.6519412Z 2025-05-07T20:33:00.6519545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6519635Z op = silu_mul_quant 2025-05-07T20:33:00.6519727Z if compiled: 2025-05-07T20:33:00.6519828Z op = torch.compile(op) 2025-05-07T20:33:00.6519937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6520017Z 2025-05-07T20:33:00.6520108Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.6520237Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.6520377Z 2025-05-07T20:33:00.6520517Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6520626Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.6520726Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.6520849Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.6520997Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6521074Z 2025-05-07T20:33:00.6521175Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.6521180Z 2025-05-07T20:33:00.6521286Z moe/activation_test.py:126: 2025-05-07T20:33:00.6521412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6521522Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.6521664Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6522231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.6522345Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.6522709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6522932Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6523307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.6523568Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6523976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.6524272Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6524653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.6524833Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.6525178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.6525264Z fn() 2025-05-07T20:33:00.6525668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.6525795Z self.fn.run( 2025-05-07T20:33:00.6526143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6526239Z kernel = self.compile( 2025-05-07T20:33:00.6526626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6526813Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6526939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6526947Z 2025-05-07T20:33:00.6527201Z self = 2025-05-07T20:33:00.6527988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6528503Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daf79b40>} 2025-05-07T20:33:00.6529266Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6529460Z context = 2025-05-07T20:33:00.6529518Z 2025-05-07T20:33:00.6529692Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6529964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6530081Z module_map=module_map) 2025-05-07T20:33:00.6530243Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6530348Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.6530435Z E ^ 2025-05-07T20:33:00.6530790Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6530795Z 2025-05-07T20:33:00.6531213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6531220Z 2025-05-07T20:33:00.6531335Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6531559Z self=, 2025-05-07T20:33:00.6531645Z T=1, 2025-05-07T20:33:00.6531726Z D=5120, 2025-05-07T20:33:00.6531811Z scale_ub=None, 2025-05-07T20:33:00.6531904Z contiguous=True, 2025-05-07T20:33:00.6531990Z compiled=False, 2025-05-07T20:33:00.6532063Z ) 2025-05-07T20:33:00.6532290Z self = 2025-05-07T20:33:00.6532454Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.6532460Z 2025-05-07T20:33:00.6532539Z @given( 2025-05-07T20:33:00.6532666Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6532771Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6532894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6533058Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6533174Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6533256Z ) 2025-05-07T20:33:00.6533504Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6533605Z def test_silu_mul_quant( 2025-05-07T20:33:00.6533687Z self, 2025-05-07T20:33:00.6533763Z T: int, 2025-05-07T20:33:00.6533839Z D: int, 2025-05-07T20:33:00.6533942Z scale_ub: Optional[float], 2025-05-07T20:33:00.6534030Z contiguous: bool, 2025-05-07T20:33:00.6534115Z compiled: bool, 2025-05-07T20:33:00.6534248Z ) -> None: 2025-05-07T20:33:00.6534342Z torch.manual_seed(2025) 2025-05-07T20:33:00.6534422Z 2025-05-07T20:33:00.6534593Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6534666Z 2025-05-07T20:33:00.6534774Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6534918Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6535030Z x = x_sign * x_clamp 2025-05-07T20:33:00.6535119Z x0 = x[:, :D] 2025-05-07T20:33:00.6535198Z x1 = x[:, D:] 2025-05-07T20:33:00.6535273Z 2025-05-07T20:33:00.6535364Z if contiguous: 2025-05-07T20:33:00.6535495Z x0 = x0.contiguous() 2025-05-07T20:33:00.6535584Z x1 = x1.contiguous() 2025-05-07T20:33:00.6535661Z 2025-05-07T20:33:00.6535750Z if scale_ub is not None: 2025-05-07T20:33:00.6535858Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6535993Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6536071Z ) 2025-05-07T20:33:00.6536149Z else: 2025-05-07T20:33:00.6536243Z scale_ub_tensor = None 2025-05-07T20:33:00.6536316Z 2025-05-07T20:33:00.6536453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6536543Z op = silu_mul_quant 2025-05-07T20:33:00.6536627Z if compiled: 2025-05-07T20:33:00.6536733Z op = torch.compile(op) 2025-05-07T20:33:00.6536838Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6536957Z 2025-05-07T20:33:00.6537050Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6537054Z 2025-05-07T20:33:00.6537153Z moe/activation_test.py:117: 2025-05-07T20:33:00.6537287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6537388Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6537489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6538001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6538220Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6538616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6538843Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6539192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6539295Z kernel = self.compile( 2025-05-07T20:33:00.6539685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6539868Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6539993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6539998Z 2025-05-07T20:33:00.6540208Z self = 2025-05-07T20:33:00.6540999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6541556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daf79ea0>} 2025-05-07T20:33:00.6542328Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6542528Z context = 2025-05-07T20:33:00.6542533Z 2025-05-07T20:33:00.6542704Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6543018Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6543130Z module_map=module_map) 2025-05-07T20:33:00.6543299Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6543400Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6543478Z E ^ 2025-05-07T20:33:00.6543847Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6543854Z 2025-05-07T20:33:00.6544313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6544318Z 2025-05-07T20:33:00.6544433Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6544660Z self=, 2025-05-07T20:33:00.6544737Z T=128, 2025-05-07T20:33:00.6544821Z D=5120, 2025-05-07T20:33:00.6544911Z scale_ub=None, 2025-05-07T20:33:00.6545019Z contiguous=False, 2025-05-07T20:33:00.6545118Z compiled=True, 2025-05-07T20:33:00.6545209Z ) 2025-05-07T20:33:00.6545436Z self = 2025-05-07T20:33:00.6545620Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.6545627Z 2025-05-07T20:33:00.6545705Z @given( 2025-05-07T20:33:00.6545832Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6545932Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6546090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6546222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6546339Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6546415Z ) 2025-05-07T20:33:00.6546673Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6546768Z def test_silu_mul_quant( 2025-05-07T20:33:00.6546846Z self, 2025-05-07T20:33:00.6546931Z T: int, 2025-05-07T20:33:00.6547006Z D: int, 2025-05-07T20:33:00.6547111Z scale_ub: Optional[float], 2025-05-07T20:33:00.6547202Z contiguous: bool, 2025-05-07T20:33:00.6547288Z compiled: bool, 2025-05-07T20:33:00.6547372Z ) -> None: 2025-05-07T20:33:00.6547470Z torch.manual_seed(2025) 2025-05-07T20:33:00.6547543Z 2025-05-07T20:33:00.6547719Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6547799Z 2025-05-07T20:33:00.6547891Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6548027Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6548118Z x = x_sign * x_clamp 2025-05-07T20:33:00.6548199Z x0 = x[:, :D] 2025-05-07T20:33:00.6548286Z x1 = x[:, D:] 2025-05-07T20:33:00.6548359Z 2025-05-07T20:33:00.6548443Z if contiguous: 2025-05-07T20:33:00.6548544Z x0 = x0.contiguous() 2025-05-07T20:33:00.6548635Z x1 = x1.contiguous() 2025-05-07T20:33:00.6548712Z 2025-05-07T20:33:00.6548807Z if scale_ub is not None: 2025-05-07T20:33:00.6548913Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6549055Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6549130Z ) 2025-05-07T20:33:00.6549252Z else: 2025-05-07T20:33:00.6549355Z scale_ub_tensor = None 2025-05-07T20:33:00.6549426Z 2025-05-07T20:33:00.6549561Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6549658Z op = silu_mul_quant 2025-05-07T20:33:00.6549744Z if compiled: 2025-05-07T20:33:00.6549847Z op = torch.compile(op) 2025-05-07T20:33:00.6549963Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6550035Z 2025-05-07T20:33:00.6550134Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6550139Z 2025-05-07T20:33:00.6550279Z moe/activation_test.py:117: 2025-05-07T20:33:00.6550410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6550520Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6550621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6550997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6551097Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6551598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6551769Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6552131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6552356Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6552708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6552811Z kernel = self.compile( 2025-05-07T20:33:00.6553196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6553380Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6553507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6553512Z 2025-05-07T20:33:00.6553725Z self = 2025-05-07T20:33:00.6554554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6555069Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daf78dc0>} 2025-05-07T20:33:00.6556113Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6556316Z context = 2025-05-07T20:33:00.6556322Z 2025-05-07T20:33:00.6556497Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6556770Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6556886Z module_map=module_map) 2025-05-07T20:33:00.6557050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6557151Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6557237Z E ^ 2025-05-07T20:33:00.6557596Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6557604Z 2025-05-07T20:33:00.6558025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6558038Z 2025-05-07T20:33:00.6558144Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6558457Z self=, 2025-05-07T20:33:00.6558544Z T=128, 2025-05-07T20:33:00.6558621Z D=7168, 2025-05-07T20:33:00.6558709Z scale_ub=1200.0, 2025-05-07T20:33:00.6558803Z contiguous=False, 2025-05-07T20:33:00.6558887Z compiled=False, 2025-05-07T20:33:00.6558965Z ) 2025-05-07T20:33:00.6559193Z self = 2025-05-07T20:33:00.6559368Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.6559373Z 2025-05-07T20:33:00.6559453Z @given( 2025-05-07T20:33:00.6559639Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6559739Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6559859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6559979Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6560094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6560175Z ) 2025-05-07T20:33:00.6560427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6560525Z def test_silu_mul_quant( 2025-05-07T20:33:00.6560611Z self, 2025-05-07T20:33:00.6560688Z T: int, 2025-05-07T20:33:00.6560826Z D: int, 2025-05-07T20:33:00.6560932Z scale_ub: Optional[float], 2025-05-07T20:33:00.6561021Z contiguous: bool, 2025-05-07T20:33:00.6561113Z compiled: bool, 2025-05-07T20:33:00.6561192Z ) -> None: 2025-05-07T20:33:00.6561289Z torch.manual_seed(2025) 2025-05-07T20:33:00.6561370Z 2025-05-07T20:33:00.6561545Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6561620Z 2025-05-07T20:33:00.6561717Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6561841Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6561928Z x = x_sign * x_clamp 2025-05-07T20:33:00.6562014Z x0 = x[:, :D] 2025-05-07T20:33:00.6562097Z x1 = x[:, D:] 2025-05-07T20:33:00.6562170Z 2025-05-07T20:33:00.6562260Z if contiguous: 2025-05-07T20:33:00.6562352Z x0 = x0.contiguous() 2025-05-07T20:33:00.6562504Z x1 = x1.contiguous() 2025-05-07T20:33:00.6562583Z 2025-05-07T20:33:00.6562677Z if scale_ub is not None: 2025-05-07T20:33:00.6562785Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6562918Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6562990Z ) 2025-05-07T20:33:00.6563071Z else: 2025-05-07T20:33:00.6563163Z scale_ub_tensor = None 2025-05-07T20:33:00.6563237Z 2025-05-07T20:33:00.6563374Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6563462Z op = silu_mul_quant 2025-05-07T20:33:00.6563545Z if compiled: 2025-05-07T20:33:00.6563649Z op = torch.compile(op) 2025-05-07T20:33:00.6563756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6563827Z 2025-05-07T20:33:00.6563922Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6563927Z 2025-05-07T20:33:00.6564024Z moe/activation_test.py:117: 2025-05-07T20:33:00.6564162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6564265Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6564364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6564902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6565015Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6565390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6565621Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6566010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6566111Z kernel = self.compile( 2025-05-07T20:33:00.6566496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6566676Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6566806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6566811Z 2025-05-07T20:33:00.6567016Z self = 2025-05-07T20:33:00.6567810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6568361Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daba3f40>} 2025-05-07T20:33:00.6569117Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6569358Z context = 2025-05-07T20:33:00.6569363Z 2025-05-07T20:33:00.6569531Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6569802Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6569914Z module_map=module_map) 2025-05-07T20:33:00.6570077Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6570182Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6570261Z E ^ 2025-05-07T20:33:00.6570628Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6570633Z 2025-05-07T20:33:00.6571053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6571099Z 2025-05-07T20:33:00.6571205Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6571437Z self=, 2025-05-07T20:33:00.6571515Z T=128, 2025-05-07T20:33:00.6571591Z D=5120, 2025-05-07T20:33:00.6571679Z scale_ub=None, 2025-05-07T20:33:00.6571767Z contiguous=False, 2025-05-07T20:33:00.6571858Z compiled=False, 2025-05-07T20:33:00.6571934Z ) 2025-05-07T20:33:00.6572155Z self = 2025-05-07T20:33:00.6572333Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.6572338Z 2025-05-07T20:33:00.6572416Z @given( 2025-05-07T20:33:00.6572538Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6572647Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6572766Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6572887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6573011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6573088Z ) 2025-05-07T20:33:00.6573342Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6573437Z def test_silu_mul_quant( 2025-05-07T20:33:00.6573514Z self, 2025-05-07T20:33:00.6573598Z T: int, 2025-05-07T20:33:00.6573678Z D: int, 2025-05-07T20:33:00.6573778Z scale_ub: Optional[float], 2025-05-07T20:33:00.6573875Z contiguous: bool, 2025-05-07T20:33:00.6573963Z compiled: bool, 2025-05-07T20:33:00.6574040Z ) -> None: 2025-05-07T20:33:00.6574143Z torch.manual_seed(2025) 2025-05-07T20:33:00.6574218Z 2025-05-07T20:33:00.6574436Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6574517Z 2025-05-07T20:33:00.6574609Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6574741Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6574831Z x = x_sign * x_clamp 2025-05-07T20:33:00.6574913Z x0 = x[:, :D] 2025-05-07T20:33:00.6575019Z x1 = x[:, D:] 2025-05-07T20:33:00.6575092Z 2025-05-07T20:33:00.6575195Z if contiguous: 2025-05-07T20:33:00.6575295Z x0 = x0.contiguous() 2025-05-07T20:33:00.6575381Z x1 = x1.contiguous() 2025-05-07T20:33:00.6575453Z 2025-05-07T20:33:00.6575590Z if scale_ub is not None: 2025-05-07T20:33:00.6575696Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6575832Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6575914Z ) 2025-05-07T20:33:00.6575987Z else: 2025-05-07T20:33:00.6576085Z scale_ub_tensor = None 2025-05-07T20:33:00.6576160Z 2025-05-07T20:33:00.6576290Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6576388Z op = silu_mul_quant 2025-05-07T20:33:00.6576473Z if compiled: 2025-05-07T20:33:00.6576572Z op = torch.compile(op) 2025-05-07T20:33:00.6576722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6576794Z 2025-05-07T20:33:00.6576882Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6576887Z 2025-05-07T20:33:00.6576989Z moe/activation_test.py:117: 2025-05-07T20:33:00.6577116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6577221Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6577326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6577830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6577934Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6578371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6578594Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6578990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6579084Z kernel = self.compile( 2025-05-07T20:33:00.6579473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6579648Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6579775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6579780Z 2025-05-07T20:33:00.6579995Z self = 2025-05-07T20:33:00.6580786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6581308Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daba12d0>} 2025-05-07T20:33:00.6582064Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6582261Z context = 2025-05-07T20:33:00.6582266Z 2025-05-07T20:33:00.6582438Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6582705Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6582860Z module_map=module_map) 2025-05-07T20:33:00.6583023Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6583122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6583207Z E ^ 2025-05-07T20:33:00.6583568Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6583573Z 2025-05-07T20:33:00.6583998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6584003Z 2025-05-07T20:33:00.6584107Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6584398Z self=, 2025-05-07T20:33:00.6584486Z T=128, 2025-05-07T20:33:00.6584563Z D=5120, 2025-05-07T20:33:00.6584649Z scale_ub=1200.0, 2025-05-07T20:33:00.6584743Z contiguous=True, 2025-05-07T20:33:00.6584827Z compiled=False, 2025-05-07T20:33:00.6584903Z ) 2025-05-07T20:33:00.6585136Z self = 2025-05-07T20:33:00.6585311Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.6585318Z 2025-05-07T20:33:00.6585402Z @given( 2025-05-07T20:33:00.6585564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6585667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6585792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6585913Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6586032Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6586118Z ) 2025-05-07T20:33:00.6586367Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6591200Z def test_silu_mul_quant( 2025-05-07T20:33:00.6591287Z self, 2025-05-07T20:33:00.6591361Z T: int, 2025-05-07T20:33:00.6591436Z D: int, 2025-05-07T20:33:00.6591537Z scale_ub: Optional[float], 2025-05-07T20:33:00.6591625Z contiguous: bool, 2025-05-07T20:33:00.6591711Z compiled: bool, 2025-05-07T20:33:00.6591862Z ) -> None: 2025-05-07T20:33:00.6591959Z torch.manual_seed(2025) 2025-05-07T20:33:00.6592031Z 2025-05-07T20:33:00.6592209Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6592283Z 2025-05-07T20:33:00.6592373Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6592497Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6592586Z x = x_sign * x_clamp 2025-05-07T20:33:00.6592667Z x0 = x[:, :D] 2025-05-07T20:33:00.6592742Z x1 = x[:, D:] 2025-05-07T20:33:00.6592813Z 2025-05-07T20:33:00.6592895Z if contiguous: 2025-05-07T20:33:00.6592988Z x0 = x0.contiguous() 2025-05-07T20:33:00.6593081Z x1 = x1.contiguous() 2025-05-07T20:33:00.6593152Z 2025-05-07T20:33:00.6593245Z if scale_ub is not None: 2025-05-07T20:33:00.6593354Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6593488Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6593566Z ) 2025-05-07T20:33:00.6593639Z else: 2025-05-07T20:33:00.6593731Z scale_ub_tensor = None 2025-05-07T20:33:00.6593805Z 2025-05-07T20:33:00.6593934Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6594022Z op = silu_mul_quant 2025-05-07T20:33:00.6594106Z if compiled: 2025-05-07T20:33:00.6594203Z op = torch.compile(op) 2025-05-07T20:33:00.6594308Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6594382Z 2025-05-07T20:33:00.6594470Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6594475Z 2025-05-07T20:33:00.6594574Z moe/activation_test.py:117: 2025-05-07T20:33:00.6594702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6594873Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6594986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6595509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6595610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6595974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6596195Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6596548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6596682Z kernel = self.compile( 2025-05-07T20:33:00.6597068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6597246Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6597370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6597375Z 2025-05-07T20:33:00.6597581Z self = 2025-05-07T20:33:00.6598418Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6598929Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daba3d90>} 2025-05-07T20:33:00.6599691Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6599886Z context = 2025-05-07T20:33:00.6599891Z 2025-05-07T20:33:00.6600058Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6600366Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6600473Z module_map=module_map) 2025-05-07T20:33:00.6600638Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6600736Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6600810Z E ^ 2025-05-07T20:33:00.6601171Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6601179Z 2025-05-07T20:33:00.6601598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6601603Z 2025-05-07T20:33:00.6601710Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6601934Z self=, 2025-05-07T20:33:00.6602011Z T=1, 2025-05-07T20:33:00.6602087Z D=7168, 2025-05-07T20:33:00.6602170Z scale_ub=1200.0, 2025-05-07T20:33:00.6602253Z contiguous=True, 2025-05-07T20:33:00.6602334Z compiled=True, 2025-05-07T20:33:00.6602406Z ) 2025-05-07T20:33:00.6602629Z self = 2025-05-07T20:33:00.6602798Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.6602803Z 2025-05-07T20:33:00.6602876Z @given( 2025-05-07T20:33:00.6603000Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6603096Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6603209Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6603331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6603443Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6603560Z ) 2025-05-07T20:33:00.6603811Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6603904Z def test_silu_mul_quant( 2025-05-07T20:33:00.6603983Z self, 2025-05-07T20:33:00.6604056Z T: int, 2025-05-07T20:33:00.6604133Z D: int, 2025-05-07T20:33:00.6604237Z scale_ub: Optional[float], 2025-05-07T20:33:00.6604323Z contiguous: bool, 2025-05-07T20:33:00.6604405Z compiled: bool, 2025-05-07T20:33:00.6604483Z ) -> None: 2025-05-07T20:33:00.6604575Z torch.manual_seed(2025) 2025-05-07T20:33:00.6604645Z 2025-05-07T20:33:00.6604860Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6604938Z 2025-05-07T20:33:00.6605044Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6605192Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6605280Z x = x_sign * x_clamp 2025-05-07T20:33:00.6605361Z x0 = x[:, :D] 2025-05-07T20:33:00.6605439Z x1 = x[:, D:] 2025-05-07T20:33:00.6605508Z 2025-05-07T20:33:00.6605591Z if contiguous: 2025-05-07T20:33:00.6605681Z x0 = x0.contiguous() 2025-05-07T20:33:00.6605770Z x1 = x1.contiguous() 2025-05-07T20:33:00.6605844Z 2025-05-07T20:33:00.6605973Z if scale_ub is not None: 2025-05-07T20:33:00.6606078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6606220Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6606292Z ) 2025-05-07T20:33:00.6606364Z else: 2025-05-07T20:33:00.6606460Z scale_ub_tensor = None 2025-05-07T20:33:00.6606540Z 2025-05-07T20:33:00.6606669Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6606756Z op = silu_mul_quant 2025-05-07T20:33:00.6606842Z if compiled: 2025-05-07T20:33:00.6606940Z op = torch.compile(op) 2025-05-07T20:33:00.6607047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6607121Z 2025-05-07T20:33:00.6607209Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6607213Z 2025-05-07T20:33:00.6607360Z moe/activation_test.py:117: 2025-05-07T20:33:00.6607489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6607589Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6607695Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6608068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6608160Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6608664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6608759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6609122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6609348Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6609691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6609791Z kernel = self.compile( 2025-05-07T20:33:00.6610176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6610356Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6610478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6610486Z 2025-05-07T20:33:00.6610694Z self = 2025-05-07T20:33:00.6611530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6612041Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daba31c0>} 2025-05-07T20:33:00.6612810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6613000Z context = 2025-05-07T20:33:00.6613005Z 2025-05-07T20:33:00.6613209Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6613482Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6613588Z module_map=module_map) 2025-05-07T20:33:00.6613754Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6613854Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6613930Z E ^ 2025-05-07T20:33:00.6614297Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6614305Z 2025-05-07T20:33:00.6614763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6614769Z 2025-05-07T20:33:00.6614875Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6615098Z self=, 2025-05-07T20:33:00.6615177Z T=1, 2025-05-07T20:33:00.6615258Z D=7168, 2025-05-07T20:33:00.6615340Z scale_ub=1200.0, 2025-05-07T20:33:00.6615423Z contiguous=False, 2025-05-07T20:33:00.6615506Z compiled=True, 2025-05-07T20:33:00.6615576Z ) 2025-05-07T20:33:00.6615795Z self = 2025-05-07T20:33:00.6615973Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.6615977Z 2025-05-07T20:33:00.6616050Z @given( 2025-05-07T20:33:00.6616172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6616312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6616427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6616549Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6616660Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6616730Z ) 2025-05-07T20:33:00.6616980Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6617076Z def test_silu_mul_quant( 2025-05-07T20:33:00.6617149Z self, 2025-05-07T20:33:00.6617228Z T: int, 2025-05-07T20:33:00.6617300Z D: int, 2025-05-07T20:33:00.6617397Z scale_ub: Optional[float], 2025-05-07T20:33:00.6617489Z contiguous: bool, 2025-05-07T20:33:00.6617574Z compiled: bool, 2025-05-07T20:33:00.6617653Z ) -> None: 2025-05-07T20:33:00.6617746Z torch.manual_seed(2025) 2025-05-07T20:33:00.6617816Z 2025-05-07T20:33:00.6617993Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6618171Z 2025-05-07T20:33:00.6618265Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6618392Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6618478Z x = x_sign * x_clamp 2025-05-07T20:33:00.6618553Z x0 = x[:, :D] 2025-05-07T20:33:00.6618635Z x1 = x[:, D:] 2025-05-07T20:33:00.6618702Z 2025-05-07T20:33:00.6618785Z if contiguous: 2025-05-07T20:33:00.6618880Z x0 = x0.contiguous() 2025-05-07T20:33:00.6618965Z x1 = x1.contiguous() 2025-05-07T20:33:00.6619034Z 2025-05-07T20:33:00.6619127Z if scale_ub is not None: 2025-05-07T20:33:00.6619231Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6619442Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6619516Z ) 2025-05-07T20:33:00.6619590Z else: 2025-05-07T20:33:00.6619686Z scale_ub_tensor = None 2025-05-07T20:33:00.6619758Z 2025-05-07T20:33:00.6619887Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6619983Z op = silu_mul_quant 2025-05-07T20:33:00.6620064Z if compiled: 2025-05-07T20:33:00.6620161Z op = torch.compile(op) 2025-05-07T20:33:00.6620269Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6620338Z 2025-05-07T20:33:00.6620426Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6620476Z 2025-05-07T20:33:00.6620574Z moe/activation_test.py:117: 2025-05-07T20:33:00.6620701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6620806Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6620904Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6621282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6621377Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6621919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6622016Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6622378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6622600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6622950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6623043Z kernel = self.compile( 2025-05-07T20:33:00.6623427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6623609Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6623733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6623779Z 2025-05-07T20:33:00.6623991Z self = 2025-05-07T20:33:00.6624790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6625340Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daba05e0>} 2025-05-07T20:33:00.6626101Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6626294Z context = 2025-05-07T20:33:00.6626299Z 2025-05-07T20:33:00.6626466Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6626736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6626843Z module_map=module_map) 2025-05-07T20:33:00.6627012Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6627110Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6627189Z E ^ 2025-05-07T20:33:00.6627549Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6627554Z 2025-05-07T20:33:00.6627973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6627978Z 2025-05-07T20:33:00.6628132Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6628358Z self=, 2025-05-07T20:33:00.6628434Z T=1, 2025-05-07T20:33:00.6628509Z D=7168, 2025-05-07T20:33:00.6628587Z scale_ub=None, 2025-05-07T20:33:00.6628677Z contiguous=False, 2025-05-07T20:33:00.6628756Z compiled=True, 2025-05-07T20:33:00.6628824Z ) 2025-05-07T20:33:00.6629044Z self = 2025-05-07T20:33:00.6629209Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.6629214Z 2025-05-07T20:33:00.6629328Z @given( 2025-05-07T20:33:00.6629452Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6629547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6629663Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6629780Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6629894Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6629969Z ) 2025-05-07T20:33:00.6630214Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6630306Z def test_silu_mul_quant( 2025-05-07T20:33:00.6630382Z self, 2025-05-07T20:33:00.6630493Z T: int, 2025-05-07T20:33:00.6630566Z D: int, 2025-05-07T20:33:00.6630667Z scale_ub: Optional[float], 2025-05-07T20:33:00.6630755Z contiguous: bool, 2025-05-07T20:33:00.6630837Z compiled: bool, 2025-05-07T20:33:00.6630913Z ) -> None: 2025-05-07T20:33:00.6631004Z torch.manual_seed(2025) 2025-05-07T20:33:00.6631083Z 2025-05-07T20:33:00.6631251Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6631322Z 2025-05-07T20:33:00.6631415Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6631538Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6631623Z x = x_sign * x_clamp 2025-05-07T20:33:00.6631705Z x0 = x[:, :D] 2025-05-07T20:33:00.6631782Z x1 = x[:, D:] 2025-05-07T20:33:00.6631850Z 2025-05-07T20:33:00.6631934Z if contiguous: 2025-05-07T20:33:00.6632067Z x0 = x0.contiguous() 2025-05-07T20:33:00.6632155Z x1 = x1.contiguous() 2025-05-07T20:33:00.6632228Z 2025-05-07T20:33:00.6632315Z if scale_ub is not None: 2025-05-07T20:33:00.6632418Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6632555Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6632628Z ) 2025-05-07T20:33:00.6632705Z else: 2025-05-07T20:33:00.6632798Z scale_ub_tensor = None 2025-05-07T20:33:00.6632866Z 2025-05-07T20:33:00.6632998Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6633087Z op = silu_mul_quant 2025-05-07T20:33:00.6633167Z if compiled: 2025-05-07T20:33:00.6633265Z op = torch.compile(op) 2025-05-07T20:33:00.6633373Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6633441Z 2025-05-07T20:33:00.6633531Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.6633653Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.6633722Z 2025-05-07T20:33:00.6633861Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6633960Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.6634063Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.6634183Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.6634323Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6634395Z 2025-05-07T20:33:00.6634491Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.6634496Z 2025-05-07T20:33:00.6634593Z moe/activation_test.py:126: 2025-05-07T20:33:00.6634721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6634869Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.6635007Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.6635626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.6635729Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.6636094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6636315Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6636728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.6636986Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6637389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.6637644Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.6638020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.6638226Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.6638576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.6638651Z fn() 2025-05-07T20:33:00.6639054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.6639136Z self.fn.run( 2025-05-07T20:33:00.6639476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6639570Z kernel = self.compile( 2025-05-07T20:33:00.6639952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6640127Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6640295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6640302Z 2025-05-07T20:33:00.6640507Z self = 2025-05-07T20:33:00.6641295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6641805Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da3aaef0>} 2025-05-07T20:33:00.6642566Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6642759Z context = 2025-05-07T20:33:00.6642765Z 2025-05-07T20:33:00.6642933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6643201Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6643306Z module_map=module_map) 2025-05-07T20:33:00.6643466Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6643572Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.6643647Z E ^ 2025-05-07T20:33:00.6644008Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6644013Z 2025-05-07T20:33:00.6645003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6645009Z 2025-05-07T20:33:00.6645115Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6645343Z self=, 2025-05-07T20:33:00.6645422Z T=1, 2025-05-07T20:33:00.6645500Z D=5120, 2025-05-07T20:33:00.6645581Z scale_ub=1200.0, 2025-05-07T20:33:00.6645667Z contiguous=False, 2025-05-07T20:33:00.6645751Z compiled=True, 2025-05-07T20:33:00.6645821Z ) 2025-05-07T20:33:00.6646039Z self = 2025-05-07T20:33:00.6646285Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.6646289Z 2025-05-07T20:33:00.6646364Z @given( 2025-05-07T20:33:00.6646484Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6646588Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6646701Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6646824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6646939Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6647014Z ) 2025-05-07T20:33:00.6647265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6647400Z def test_silu_mul_quant( 2025-05-07T20:33:00.6647476Z self, 2025-05-07T20:33:00.6647551Z T: int, 2025-05-07T20:33:00.6647623Z D: int, 2025-05-07T20:33:00.6647719Z scale_ub: Optional[float], 2025-05-07T20:33:00.6647810Z contiguous: bool, 2025-05-07T20:33:00.6647899Z compiled: bool, 2025-05-07T20:33:00.6647973Z ) -> None: 2025-05-07T20:33:00.6648068Z torch.manual_seed(2025) 2025-05-07T20:33:00.6648137Z 2025-05-07T20:33:00.6648312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6648385Z 2025-05-07T20:33:00.6648474Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6648601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6648686Z x = x_sign * x_clamp 2025-05-07T20:33:00.6648762Z x0 = x[:, :D] 2025-05-07T20:33:00.6648908Z x1 = x[:, D:] 2025-05-07T20:33:00.6648979Z 2025-05-07T20:33:00.6649060Z if contiguous: 2025-05-07T20:33:00.6649156Z x0 = x0.contiguous() 2025-05-07T20:33:00.6649241Z x1 = x1.contiguous() 2025-05-07T20:33:00.6649311Z 2025-05-07T20:33:00.6649406Z if scale_ub is not None: 2025-05-07T20:33:00.6649513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6649653Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6649725Z ) 2025-05-07T20:33:00.6649798Z else: 2025-05-07T20:33:00.6649893Z scale_ub_tensor = None 2025-05-07T20:33:00.6649961Z 2025-05-07T20:33:00.6650091Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6650182Z op = silu_mul_quant 2025-05-07T20:33:00.6650267Z if compiled: 2025-05-07T20:33:00.6650366Z op = torch.compile(op) 2025-05-07T20:33:00.6650469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6650541Z 2025-05-07T20:33:00.6650631Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6650639Z 2025-05-07T20:33:00.6650733Z moe/activation_test.py:117: 2025-05-07T20:33:00.6650857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6650963Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6651059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6651433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6651527Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6652024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6652165Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6652527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6652751Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6653100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6653190Z kernel = self.compile( 2025-05-07T20:33:00.6653576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6653792Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6653916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6653920Z 2025-05-07T20:33:00.6654128Z self = 2025-05-07T20:33:00.6654942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6655519Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da3abeb0>} 2025-05-07T20:33:00.6656571Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6656771Z context = 2025-05-07T20:33:00.6656777Z 2025-05-07T20:33:00.6656946Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6657211Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6657320Z module_map=module_map) 2025-05-07T20:33:00.6657482Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6657679Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6657756Z E ^ 2025-05-07T20:33:00.6658174Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6658179Z 2025-05-07T20:33:00.6658599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6658603Z 2025-05-07T20:33:00.6658707Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6658929Z self=, 2025-05-07T20:33:00.6659010Z T=1, 2025-05-07T20:33:00.6659083Z D=5120, 2025-05-07T20:33:00.6659163Z scale_ub=1200.0, 2025-05-07T20:33:00.6659250Z contiguous=False, 2025-05-07T20:33:00.6659329Z compiled=False, 2025-05-07T20:33:00.6659401Z ) 2025-05-07T20:33:00.6659621Z self = 2025-05-07T20:33:00.6659788Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.6659796Z 2025-05-07T20:33:00.6659875Z @given( 2025-05-07T20:33:00.6659992Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6660090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6660206Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6660321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6660435Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6660513Z ) 2025-05-07T20:33:00.6660759Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6660852Z def test_silu_mul_quant( 2025-05-07T20:33:00.6660928Z self, 2025-05-07T20:33:00.6661002Z T: int, 2025-05-07T20:33:00.6661151Z D: int, 2025-05-07T20:33:00.6661250Z scale_ub: Optional[float], 2025-05-07T20:33:00.6661337Z contiguous: bool, 2025-05-07T20:33:00.6661422Z compiled: bool, 2025-05-07T20:33:00.6661501Z ) -> None: 2025-05-07T20:33:00.6661593Z torch.manual_seed(2025) 2025-05-07T20:33:00.6661668Z 2025-05-07T20:33:00.6661838Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6661912Z 2025-05-07T20:33:00.6662005Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6662128Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6662272Z x = x_sign * x_clamp 2025-05-07T20:33:00.6662351Z x0 = x[:, :D] 2025-05-07T20:33:00.6662426Z x1 = x[:, D:] 2025-05-07T20:33:00.6662495Z 2025-05-07T20:33:00.6662577Z if contiguous: 2025-05-07T20:33:00.6662665Z x0 = x0.contiguous() 2025-05-07T20:33:00.6662754Z x1 = x1.contiguous() 2025-05-07T20:33:00.6662824Z 2025-05-07T20:33:00.6662919Z if scale_ub is not None: 2025-05-07T20:33:00.6663024Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6663154Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6663230Z ) 2025-05-07T20:33:00.6663303Z else: 2025-05-07T20:33:00.6663453Z scale_ub_tensor = None 2025-05-07T20:33:00.6663523Z 2025-05-07T20:33:00.6663657Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6663743Z op = silu_mul_quant 2025-05-07T20:33:00.6663825Z if compiled: 2025-05-07T20:33:00.6663922Z op = torch.compile(op) 2025-05-07T20:33:00.6664032Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6664105Z 2025-05-07T20:33:00.6664192Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6664197Z 2025-05-07T20:33:00.6664292Z moe/activation_test.py:117: 2025-05-07T20:33:00.6664418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6664519Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6664616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6665159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6665311Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6665679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6665899Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6666246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6666341Z kernel = self.compile( 2025-05-07T20:33:00.6666725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6666901Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6667025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6667032Z 2025-05-07T20:33:00.6667236Z self = 2025-05-07T20:33:00.6668025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6668536Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6daba1c60>} 2025-05-07T20:33:00.6669292Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6669526Z context = 2025-05-07T20:33:00.6669531Z 2025-05-07T20:33:00.6669696Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6669972Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6670078Z module_map=module_map) 2025-05-07T20:33:00.6670243Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6670339Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6670412Z E ^ 2025-05-07T20:33:00.6670818Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6670823Z 2025-05-07T20:33:00.6671239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6671244Z 2025-05-07T20:33:00.6671345Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6671574Z self=, 2025-05-07T20:33:00.6671652Z T=16384, 2025-05-07T20:33:00.6671733Z D=5120, 2025-05-07T20:33:00.6671813Z scale_ub=1200.0, 2025-05-07T20:33:00.6671895Z contiguous=False, 2025-05-07T20:33:00.6672017Z compiled=True, 2025-05-07T20:33:00.6672089Z ) 2025-05-07T20:33:00.6672309Z self = 2025-05-07T20:33:00.6672490Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.6672495Z 2025-05-07T20:33:00.6672573Z @given( 2025-05-07T20:33:00.6672688Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6672785Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6672897Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6673018Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6673130Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6673201Z ) 2025-05-07T20:33:00.6673448Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6673584Z def test_silu_mul_quant( 2025-05-07T20:33:00.6673659Z self, 2025-05-07T20:33:00.6673741Z T: int, 2025-05-07T20:33:00.6673815Z D: int, 2025-05-07T20:33:00.6673910Z scale_ub: Optional[float], 2025-05-07T20:33:00.6674000Z contiguous: bool, 2025-05-07T20:33:00.6674082Z compiled: bool, 2025-05-07T20:33:00.6674159Z ) -> None: 2025-05-07T20:33:00.6674255Z torch.manual_seed(2025) 2025-05-07T20:33:00.6674327Z 2025-05-07T20:33:00.6674496Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6674566Z 2025-05-07T20:33:00.6674654Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6674778Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6674862Z x = x_sign * x_clamp 2025-05-07T20:33:00.6674940Z x0 = x[:, :D] 2025-05-07T20:33:00.6675019Z x1 = x[:, D:] 2025-05-07T20:33:00.6675090Z 2025-05-07T20:33:00.6675171Z if contiguous: 2025-05-07T20:33:00.6675266Z x0 = x0.contiguous() 2025-05-07T20:33:00.6675352Z x1 = x1.contiguous() 2025-05-07T20:33:00.6675422Z 2025-05-07T20:33:00.6675512Z if scale_ub is not None: 2025-05-07T20:33:00.6675615Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6675751Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6675825Z ) 2025-05-07T20:33:00.6675898Z else: 2025-05-07T20:33:00.6675991Z scale_ub_tensor = None 2025-05-07T20:33:00.6676061Z 2025-05-07T20:33:00.6676188Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6676278Z op = silu_mul_quant 2025-05-07T20:33:00.6676359Z if compiled: 2025-05-07T20:33:00.6676454Z op = torch.compile(op) 2025-05-07T20:33:00.6676604Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6676675Z 2025-05-07T20:33:00.6676762Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6676768Z 2025-05-07T20:33:00.6676867Z moe/activation_test.py:117: 2025-05-07T20:33:00.6676992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6677092Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6677187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6677556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6677690Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6678185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6678281Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6678646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6678868Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6679215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6679372Z kernel = self.compile( 2025-05-07T20:33:00.6679757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6679937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6680060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6680067Z 2025-05-07T20:33:00.6680276Z self = 2025-05-07T20:33:00.6681061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6681565Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5058148b0>} 2025-05-07T20:33:00.6682367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6682559Z context = 2025-05-07T20:33:00.6682566Z 2025-05-07T20:33:00.6682734Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6682999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6683104Z module_map=module_map) 2025-05-07T20:33:00.6683269Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6683367Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6683445Z E ^ 2025-05-07T20:33:00.6683802Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6683809Z 2025-05-07T20:33:00.6684229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6684233Z 2025-05-07T20:33:00.6684338Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6684560Z self=, 2025-05-07T20:33:00.6684638Z T=2048, 2025-05-07T20:33:00.6684716Z D=7168, 2025-05-07T20:33:00.6684819Z scale_ub=1200.0, 2025-05-07T20:33:00.6684914Z contiguous=False, 2025-05-07T20:33:00.6685015Z compiled=True, 2025-05-07T20:33:00.6685085Z ) 2025-05-07T20:33:00.6685304Z self = 2025-05-07T20:33:00.6685520Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.6685525Z 2025-05-07T20:33:00.6685598Z @given( 2025-05-07T20:33:00.6685720Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6685818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6685929Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6686046Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6686156Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6686230Z ) 2025-05-07T20:33:00.6686518Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6686611Z def test_silu_mul_quant( 2025-05-07T20:33:00.6686686Z self, 2025-05-07T20:33:00.6686759Z T: int, 2025-05-07T20:33:00.6686831Z D: int, 2025-05-07T20:33:00.6686932Z scale_ub: Optional[float], 2025-05-07T20:33:00.6687017Z contiguous: bool, 2025-05-07T20:33:00.6687102Z compiled: bool, 2025-05-07T20:33:00.6687178Z ) -> None: 2025-05-07T20:33:00.6687270Z torch.manual_seed(2025) 2025-05-07T20:33:00.6687341Z 2025-05-07T20:33:00.6687513Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6687625Z 2025-05-07T20:33:00.6687717Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6687839Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6687924Z x = x_sign * x_clamp 2025-05-07T20:33:00.6688004Z x0 = x[:, :D] 2025-05-07T20:33:00.6688079Z x1 = x[:, D:] 2025-05-07T20:33:00.6688150Z 2025-05-07T20:33:00.6688233Z if contiguous: 2025-05-07T20:33:00.6688320Z x0 = x0.contiguous() 2025-05-07T20:33:00.6688404Z x1 = x1.contiguous() 2025-05-07T20:33:00.6688474Z 2025-05-07T20:33:00.6688564Z if scale_ub is not None: 2025-05-07T20:33:00.6688665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6688801Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6688873Z ) 2025-05-07T20:33:00.6688949Z else: 2025-05-07T20:33:00.6689083Z scale_ub_tensor = None 2025-05-07T20:33:00.6689152Z 2025-05-07T20:33:00.6689286Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6689372Z op = silu_mul_quant 2025-05-07T20:33:00.6689453Z if compiled: 2025-05-07T20:33:00.6689551Z op = torch.compile(op) 2025-05-07T20:33:00.6689656Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6689729Z 2025-05-07T20:33:00.6689818Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6689822Z 2025-05-07T20:33:00.6689918Z moe/activation_test.py:117: 2025-05-07T20:33:00.6690041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6690142Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6690243Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6690615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6690708Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6691203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6691300Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6691658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6691879Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6692221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6692313Z kernel = self.compile( 2025-05-07T20:33:00.6692741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6692917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6693038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6693045Z 2025-05-07T20:33:00.6693254Z self = 2025-05-07T20:33:00.6694037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6694581Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505815090>} 2025-05-07T20:33:00.6695387Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6695583Z context = 2025-05-07T20:33:00.6695590Z 2025-05-07T20:33:00.6695754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6696056Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6696167Z module_map=module_map) 2025-05-07T20:33:00.6696331Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6696429Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6696515Z E ^ 2025-05-07T20:33:00.6696873Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6696877Z 2025-05-07T20:33:00.6697302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6697309Z 2025-05-07T20:33:00.6697416Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6697640Z self=, 2025-05-07T20:33:00.6697765Z T=1, 2025-05-07T20:33:00.6697840Z D=5120, 2025-05-07T20:33:00.6697924Z scale_ub=None, 2025-05-07T20:33:00.6698017Z contiguous=False, 2025-05-07T20:33:00.6698158Z compiled=False, 2025-05-07T20:33:00.6698237Z ) 2025-05-07T20:33:00.6698458Z self = 2025-05-07T20:33:00.6698626Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.6698634Z 2025-05-07T20:33:00.6698713Z @given( 2025-05-07T20:33:00.6698831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6698931Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6699049Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6699169Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6699282Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6699362Z ) 2025-05-07T20:33:00.6699611Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6699716Z def test_silu_mul_quant( 2025-05-07T20:33:00.6699793Z self, 2025-05-07T20:33:00.6699868Z T: int, 2025-05-07T20:33:00.6699953Z D: int, 2025-05-07T20:33:00.6700051Z scale_ub: Optional[float], 2025-05-07T20:33:00.6700138Z contiguous: bool, 2025-05-07T20:33:00.6700229Z compiled: bool, 2025-05-07T20:33:00.6700307Z ) -> None: 2025-05-07T20:33:00.6700401Z torch.manual_seed(2025) 2025-05-07T20:33:00.6700482Z 2025-05-07T20:33:00.6700652Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6700725Z 2025-05-07T20:33:00.6700823Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6700946Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6701090Z x = x_sign * x_clamp 2025-05-07T20:33:00.6701169Z x0 = x[:, :D] 2025-05-07T20:33:00.6701246Z x1 = x[:, D:] 2025-05-07T20:33:00.6701325Z 2025-05-07T20:33:00.6701407Z if contiguous: 2025-05-07T20:33:00.6701497Z x0 = x0.contiguous() 2025-05-07T20:33:00.6701593Z x1 = x1.contiguous() 2025-05-07T20:33:00.6701666Z 2025-05-07T20:33:00.6701760Z if scale_ub is not None: 2025-05-07T20:33:00.6701869Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6702002Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6702117Z ) 2025-05-07T20:33:00.6702199Z else: 2025-05-07T20:33:00.6702292Z scale_ub_tensor = None 2025-05-07T20:33:00.6702364Z 2025-05-07T20:33:00.6702499Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6702588Z op = silu_mul_quant 2025-05-07T20:33:00.6702679Z if compiled: 2025-05-07T20:33:00.6702779Z op = torch.compile(op) 2025-05-07T20:33:00.6702884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6702962Z 2025-05-07T20:33:00.6703053Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6703058Z 2025-05-07T20:33:00.6703194Z moe/activation_test.py:117: 2025-05-07T20:33:00.6703327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6703428Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6703528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6704037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6704139Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6704506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6704736Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6705080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6705223Z kernel = self.compile( 2025-05-07T20:33:00.6705612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6705793Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6705915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6705919Z 2025-05-07T20:33:00.6706124Z self = 2025-05-07T20:33:00.6706919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6707431Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5058157e0>} 2025-05-07T20:33:00.6708200Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6708392Z context = 2025-05-07T20:33:00.6708397Z 2025-05-07T20:33:00.6708564Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6708841Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6708950Z module_map=module_map) 2025-05-07T20:33:00.6709118Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6709217Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6709293Z E ^ 2025-05-07T20:33:00.6709698Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6709706Z 2025-05-07T20:33:00.6710131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6710135Z 2025-05-07T20:33:00.6710246Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6710470Z self=, 2025-05-07T20:33:00.6710547Z T=4096, 2025-05-07T20:33:00.6710628Z D=7168, 2025-05-07T20:33:00.6710775Z scale_ub=1200.0, 2025-05-07T20:33:00.6710859Z contiguous=False, 2025-05-07T20:33:00.6710947Z compiled=False, 2025-05-07T20:33:00.6711028Z ) 2025-05-07T20:33:00.6711247Z self = 2025-05-07T20:33:00.6711431Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.6711436Z 2025-05-07T20:33:00.6711515Z @given( 2025-05-07T20:33:00.6711639Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6711740Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6711857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6716056Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6716173Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6716248Z ) 2025-05-07T20:33:00.6716499Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6716592Z def test_silu_mul_quant( 2025-05-07T20:33:00.6716673Z self, 2025-05-07T20:33:00.6716745Z T: int, 2025-05-07T20:33:00.6716815Z D: int, 2025-05-07T20:33:00.6716912Z scale_ub: Optional[float], 2025-05-07T20:33:00.6717000Z contiguous: bool, 2025-05-07T20:33:00.6717083Z compiled: bool, 2025-05-07T20:33:00.6717163Z ) -> None: 2025-05-07T20:33:00.6717260Z torch.manual_seed(2025) 2025-05-07T20:33:00.6717329Z 2025-05-07T20:33:00.6717504Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6717622Z 2025-05-07T20:33:00.6717714Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6717839Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6717924Z x = x_sign * x_clamp 2025-05-07T20:33:00.6718004Z x0 = x[:, :D] 2025-05-07T20:33:00.6718081Z x1 = x[:, D:] 2025-05-07T20:33:00.6718151Z 2025-05-07T20:33:00.6718235Z if contiguous: 2025-05-07T20:33:00.6718326Z x0 = x0.contiguous() 2025-05-07T20:33:00.6718415Z x1 = x1.contiguous() 2025-05-07T20:33:00.6718485Z 2025-05-07T20:33:00.6718573Z if scale_ub is not None: 2025-05-07T20:33:00.6718675Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6718816Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6718889Z ) 2025-05-07T20:33:00.6718966Z else: 2025-05-07T20:33:00.6719057Z scale_ub_tensor = None 2025-05-07T20:33:00.6719125Z 2025-05-07T20:33:00.6719255Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6719345Z op = silu_mul_quant 2025-05-07T20:33:00.6719428Z if compiled: 2025-05-07T20:33:00.6719528Z op = torch.compile(op) 2025-05-07T20:33:00.6719632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6719701Z 2025-05-07T20:33:00.6719791Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6719796Z 2025-05-07T20:33:00.6719895Z moe/activation_test.py:117: 2025-05-07T20:33:00.6720024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6720123Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6720220Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6720781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6720877Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6721240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6721472Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6721820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6721919Z kernel = self.compile( 2025-05-07T20:33:00.6722308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6722532Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6722657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6722662Z 2025-05-07T20:33:00.6722872Z self = 2025-05-07T20:33:00.6723670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6724223Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505816200>} 2025-05-07T20:33:00.6724988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6725218Z context = 2025-05-07T20:33:00.6725224Z 2025-05-07T20:33:00.6725407Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6725682Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6725789Z module_map=module_map) 2025-05-07T20:33:00.6725991Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6726093Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6726168Z E ^ 2025-05-07T20:33:00.6726527Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6726540Z 2025-05-07T20:33:00.6726961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6726968Z 2025-05-07T20:33:00.6727072Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6727300Z self=, 2025-05-07T20:33:00.6727375Z T=16384, 2025-05-07T20:33:00.6727449Z D=7168, 2025-05-07T20:33:00.6727533Z scale_ub=None, 2025-05-07T20:33:00.6727621Z contiguous=True, 2025-05-07T20:33:00.6727701Z compiled=True, 2025-05-07T20:33:00.6727775Z ) 2025-05-07T20:33:00.6727994Z self = 2025-05-07T20:33:00.6728175Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.6728180Z 2025-05-07T20:33:00.6728254Z @given( 2025-05-07T20:33:00.6728371Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6728471Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6728584Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6728704Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6728820Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6728894Z ) 2025-05-07T20:33:00.6729142Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6729241Z def test_silu_mul_quant( 2025-05-07T20:33:00.6729359Z self, 2025-05-07T20:33:00.6729437Z T: int, 2025-05-07T20:33:00.6729510Z D: int, 2025-05-07T20:33:00.6729605Z scale_ub: Optional[float], 2025-05-07T20:33:00.6729696Z contiguous: bool, 2025-05-07T20:33:00.6729780Z compiled: bool, 2025-05-07T20:33:00.6729858Z ) -> None: 2025-05-07T20:33:00.6729955Z torch.manual_seed(2025) 2025-05-07T20:33:00.6730024Z 2025-05-07T20:33:00.6730193Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6730270Z 2025-05-07T20:33:00.6730360Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6730528Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6730618Z x = x_sign * x_clamp 2025-05-07T20:33:00.6730696Z x0 = x[:, :D] 2025-05-07T20:33:00.6730778Z x1 = x[:, D:] 2025-05-07T20:33:00.6730848Z 2025-05-07T20:33:00.6730931Z if contiguous: 2025-05-07T20:33:00.6731024Z x0 = x0.contiguous() 2025-05-07T20:33:00.6731113Z x1 = x1.contiguous() 2025-05-07T20:33:00.6731183Z 2025-05-07T20:33:00.6731274Z if scale_ub is not None: 2025-05-07T20:33:00.6731382Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6731517Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6731633Z ) 2025-05-07T20:33:00.6731712Z else: 2025-05-07T20:33:00.6731804Z scale_ub_tensor = None 2025-05-07T20:33:00.6731876Z 2025-05-07T20:33:00.6732007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6732095Z op = silu_mul_quant 2025-05-07T20:33:00.6732178Z if compiled: 2025-05-07T20:33:00.6732279Z op = torch.compile(op) 2025-05-07T20:33:00.6732381Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6732450Z 2025-05-07T20:33:00.6732541Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6732545Z 2025-05-07T20:33:00.6732640Z moe/activation_test.py:117: 2025-05-07T20:33:00.6732770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6732869Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6733008Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6733386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6733477Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6733978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6734078Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6734443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6734670Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6735016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6735107Z kernel = self.compile( 2025-05-07T20:33:00.6735500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6735680Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6735804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6735809Z 2025-05-07T20:33:00.6736016Z self = 2025-05-07T20:33:00.6736804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6737357Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505817760>} 2025-05-07T20:33:00.6738195Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6738396Z context = 2025-05-07T20:33:00.6738401Z 2025-05-07T20:33:00.6738565Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6738831Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6738981Z module_map=module_map) 2025-05-07T20:33:00.6739141Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6739241Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6739313Z E ^ 2025-05-07T20:33:00.6739674Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6739679Z 2025-05-07T20:33:00.6740107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6740114Z 2025-05-07T20:33:00.6740216Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6740483Z self=, 2025-05-07T20:33:00.6740559Z T=4096, 2025-05-07T20:33:00.6740631Z D=5120, 2025-05-07T20:33:00.6740714Z scale_ub=None, 2025-05-07T20:33:00.6740797Z contiguous=False, 2025-05-07T20:33:00.6740875Z compiled=True, 2025-05-07T20:33:00.6740950Z ) 2025-05-07T20:33:00.6741168Z self = 2025-05-07T20:33:00.6741341Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.6741346Z 2025-05-07T20:33:00.6741421Z @given( 2025-05-07T20:33:00.6741535Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6741632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6741747Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6741905Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6742017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6742091Z ) 2025-05-07T20:33:00.6742337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6742430Z def test_silu_mul_quant( 2025-05-07T20:33:00.6742503Z self, 2025-05-07T20:33:00.6742575Z T: int, 2025-05-07T20:33:00.6742652Z D: int, 2025-05-07T20:33:00.6742747Z scale_ub: Optional[float], 2025-05-07T20:33:00.6742832Z contiguous: bool, 2025-05-07T20:33:00.6742916Z compiled: bool, 2025-05-07T20:33:00.6742990Z ) -> None: 2025-05-07T20:33:00.6743082Z torch.manual_seed(2025) 2025-05-07T20:33:00.6743155Z 2025-05-07T20:33:00.6743324Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6743398Z 2025-05-07T20:33:00.6743486Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6743608Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6743698Z x = x_sign * x_clamp 2025-05-07T20:33:00.6743777Z x0 = x[:, :D] 2025-05-07T20:33:00.6743853Z x1 = x[:, D:] 2025-05-07T20:33:00.6743924Z 2025-05-07T20:33:00.6744005Z if contiguous: 2025-05-07T20:33:00.6744092Z x0 = x0.contiguous() 2025-05-07T20:33:00.6744181Z x1 = x1.contiguous() 2025-05-07T20:33:00.6744250Z 2025-05-07T20:33:00.6744339Z if scale_ub is not None: 2025-05-07T20:33:00.6744443Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6744575Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6744651Z ) 2025-05-07T20:33:00.6744723Z else: 2025-05-07T20:33:00.6744813Z scale_ub_tensor = None 2025-05-07T20:33:00.6744885Z 2025-05-07T20:33:00.6745117Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6745220Z op = silu_mul_quant 2025-05-07T20:33:00.6745306Z if compiled: 2025-05-07T20:33:00.6745404Z op = torch.compile(op) 2025-05-07T20:33:00.6745510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6745584Z 2025-05-07T20:33:00.6745672Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6745676Z 2025-05-07T20:33:00.6745772Z moe/activation_test.py:117: 2025-05-07T20:33:00.6745900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6746040Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6746140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6746515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6746605Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6747113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6747210Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6747611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6747839Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6748188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6748286Z kernel = self.compile( 2025-05-07T20:33:00.6748677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6748853Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6748979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6748983Z 2025-05-07T20:33:00.6749192Z self = 2025-05-07T20:33:00.6749989Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6750539Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a4280>} 2025-05-07T20:33:00.6751304Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6751502Z context = 2025-05-07T20:33:00.6751506Z 2025-05-07T20:33:00.6751672Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6751946Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6752051Z module_map=module_map) 2025-05-07T20:33:00.6752214Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6752318Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6752392Z E ^ 2025-05-07T20:33:00.6752752Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6752757Z 2025-05-07T20:33:00.6753177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6753184Z 2025-05-07T20:33:00.6753286Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6753514Z self=, 2025-05-07T20:33:00.6753590Z T=4096, 2025-05-07T20:33:00.6753666Z D=5120, 2025-05-07T20:33:00.6753790Z scale_ub=1200.0, 2025-05-07T20:33:00.6753878Z contiguous=False, 2025-05-07T20:33:00.6753964Z compiled=False, 2025-05-07T20:33:00.6754037Z ) 2025-05-07T20:33:00.6754256Z self = 2025-05-07T20:33:00.6754436Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.6754441Z 2025-05-07T20:33:00.6754517Z @given( 2025-05-07T20:33:00.6754633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6754734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6754889Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6755034Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6755163Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6755243Z ) 2025-05-07T20:33:00.6755493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6755781Z def test_silu_mul_quant( 2025-05-07T20:33:00.6755899Z self, 2025-05-07T20:33:00.6756017Z T: int, 2025-05-07T20:33:00.6756103Z D: int, 2025-05-07T20:33:00.6756204Z scale_ub: Optional[float], 2025-05-07T20:33:00.6756293Z contiguous: bool, 2025-05-07T20:33:00.6756467Z compiled: bool, 2025-05-07T20:33:00.6756545Z ) -> None: 2025-05-07T20:33:00.6756641Z torch.manual_seed(2025) 2025-05-07T20:33:00.6756711Z 2025-05-07T20:33:00.6756881Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6756956Z 2025-05-07T20:33:00.6757051Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6757177Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6757262Z x = x_sign * x_clamp 2025-05-07T20:33:00.6757341Z x0 = x[:, :D] 2025-05-07T20:33:00.6757421Z x1 = x[:, D:] 2025-05-07T20:33:00.6757491Z 2025-05-07T20:33:00.6757572Z if contiguous: 2025-05-07T20:33:00.6757668Z x0 = x0.contiguous() 2025-05-07T20:33:00.6757755Z x1 = x1.contiguous() 2025-05-07T20:33:00.6757827Z 2025-05-07T20:33:00.6757988Z if scale_ub is not None: 2025-05-07T20:33:00.6758093Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6758228Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6758305Z ) 2025-05-07T20:33:00.6758380Z else: 2025-05-07T20:33:00.6758474Z scale_ub_tensor = None 2025-05-07T20:33:00.6758543Z 2025-05-07T20:33:00.6758674Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6758765Z op = silu_mul_quant 2025-05-07T20:33:00.6758847Z if compiled: 2025-05-07T20:33:00.6758944Z op = torch.compile(op) 2025-05-07T20:33:00.6759053Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6759123Z 2025-05-07T20:33:00.6759213Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6759217Z 2025-05-07T20:33:00.6759319Z moe/activation_test.py:117: 2025-05-07T20:33:00.6759444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6759552Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6759650Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6760160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6760261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6760623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6760848Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6761195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6761287Z kernel = self.compile( 2025-05-07T20:33:00.6761748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6761927Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6762053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6762060Z 2025-05-07T20:33:00.6762269Z self = 2025-05-07T20:33:00.6763060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6763631Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a5000>} 2025-05-07T20:33:00.6764397Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6764591Z context = 2025-05-07T20:33:00.6764603Z 2025-05-07T20:33:00.6764809Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6765079Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6765187Z module_map=module_map) 2025-05-07T20:33:00.6765347Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6765449Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6765528Z E ^ 2025-05-07T20:33:00.6765885Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6765890Z 2025-05-07T20:33:00.6766318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6766322Z 2025-05-07T20:33:00.6766427Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6766695Z self=, 2025-05-07T20:33:00.6766774Z T=4096, 2025-05-07T20:33:00.6766851Z D=5120, 2025-05-07T20:33:00.6766933Z scale_ub=1200.0, 2025-05-07T20:33:00.6767024Z contiguous=False, 2025-05-07T20:33:00.6767108Z compiled=True, 2025-05-07T20:33:00.6767180Z ) 2025-05-07T20:33:00.6767403Z self = 2025-05-07T20:33:00.6767580Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.6767585Z 2025-05-07T20:33:00.6767662Z @given( 2025-05-07T20:33:00.6767777Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6767875Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6767995Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6768112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6768224Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6768300Z ) 2025-05-07T20:33:00.6768548Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6768639Z def test_silu_mul_quant( 2025-05-07T20:33:00.6768712Z self, 2025-05-07T20:33:00.6768784Z T: int, 2025-05-07T20:33:00.6768858Z D: int, 2025-05-07T20:33:00.6768954Z scale_ub: Optional[float], 2025-05-07T20:33:00.6769039Z contiguous: bool, 2025-05-07T20:33:00.6769124Z compiled: bool, 2025-05-07T20:33:00.6769199Z ) -> None: 2025-05-07T20:33:00.6769294Z torch.manual_seed(2025) 2025-05-07T20:33:00.6769367Z 2025-05-07T20:33:00.6769534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6769603Z 2025-05-07T20:33:00.6769694Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6769865Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6769952Z x = x_sign * x_clamp 2025-05-07T20:33:00.6770032Z x0 = x[:, :D] 2025-05-07T20:33:00.6770107Z x1 = x[:, D:] 2025-05-07T20:33:00.6770178Z 2025-05-07T20:33:00.6770259Z if contiguous: 2025-05-07T20:33:00.6770347Z x0 = x0.contiguous() 2025-05-07T20:33:00.6770438Z x1 = x1.contiguous() 2025-05-07T20:33:00.6770505Z 2025-05-07T20:33:00.6770592Z if scale_ub is not None: 2025-05-07T20:33:00.6770695Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6770868Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6770941Z ) 2025-05-07T20:33:00.6771015Z else: 2025-05-07T20:33:00.6771104Z scale_ub_tensor = None 2025-05-07T20:33:00.6771172Z 2025-05-07T20:33:00.6771303Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6771392Z op = silu_mul_quant 2025-05-07T20:33:00.6771472Z if compiled: 2025-05-07T20:33:00.6771572Z op = torch.compile(op) 2025-05-07T20:33:00.6771677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6771748Z 2025-05-07T20:33:00.6771875Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6771879Z 2025-05-07T20:33:00.6771976Z moe/activation_test.py:117: 2025-05-07T20:33:00.6772101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6772203Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6772303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6772679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6772768Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6773274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6773371Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6773737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6774027Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6774375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6774470Z kernel = self.compile( 2025-05-07T20:33:00.6774885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6775085Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6775210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6775215Z 2025-05-07T20:33:00.6775420Z self = 2025-05-07T20:33:00.6776218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6776732Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a4700>} 2025-05-07T20:33:00.6777494Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6777689Z context = 2025-05-07T20:33:00.6777693Z 2025-05-07T20:33:00.6777860Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6778232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6778340Z module_map=module_map) 2025-05-07T20:33:00.6778501Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6778607Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6778683Z E ^ 2025-05-07T20:33:00.6779043Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6779048Z 2025-05-07T20:33:00.6779468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6779524Z 2025-05-07T20:33:00.6779630Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6779860Z self=, 2025-05-07T20:33:00.6779935Z T=2048, 2025-05-07T20:33:00.6780011Z D=7168, 2025-05-07T20:33:00.6780091Z scale_ub=1200.0, 2025-05-07T20:33:00.6780175Z contiguous=False, 2025-05-07T20:33:00.6780261Z compiled=False, 2025-05-07T20:33:00.6780332Z ) 2025-05-07T20:33:00.6780551Z self = 2025-05-07T20:33:00.6780730Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.6780735Z 2025-05-07T20:33:00.6780853Z @given( 2025-05-07T20:33:00.6780971Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6781072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6781185Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6781303Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6781417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6781490Z ) 2025-05-07T20:33:00.6781744Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6781838Z def test_silu_mul_quant( 2025-05-07T20:33:00.6781913Z self, 2025-05-07T20:33:00.6781993Z T: int, 2025-05-07T20:33:00.6782071Z D: int, 2025-05-07T20:33:00.6782169Z scale_ub: Optional[float], 2025-05-07T20:33:00.6782259Z contiguous: bool, 2025-05-07T20:33:00.6782388Z compiled: bool, 2025-05-07T20:33:00.6782465Z ) -> None: 2025-05-07T20:33:00.6782566Z torch.manual_seed(2025) 2025-05-07T20:33:00.6782640Z 2025-05-07T20:33:00.6782813Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6782886Z 2025-05-07T20:33:00.6782977Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6783107Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6783195Z x = x_sign * x_clamp 2025-05-07T20:33:00.6783273Z x0 = x[:, :D] 2025-05-07T20:33:00.6783352Z x1 = x[:, D:] 2025-05-07T20:33:00.6783423Z 2025-05-07T20:33:00.6783505Z if contiguous: 2025-05-07T20:33:00.6783597Z x0 = x0.contiguous() 2025-05-07T20:33:00.6783683Z x1 = x1.contiguous() 2025-05-07T20:33:00.6783755Z 2025-05-07T20:33:00.6783847Z if scale_ub is not None: 2025-05-07T20:33:00.6783950Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6784089Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6784168Z ) 2025-05-07T20:33:00.6784244Z else: 2025-05-07T20:33:00.6784338Z scale_ub_tensor = None 2025-05-07T20:33:00.6784409Z 2025-05-07T20:33:00.6784539Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6784629Z op = silu_mul_quant 2025-05-07T20:33:00.6784710Z if compiled: 2025-05-07T20:33:00.6784809Z op = torch.compile(op) 2025-05-07T20:33:00.6784929Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6785010Z 2025-05-07T20:33:00.6785109Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6785114Z 2025-05-07T20:33:00.6785225Z moe/activation_test.py:117: 2025-05-07T20:33:00.6785395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6785499Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6785597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6786110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6786210Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6786573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6786795Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6787187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6787279Z kernel = self.compile( 2025-05-07T20:33:00.6787671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6787849Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6787972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6787979Z 2025-05-07T20:33:00.6788190Z self = 2025-05-07T20:33:00.6789020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6789539Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a5240>} 2025-05-07T20:33:00.6790307Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6790498Z context = 2025-05-07T20:33:00.6790505Z 2025-05-07T20:33:00.6790670Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6790980Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6791089Z module_map=module_map) 2025-05-07T20:33:00.6791250Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6791346Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6791423Z E ^ 2025-05-07T20:33:00.6791784Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6791789Z 2025-05-07T20:33:00.6792212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6792216Z 2025-05-07T20:33:00.6792320Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6792543Z self=, 2025-05-07T20:33:00.6792624Z T=1, 2025-05-07T20:33:00.6792701Z D=7168, 2025-05-07T20:33:00.6792783Z scale_ub=None, 2025-05-07T20:33:00.6792870Z contiguous=True, 2025-05-07T20:33:00.6792952Z compiled=False, 2025-05-07T20:33:00.6793023Z ) 2025-05-07T20:33:00.6793245Z self = 2025-05-07T20:33:00.6793408Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.6793416Z 2025-05-07T20:33:00.6793493Z @given( 2025-05-07T20:33:00.6793609Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6793706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6793825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6793942Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6794094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6794170Z ) 2025-05-07T20:33:00.6794419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6794520Z def test_silu_mul_quant( 2025-05-07T20:33:00.6794594Z self, 2025-05-07T20:33:00.6794670Z T: int, 2025-05-07T20:33:00.6794749Z D: int, 2025-05-07T20:33:00.6794845Z scale_ub: Optional[float], 2025-05-07T20:33:00.6794932Z contiguous: bool, 2025-05-07T20:33:00.6795017Z compiled: bool, 2025-05-07T20:33:00.6795093Z ) -> None: 2025-05-07T20:33:00.6795227Z torch.manual_seed(2025) 2025-05-07T20:33:00.6795299Z 2025-05-07T20:33:00.6795468Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6795542Z 2025-05-07T20:33:00.6795634Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6795758Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6795847Z x = x_sign * x_clamp 2025-05-07T20:33:00.6795928Z x0 = x[:, :D] 2025-05-07T20:33:00.6796004Z x1 = x[:, D:] 2025-05-07T20:33:00.6796076Z 2025-05-07T20:33:00.6796158Z if contiguous: 2025-05-07T20:33:00.6796246Z x0 = x0.contiguous() 2025-05-07T20:33:00.6796374Z x1 = x1.contiguous() 2025-05-07T20:33:00.6796443Z 2025-05-07T20:33:00.6796531Z if scale_ub is not None: 2025-05-07T20:33:00.6796636Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6796771Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6796842Z ) 2025-05-07T20:33:00.6796919Z else: 2025-05-07T20:33:00.6797011Z scale_ub_tensor = None 2025-05-07T20:33:00.6797081Z 2025-05-07T20:33:00.6797211Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6797298Z op = silu_mul_quant 2025-05-07T20:33:00.6797381Z if compiled: 2025-05-07T20:33:00.6797480Z op = torch.compile(op) 2025-05-07T20:33:00.6797584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6797655Z 2025-05-07T20:33:00.6797743Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6797791Z 2025-05-07T20:33:00.6797887Z moe/activation_test.py:117: 2025-05-07T20:33:00.6798022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6798125Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6798224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6798734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6798833Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6799198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6799422Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6799770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6799869Z kernel = self.compile( 2025-05-07T20:33:00.6800260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6800437Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6800565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6800569Z 2025-05-07T20:33:00.6800774Z self = 2025-05-07T20:33:00.6801572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6802127Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a6050>} 2025-05-07T20:33:00.6802897Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6803092Z context = 2025-05-07T20:33:00.6803097Z 2025-05-07T20:33:00.6803263Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6803534Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6803679Z module_map=module_map) 2025-05-07T20:33:00.6803846Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6803944Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6804018Z E ^ 2025-05-07T20:33:00.6804383Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6804388Z 2025-05-07T20:33:00.6804808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6804815Z 2025-05-07T20:33:00.6804985Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6805211Z self=, 2025-05-07T20:33:00.6805289Z T=16384, 2025-05-07T20:33:00.6805365Z D=7168, 2025-05-07T20:33:00.6805446Z scale_ub=1200.0, 2025-05-07T20:33:00.6805535Z contiguous=False, 2025-05-07T20:33:00.6805622Z compiled=True, 2025-05-07T20:33:00.6805692Z ) 2025-05-07T20:33:00.6805911Z self = 2025-05-07T20:33:00.6806094Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.6806098Z 2025-05-07T20:33:00.6806172Z @given( 2025-05-07T20:33:00.6806292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6806397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6806555Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6806677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6806791Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6806864Z ) 2025-05-07T20:33:00.6807118Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6807208Z def test_silu_mul_quant( 2025-05-07T20:33:00.6807284Z self, 2025-05-07T20:33:00.6807358Z T: int, 2025-05-07T20:33:00.6807428Z D: int, 2025-05-07T20:33:00.6807525Z scale_ub: Optional[float], 2025-05-07T20:33:00.6807616Z contiguous: bool, 2025-05-07T20:33:00.6807698Z compiled: bool, 2025-05-07T20:33:00.6807774Z ) -> None: 2025-05-07T20:33:00.6807867Z torch.manual_seed(2025) 2025-05-07T20:33:00.6807937Z 2025-05-07T20:33:00.6808106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6808175Z 2025-05-07T20:33:00.6808266Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6808392Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6808479Z x = x_sign * x_clamp 2025-05-07T20:33:00.6808555Z x0 = x[:, :D] 2025-05-07T20:33:00.6808633Z x1 = x[:, D:] 2025-05-07T20:33:00.6808703Z 2025-05-07T20:33:00.6808783Z if contiguous: 2025-05-07T20:33:00.6808874Z x0 = x0.contiguous() 2025-05-07T20:33:00.6808962Z x1 = x1.contiguous() 2025-05-07T20:33:00.6809030Z 2025-05-07T20:33:00.6809120Z if scale_ub is not None: 2025-05-07T20:33:00.6809221Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6809357Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6809430Z ) 2025-05-07T20:33:00.6809502Z else: 2025-05-07T20:33:00.6809640Z scale_ub_tensor = None 2025-05-07T20:33:00.6809710Z 2025-05-07T20:33:00.6809838Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6809928Z op = silu_mul_quant 2025-05-07T20:33:00.6810009Z if compiled: 2025-05-07T20:33:00.6810107Z op = torch.compile(op) 2025-05-07T20:33:00.6810212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6810280Z 2025-05-07T20:33:00.6810368Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6810375Z 2025-05-07T20:33:00.6810472Z moe/activation_test.py:117: 2025-05-07T20:33:00.6810636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6810736Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6810833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6811208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6811302Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6811806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6811903Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6812308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6812532Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6812882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6812976Z kernel = self.compile( 2025-05-07T20:33:00.6813364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6813540Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6813666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6813670Z 2025-05-07T20:33:00.6813875Z self = 2025-05-07T20:33:00.6814709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6815273Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a7490>} 2025-05-07T20:33:00.6816041Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6816232Z context = 2025-05-07T20:33:00.6816237Z 2025-05-07T20:33:00.6816407Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6816674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6816781Z module_map=module_map) 2025-05-07T20:33:00.6816944Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6817040Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6817115Z E ^ 2025-05-07T20:33:00.6817472Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6817479Z 2025-05-07T20:33:00.6817900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6817905Z 2025-05-07T20:33:00.6818007Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6818304Z self=, 2025-05-07T20:33:00.6818425Z T=1, 2025-05-07T20:33:00.6818499Z D=7168, 2025-05-07T20:33:00.6818576Z scale_ub=None, 2025-05-07T20:33:00.6818663Z contiguous=False, 2025-05-07T20:33:00.6818746Z compiled=False, 2025-05-07T20:33:00.6818815Z ) 2025-05-07T20:33:00.6819040Z self = 2025-05-07T20:33:00.6819206Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.6819211Z 2025-05-07T20:33:00.6819283Z @given( 2025-05-07T20:33:00.6819402Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6819540Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6819656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6819773Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6819883Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6819959Z ) 2025-05-07T20:33:00.6820211Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6820301Z def test_silu_mul_quant( 2025-05-07T20:33:00.6820379Z self, 2025-05-07T20:33:00.6820458Z T: int, 2025-05-07T20:33:00.6820529Z D: int, 2025-05-07T20:33:00.6820666Z scale_ub: Optional[float], 2025-05-07T20:33:00.6820752Z contiguous: bool, 2025-05-07T20:33:00.6820834Z compiled: bool, 2025-05-07T20:33:00.6820914Z ) -> None: 2025-05-07T20:33:00.6821007Z torch.manual_seed(2025) 2025-05-07T20:33:00.6821080Z 2025-05-07T20:33:00.6821250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6821324Z 2025-05-07T20:33:00.6821414Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6821537Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6821623Z x = x_sign * x_clamp 2025-05-07T20:33:00.6821703Z x0 = x[:, :D] 2025-05-07T20:33:00.6821778Z x1 = x[:, D:] 2025-05-07T20:33:00.6821851Z 2025-05-07T20:33:00.6821935Z if contiguous: 2025-05-07T20:33:00.6822024Z x0 = x0.contiguous() 2025-05-07T20:33:00.6822110Z x1 = x1.contiguous() 2025-05-07T20:33:00.6822226Z 2025-05-07T20:33:00.6822313Z if scale_ub is not None: 2025-05-07T20:33:00.6822417Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6822552Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6822623Z ) 2025-05-07T20:33:00.6822698Z else: 2025-05-07T20:33:00.6822790Z scale_ub_tensor = None 2025-05-07T20:33:00.6822861Z 2025-05-07T20:33:00.6822995Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6823080Z op = silu_mul_quant 2025-05-07T20:33:00.6823160Z if compiled: 2025-05-07T20:33:00.6823260Z op = torch.compile(op) 2025-05-07T20:33:00.6823361Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6823428Z 2025-05-07T20:33:00.6823520Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6823525Z 2025-05-07T20:33:00.6823618Z moe/activation_test.py:117: 2025-05-07T20:33:00.6823744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6823845Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6823944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6824454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6824547Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6824911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6825150Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6825532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6825672Z kernel = self.compile( 2025-05-07T20:33:00.6826060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6826236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6826362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6826367Z 2025-05-07T20:33:00.6826570Z self = 2025-05-07T20:33:00.6827361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6827910Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd6da1a77f0>} 2025-05-07T20:33:00.6828674Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6828908Z context = 2025-05-07T20:33:00.6828913Z 2025-05-07T20:33:00.6829079Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6829351Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6829455Z module_map=module_map) 2025-05-07T20:33:00.6829622Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6829721Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6829798Z E ^ 2025-05-07T20:33:00.6830155Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6830162Z 2025-05-07T20:33:00.6830587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6830633Z 2025-05-07T20:33:00.6830736Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6830964Z self=, 2025-05-07T20:33:00.6831039Z T=2048, 2025-05-07T20:33:00.6831112Z D=7168, 2025-05-07T20:33:00.6831195Z scale_ub=None, 2025-05-07T20:33:00.6831279Z contiguous=False, 2025-05-07T20:33:00.6831359Z compiled=True, 2025-05-07T20:33:00.6831434Z ) 2025-05-07T20:33:00.6831655Z self = 2025-05-07T20:33:00.6831833Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.6831837Z 2025-05-07T20:33:00.6831910Z @given( 2025-05-07T20:33:00.6832029Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6832134Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6832248Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6832366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6832487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6832561Z ) 2025-05-07T20:33:00.6832811Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6832914Z def test_silu_mul_quant( 2025-05-07T20:33:00.6832990Z self, 2025-05-07T20:33:00.6833069Z T: int, 2025-05-07T20:33:00.6833143Z D: int, 2025-05-07T20:33:00.6833238Z scale_ub: Optional[float], 2025-05-07T20:33:00.6833330Z contiguous: bool, 2025-05-07T20:33:00.6833417Z compiled: bool, 2025-05-07T20:33:00.6833495Z ) -> None: 2025-05-07T20:33:00.6833591Z torch.manual_seed(2025) 2025-05-07T20:33:00.6833662Z 2025-05-07T20:33:00.6833829Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6833951Z 2025-05-07T20:33:00.6834042Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6838007Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6838116Z x = x_sign * x_clamp 2025-05-07T20:33:00.6838195Z x0 = x[:, :D] 2025-05-07T20:33:00.6838277Z x1 = x[:, D:] 2025-05-07T20:33:00.6838346Z 2025-05-07T20:33:00.6838426Z if contiguous: 2025-05-07T20:33:00.6838516Z x0 = x0.contiguous() 2025-05-07T20:33:00.6838601Z x1 = x1.contiguous() 2025-05-07T20:33:00.6838672Z 2025-05-07T20:33:00.6838761Z if scale_ub is not None: 2025-05-07T20:33:00.6838955Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6839095Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6839168Z ) 2025-05-07T20:33:00.6839239Z else: 2025-05-07T20:33:00.6839334Z scale_ub_tensor = None 2025-05-07T20:33:00.6839411Z 2025-05-07T20:33:00.6839545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6839635Z op = silu_mul_quant 2025-05-07T20:33:00.6839716Z if compiled: 2025-05-07T20:33:00.6839816Z op = torch.compile(op) 2025-05-07T20:33:00.6839924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6840037Z 2025-05-07T20:33:00.6840130Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6840135Z 2025-05-07T20:33:00.6840231Z moe/activation_test.py:117: 2025-05-07T20:33:00.6840359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6840460Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6840560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6840940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6841039Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6841548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6841648Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6842012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6842281Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6842631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6842722Z kernel = self.compile( 2025-05-07T20:33:00.6843111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6843288Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6843413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6843418Z 2025-05-07T20:33:00.6843629Z self = 2025-05-07T20:33:00.6844427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6844969Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505540af0>} 2025-05-07T20:33:00.6845762Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6845956Z context = 2025-05-07T20:33:00.6845960Z 2025-05-07T20:33:00.6846128Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6846440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6846550Z module_map=module_map) 2025-05-07T20:33:00.6846714Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6846815Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6846894Z E ^ 2025-05-07T20:33:00.6847257Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6847262Z 2025-05-07T20:33:00.6847683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6847729Z 2025-05-07T20:33:00.6847836Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6848061Z self=, 2025-05-07T20:33:00.6848139Z T=4096, 2025-05-07T20:33:00.6848212Z D=7168, 2025-05-07T20:33:00.6848293Z scale_ub=None, 2025-05-07T20:33:00.6848383Z contiguous=False, 2025-05-07T20:33:00.6848466Z compiled=True, 2025-05-07T20:33:00.6848537Z ) 2025-05-07T20:33:00.6848761Z self = 2025-05-07T20:33:00.6848980Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.6848985Z 2025-05-07T20:33:00.6849060Z @given( 2025-05-07T20:33:00.6849182Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6849279Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6849398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6849518Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6849631Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6849708Z ) 2025-05-07T20:33:00.6849956Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6850048Z def test_silu_mul_quant( 2025-05-07T20:33:00.6850130Z self, 2025-05-07T20:33:00.6850204Z T: int, 2025-05-07T20:33:00.6850278Z D: int, 2025-05-07T20:33:00.6850377Z scale_ub: Optional[float], 2025-05-07T20:33:00.6850510Z contiguous: bool, 2025-05-07T20:33:00.6850593Z compiled: bool, 2025-05-07T20:33:00.6850674Z ) -> None: 2025-05-07T20:33:00.6850767Z torch.manual_seed(2025) 2025-05-07T20:33:00.6850840Z 2025-05-07T20:33:00.6851009Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6851080Z 2025-05-07T20:33:00.6851175Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6851302Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6851389Z x = x_sign * x_clamp 2025-05-07T20:33:00.6851474Z x0 = x[:, :D] 2025-05-07T20:33:00.6851551Z x1 = x[:, D:] 2025-05-07T20:33:00.6851619Z 2025-05-07T20:33:00.6851703Z if contiguous: 2025-05-07T20:33:00.6851792Z x0 = x0.contiguous() 2025-05-07T20:33:00.6851878Z x1 = x1.contiguous() 2025-05-07T20:33:00.6851948Z 2025-05-07T20:33:00.6852036Z if scale_ub is not None: 2025-05-07T20:33:00.6852144Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6852279Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6852351Z ) 2025-05-07T20:33:00.6852424Z else: 2025-05-07T20:33:00.6852514Z scale_ub_tensor = None 2025-05-07T20:33:00.6852585Z 2025-05-07T20:33:00.6852718Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6852803Z op = silu_mul_quant 2025-05-07T20:33:00.6852888Z if compiled: 2025-05-07T20:33:00.6852989Z op = torch.compile(op) 2025-05-07T20:33:00.6853092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6853161Z 2025-05-07T20:33:00.6853251Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6853255Z 2025-05-07T20:33:00.6853350Z moe/activation_test.py:117: 2025-05-07T20:33:00.6853533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6853633Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6853738Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6854118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6854210Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6854714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6854855Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6855269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6855494Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6856133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6856232Z kernel = self.compile( 2025-05-07T20:33:00.6856620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6856887Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6857014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6857019Z 2025-05-07T20:33:00.6857224Z self = 2025-05-07T20:33:00.6858021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6858634Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505540280>} 2025-05-07T20:33:00.6859396Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6859658Z context = 2025-05-07T20:33:00.6859663Z 2025-05-07T20:33:00.6859831Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6860098Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6860206Z module_map=module_map) 2025-05-07T20:33:00.6860371Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6860469Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6860541Z E ^ 2025-05-07T20:33:00.6860903Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6860908Z 2025-05-07T20:33:00.6861331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6861339Z 2025-05-07T20:33:00.6861443Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6861670Z self=, 2025-05-07T20:33:00.6861746Z T=16384, 2025-05-07T20:33:00.6861820Z D=5120, 2025-05-07T20:33:00.6861899Z scale_ub=1200.0, 2025-05-07T20:33:00.6861982Z contiguous=False, 2025-05-07T20:33:00.6862066Z compiled=False, 2025-05-07T20:33:00.6862134Z ) 2025-05-07T20:33:00.6862357Z self = 2025-05-07T20:33:00.6862537Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.6862542Z 2025-05-07T20:33:00.6862614Z @given( 2025-05-07T20:33:00.6862797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6862895Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6863008Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6863135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6863250Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6863320Z ) 2025-05-07T20:33:00.6863570Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6863659Z def test_silu_mul_quant( 2025-05-07T20:33:00.6863735Z self, 2025-05-07T20:33:00.6863807Z T: int, 2025-05-07T20:33:00.6863939Z D: int, 2025-05-07T20:33:00.6864036Z scale_ub: Optional[float], 2025-05-07T20:33:00.6864122Z contiguous: bool, 2025-05-07T20:33:00.6864204Z compiled: bool, 2025-05-07T20:33:00.6864280Z ) -> None: 2025-05-07T20:33:00.6864373Z torch.manual_seed(2025) 2025-05-07T20:33:00.6864440Z 2025-05-07T20:33:00.6864616Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6864687Z 2025-05-07T20:33:00.6864775Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6864905Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6864989Z x = x_sign * x_clamp 2025-05-07T20:33:00.6865110Z x0 = x[:, :D] 2025-05-07T20:33:00.6865188Z x1 = x[:, D:] 2025-05-07T20:33:00.6865256Z 2025-05-07T20:33:00.6865337Z if contiguous: 2025-05-07T20:33:00.6865427Z x0 = x0.contiguous() 2025-05-07T20:33:00.6865512Z x1 = x1.contiguous() 2025-05-07T20:33:00.6865587Z 2025-05-07T20:33:00.6865674Z if scale_ub is not None: 2025-05-07T20:33:00.6865777Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6865913Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6865983Z ) 2025-05-07T20:33:00.6866054Z else: 2025-05-07T20:33:00.6866151Z scale_ub_tensor = None 2025-05-07T20:33:00.6866222Z 2025-05-07T20:33:00.6866350Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6866440Z op = silu_mul_quant 2025-05-07T20:33:00.6866564Z if compiled: 2025-05-07T20:33:00.6866662Z op = torch.compile(op) 2025-05-07T20:33:00.6866768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6866836Z 2025-05-07T20:33:00.6866928Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6866933Z 2025-05-07T20:33:00.6867027Z moe/activation_test.py:117: 2025-05-07T20:33:00.6867151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6867254Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6867351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6867864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6867960Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6868325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6868554Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6868902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6868993Z kernel = self.compile( 2025-05-07T20:33:00.6869383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6869559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6869691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6869696Z 2025-05-07T20:33:00.6869901Z self = 2025-05-07T20:33:00.6870765Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6871285Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505542d40>} 2025-05-07T20:33:00.6872043Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6872276Z context = 2025-05-07T20:33:00.6872281Z 2025-05-07T20:33:00.6872446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6872711Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6872822Z module_map=module_map) 2025-05-07T20:33:00.6872980Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6873082Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6873158Z E ^ 2025-05-07T20:33:00.6873557Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6873562Z 2025-05-07T20:33:00.6873986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6873991Z 2025-05-07T20:33:00.6874093Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6874324Z self=, 2025-05-07T20:33:00.6874397Z T=16384, 2025-05-07T20:33:00.6874469Z D=5120, 2025-05-07T20:33:00.6874551Z scale_ub=1200.0, 2025-05-07T20:33:00.6874632Z contiguous=True, 2025-05-07T20:33:00.6874712Z compiled=True, 2025-05-07T20:33:00.6874785Z ) 2025-05-07T20:33:00.6875035Z self = 2025-05-07T20:33:00.6875229Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.6875275Z 2025-05-07T20:33:00.6875353Z @given( 2025-05-07T20:33:00.6875471Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6875571Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6875683Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6875801Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6875919Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6875990Z ) 2025-05-07T20:33:00.6876237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6876333Z def test_silu_mul_quant( 2025-05-07T20:33:00.6876404Z self, 2025-05-07T20:33:00.6876476Z T: int, 2025-05-07T20:33:00.6876550Z D: int, 2025-05-07T20:33:00.6876646Z scale_ub: Optional[float], 2025-05-07T20:33:00.6876734Z contiguous: bool, 2025-05-07T20:33:00.6876818Z compiled: bool, 2025-05-07T20:33:00.6876894Z ) -> None: 2025-05-07T20:33:00.6876989Z torch.manual_seed(2025) 2025-05-07T20:33:00.6877058Z 2025-05-07T20:33:00.6877228Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6877300Z 2025-05-07T20:33:00.6877390Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6877512Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6877600Z x = x_sign * x_clamp 2025-05-07T20:33:00.6877678Z x0 = x[:, :D] 2025-05-07T20:33:00.6877754Z x1 = x[:, D:] 2025-05-07T20:33:00.6877826Z 2025-05-07T20:33:00.6877907Z if contiguous: 2025-05-07T20:33:00.6877996Z x0 = x0.contiguous() 2025-05-07T20:33:00.6878083Z x1 = x1.contiguous() 2025-05-07T20:33:00.6878150Z 2025-05-07T20:33:00.6878285Z if scale_ub is not None: 2025-05-07T20:33:00.6878387Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6878519Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6878598Z ) 2025-05-07T20:33:00.6878670Z else: 2025-05-07T20:33:00.6878762Z scale_ub_tensor = None 2025-05-07T20:33:00.6878833Z 2025-05-07T20:33:00.6878961Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6879047Z op = silu_mul_quant 2025-05-07T20:33:00.6879130Z if compiled: 2025-05-07T20:33:00.6879226Z op = torch.compile(op) 2025-05-07T20:33:00.6879371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6879442Z 2025-05-07T20:33:00.6879528Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6879532Z 2025-05-07T20:33:00.6879631Z moe/activation_test.py:117: 2025-05-07T20:33:00.6879754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6879856Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6879956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6880330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6880462Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6880970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6881067Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6881431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6881656Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6882001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6882094Z kernel = self.compile( 2025-05-07T20:33:00.6882483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6882657Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6882829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6882834Z 2025-05-07T20:33:00.6883038Z self = 2025-05-07T20:33:00.6883832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6884344Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505542830>} 2025-05-07T20:33:00.6885149Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6885350Z context = 2025-05-07T20:33:00.6885355Z 2025-05-07T20:33:00.6885520Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6885788Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6885894Z module_map=module_map) 2025-05-07T20:33:00.6886056Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6886157Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6886232Z E ^ 2025-05-07T20:33:00.6886592Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6886597Z 2025-05-07T20:33:00.6887057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6887062Z 2025-05-07T20:33:00.6887166Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6887396Z self=, 2025-05-07T20:33:00.6887473Z T=16384, 2025-05-07T20:33:00.6887550Z D=5120, 2025-05-07T20:33:00.6887629Z scale_ub=None, 2025-05-07T20:33:00.6887714Z contiguous=False, 2025-05-07T20:33:00.6887798Z compiled=True, 2025-05-07T20:33:00.6887866Z ) 2025-05-07T20:33:00.6888082Z self = 2025-05-07T20:33:00.6888301Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.6888306Z 2025-05-07T20:33:00.6888379Z @given( 2025-05-07T20:33:00.6888494Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6888595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6888709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6888827Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6888938Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6889011Z ) 2025-05-07T20:33:00.6889300Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6889391Z def test_silu_mul_quant( 2025-05-07T20:33:00.6889464Z self, 2025-05-07T20:33:00.6889545Z T: int, 2025-05-07T20:33:00.6889618Z D: int, 2025-05-07T20:33:00.6889714Z scale_ub: Optional[float], 2025-05-07T20:33:00.6889807Z contiguous: bool, 2025-05-07T20:33:00.6889888Z compiled: bool, 2025-05-07T20:33:00.6889961Z ) -> None: 2025-05-07T20:33:00.6890054Z torch.manual_seed(2025) 2025-05-07T20:33:00.6890124Z 2025-05-07T20:33:00.6890296Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6890366Z 2025-05-07T20:33:00.6890458Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6890582Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6890667Z x = x_sign * x_clamp 2025-05-07T20:33:00.6890787Z x0 = x[:, :D] 2025-05-07T20:33:00.6890864Z x1 = x[:, D:] 2025-05-07T20:33:00.6890933Z 2025-05-07T20:33:00.6891015Z if contiguous: 2025-05-07T20:33:00.6891107Z x0 = x0.contiguous() 2025-05-07T20:33:00.6891191Z x1 = x1.contiguous() 2025-05-07T20:33:00.6891259Z 2025-05-07T20:33:00.6891348Z if scale_ub is not None: 2025-05-07T20:33:00.6891452Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6891591Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6891662Z ) 2025-05-07T20:33:00.6891733Z else: 2025-05-07T20:33:00.6891826Z scale_ub_tensor = None 2025-05-07T20:33:00.6891897Z 2025-05-07T20:33:00.6892026Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6892116Z op = silu_mul_quant 2025-05-07T20:33:00.6892198Z if compiled: 2025-05-07T20:33:00.6892293Z op = torch.compile(op) 2025-05-07T20:33:00.6892400Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6892468Z 2025-05-07T20:33:00.6892556Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6892561Z 2025-05-07T20:33:00.6892658Z moe/activation_test.py:117: 2025-05-07T20:33:00.6892780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6892880Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6892977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6893354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6893446Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6893994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6894089Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6894452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6894679Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6895026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6895117Z kernel = self.compile( 2025-05-07T20:33:00.6895504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6895722Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6895843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6895847Z 2025-05-07T20:33:00.6896053Z self = 2025-05-07T20:33:00.6896846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6897400Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd505543760>} 2025-05-07T20:33:00.6898240Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6898437Z context = 2025-05-07T20:33:00.6898445Z 2025-05-07T20:33:00.6898608Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6898880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6898987Z module_map=module_map) 2025-05-07T20:33:00.6899146Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6899311Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6899386Z E ^ 2025-05-07T20:33:00.6899748Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6899752Z 2025-05-07T20:33:00.6900175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6900182Z 2025-05-07T20:33:00.6900284Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6900507Z self=, 2025-05-07T20:33:00.6900584Z T=2048, 2025-05-07T20:33:00.6900657Z D=5120, 2025-05-07T20:33:00.6900735Z scale_ub=None, 2025-05-07T20:33:00.6900820Z contiguous=False, 2025-05-07T20:33:00.6900901Z compiled=True, 2025-05-07T20:33:00.6900972Z ) 2025-05-07T20:33:00.6901191Z self = 2025-05-07T20:33:00.6901366Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.6901373Z 2025-05-07T20:33:00.6901450Z @given( 2025-05-07T20:33:00.6901568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6901665Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6901779Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6901894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6902008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6902084Z ) 2025-05-07T20:33:00.6902330Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6902424Z def test_silu_mul_quant( 2025-05-07T20:33:00.6902496Z self, 2025-05-07T20:33:00.6902612Z T: int, 2025-05-07T20:33:00.6902688Z D: int, 2025-05-07T20:33:00.6902782Z scale_ub: Optional[float], 2025-05-07T20:33:00.6902866Z contiguous: bool, 2025-05-07T20:33:00.6902953Z compiled: bool, 2025-05-07T20:33:00.6903029Z ) -> None: 2025-05-07T20:33:00.6903123Z torch.manual_seed(2025) 2025-05-07T20:33:00.6903194Z 2025-05-07T20:33:00.6903360Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6903430Z 2025-05-07T20:33:00.6903521Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6903643Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6903771Z x = x_sign * x_clamp 2025-05-07T20:33:00.6903851Z x0 = x[:, :D] 2025-05-07T20:33:00.6903926Z x1 = x[:, D:] 2025-05-07T20:33:00.6903999Z 2025-05-07T20:33:00.6904079Z if contiguous: 2025-05-07T20:33:00.6904166Z x0 = x0.contiguous() 2025-05-07T20:33:00.6904252Z x1 = x1.contiguous() 2025-05-07T20:33:00.6904324Z 2025-05-07T20:33:00.6904412Z if scale_ub is not None: 2025-05-07T20:33:00.6904518Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6904651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6904723Z ) 2025-05-07T20:33:00.6904858Z else: 2025-05-07T20:33:00.6904963Z scale_ub_tensor = None 2025-05-07T20:33:00.6905044Z 2025-05-07T20:33:00.6905189Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6905274Z op = silu_mul_quant 2025-05-07T20:33:00.6905356Z if compiled: 2025-05-07T20:33:00.6905456Z op = torch.compile(op) 2025-05-07T20:33:00.6905559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6905629Z 2025-05-07T20:33:00.6905715Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6905720Z 2025-05-07T20:33:00.6905814Z moe/activation_test.py:117: 2025-05-07T20:33:00.6905943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6906040Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6906138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6906560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6906652Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6907156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6907252Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6907615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6907839Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6908182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6908274Z kernel = self.compile( 2025-05-07T20:33:00.6908661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6908838Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6908963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6908968Z 2025-05-07T20:33:00.6909173Z self = 2025-05-07T20:33:00.6909961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6910474Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5054cc3a0>} 2025-05-07T20:33:00.6911272Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6911472Z context = 2025-05-07T20:33:00.6911477Z 2025-05-07T20:33:00.6911642Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6911910Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6912019Z module_map=module_map) 2025-05-07T20:33:00.6912231Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6912331Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6912405Z E ^ 2025-05-07T20:33:00.6912762Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6912767Z 2025-05-07T20:33:00.6913195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6913202Z 2025-05-07T20:33:00.6913303Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6913566Z self=, 2025-05-07T20:33:00.6913642Z T=2048, 2025-05-07T20:33:00.6913714Z D=5120, 2025-05-07T20:33:00.6913795Z scale_ub=1200.0, 2025-05-07T20:33:00.6913878Z contiguous=False, 2025-05-07T20:33:00.6913961Z compiled=True, 2025-05-07T20:33:00.6914032Z ) 2025-05-07T20:33:00.6914255Z self = 2025-05-07T20:33:00.6914428Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.6914436Z 2025-05-07T20:33:00.6914507Z @given( 2025-05-07T20:33:00.6914622Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6914723Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6914834Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6914948Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6915129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6915217Z ) 2025-05-07T20:33:00.6915477Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6915569Z def test_silu_mul_quant( 2025-05-07T20:33:00.6915642Z self, 2025-05-07T20:33:00.6915714Z T: int, 2025-05-07T20:33:00.6915788Z D: int, 2025-05-07T20:33:00.6915887Z scale_ub: Optional[float], 2025-05-07T20:33:00.6915976Z contiguous: bool, 2025-05-07T20:33:00.6916056Z compiled: bool, 2025-05-07T20:33:00.6916130Z ) -> None: 2025-05-07T20:33:00.6916228Z torch.manual_seed(2025) 2025-05-07T20:33:00.6916296Z 2025-05-07T20:33:00.6916464Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6916539Z 2025-05-07T20:33:00.6916628Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6916749Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6916843Z x = x_sign * x_clamp 2025-05-07T20:33:00.6916918Z x0 = x[:, :D] 2025-05-07T20:33:00.6916998Z x1 = x[:, D:] 2025-05-07T20:33:00.6917070Z 2025-05-07T20:33:00.6917150Z if contiguous: 2025-05-07T20:33:00.6917240Z x0 = x0.contiguous() 2025-05-07T20:33:00.6917325Z x1 = x1.contiguous() 2025-05-07T20:33:00.6917392Z 2025-05-07T20:33:00.6917481Z if scale_ub is not None: 2025-05-07T20:33:00.6917585Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6917718Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6917792Z ) 2025-05-07T20:33:00.6917863Z else: 2025-05-07T20:33:00.6917952Z scale_ub_tensor = None 2025-05-07T20:33:00.6918024Z 2025-05-07T20:33:00.6918196Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6918284Z op = silu_mul_quant 2025-05-07T20:33:00.6918367Z if compiled: 2025-05-07T20:33:00.6918465Z op = torch.compile(op) 2025-05-07T20:33:00.6918572Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6918640Z 2025-05-07T20:33:00.6918727Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6918732Z 2025-05-07T20:33:00.6918829Z moe/activation_test.py:117: 2025-05-07T20:33:00.6918952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6919092Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6919190Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6919561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6919652Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6920156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6920250Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6920616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6920875Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6921220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6921313Z kernel = self.compile( 2025-05-07T20:33:00.6921698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6921880Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6922000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6922005Z 2025-05-07T20:33:00.6922213Z self = 2025-05-07T20:33:00.6923006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6923555Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5054cc820>} 2025-05-07T20:33:00.6924315Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6924509Z context = 2025-05-07T20:33:00.6924513Z 2025-05-07T20:33:00.6924680Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6924948Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6925056Z module_map=module_map) 2025-05-07T20:33:00.6925219Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6925318Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6925392Z E ^ 2025-05-07T20:33:00.6925752Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6925757Z 2025-05-07T20:33:00.6926178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6926185Z 2025-05-07T20:33:00.6926289Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6926512Z self=, 2025-05-07T20:33:00.6926586Z T=4096, 2025-05-07T20:33:00.6926661Z D=5120, 2025-05-07T20:33:00.6926783Z scale_ub=1200.0, 2025-05-07T20:33:00.6926864Z contiguous=True, 2025-05-07T20:33:00.6926947Z compiled=True, 2025-05-07T20:33:00.6927016Z ) 2025-05-07T20:33:00.6927235Z self = 2025-05-07T20:33:00.6927411Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.6927416Z 2025-05-07T20:33:00.6927488Z @given( 2025-05-07T20:33:00.6927605Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6927702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6927814Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6927972Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6928082Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6928152Z ) 2025-05-07T20:33:00.6928400Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6928491Z def test_silu_mul_quant( 2025-05-07T20:33:00.6928566Z self, 2025-05-07T20:33:00.6928643Z T: int, 2025-05-07T20:33:00.6928714Z D: int, 2025-05-07T20:33:00.6928810Z scale_ub: Optional[float], 2025-05-07T20:33:00.6928901Z contiguous: bool, 2025-05-07T20:33:00.6928984Z compiled: bool, 2025-05-07T20:33:00.6929124Z ) -> None: 2025-05-07T20:33:00.6929219Z torch.manual_seed(2025) 2025-05-07T20:33:00.6929289Z 2025-05-07T20:33:00.6929460Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6929530Z 2025-05-07T20:33:00.6929619Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6929746Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6929830Z x = x_sign * x_clamp 2025-05-07T20:33:00.6929906Z x0 = x[:, :D] 2025-05-07T20:33:00.6929990Z x1 = x[:, D:] 2025-05-07T20:33:00.6930059Z 2025-05-07T20:33:00.6930138Z if contiguous: 2025-05-07T20:33:00.6930229Z x0 = x0.contiguous() 2025-05-07T20:33:00.6930315Z x1 = x1.contiguous() 2025-05-07T20:33:00.6930386Z 2025-05-07T20:33:00.6930473Z if scale_ub is not None: 2025-05-07T20:33:00.6930618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6930755Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6930826Z ) 2025-05-07T20:33:00.6930897Z else: 2025-05-07T20:33:00.6930990Z scale_ub_tensor = None 2025-05-07T20:33:00.6931060Z 2025-05-07T20:33:00.6931187Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6931282Z op = silu_mul_quant 2025-05-07T20:33:00.6931362Z if compiled: 2025-05-07T20:33:00.6931457Z op = torch.compile(op) 2025-05-07T20:33:00.6931562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6931630Z 2025-05-07T20:33:00.6931721Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6931725Z 2025-05-07T20:33:00.6931827Z moe/activation_test.py:117: 2025-05-07T20:33:00.6931951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6932054Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6932154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6932531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6932623Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6933123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6933223Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6933585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6933806Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6934197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6934290Z kernel = self.compile( 2025-05-07T20:33:00.6934676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6934856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6934978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6934983Z 2025-05-07T20:33:00.6935221Z self = 2025-05-07T20:33:00.6936027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6936579Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5054cd360>} 2025-05-07T20:33:00.6937341Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6937572Z context = 2025-05-07T20:33:00.6937577Z 2025-05-07T20:33:00.6937745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6938013Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6938207Z module_map=module_map) 2025-05-07T20:33:00.6938366Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6938462Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6938539Z E ^ 2025-05-07T20:33:00.6938899Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6938904Z 2025-05-07T20:33:00.6939322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6939370Z 2025-05-07T20:33:00.6939479Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6939702Z self=, 2025-05-07T20:33:00.6939780Z T=128, 2025-05-07T20:33:00.6939854Z D=5120, 2025-05-07T20:33:00.6939936Z scale_ub=1200.0, 2025-05-07T20:33:00.6940022Z contiguous=False, 2025-05-07T20:33:00.6940106Z compiled=True, 2025-05-07T20:33:00.6940174Z ) 2025-05-07T20:33:00.6940394Z self = 2025-05-07T20:33:00.6940563Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.6940567Z 2025-05-07T20:33:00.6940641Z @given( 2025-05-07T20:33:00.6940760Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6940855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6940970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6941087Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6941199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6941273Z ) 2025-05-07T20:33:00.6941518Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6941607Z def test_silu_mul_quant( 2025-05-07T20:33:00.6941684Z self, 2025-05-07T20:33:00.6941756Z T: int, 2025-05-07T20:33:00.6941831Z D: int, 2025-05-07T20:33:00.6941930Z scale_ub: Optional[float], 2025-05-07T20:33:00.6942016Z contiguous: bool, 2025-05-07T20:33:00.6942098Z compiled: bool, 2025-05-07T20:33:00.6942175Z ) -> None: 2025-05-07T20:33:00.6942264Z torch.manual_seed(2025) 2025-05-07T20:33:00.6942336Z 2025-05-07T20:33:00.6942549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6942620Z 2025-05-07T20:33:00.6942713Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6942842Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6942925Z x = x_sign * x_clamp 2025-05-07T20:33:00.6943005Z x0 = x[:, :D] 2025-05-07T20:33:00.6943083Z x1 = x[:, D:] 2025-05-07T20:33:00.6943151Z 2025-05-07T20:33:00.6943235Z if contiguous: 2025-05-07T20:33:00.6943325Z x0 = x0.contiguous() 2025-05-07T20:33:00.6943410Z x1 = x1.contiguous() 2025-05-07T20:33:00.6943521Z 2025-05-07T20:33:00.6943608Z if scale_ub is not None: 2025-05-07T20:33:00.6943715Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6943849Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6943921Z ) 2025-05-07T20:33:00.6943997Z else: 2025-05-07T20:33:00.6944088Z scale_ub_tensor = None 2025-05-07T20:33:00.6944161Z 2025-05-07T20:33:00.6944293Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6944383Z op = silu_mul_quant 2025-05-07T20:33:00.6944466Z if compiled: 2025-05-07T20:33:00.6944565Z op = torch.compile(op) 2025-05-07T20:33:00.6944707Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6944778Z 2025-05-07T20:33:00.6944868Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6944873Z 2025-05-07T20:33:00.6944968Z moe/activation_test.py:117: 2025-05-07T20:33:00.6945096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6945200Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6945312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6945724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6945814Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6946318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6946461Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6946827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6947050Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6947396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6947490Z kernel = self.compile( 2025-05-07T20:33:00.6947878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6948053Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6948172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6948187Z 2025-05-07T20:33:00.6948390Z self = 2025-05-07T20:33:00.6949180Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6949693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5054ce290>} 2025-05-07T20:33:00.6950452Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6950645Z context = 2025-05-07T20:33:00.6950650Z 2025-05-07T20:33:00.6950853Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6951121Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6951229Z module_map=module_map) 2025-05-07T20:33:00.6951391Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6951489Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6951561Z E ^ 2025-05-07T20:33:00.6951918Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6951922Z 2025-05-07T20:33:00.6952386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6952391Z 2025-05-07T20:33:00.6952493Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6952717Z self=, 2025-05-07T20:33:00.6952796Z T=16384, 2025-05-07T20:33:00.6952870Z D=7168, 2025-05-07T20:33:00.6952952Z scale_ub=1200.0, 2025-05-07T20:33:00.6953032Z contiguous=True, 2025-05-07T20:33:00.6953110Z compiled=True, 2025-05-07T20:33:00.6953184Z ) 2025-05-07T20:33:00.6953444Z self = 2025-05-07T20:33:00.6953619Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.6953624Z 2025-05-07T20:33:00.6953700Z @given( 2025-05-07T20:33:00.6953815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6953910Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6954028Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6954143Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6954255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6954328Z ) 2025-05-07T20:33:00.6954577Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6954671Z def test_silu_mul_quant( 2025-05-07T20:33:00.6954744Z self, 2025-05-07T20:33:00.6954817Z T: int, 2025-05-07T20:33:00.6954934Z D: int, 2025-05-07T20:33:00.6955029Z scale_ub: Optional[float], 2025-05-07T20:33:00.6955119Z contiguous: bool, 2025-05-07T20:33:00.6955205Z compiled: bool, 2025-05-07T20:33:00.6955279Z ) -> None: 2025-05-07T20:33:00.6955373Z torch.manual_seed(2025) 2025-05-07T20:33:00.6955448Z 2025-05-07T20:33:00.6955869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6955988Z 2025-05-07T20:33:00.6956115Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6956282Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6956411Z x = x_sign * x_clamp 2025-05-07T20:33:00.6956501Z x0 = x[:, :D] 2025-05-07T20:33:00.6956579Z x1 = x[:, D:] 2025-05-07T20:33:00.6956646Z 2025-05-07T20:33:00.6956728Z if contiguous: 2025-05-07T20:33:00.6956816Z x0 = x0.contiguous() 2025-05-07T20:33:00.6956906Z x1 = x1.contiguous() 2025-05-07T20:33:00.6956981Z 2025-05-07T20:33:00.6960958Z if scale_ub is not None: 2025-05-07T20:33:00.6961086Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6961222Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6961299Z ) 2025-05-07T20:33:00.6961371Z else: 2025-05-07T20:33:00.6961462Z scale_ub_tensor = None 2025-05-07T20:33:00.6961533Z 2025-05-07T20:33:00.6961663Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6961757Z op = silu_mul_quant 2025-05-07T20:33:00.6961839Z if compiled: 2025-05-07T20:33:00.6961939Z op = torch.compile(op) 2025-05-07T20:33:00.6962046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6962116Z 2025-05-07T20:33:00.6962204Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6962319Z 2025-05-07T20:33:00.6962423Z moe/activation_test.py:117: 2025-05-07T20:33:00.6962555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6962657Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6962762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6963141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.6963235Z return fn(*args, **kwargs) 2025-05-07T20:33:00.6963738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6963931Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6964296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6964518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6964878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6964989Z kernel = self.compile( 2025-05-07T20:33:00.6965464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6965644Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6965770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6965774Z 2025-05-07T20:33:00.6965982Z self = 2025-05-07T20:33:00.6966784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6967296Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5054ced40>} 2025-05-07T20:33:00.6968061Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6968314Z context = 2025-05-07T20:33:00.6968319Z 2025-05-07T20:33:00.6968486Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6968753Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6968860Z module_map=module_map) 2025-05-07T20:33:00.6969024Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6969121Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6969196Z E ^ 2025-05-07T20:33:00.6969562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6969566Z 2025-05-07T20:33:00.6969990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6969997Z 2025-05-07T20:33:00.6970100Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6970324Z self=, 2025-05-07T20:33:00.6970397Z T=16384, 2025-05-07T20:33:00.6970475Z D=5120, 2025-05-07T20:33:00.6970554Z scale_ub=1200.0, 2025-05-07T20:33:00.6970639Z contiguous=True, 2025-05-07T20:33:00.6970723Z compiled=False, 2025-05-07T20:33:00.6970795Z ) 2025-05-07T20:33:00.6971014Z self = 2025-05-07T20:33:00.6971194Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.6971198Z 2025-05-07T20:33:00.6971314Z @given( 2025-05-07T20:33:00.6971435Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6971531Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6971647Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6971767Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6971878Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6971949Z ) 2025-05-07T20:33:00.6972204Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6972296Z def test_silu_mul_quant( 2025-05-07T20:33:00.6972415Z self, 2025-05-07T20:33:00.6972487Z T: int, 2025-05-07T20:33:00.6972559Z D: int, 2025-05-07T20:33:00.6972658Z scale_ub: Optional[float], 2025-05-07T20:33:00.6972744Z contiguous: bool, 2025-05-07T20:33:00.6972825Z compiled: bool, 2025-05-07T20:33:00.6972905Z ) -> None: 2025-05-07T20:33:00.6973001Z torch.manual_seed(2025) 2025-05-07T20:33:00.6973070Z 2025-05-07T20:33:00.6973243Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6973317Z 2025-05-07T20:33:00.6973410Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6973580Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6973665Z x = x_sign * x_clamp 2025-05-07T20:33:00.6973742Z x0 = x[:, :D] 2025-05-07T20:33:00.6973821Z x1 = x[:, D:] 2025-05-07T20:33:00.6973889Z 2025-05-07T20:33:00.6973972Z if contiguous: 2025-05-07T20:33:00.6974060Z x0 = x0.contiguous() 2025-05-07T20:33:00.6974150Z x1 = x1.contiguous() 2025-05-07T20:33:00.6974223Z 2025-05-07T20:33:00.6974312Z if scale_ub is not None: 2025-05-07T20:33:00.6974414Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6974553Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6974625Z ) 2025-05-07T20:33:00.6974700Z else: 2025-05-07T20:33:00.6974793Z scale_ub_tensor = None 2025-05-07T20:33:00.6974862Z 2025-05-07T20:33:00.6974999Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6975151Z op = silu_mul_quant 2025-05-07T20:33:00.6975252Z if compiled: 2025-05-07T20:33:00.6975357Z op = torch.compile(op) 2025-05-07T20:33:00.6975459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6975527Z 2025-05-07T20:33:00.6975615Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6975619Z 2025-05-07T20:33:00.6975713Z moe/activation_test.py:117: 2025-05-07T20:33:00.6975841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6975945Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6976042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6976550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6976649Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6977010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6977239Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6977585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6977678Z kernel = self.compile( 2025-05-07T20:33:00.6978155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6978337Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6978459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6978463Z 2025-05-07T20:33:00.6978667Z self = 2025-05-07T20:33:00.6979510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6980027Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5054cfac0>} 2025-05-07T20:33:00.6980791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6981020Z context = 2025-05-07T20:33:00.6981025Z 2025-05-07T20:33:00.6981193Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6981462Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6981567Z module_map=module_map) 2025-05-07T20:33:00.6981729Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6981827Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6981940Z E ^ 2025-05-07T20:33:00.6982302Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6982307Z 2025-05-07T20:33:00.6982726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6982734Z 2025-05-07T20:33:00.6982840Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6983063Z self=, 2025-05-07T20:33:00.6983135Z T=1, 2025-05-07T20:33:00.6983210Z D=7168, 2025-05-07T20:33:00.6983290Z scale_ub=1200.0, 2025-05-07T20:33:00.6983378Z contiguous=False, 2025-05-07T20:33:00.6983462Z compiled=False, 2025-05-07T20:33:00.6983532Z ) 2025-05-07T20:33:00.6983752Z self = 2025-05-07T20:33:00.6983964Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.6983974Z 2025-05-07T20:33:00.6984049Z @given( 2025-05-07T20:33:00.6984168Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6984264Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6984378Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6984499Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6984608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6984681Z ) 2025-05-07T20:33:00.6984927Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6985017Z def test_silu_mul_quant( 2025-05-07T20:33:00.6985095Z self, 2025-05-07T20:33:00.6985169Z T: int, 2025-05-07T20:33:00.6985241Z D: int, 2025-05-07T20:33:00.6985338Z scale_ub: Optional[float], 2025-05-07T20:33:00.6985423Z contiguous: bool, 2025-05-07T20:33:00.6985507Z compiled: bool, 2025-05-07T20:33:00.6985584Z ) -> None: 2025-05-07T20:33:00.6985678Z torch.manual_seed(2025) 2025-05-07T20:33:00.6985746Z 2025-05-07T20:33:00.6985915Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6985985Z 2025-05-07T20:33:00.6986073Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6986202Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6986288Z x = x_sign * x_clamp 2025-05-07T20:33:00.6986367Z x0 = x[:, :D] 2025-05-07T20:33:00.6986442Z x1 = x[:, D:] 2025-05-07T20:33:00.6986510Z 2025-05-07T20:33:00.6986592Z if contiguous: 2025-05-07T20:33:00.6986679Z x0 = x0.contiguous() 2025-05-07T20:33:00.6986809Z x1 = x1.contiguous() 2025-05-07T20:33:00.6986880Z 2025-05-07T20:33:00.6986968Z if scale_ub is not None: 2025-05-07T20:33:00.6987069Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6987208Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6987282Z ) 2025-05-07T20:33:00.6987353Z else: 2025-05-07T20:33:00.6987446Z scale_ub_tensor = None 2025-05-07T20:33:00.6987515Z 2025-05-07T20:33:00.6987645Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.6987732Z op = silu_mul_quant 2025-05-07T20:33:00.6987854Z if compiled: 2025-05-07T20:33:00.6987951Z op = torch.compile(op) 2025-05-07T20:33:00.6988053Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6988125Z 2025-05-07T20:33:00.6988214Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.6988218Z 2025-05-07T20:33:00.6988313Z moe/activation_test.py:117: 2025-05-07T20:33:00.6988440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6988541Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.6988637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.6989192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.6989288Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.6989651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.6989875Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.6990226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.6990317Z kernel = self.compile( 2025-05-07T20:33:00.6990707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.6990883Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.6991008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.6991052Z 2025-05-07T20:33:00.6991260Z self = 2025-05-07T20:33:00.6992053Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.6992566Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5050484c0>} 2025-05-07T20:33:00.6993329Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.6993522Z context = 2025-05-07T20:33:00.6993527Z 2025-05-07T20:33:00.6993694Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.6993964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.6994070Z module_map=module_map) 2025-05-07T20:33:00.6994230Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.6994334Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.6994413Z E ^ 2025-05-07T20:33:00.6994774Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.6994779Z 2025-05-07T20:33:00.6995254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.6995259Z 2025-05-07T20:33:00.6995427Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.6995656Z self=, 2025-05-07T20:33:00.6995733Z T=4096, 2025-05-07T20:33:00.6995807Z D=7168, 2025-05-07T20:33:00.6995890Z scale_ub=1200.0, 2025-05-07T20:33:00.6995977Z contiguous=False, 2025-05-07T20:33:00.6996059Z compiled=True, 2025-05-07T20:33:00.6996133Z ) 2025-05-07T20:33:00.6996351Z self = 2025-05-07T20:33:00.6996528Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.6996577Z 2025-05-07T20:33:00.6996653Z @given( 2025-05-07T20:33:00.6996770Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.6996871Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.6996983Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.6997100Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.6997217Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.6997286Z ) 2025-05-07T20:33:00.6997531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.6997628Z def test_silu_mul_quant( 2025-05-07T20:33:00.6997740Z self, 2025-05-07T20:33:00.6997814Z T: int, 2025-05-07T20:33:00.6997890Z D: int, 2025-05-07T20:33:00.6997986Z scale_ub: Optional[float], 2025-05-07T20:33:00.6998074Z contiguous: bool, 2025-05-07T20:33:00.6998155Z compiled: bool, 2025-05-07T20:33:00.6998229Z ) -> None: 2025-05-07T20:33:00.6998325Z torch.manual_seed(2025) 2025-05-07T20:33:00.6998394Z 2025-05-07T20:33:00.6998560Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.6998637Z 2025-05-07T20:33:00.6998726Z x_sign = torch.sign(x) 2025-05-07T20:33:00.6998848Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.6998937Z x = x_sign * x_clamp 2025-05-07T20:33:00.6999013Z x0 = x[:, :D] 2025-05-07T20:33:00.6999090Z x1 = x[:, D:] 2025-05-07T20:33:00.6999204Z 2025-05-07T20:33:00.6999284Z if contiguous: 2025-05-07T20:33:00.6999373Z x0 = x0.contiguous() 2025-05-07T20:33:00.6999464Z x1 = x1.contiguous() 2025-05-07T20:33:00.6999532Z 2025-05-07T20:33:00.6999623Z if scale_ub is not None: 2025-05-07T20:33:00.6999725Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.6999857Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.6999934Z ) 2025-05-07T20:33:00.7000006Z else: 2025-05-07T20:33:00.7000095Z scale_ub_tensor = None 2025-05-07T20:33:00.7000170Z 2025-05-07T20:33:00.7000304Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.7000391Z op = silu_mul_quant 2025-05-07T20:33:00.7000473Z if compiled: 2025-05-07T20:33:00.7000570Z op = torch.compile(op) 2025-05-07T20:33:00.7000672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7000741Z 2025-05-07T20:33:00.7000832Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.7000837Z 2025-05-07T20:33:00.7000933Z moe/activation_test.py:117: 2025-05-07T20:33:00.7001059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7001157Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.7001255Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7001629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.7001720Z return fn(*args, **kwargs) 2025-05-07T20:33:00.7002223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.7002317Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.7002726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.7002950Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.7003300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.7003396Z kernel = self.compile( 2025-05-07T20:33:00.7003782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.7003961Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.7004122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7004126Z 2025-05-07T20:33:00.7004331Z self = 2025-05-07T20:33:00.7005160Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.7005735Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5050491b0>} 2025-05-07T20:33:00.7006501Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.7006692Z context = 2025-05-07T20:33:00.7006700Z 2025-05-07T20:33:00.7006864Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.7007133Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.7007238Z module_map=module_map) 2025-05-07T20:33:00.7007404Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.7007501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.7007618Z E ^ 2025-05-07T20:33:00.7007983Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.7007987Z 2025-05-07T20:33:00.7008408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.7008412Z 2025-05-07T20:33:00.7008522Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7008748Z self=, 2025-05-07T20:33:00.7008823Z T=128, 2025-05-07T20:33:00.7008899Z D=7168, 2025-05-07T20:33:00.7008980Z scale_ub=1200.0, 2025-05-07T20:33:00.7009066Z contiguous=False, 2025-05-07T20:33:00.7009150Z compiled=True, 2025-05-07T20:33:00.7009220Z ) 2025-05-07T20:33:00.7009440Z self = 2025-05-07T20:33:00.7009614Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.7009622Z 2025-05-07T20:33:00.7009695Z @given( 2025-05-07T20:33:00.7009815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7009910Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7010022Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7010138Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7010248Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7010322Z ) 2025-05-07T20:33:00.7010572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7010662Z def test_silu_mul_quant( 2025-05-07T20:33:00.7010733Z self, 2025-05-07T20:33:00.7010807Z T: int, 2025-05-07T20:33:00.7010878Z D: int, 2025-05-07T20:33:00.7011019Z scale_ub: Optional[float], 2025-05-07T20:33:00.7011109Z contiguous: bool, 2025-05-07T20:33:00.7011190Z compiled: bool, 2025-05-07T20:33:00.7011267Z ) -> None: 2025-05-07T20:33:00.7011361Z torch.manual_seed(2025) 2025-05-07T20:33:00.7011430Z 2025-05-07T20:33:00.7011603Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7011672Z 2025-05-07T20:33:00.7011761Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7011889Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7011974Z x = x_sign * x_clamp 2025-05-07T20:33:00.7012050Z x0 = x[:, :D] 2025-05-07T20:33:00.7012172Z x1 = x[:, D:] 2025-05-07T20:33:00.7012241Z 2025-05-07T20:33:00.7012321Z if contiguous: 2025-05-07T20:33:00.7012410Z x0 = x0.contiguous() 2025-05-07T20:33:00.7012496Z x1 = x1.contiguous() 2025-05-07T20:33:00.7012565Z 2025-05-07T20:33:00.7012657Z if scale_ub is not None: 2025-05-07T20:33:00.7012763Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.7012899Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.7012973Z ) 2025-05-07T20:33:00.7013046Z else: 2025-05-07T20:33:00.7013138Z scale_ub_tensor = None 2025-05-07T20:33:00.7013247Z 2025-05-07T20:33:00.7013376Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.7013466Z op = silu_mul_quant 2025-05-07T20:33:00.7013546Z if compiled: 2025-05-07T20:33:00.7013642Z op = torch.compile(op) 2025-05-07T20:33:00.7013747Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7013819Z 2025-05-07T20:33:00.7013906Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.7013914Z 2025-05-07T20:33:00.7014008Z moe/activation_test.py:117: 2025-05-07T20:33:00.7014130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7014231Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.7014332Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7014707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.7014840Z return fn(*args, **kwargs) 2025-05-07T20:33:00.7015348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.7015442Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.7015804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.7016030Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.7016379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.7016470Z kernel = self.compile( 2025-05-07T20:33:00.7016859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.7017036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.7017161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7017168Z 2025-05-07T20:33:00.7017375Z self = 2025-05-07T20:33:00.7018256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.7018769Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd5050480d0>} 2025-05-07T20:33:00.7019574Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.7019767Z context = 2025-05-07T20:33:00.7019774Z 2025-05-07T20:33:00.7019944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.7020211Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.7020316Z module_map=module_map) 2025-05-07T20:33:00.7020479Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.7020618Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.7020695Z E ^ 2025-05-07T20:33:00.7021052Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.7021056Z 2025-05-07T20:33:00.7021487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.7021492Z 2025-05-07T20:33:00.7021593Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7021815Z self=, 2025-05-07T20:33:00.7021896Z T=2048, 2025-05-07T20:33:00.7022006Z D=7168, 2025-05-07T20:33:00.7022087Z scale_ub=None, 2025-05-07T20:33:00.7022172Z contiguous=True, 2025-05-07T20:33:00.7022250Z compiled=True, 2025-05-07T20:33:00.7022318Z ) 2025-05-07T20:33:00.7022539Z self = 2025-05-07T20:33:00.7022711Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.7022716Z 2025-05-07T20:33:00.7022792Z @given( 2025-05-07T20:33:00.7022907Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7023003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7023117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7023235Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7023345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7023486Z ) 2025-05-07T20:33:00.7023733Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7023831Z def test_silu_mul_quant( 2025-05-07T20:33:00.7023903Z self, 2025-05-07T20:33:00.7023975Z T: int, 2025-05-07T20:33:00.7024050Z D: int, 2025-05-07T20:33:00.7024145Z scale_ub: Optional[float], 2025-05-07T20:33:00.7024231Z contiguous: bool, 2025-05-07T20:33:00.7024317Z compiled: bool, 2025-05-07T20:33:00.7024390Z ) -> None: 2025-05-07T20:33:00.7024484Z torch.manual_seed(2025) 2025-05-07T20:33:00.7024558Z 2025-05-07T20:33:00.7024726Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7024797Z 2025-05-07T20:33:00.7024907Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7025047Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7025148Z x = x_sign * x_clamp 2025-05-07T20:33:00.7025226Z x0 = x[:, :D] 2025-05-07T20:33:00.7025305Z x1 = x[:, D:] 2025-05-07T20:33:00.7025375Z 2025-05-07T20:33:00.7025457Z if contiguous: 2025-05-07T20:33:00.7025548Z x0 = x0.contiguous() 2025-05-07T20:33:00.7025636Z x1 = x1.contiguous() 2025-05-07T20:33:00.7025705Z 2025-05-07T20:33:00.7025792Z if scale_ub is not None: 2025-05-07T20:33:00.7025895Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.7026028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.7026104Z ) 2025-05-07T20:33:00.7026180Z else: 2025-05-07T20:33:00.7026270Z scale_ub_tensor = None 2025-05-07T20:33:00.7026339Z 2025-05-07T20:33:00.7026470Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.7026558Z op = silu_mul_quant 2025-05-07T20:33:00.7026689Z if compiled: 2025-05-07T20:33:00.7026786Z op = torch.compile(op) 2025-05-07T20:33:00.7026889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7026965Z 2025-05-07T20:33:00.7027052Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.7027059Z 2025-05-07T20:33:00.7027154Z moe/activation_test.py:117: 2025-05-07T20:33:00.7027280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7027379Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.7027476Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7027896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.7027986Z return fn(*args, **kwargs) 2025-05-07T20:33:00.7028489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.7028592Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.7028954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.7029181Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.7029565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.7029659Z kernel = self.compile( 2025-05-07T20:33:00.7030046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.7030225Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.7030351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7030355Z 2025-05-07T20:33:00.7030560Z self = 2025-05-07T20:33:00.7031354Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.7031903Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd50504a560>} 2025-05-07T20:33:00.7032664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.7032862Z context = 2025-05-07T20:33:00.7032866Z 2025-05-07T20:33:00.7033031Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.7033299Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.7033405Z module_map=module_map) 2025-05-07T20:33:00.7033564Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.7033667Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.7033739Z E ^ 2025-05-07T20:33:00.7034101Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.7034108Z 2025-05-07T20:33:00.7034527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.7034532Z 2025-05-07T20:33:00.7034637Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7034864Z self=, 2025-05-07T20:33:00.7034941Z T=16384, 2025-05-07T20:33:00.7035027Z D=5120, 2025-05-07T20:33:00.7035122Z scale_ub=None, 2025-05-07T20:33:00.7035220Z contiguous=False, 2025-05-07T20:33:00.7035313Z compiled=False, 2025-05-07T20:33:00.7035426Z ) 2025-05-07T20:33:00.7035646Z self = 2025-05-07T20:33:00.7035829Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.7035839Z 2025-05-07T20:33:00.7035916Z @given( 2025-05-07T20:33:00.7036032Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7036130Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7036243Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7036357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7036510Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7036580Z ) 2025-05-07T20:33:00.7036830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7036926Z def test_silu_mul_quant( 2025-05-07T20:33:00.7036999Z self, 2025-05-07T20:33:00.7037075Z T: int, 2025-05-07T20:33:00.7037149Z D: int, 2025-05-07T20:33:00.7037244Z scale_ub: Optional[float], 2025-05-07T20:33:00.7037331Z contiguous: bool, 2025-05-07T20:33:00.7037412Z compiled: bool, 2025-05-07T20:33:00.7037488Z ) -> None: 2025-05-07T20:33:00.7037583Z torch.manual_seed(2025) 2025-05-07T20:33:00.7037692Z 2025-05-07T20:33:00.7037863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7037936Z 2025-05-07T20:33:00.7038027Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7038150Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7040024Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7040070Z 2025-05-07T20:33:00.7040187Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:00.7040199Z 2025-05-07T20:33:00.7040298Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7040521Z self=, 2025-05-07T20:33:00.7040599Z T=4096, 2025-05-07T20:33:00.7040674Z D=7168, 2025-05-07T20:33:00.7040753Z scale_ub=1200.0, 2025-05-07T20:33:00.7040839Z contiguous=True, 2025-05-07T20:33:00.7040918Z compiled=True, 2025-05-07T20:33:00.7040987Z ) 2025-05-07T20:33:00.7041215Z self = 2025-05-07T20:33:00.7041385Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.7041389Z 2025-05-07T20:33:00.7041464Z @given( 2025-05-07T20:33:00.7041581Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7041676Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7041793Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7041909Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7042018Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7042090Z ) 2025-05-07T20:33:00.7042337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7042427Z def test_silu_mul_quant( 2025-05-07T20:33:00.7042505Z self, 2025-05-07T20:33:00.7042577Z T: int, 2025-05-07T20:33:00.7042648Z D: int, 2025-05-07T20:33:00.7042744Z scale_ub: Optional[float], 2025-05-07T20:33:00.7042829Z contiguous: bool, 2025-05-07T20:33:00.7042913Z compiled: bool, 2025-05-07T20:33:00.7042986Z ) -> None: 2025-05-07T20:33:00.7043078Z torch.manual_seed(2025) 2025-05-07T20:33:00.7043193Z 2025-05-07T20:33:00.7043361Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7043431Z 2025-05-07T20:33:00.7043524Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7043648Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7045487Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7045543Z 2025-05-07T20:33:00.7045661Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:00.7045668Z 2025-05-07T20:33:00.7045768Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7045994Z self=, 2025-05-07T20:33:00.7046071Z T=16384, 2025-05-07T20:33:00.7046144Z D=7168, 2025-05-07T20:33:00.7046261Z scale_ub=None, 2025-05-07T20:33:00.7046344Z contiguous=False, 2025-05-07T20:33:00.7046429Z compiled=False, 2025-05-07T20:33:00.7046497Z ) 2025-05-07T20:33:00.7046714Z self = 2025-05-07T20:33:00.7046892Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.7046900Z 2025-05-07T20:33:00.7046972Z @given( 2025-05-07T20:33:00.7047084Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7047187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7047297Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7047420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7047528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7047598Z ) 2025-05-07T20:33:00.7047889Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7047982Z def test_silu_mul_quant( 2025-05-07T20:33:00.7048054Z self, 2025-05-07T20:33:00.7048128Z T: int, 2025-05-07T20:33:00.7048201Z D: int, 2025-05-07T20:33:00.7048295Z scale_ub: Optional[float], 2025-05-07T20:33:00.7048384Z contiguous: bool, 2025-05-07T20:33:00.7048468Z compiled: bool, 2025-05-07T20:33:00.7048544Z ) -> None: 2025-05-07T20:33:00.7048638Z torch.manual_seed(2025) 2025-05-07T20:33:00.7048706Z 2025-05-07T20:33:00.7048872Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7050719Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7050728Z 2025-05-07T20:33:00.7050847Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7050852Z 2025-05-07T20:33:00.7050954Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7051180Z self=, 2025-05-07T20:33:00.7051255Z T=2048, 2025-05-07T20:33:00.7051327Z D=7168, 2025-05-07T20:33:00.7051407Z scale_ub=1200.0, 2025-05-07T20:33:00.7051490Z contiguous=True, 2025-05-07T20:33:00.7051570Z compiled=True, 2025-05-07T20:33:00.7051641Z ) 2025-05-07T20:33:00.7051904Z self = 2025-05-07T20:33:00.7052074Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.7052081Z 2025-05-07T20:33:00.7052156Z @given( 2025-05-07T20:33:00.7052272Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7052366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7052481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7052594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7052702Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7052816Z ) 2025-05-07T20:33:00.7053061Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7053152Z def test_silu_mul_quant( 2025-05-07T20:33:00.7053226Z self, 2025-05-07T20:33:00.7053299Z T: int, 2025-05-07T20:33:00.7053376Z D: int, 2025-05-07T20:33:00.7053472Z scale_ub: Optional[float], 2025-05-07T20:33:00.7053556Z contiguous: bool, 2025-05-07T20:33:00.7053641Z compiled: bool, 2025-05-07T20:33:00.7053715Z ) -> None: 2025-05-07T20:33:00.7053809Z torch.manual_seed(2025) 2025-05-07T20:33:00.7053882Z 2025-05-07T20:33:00.7054112Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7054184Z 2025-05-07T20:33:00.7054275Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7054399Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7056615Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7056714Z 2025-05-07T20:33:00.7056840Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:00.7056845Z 2025-05-07T20:33:00.7056952Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7057177Z self=, 2025-05-07T20:33:00.7057251Z T=2048, 2025-05-07T20:33:00.7057327Z D=7168, 2025-05-07T20:33:00.7057407Z scale_ub=None, 2025-05-07T20:33:00.7057489Z contiguous=True, 2025-05-07T20:33:00.7057579Z compiled=False, 2025-05-07T20:33:00.7057649Z ) 2025-05-07T20:33:00.7057868Z self = 2025-05-07T20:33:00.7058138Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.7058142Z 2025-05-07T20:33:00.7058220Z @given( 2025-05-07T20:33:00.7058337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7058438Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7058549Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7058674Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7058791Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7058864Z ) 2025-05-07T20:33:00.7059111Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7059205Z def test_silu_mul_quant( 2025-05-07T20:33:00.7059278Z self, 2025-05-07T20:33:00.7059358Z T: int, 2025-05-07T20:33:00.7059432Z D: int, 2025-05-07T20:33:00.7059526Z scale_ub: Optional[float], 2025-05-07T20:33:00.7059620Z contiguous: bool, 2025-05-07T20:33:00.7059703Z compiled: bool, 2025-05-07T20:33:00.7059782Z ) -> None: 2025-05-07T20:33:00.7059875Z torch.manual_seed(2025) 2025-05-07T20:33:00.7059945Z 2025-05-07T20:33:00.7060181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7060255Z 2025-05-07T20:33:00.7060343Z > x_sign = torch.sign(x) 2025-05-07T20:33:00.7062190Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7062250Z 2025-05-07T20:33:00.7062372Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:00.7062377Z 2025-05-07T20:33:00.7062480Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7062704Z self=, 2025-05-07T20:33:00.7062782Z T=1, 2025-05-07T20:33:00.7062861Z D=7168, 2025-05-07T20:33:00.7062945Z scale_ub=1200.0, 2025-05-07T20:33:00.7063029Z contiguous=True, 2025-05-07T20:33:00.7063178Z compiled=False, 2025-05-07T20:33:00.7063249Z ) 2025-05-07T20:33:00.7063470Z self = 2025-05-07T20:33:00.7063635Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.7063639Z 2025-05-07T20:33:00.7063712Z @given( 2025-05-07T20:33:00.7063834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7063931Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7064045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7064163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7064273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7064350Z ) 2025-05-07T20:33:00.7064595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7064687Z def test_silu_mul_quant( 2025-05-07T20:33:00.7064832Z self, 2025-05-07T20:33:00.7064904Z T: int, 2025-05-07T20:33:00.7064978Z D: int, 2025-05-07T20:33:00.7065074Z scale_ub: Optional[float], 2025-05-07T20:33:00.7065161Z contiguous: bool, 2025-05-07T20:33:00.7065254Z compiled: bool, 2025-05-07T20:33:00.7065345Z ) -> None: 2025-05-07T20:33:00.7065447Z torch.manual_seed(2025) 2025-05-07T20:33:00.7065533Z 2025-05-07T20:33:00.7065707Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7065777Z 2025-05-07T20:33:00.7065864Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7065990Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7066076Z x = x_sign * x_clamp 2025-05-07T20:33:00.7066155Z x0 = x[:, :D] 2025-05-07T20:33:00.7066235Z x1 = x[:, D:] 2025-05-07T20:33:00.7066303Z 2025-05-07T20:33:00.7066387Z if contiguous: 2025-05-07T20:33:00.7066477Z x0 = x0.contiguous() 2025-05-07T20:33:00.7066562Z x1 = x1.contiguous() 2025-05-07T20:33:00.7066633Z 2025-05-07T20:33:00.7066722Z if scale_ub is not None: 2025-05-07T20:33:00.7066823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.7066958Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.7067029Z ) 2025-05-07T20:33:00.7067100Z else: 2025-05-07T20:33:00.7067198Z scale_ub_tensor = None 2025-05-07T20:33:00.7067268Z 2025-05-07T20:33:00.7067398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.7067485Z op = silu_mul_quant 2025-05-07T20:33:00.7067565Z if compiled: 2025-05-07T20:33:00.7067664Z op = torch.compile(op) 2025-05-07T20:33:00.7067818Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7067890Z 2025-05-07T20:33:00.7067980Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.7067985Z 2025-05-07T20:33:00.7068081Z moe/activation_test.py:117: 2025-05-07T20:33:00.7068209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7068311Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.7068409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7068923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.7069059Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.7069423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.7069648Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.7069998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.7070090Z kernel = self.compile( 2025-05-07T20:33:00.7070482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.7070702Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.7070830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7070835Z 2025-05-07T20:33:00.7071041Z self = 2025-05-07T20:33:00.7071834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.7072353Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd504d204c0>} 2025-05-07T20:33:00.7073115Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.7073352Z context = 2025-05-07T20:33:00.7073357Z 2025-05-07T20:33:00.7073522Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.7073791Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.7073900Z module_map=module_map) 2025-05-07T20:33:00.7074059Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.7074160Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.7074234Z E ^ 2025-05-07T20:33:00.7074595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.7074600Z 2025-05-07T20:33:00.7075021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.7075028Z 2025-05-07T20:33:00.7075133Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7075360Z self=, 2025-05-07T20:33:00.7075436Z T=128, 2025-05-07T20:33:00.7075512Z D=5120, 2025-05-07T20:33:00.7075594Z scale_ub=None, 2025-05-07T20:33:00.7075676Z contiguous=True, 2025-05-07T20:33:00.7075764Z compiled=False, 2025-05-07T20:33:00.7075837Z ) 2025-05-07T20:33:00.7076054Z self = 2025-05-07T20:33:00.7076223Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.7076231Z 2025-05-07T20:33:00.7076306Z @given( 2025-05-07T20:33:00.7076465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7076566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7076679Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7076798Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7076914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7076986Z ) 2025-05-07T20:33:00.7077232Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7077329Z def test_silu_mul_quant( 2025-05-07T20:33:00.7077405Z self, 2025-05-07T20:33:00.7077479Z T: int, 2025-05-07T20:33:00.7077598Z D: int, 2025-05-07T20:33:00.7077693Z scale_ub: Optional[float], 2025-05-07T20:33:00.7077783Z contiguous: bool, 2025-05-07T20:33:00.7077866Z compiled: bool, 2025-05-07T20:33:00.7077943Z ) -> None: 2025-05-07T20:33:00.7078037Z torch.manual_seed(2025) 2025-05-07T20:33:00.7078110Z 2025-05-07T20:33:00.7078280Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7078353Z 2025-05-07T20:33:00.7078442Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7078565Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7078652Z x = x_sign * x_clamp 2025-05-07T20:33:00.7078768Z x0 = x[:, :D] 2025-05-07T20:33:00.7078845Z x1 = x[:, D:] 2025-05-07T20:33:00.7078918Z 2025-05-07T20:33:00.7078997Z if contiguous: 2025-05-07T20:33:00.7079086Z x0 = x0.contiguous() 2025-05-07T20:33:00.7079176Z x1 = x1.contiguous() 2025-05-07T20:33:00.7079248Z 2025-05-07T20:33:00.7079338Z if scale_ub is not None: 2025-05-07T20:33:00.7079441Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.7079572Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.7079647Z ) 2025-05-07T20:33:00.7079722Z else: 2025-05-07T20:33:00.7079813Z scale_ub_tensor = None 2025-05-07T20:33:00.7079886Z 2025-05-07T20:33:00.7080014Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.7080100Z op = silu_mul_quant 2025-05-07T20:33:00.7080230Z if compiled: 2025-05-07T20:33:00.7080327Z op = torch.compile(op) 2025-05-07T20:33:00.7084261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7084340Z 2025-05-07T20:33:00.7084435Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.7084440Z 2025-05-07T20:33:00.7084540Z moe/activation_test.py:117: 2025-05-07T20:33:00.7084667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7084792Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.7084898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7085432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.7085529Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.7085898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.7086126Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.7086478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.7086571Z kernel = self.compile( 2025-05-07T20:33:00.7086968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.7087142Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.7087271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7087275Z 2025-05-07T20:33:00.7087483Z self = 2025-05-07T20:33:00.7088342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.7088863Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd504d20940>} 2025-05-07T20:33:00.7089627Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.7089885Z context = 2025-05-07T20:33:00.7089894Z 2025-05-07T20:33:00.7090059Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.7090329Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.7090439Z module_map=module_map) 2025-05-07T20:33:00.7090600Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.7090696Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.7090774Z E ^ 2025-05-07T20:33:00.7091174Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.7091180Z 2025-05-07T20:33:00.7091605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.7091610Z 2025-05-07T20:33:00.7091712Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7091940Z self=, 2025-05-07T20:33:00.7092017Z T=128, 2025-05-07T20:33:00.7092089Z D=7168, 2025-05-07T20:33:00.7092166Z scale_ub=None, 2025-05-07T20:33:00.7092250Z contiguous=True, 2025-05-07T20:33:00.7092329Z compiled=False, 2025-05-07T20:33:00.7092398Z ) 2025-05-07T20:33:00.7092623Z self = 2025-05-07T20:33:00.7092791Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.7092838Z 2025-05-07T20:33:00.7092916Z @given( 2025-05-07T20:33:00.7093035Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7093133Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7093248Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7093363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7093474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7093550Z ) 2025-05-07T20:33:00.7093796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7093891Z def test_silu_mul_quant( 2025-05-07T20:33:00.7093963Z self, 2025-05-07T20:33:00.7094038Z T: int, 2025-05-07T20:33:00.7094112Z D: int, 2025-05-07T20:33:00.7094208Z scale_ub: Optional[float], 2025-05-07T20:33:00.7094293Z contiguous: bool, 2025-05-07T20:33:00.7094377Z compiled: bool, 2025-05-07T20:33:00.7094454Z ) -> None: 2025-05-07T20:33:00.7094547Z torch.manual_seed(2025) 2025-05-07T20:33:00.7094618Z 2025-05-07T20:33:00.7094790Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7094861Z 2025-05-07T20:33:00.7094954Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7095077Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7095179Z x = x_sign * x_clamp 2025-05-07T20:33:00.7095268Z x0 = x[:, :D] 2025-05-07T20:33:00.7095360Z x1 = x[:, D:] 2025-05-07T20:33:00.7095441Z 2025-05-07T20:33:00.7095520Z if contiguous: 2025-05-07T20:33:00.7095608Z x0 = x0.contiguous() 2025-05-07T20:33:00.7095698Z x1 = x1.contiguous() 2025-05-07T20:33:00.7095768Z 2025-05-07T20:33:00.7095900Z if scale_ub is not None: 2025-05-07T20:33:00.7096012Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.7096146Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.7096220Z ) 2025-05-07T20:33:00.7096295Z else: 2025-05-07T20:33:00.7096388Z scale_ub_tensor = None 2025-05-07T20:33:00.7096457Z 2025-05-07T20:33:00.7096589Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.7096675Z op = silu_mul_quant 2025-05-07T20:33:00.7096760Z if compiled: 2025-05-07T20:33:00.7096856Z op = torch.compile(op) 2025-05-07T20:33:00.7097002Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7097075Z 2025-05-07T20:33:00.7097161Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.7097165Z 2025-05-07T20:33:00.7097260Z moe/activation_test.py:117: 2025-05-07T20:33:00.7097390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7097491Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.7097588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7098204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.7098346Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.7098712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.7098934Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.7099282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.7099375Z kernel = self.compile( 2025-05-07T20:33:00.7099763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.7099943Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.7100064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7100068Z 2025-05-07T20:33:00.7100317Z self = 2025-05-07T20:33:00.7101118Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.7101628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd504d21240>} 2025-05-07T20:33:00.7102393Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.7102587Z context = 2025-05-07T20:33:00.7102592Z 2025-05-07T20:33:00.7102756Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.7103026Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.7103135Z module_map=module_map) 2025-05-07T20:33:00.7103298Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.7103395Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.7103470Z E ^ 2025-05-07T20:33:00.7103833Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.7103840Z 2025-05-07T20:33:00.7104265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.7104270Z 2025-05-07T20:33:00.7104372Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7104639Z self=, 2025-05-07T20:33:00.7104720Z T=2048, 2025-05-07T20:33:00.7104794Z D=7168, 2025-05-07T20:33:00.7104882Z scale_ub=1200.0, 2025-05-07T20:33:00.7104965Z contiguous=True, 2025-05-07T20:33:00.7105049Z compiled=False, 2025-05-07T20:33:00.7105122Z ) 2025-05-07T20:33:00.7105340Z self = 2025-05-07T20:33:00.7105513Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.7105518Z 2025-05-07T20:33:00.7105593Z @given( 2025-05-07T20:33:00.7105753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7105850Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7105966Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7106081Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7106195Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7106268Z ) 2025-05-07T20:33:00.7106515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7106613Z def test_silu_mul_quant( 2025-05-07T20:33:00.7106688Z self, 2025-05-07T20:33:00.7106762Z T: int, 2025-05-07T20:33:00.7106877Z D: int, 2025-05-07T20:33:00.7106974Z scale_ub: Optional[float], 2025-05-07T20:33:00.7107064Z contiguous: bool, 2025-05-07T20:33:00.7107149Z compiled: bool, 2025-05-07T20:33:00.7107224Z ) -> None: 2025-05-07T20:33:00.7107319Z torch.manual_seed(2025) 2025-05-07T20:33:00.7107394Z 2025-05-07T20:33:00.7107562Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7109415Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7109461Z 2025-05-07T20:33:00.7109579Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7109584Z 2025-05-07T20:33:00.7109687Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7109913Z self=, 2025-05-07T20:33:00.7109991Z T=1, 2025-05-07T20:33:00.7110068Z D=5120, 2025-05-07T20:33:00.7110150Z scale_ub=1200.0, 2025-05-07T20:33:00.7110234Z contiguous=True, 2025-05-07T20:33:00.7110317Z compiled=False, 2025-05-07T20:33:00.7110387Z ) 2025-05-07T20:33:00.7110604Z self = 2025-05-07T20:33:00.7110773Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.7110777Z 2025-05-07T20:33:00.7110850Z @given( 2025-05-07T20:33:00.7110970Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7111065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7111178Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7111294Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7111402Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7111473Z ) 2025-05-07T20:33:00.7111724Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7111819Z def test_silu_mul_quant( 2025-05-07T20:33:00.7111890Z self, 2025-05-07T20:33:00.7111964Z T: int, 2025-05-07T20:33:00.7112039Z D: int, 2025-05-07T20:33:00.7112137Z scale_ub: Optional[float], 2025-05-07T20:33:00.7112222Z contiguous: bool, 2025-05-07T20:33:00.7112347Z compiled: bool, 2025-05-07T20:33:00.7112424Z ) -> None: 2025-05-07T20:33:00.7112515Z torch.manual_seed(2025) 2025-05-07T20:33:00.7112584Z 2025-05-07T20:33:00.7112755Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7112825Z 2025-05-07T20:33:00.7112916Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7113041Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7113125Z x = x_sign * x_clamp 2025-05-07T20:33:00.7113203Z x0 = x[:, :D] 2025-05-07T20:33:00.7113280Z x1 = x[:, D:] 2025-05-07T20:33:00.7113390Z 2025-05-07T20:33:00.7113471Z if contiguous: 2025-05-07T20:33:00.7113562Z x0 = x0.contiguous() 2025-05-07T20:33:00.7113646Z x1 = x1.contiguous() 2025-05-07T20:33:00.7113718Z 2025-05-07T20:33:00.7113805Z if scale_ub is not None: 2025-05-07T20:33:00.7113906Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.7114044Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.7114115Z ) 2025-05-07T20:33:00.7114189Z else: 2025-05-07T20:33:00.7114286Z scale_ub_tensor = None 2025-05-07T20:33:00.7114354Z 2025-05-07T20:33:00.7114523Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.7114614Z op = silu_mul_quant 2025-05-07T20:33:00.7114695Z if compiled: 2025-05-07T20:33:00.7114814Z op = torch.compile(op) 2025-05-07T20:33:00.7114931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7115014Z 2025-05-07T20:33:00.7115107Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.7115112Z 2025-05-07T20:33:00.7115205Z moe/activation_test.py:117: 2025-05-07T20:33:00.7115330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7115432Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.7115528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7116038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.7116182Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.7116548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.7116774Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.7117120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.7117213Z kernel = self.compile( 2025-05-07T20:33:00.7117603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.7117777Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.7117898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7117907Z 2025-05-07T20:33:00.7118112Z self = 2025-05-07T20:33:00.7118908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.7119426Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd504d22200>} 2025-05-07T20:33:00.7120189Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.7120385Z context = 2025-05-07T20:33:00.7120390Z 2025-05-07T20:33:00.7120619Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.7120890Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.7121000Z module_map=module_map) 2025-05-07T20:33:00.7121162Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.7121263Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.7121335Z E ^ 2025-05-07T20:33:00.7121692Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.7121697Z 2025-05-07T20:33:00.7122157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.7122162Z 2025-05-07T20:33:00.7122263Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7122485Z self=, 2025-05-07T20:33:00.7122563Z T=2048, 2025-05-07T20:33:00.7122638Z D=5120, 2025-05-07T20:33:00.7122719Z scale_ub=None, 2025-05-07T20:33:00.7122800Z contiguous=True, 2025-05-07T20:33:00.7122879Z compiled=False, 2025-05-07T20:33:00.7122953Z ) 2025-05-07T20:33:00.7123208Z self = 2025-05-07T20:33:00.7123382Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.7123387Z 2025-05-07T20:33:00.7123464Z @given( 2025-05-07T20:33:00.7123578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7123676Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7123794Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7123908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7124021Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7124092Z ) 2025-05-07T20:33:00.7124340Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7124433Z def test_silu_mul_quant( 2025-05-07T20:33:00.7124505Z self, 2025-05-07T20:33:00.7124579Z T: int, 2025-05-07T20:33:00.7124696Z D: int, 2025-05-07T20:33:00.7124789Z scale_ub: Optional[float], 2025-05-07T20:33:00.7124876Z contiguous: bool, 2025-05-07T20:33:00.7124961Z compiled: bool, 2025-05-07T20:33:00.7125033Z ) -> None: 2025-05-07T20:33:00.7125124Z torch.manual_seed(2025) 2025-05-07T20:33:00.7125201Z 2025-05-07T20:33:00.7125394Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7125490Z 2025-05-07T20:33:00.7125577Z > x_sign = torch.sign(x) 2025-05-07T20:33:00.7127422Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7127434Z 2025-05-07T20:33:00.7127549Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:00.7127554Z 2025-05-07T20:33:00.7127653Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7127880Z self=, 2025-05-07T20:33:00.7127957Z T=16384, 2025-05-07T20:33:00.7128030Z D=5120, 2025-05-07T20:33:00.7128112Z scale_ub=None, 2025-05-07T20:33:00.7128193Z contiguous=True, 2025-05-07T20:33:00.7128274Z compiled=False, 2025-05-07T20:33:00.7128345Z ) 2025-05-07T20:33:00.7128563Z self = 2025-05-07T20:33:00.7128781Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.7128786Z 2025-05-07T20:33:00.7128860Z @given( 2025-05-07T20:33:00.7128974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7129074Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7129188Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7129304Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7129416Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7129486Z ) 2025-05-07T20:33:00.7129732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7129867Z def test_silu_mul_quant( 2025-05-07T20:33:00.7129939Z self, 2025-05-07T20:33:00.7130017Z T: int, 2025-05-07T20:33:00.7130089Z D: int, 2025-05-07T20:33:00.7130182Z scale_ub: Optional[float], 2025-05-07T20:33:00.7130269Z contiguous: bool, 2025-05-07T20:33:00.7130350Z compiled: bool, 2025-05-07T20:33:00.7130426Z ) -> None: 2025-05-07T20:33:00.7130520Z torch.manual_seed(2025) 2025-05-07T20:33:00.7130587Z 2025-05-07T20:33:00.7130754Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7132639Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7132647Z 2025-05-07T20:33:00.7132764Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7132768Z 2025-05-07T20:33:00.7132876Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7133099Z self=, 2025-05-07T20:33:00.7133175Z T=4096, 2025-05-07T20:33:00.7133287Z D=5120, 2025-05-07T20:33:00.7133366Z scale_ub=None, 2025-05-07T20:33:00.7133451Z contiguous=True, 2025-05-07T20:33:00.7133532Z compiled=False, 2025-05-07T20:33:00.7133600Z ) 2025-05-07T20:33:00.7133820Z self = 2025-05-07T20:33:00.7133991Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.7133996Z 2025-05-07T20:33:00.7134075Z @given( 2025-05-07T20:33:00.7134191Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7134287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7134399Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7134513Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7134624Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7134696Z ) 2025-05-07T20:33:00.7134941Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7135035Z def test_silu_mul_quant( 2025-05-07T20:33:00.7135109Z self, 2025-05-07T20:33:00.7135182Z T: int, 2025-05-07T20:33:00.7135253Z D: int, 2025-05-07T20:33:00.7135350Z scale_ub: Optional[float], 2025-05-07T20:33:00.7135437Z contiguous: bool, 2025-05-07T20:33:00.7135517Z compiled: bool, 2025-05-07T20:33:00.7135595Z ) -> None: 2025-05-07T20:33:00.7135691Z torch.manual_seed(2025) 2025-05-07T20:33:00.7135765Z 2025-05-07T20:33:00.7135933Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7137807Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7137819Z 2025-05-07T20:33:00.7137937Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7137941Z 2025-05-07T20:33:00.7138108Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7138339Z self=, 2025-05-07T20:33:00.7138459Z T=2048, 2025-05-07T20:33:00.7138534Z D=5120, 2025-05-07T20:33:00.7138616Z scale_ub=None, 2025-05-07T20:33:00.7138700Z contiguous=False, 2025-05-07T20:33:00.7138782Z compiled=False, 2025-05-07T20:33:00.7138855Z ) 2025-05-07T20:33:00.7139078Z self = 2025-05-07T20:33:00.7139253Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.7139258Z 2025-05-07T20:33:00.7139335Z @given( 2025-05-07T20:33:00.7139448Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7139591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7139704Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7139818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7139933Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7140006Z ) 2025-05-07T20:33:00.7140255Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7140350Z def test_silu_mul_quant( 2025-05-07T20:33:00.7140422Z self, 2025-05-07T20:33:00.7140499Z T: int, 2025-05-07T20:33:00.7140571Z D: int, 2025-05-07T20:33:00.7140668Z scale_ub: Optional[float], 2025-05-07T20:33:00.7140759Z contiguous: bool, 2025-05-07T20:33:00.7140842Z compiled: bool, 2025-05-07T20:33:00.7140918Z ) -> None: 2025-05-07T20:33:00.7141018Z torch.manual_seed(2025) 2025-05-07T20:33:00.7141130Z 2025-05-07T20:33:00.7141298Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7143135Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7143145Z 2025-05-07T20:33:00.7143261Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7143267Z 2025-05-07T20:33:00.7143368Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7143590Z self=, 2025-05-07T20:33:00.7143671Z T=4096, 2025-05-07T20:33:00.7143743Z D=7168, 2025-05-07T20:33:00.7143823Z scale_ub=None, 2025-05-07T20:33:00.7143906Z contiguous=True, 2025-05-07T20:33:00.7143984Z compiled=True, 2025-05-07T20:33:00.7144052Z ) 2025-05-07T20:33:00.7144272Z self = 2025-05-07T20:33:00.7144439Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.7144447Z 2025-05-07T20:33:00.7144519Z @given( 2025-05-07T20:33:00.7144634Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7144729Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7144869Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7145050Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7145161Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7145234Z ) 2025-05-07T20:33:00.7145484Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7145578Z def test_silu_mul_quant( 2025-05-07T20:33:00.7145656Z self, 2025-05-07T20:33:00.7145727Z T: int, 2025-05-07T20:33:00.7145801Z D: int, 2025-05-07T20:33:00.7145894Z scale_ub: Optional[float], 2025-05-07T20:33:00.7145979Z contiguous: bool, 2025-05-07T20:33:00.7146063Z compiled: bool, 2025-05-07T20:33:00.7146178Z ) -> None: 2025-05-07T20:33:00.7146269Z torch.manual_seed(2025) 2025-05-07T20:33:00.7146340Z 2025-05-07T20:33:00.7146507Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7148385Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7148394Z 2025-05-07T20:33:00.7148510Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7148515Z 2025-05-07T20:33:00.7148620Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7148846Z self=, 2025-05-07T20:33:00.7148922Z T=2048, 2025-05-07T20:33:00.7149001Z D=5120, 2025-05-07T20:33:00.7149084Z scale_ub=1200.0, 2025-05-07T20:33:00.7149171Z contiguous=False, 2025-05-07T20:33:00.7149259Z compiled=False, 2025-05-07T20:33:00.7149331Z ) 2025-05-07T20:33:00.7149551Z self = 2025-05-07T20:33:00.7149728Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.7149796Z 2025-05-07T20:33:00.7149874Z @given( 2025-05-07T20:33:00.7149993Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7150095Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7150207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7150330Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7150443Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7150521Z ) 2025-05-07T20:33:00.7150769Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7150863Z def test_silu_mul_quant( 2025-05-07T20:33:00.7150937Z self, 2025-05-07T20:33:00.7151014Z T: int, 2025-05-07T20:33:00.7151088Z D: int, 2025-05-07T20:33:00.7151186Z scale_ub: Optional[float], 2025-05-07T20:33:00.7151277Z contiguous: bool, 2025-05-07T20:33:00.7151361Z compiled: bool, 2025-05-07T20:33:00.7151445Z ) -> None: 2025-05-07T20:33:00.7151538Z torch.manual_seed(2025) 2025-05-07T20:33:00.7151611Z 2025-05-07T20:33:00.7151784Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7153612Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7153620Z 2025-05-07T20:33:00.7153781Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7153786Z 2025-05-07T20:33:00.7153891Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7154118Z self=, 2025-05-07T20:33:00.7154199Z T=4096, 2025-05-07T20:33:00.7154275Z D=7168, 2025-05-07T20:33:00.7154358Z scale_ub=1200.0, 2025-05-07T20:33:00.7154443Z contiguous=True, 2025-05-07T20:33:00.7154526Z compiled=False, 2025-05-07T20:33:00.7154597Z ) 2025-05-07T20:33:00.7154817Z self = 2025-05-07T20:33:00.7155032Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.7155036Z 2025-05-07T20:33:00.7155135Z @given( 2025-05-07T20:33:00.7155263Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7155377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7155496Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7155931Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7156093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7156175Z ) 2025-05-07T20:33:00.7156519Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7156619Z def test_silu_mul_quant( 2025-05-07T20:33:00.7156694Z self, 2025-05-07T20:33:00.7156769Z T: int, 2025-05-07T20:33:00.7156844Z D: int, 2025-05-07T20:33:00.7156938Z scale_ub: Optional[float], 2025-05-07T20:33:00.7157029Z contiguous: bool, 2025-05-07T20:33:00.7157113Z compiled: bool, 2025-05-07T20:33:00.7157189Z ) -> None: 2025-05-07T20:33:00.7157281Z torch.manual_seed(2025) 2025-05-07T20:33:00.7157356Z 2025-05-07T20:33:00.7157522Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7159380Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7159454Z 2025-05-07T20:33:00.7159572Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7159579Z 2025-05-07T20:33:00.7159684Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7159907Z self=, 2025-05-07T20:33:00.7159982Z T=16384, 2025-05-07T20:33:00.7160058Z D=7168, 2025-05-07T20:33:00.7160138Z scale_ub=None, 2025-05-07T20:33:00.7160221Z contiguous=False, 2025-05-07T20:33:00.7160308Z compiled=True, 2025-05-07T20:33:00.7160379Z ) 2025-05-07T20:33:00.7160599Z self = 2025-05-07T20:33:00.7160779Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.7160786Z 2025-05-07T20:33:00.7160860Z @given( 2025-05-07T20:33:00.7160977Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7161073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7161185Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7161304Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7161418Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7161489Z ) 2025-05-07T20:33:00.7161741Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7161835Z def test_silu_mul_quant( 2025-05-07T20:33:00.7161908Z self, 2025-05-07T20:33:00.7162048Z T: int, 2025-05-07T20:33:00.7162123Z D: int, 2025-05-07T20:33:00.7162225Z scale_ub: Optional[float], 2025-05-07T20:33:00.7162314Z contiguous: bool, 2025-05-07T20:33:00.7162399Z compiled: bool, 2025-05-07T20:33:00.7162478Z ) -> None: 2025-05-07T20:33:00.7162575Z torch.manual_seed(2025) 2025-05-07T20:33:00.7162645Z 2025-05-07T20:33:00.7162816Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7164649Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7164712Z 2025-05-07T20:33:00.7164835Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7164842Z 2025-05-07T20:33:00.7164945Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7165205Z self=, 2025-05-07T20:33:00.7165284Z T=4096, 2025-05-07T20:33:00.7165358Z D=7168, 2025-05-07T20:33:00.7165439Z scale_ub=None, 2025-05-07T20:33:00.7165528Z contiguous=True, 2025-05-07T20:33:00.7165610Z compiled=False, 2025-05-07T20:33:00.7165685Z ) 2025-05-07T20:33:00.7165907Z self = 2025-05-07T20:33:00.7166078Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.7166082Z 2025-05-07T20:33:00.7166159Z @given( 2025-05-07T20:33:00.7166276Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7166375Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7166489Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7166605Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7166760Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7166834Z ) 2025-05-07T20:33:00.7167081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7167178Z def test_silu_mul_quant( 2025-05-07T20:33:00.7167252Z self, 2025-05-07T20:33:00.7167327Z T: int, 2025-05-07T20:33:00.7167407Z D: int, 2025-05-07T20:33:00.7167507Z scale_ub: Optional[float], 2025-05-07T20:33:00.7167594Z contiguous: bool, 2025-05-07T20:33:00.7167681Z compiled: bool, 2025-05-07T20:33:00.7167755Z ) -> None: 2025-05-07T20:33:00.7167849Z torch.manual_seed(2025) 2025-05-07T20:33:00.7167922Z 2025-05-07T20:33:00.7168088Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7169925Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7169936Z 2025-05-07T20:33:00.7170053Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7170057Z 2025-05-07T20:33:00.7170161Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7170384Z self=, 2025-05-07T20:33:00.7170459Z T=16384, 2025-05-07T20:33:00.7170534Z D=7168, 2025-05-07T20:33:00.7170657Z scale_ub=None, 2025-05-07T20:33:00.7170741Z contiguous=True, 2025-05-07T20:33:00.7170825Z compiled=False, 2025-05-07T20:33:00.7170894Z ) 2025-05-07T20:33:00.7171116Z self = 2025-05-07T20:33:00.7171295Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.7171300Z 2025-05-07T20:33:00.7171374Z @given( 2025-05-07T20:33:00.7171491Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7171587Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7171700Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7171860Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7171972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7172043Z ) 2025-05-07T20:33:00.7172293Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7172385Z def test_silu_mul_quant( 2025-05-07T20:33:00.7172461Z self, 2025-05-07T20:33:00.7172538Z T: int, 2025-05-07T20:33:00.7172611Z D: int, 2025-05-07T20:33:00.7172707Z scale_ub: Optional[float], 2025-05-07T20:33:00.7172800Z contiguous: bool, 2025-05-07T20:33:00.7172923Z compiled: bool, 2025-05-07T20:33:00.7173004Z ) -> None: 2025-05-07T20:33:00.7173100Z torch.manual_seed(2025) 2025-05-07T20:33:00.7173171Z 2025-05-07T20:33:00.7173342Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7175229Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7175238Z 2025-05-07T20:33:00.7175395Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7175400Z 2025-05-07T20:33:00.7175503Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7175726Z self=, 2025-05-07T20:33:00.7175805Z T=16384, 2025-05-07T20:33:00.7175879Z D=7168, 2025-05-07T20:33:00.7175960Z scale_ub=1200.0, 2025-05-07T20:33:00.7176048Z contiguous=True, 2025-05-07T20:33:00.7176135Z compiled=False, 2025-05-07T20:33:00.7176209Z ) 2025-05-07T20:33:00.7176427Z self = 2025-05-07T20:33:00.7176603Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.7176608Z 2025-05-07T20:33:00.7176684Z @given( 2025-05-07T20:33:00.7176799Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7176896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7177013Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7177130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7177246Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7177320Z ) 2025-05-07T20:33:00.7177568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7177664Z def test_silu_mul_quant( 2025-05-07T20:33:00.7177736Z self, 2025-05-07T20:33:00.7177813Z T: int, 2025-05-07T20:33:00.7177889Z D: int, 2025-05-07T20:33:00.7177985Z scale_ub: Optional[float], 2025-05-07T20:33:00.7178167Z contiguous: bool, 2025-05-07T20:33:00.7178254Z compiled: bool, 2025-05-07T20:33:00.7178329Z ) -> None: 2025-05-07T20:33:00.7178423Z torch.manual_seed(2025) 2025-05-07T20:33:00.7178495Z 2025-05-07T20:33:00.7178709Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7180553Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7180621Z 2025-05-07T20:33:00.7180737Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7180742Z 2025-05-07T20:33:00.7180849Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7181072Z self=, 2025-05-07T20:33:00.7181150Z T=128, 2025-05-07T20:33:00.7181228Z D=5120, 2025-05-07T20:33:00.7181311Z scale_ub=1200.0, 2025-05-07T20:33:00.7181395Z contiguous=False, 2025-05-07T20:33:00.7181480Z compiled=False, 2025-05-07T20:33:00.7181550Z ) 2025-05-07T20:33:00.7181833Z self = 2025-05-07T20:33:00.7182009Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.7182014Z 2025-05-07T20:33:00.7182088Z @given( 2025-05-07T20:33:00.7182208Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7182309Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7182421Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7182540Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7182651Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7182722Z ) 2025-05-07T20:33:00.7182974Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7183066Z def test_silu_mul_quant( 2025-05-07T20:33:00.7183138Z self, 2025-05-07T20:33:00.7183214Z T: int, 2025-05-07T20:33:00.7183328Z D: int, 2025-05-07T20:33:00.7183425Z scale_ub: Optional[float], 2025-05-07T20:33:00.7183518Z contiguous: bool, 2025-05-07T20:33:00.7183601Z compiled: bool, 2025-05-07T20:33:00.7183679Z ) -> None: 2025-05-07T20:33:00.7183773Z torch.manual_seed(2025) 2025-05-07T20:33:00.7183842Z 2025-05-07T20:33:00.7184012Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7184086Z 2025-05-07T20:33:00.7184176Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7184305Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7184393Z x = x_sign * x_clamp 2025-05-07T20:33:00.7184473Z x0 = x[:, :D] 2025-05-07T20:33:00.7184554Z x1 = x[:, D:] 2025-05-07T20:33:00.7184623Z 2025-05-07T20:33:00.7184708Z if contiguous: 2025-05-07T20:33:00.7184801Z x0 = x0.contiguous() 2025-05-07T20:33:00.7184892Z x1 = x1.contiguous() 2025-05-07T20:33:00.7184985Z 2025-05-07T20:33:00.7185086Z if scale_ub is not None: 2025-05-07T20:33:00.7185212Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.7185350Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.7185422Z ) 2025-05-07T20:33:00.7185494Z else: 2025-05-07T20:33:00.7185589Z scale_ub_tensor = None 2025-05-07T20:33:00.7185658Z 2025-05-07T20:33:00.7185791Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.7185883Z op = silu_mul_quant 2025-05-07T20:33:00.7185966Z if compiled: 2025-05-07T20:33:00.7186067Z op = torch.compile(op) 2025-05-07T20:33:00.7186173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7186242Z 2025-05-07T20:33:00.7186378Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.7186386Z 2025-05-07T20:33:00.7186484Z moe/activation_test.py:117: 2025-05-07T20:33:00.7186610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7186716Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.7186817Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7187327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.7187427Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.7187792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.7188058Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.7188407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.7188500Z kernel = self.compile( 2025-05-07T20:33:00.7188895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.7189073Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.7189233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7189239Z 2025-05-07T20:33:00.7189448Z self = 2025-05-07T20:33:00.7190241Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.7190760Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd504c45ea0>} 2025-05-07T20:33:00.7191526Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.7191763Z context = 2025-05-07T20:33:00.7191770Z 2025-05-07T20:33:00.7191936Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.7192205Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.7192314Z module_map=module_map) 2025-05-07T20:33:00.7192478Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.7192578Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.7192654Z E ^ 2025-05-07T20:33:00.7193016Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.7193021Z 2025-05-07T20:33:00.7193449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.7193454Z 2025-05-07T20:33:00.7193558Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7193785Z self=, 2025-05-07T20:33:00.7193864Z T=2048, 2025-05-07T20:33:00.7193937Z D=7168, 2025-05-07T20:33:00.7194016Z scale_ub=None, 2025-05-07T20:33:00.7194105Z contiguous=False, 2025-05-07T20:33:00.7194186Z compiled=False, 2025-05-07T20:33:00.7194259Z ) 2025-05-07T20:33:00.7194478Z self = 2025-05-07T20:33:00.7194654Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.7194659Z 2025-05-07T20:33:00.7194738Z @given( 2025-05-07T20:33:00.7194854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7194951Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7195110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7195228Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7195346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7195420Z ) 2025-05-07T20:33:00.7195670Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7195764Z def test_silu_mul_quant( 2025-05-07T20:33:00.7195834Z self, 2025-05-07T20:33:00.7195907Z T: int, 2025-05-07T20:33:00.7195984Z D: int, 2025-05-07T20:33:00.7196081Z scale_ub: Optional[float], 2025-05-07T20:33:00.7196210Z contiguous: bool, 2025-05-07T20:33:00.7196297Z compiled: bool, 2025-05-07T20:33:00.7196373Z ) -> None: 2025-05-07T20:33:00.7196466Z torch.manual_seed(2025) 2025-05-07T20:33:00.7196540Z 2025-05-07T20:33:00.7196709Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7198597Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7198605Z 2025-05-07T20:33:00.7198725Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7198729Z 2025-05-07T20:33:00.7198835Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7199058Z self=, 2025-05-07T20:33:00.7199133Z T=128, 2025-05-07T20:33:00.7199211Z D=7168, 2025-05-07T20:33:00.7199293Z scale_ub=1200.0, 2025-05-07T20:33:00.7199378Z contiguous=True, 2025-05-07T20:33:00.7199463Z compiled=True, 2025-05-07T20:33:00.7199533Z ) 2025-05-07T20:33:00.7199749Z self = 2025-05-07T20:33:00.7199962Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.7199967Z 2025-05-07T20:33:00.7200041Z @given( 2025-05-07T20:33:00.7200158Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7200255Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7200368Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7200488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7200599Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7200671Z ) 2025-05-07T20:33:00.7200920Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7201013Z def test_silu_mul_quant( 2025-05-07T20:33:00.7201086Z self, 2025-05-07T20:33:00.7201170Z T: int, 2025-05-07T20:33:00.7201243Z D: int, 2025-05-07T20:33:00.7201340Z scale_ub: Optional[float], 2025-05-07T20:33:00.7201431Z contiguous: bool, 2025-05-07T20:33:00.7201514Z compiled: bool, 2025-05-07T20:33:00.7201595Z ) -> None: 2025-05-07T20:33:00.7201688Z torch.manual_seed(2025) 2025-05-07T20:33:00.7201758Z 2025-05-07T20:33:00.7201926Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7201999Z 2025-05-07T20:33:00.7202088Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7202220Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7202307Z x = x_sign * x_clamp 2025-05-07T20:33:00.7202386Z x0 = x[:, :D] 2025-05-07T20:33:00.7202465Z x1 = x[:, D:] 2025-05-07T20:33:00.7202533Z 2025-05-07T20:33:00.7202612Z if contiguous: 2025-05-07T20:33:00.7202703Z x0 = x0.contiguous() 2025-05-07T20:33:00.7202834Z x1 = x1.contiguous() 2025-05-07T20:33:00.7202908Z 2025-05-07T20:33:00.7202995Z if scale_ub is not None: 2025-05-07T20:33:00.7203098Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.7203235Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.7203309Z ) 2025-05-07T20:33:00.7203380Z else: 2025-05-07T20:33:00.7203473Z scale_ub_tensor = None 2025-05-07T20:33:00.7203543Z 2025-05-07T20:33:00.7203672Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.7203761Z op = silu_mul_quant 2025-05-07T20:33:00.7203884Z if compiled: 2025-05-07T20:33:00.7203980Z op = torch.compile(op) 2025-05-07T20:33:00.7204084Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7204153Z 2025-05-07T20:33:00.7204243Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.7204247Z 2025-05-07T20:33:00.7204342Z moe/activation_test.py:117: 2025-05-07T20:33:00.7204469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7204571Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.7204671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.7205113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.7205217Z return fn(*args, **kwargs) 2025-05-07T20:33:00.7205737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.7205835Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.7206200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.7206423Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.7206778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.7206869Z kernel = self.compile( 2025-05-07T20:33:00.7207256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.7207477Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.7207600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.7207605Z 2025-05-07T20:33:00.7207812Z self = 2025-05-07T20:33:00.7212749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.7213294Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fd504c477f0>} 2025-05-07T20:33:00.7214056Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.7214260Z context = 2025-05-07T20:33:00.7214266Z 2025-05-07T20:33:00.7214431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.7214700Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.7214809Z module_map=module_map) 2025-05-07T20:33:00.7214996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.7215105Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.7215196Z E ^ 2025-05-07T20:33:00.7215619Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.7215624Z 2025-05-07T20:33:00.7216046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.7216053Z 2025-05-07T20:33:00.7216157Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7216381Z self=, 2025-05-07T20:33:00.7216459Z T=128, 2025-05-07T20:33:00.7216534Z D=7168, 2025-05-07T20:33:00.7216617Z scale_ub=1200.0, 2025-05-07T20:33:00.7216697Z contiguous=True, 2025-05-07T20:33:00.7216776Z compiled=False, 2025-05-07T20:33:00.7216892Z ) 2025-05-07T20:33:00.7217109Z self = 2025-05-07T20:33:00.7217277Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:00.7217282Z 2025-05-07T20:33:00.7217357Z @given( 2025-05-07T20:33:00.7217473Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7217571Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7217686Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7217802Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7217915Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7218029Z ) 2025-05-07T20:33:00.7218364Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7218459Z def test_silu_mul_quant( 2025-05-07T20:33:00.7218531Z self, 2025-05-07T20:33:00.7218603Z T: int, 2025-05-07T20:33:00.7218680Z D: int, 2025-05-07T20:33:00.7218775Z scale_ub: Optional[float], 2025-05-07T20:33:00.7218860Z contiguous: bool, 2025-05-07T20:33:00.7218945Z compiled: bool, 2025-05-07T20:33:00.7219020Z ) -> None: 2025-05-07T20:33:00.7219112Z torch.manual_seed(2025) 2025-05-07T20:33:00.7219184Z 2025-05-07T20:33:00.7219355Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7219427Z 2025-05-07T20:33:00.7219518Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7219641Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7221510Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7221518Z 2025-05-07T20:33:00.7221634Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:00.7221638Z 2025-05-07T20:33:00.7221741Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7221966Z self=, 2025-05-07T20:33:00.7222040Z T=128, 2025-05-07T20:33:00.7222119Z D=5120, 2025-05-07T20:33:00.7222200Z scale_ub=1200.0, 2025-05-07T20:33:00.7222282Z contiguous=True, 2025-05-07T20:33:00.7222366Z compiled=True, 2025-05-07T20:33:00.7222435Z ) 2025-05-07T20:33:00.7222651Z self = 2025-05-07T20:33:00.7222818Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:00.7222823Z 2025-05-07T20:33:00.7222894Z @given( 2025-05-07T20:33:00.7223014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7223111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7223223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7223339Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7223448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7223565Z ) 2025-05-07T20:33:00.7223814Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7223909Z def test_silu_mul_quant( 2025-05-07T20:33:00.7223987Z self, 2025-05-07T20:33:00.7224059Z T: int, 2025-05-07T20:33:00.7224136Z D: int, 2025-05-07T20:33:00.7224232Z scale_ub: Optional[float], 2025-05-07T20:33:00.7224317Z contiguous: bool, 2025-05-07T20:33:00.7224398Z compiled: bool, 2025-05-07T20:33:00.7224479Z ) -> None: 2025-05-07T20:33:00.7224570Z torch.manual_seed(2025) 2025-05-07T20:33:00.7224682Z 2025-05-07T20:33:00.7224852Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7224923Z 2025-05-07T20:33:00.7225009Z x_sign = torch.sign(x) 2025-05-07T20:33:00.7225133Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.7226977Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7226989Z 2025-05-07T20:33:00.7227107Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:00.7227114Z 2025-05-07T20:33:00.7227216Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.7227441Z self=, 2025-05-07T20:33:00.7227515Z T=128, 2025-05-07T20:33:00.7227588Z D=7168, 2025-05-07T20:33:00.7227675Z scale_ub=None, 2025-05-07T20:33:00.7227755Z contiguous=True, 2025-05-07T20:33:00.7227837Z compiled=True, 2025-05-07T20:33:00.7227908Z ) 2025-05-07T20:33:00.7228125Z self = 2025-05-07T20:33:00.7228331Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.7228336Z 2025-05-07T20:33:00.7228413Z @given( 2025-05-07T20:33:00.7228527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.7228626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.7228736Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.7228850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.7228964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.7229033Z ) 2025-05-07T20:33:00.7229277Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.7229372Z def test_silu_mul_quant( 2025-05-07T20:33:00.7229444Z self, 2025-05-07T20:33:00.7229515Z T: int, 2025-05-07T20:33:00.7229591Z D: int, 2025-05-07T20:33:00.7229684Z scale_ub: Optional[float], 2025-05-07T20:33:00.7229770Z contiguous: bool, 2025-05-07T20:33:00.7229858Z compiled: bool, 2025-05-07T20:33:00.7229930Z ) -> None: 2025-05-07T20:33:00.7230027Z torch.manual_seed(2025) 2025-05-07T20:33:00.7230096Z 2025-05-07T20:33:00.7230266Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.7232118Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:00.7232127Z 2025-05-07T20:33:00.7232248Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:00.7232378Z =============================== warnings summary =============================== 2025-05-07T20:33:00.7232696Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:00.7232999Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:00.7233299Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:00.7234228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.10/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:00.7234461Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:00.7234465Z 2025-05-07T20:33:00.7234678Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:00.7234844Z ================= 1 failed, 1 deselected, 3 warnings in 17.43s ================= 2025-05-07T20:33:02.2625196Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:02.3240090Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:33:02.3240806Z 2025-05-07T20:33:02.3241328Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:33:02.3242967Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:33:02.3244145Z 2025-05-07T20:33:02.3244157Z 2025-05-07T20:33:02.3244168Z 2025-05-07T20:33:02.3261595Z ##[error]Process completed with exit code 1. 2025-05-07T20:33:02.3342463Z Post job cleanup. 2025-05-07T20:33:02.4351341Z [command]/usr/bin/git version 2025-05-07T20:33:02.4391472Z git version 2.47.1 2025-05-07T20:33:02.4430217Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/19d6b077-0aa3-409a-923a-d25e4232a9ba/.gitconfig' 2025-05-07T20:33:02.4440933Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/19d6b077-0aa3-409a-923a-d25e4232a9ba' before making global git config changes 2025-05-07T20:33:02.4442322Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:33:02.4456679Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:33:02.4500041Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:33:02.4535464Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:33:02.4871209Z Entering 'external/asmjit' 2025-05-07T20:33:02.4937313Z Entering 'external/composable_kernel' 2025-05-07T20:33:02.5011871Z Entering 'external/cpuinfo' 2025-05-07T20:33:02.5079113Z Entering 'external/cutlass' 2025-05-07T20:33:02.5152395Z Entering 'external/googletest' 2025-05-07T20:33:02.5219127Z Entering 'external/hipify_torch' 2025-05-07T20:33:02.5285881Z Entering 'external/json' 2025-05-07T20:33:02.5371279Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:33:02.5396959Z http.https://github.com/.extraheader 2025-05-07T20:33:02.5408667Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:33:02.5439913Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:33:02.5768668Z Entering 'external/asmjit' 2025-05-07T20:33:02.5810077Z http.https://github.com/.extraheader 2025-05-07T20:33:02.5852974Z Entering 'external/composable_kernel' 2025-05-07T20:33:02.5896354Z http.https://github.com/.extraheader 2025-05-07T20:33:02.5948393Z Entering 'external/cpuinfo' 2025-05-07T20:33:02.5990602Z http.https://github.com/.extraheader 2025-05-07T20:33:02.6033418Z Entering 'external/cutlass' 2025-05-07T20:33:02.6077460Z http.https://github.com/.extraheader 2025-05-07T20:33:02.6128565Z Entering 'external/googletest' 2025-05-07T20:33:02.6172045Z http.https://github.com/.extraheader 2025-05-07T20:33:02.6215365Z Entering 'external/hipify_torch' 2025-05-07T20:33:02.6259270Z http.https://github.com/.extraheader 2025-05-07T20:33:02.6301682Z Entering 'external/json' 2025-05-07T20:33:02.6344543Z http.https://github.com/.extraheader 2025-05-07T20:33:02.6501325Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:33:02.6532487Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:33:02.6543019Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:33:02.6543380Z ##[endgroup] 2025-05-07T20:33:02.6643709Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:33:13.4143563Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:33:29.8221116Z Cleaning up orphan processes