2025-05-07T20:22:34.9474251Z Current runner version: '2.323.0' 2025-05-07T20:22:34.9480282Z Runner name: 'i-00cb9561c833cfdb2' 2025-05-07T20:22:34.9481172Z Machine name: 'ip-10-0-73-154' 2025-05-07T20:22:34.9483974Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:22:34.9486239Z Contents: read 2025-05-07T20:22:34.9486747Z Metadata: read 2025-05-07T20:22:34.9487238Z Packages: read 2025-05-07T20:22:34.9487715Z ##[endgroup] 2025-05-07T20:22:34.9489588Z Secret source: None 2025-05-07T20:22:34.9490273Z Prepare workflow directory 2025-05-07T20:22:35.0007182Z Prepare all required actions 2025-05-07T20:22:35.0044849Z Getting action download info 2025-05-07T20:22:35.2235599Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:22:35.5364156Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:22:35.9021723Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:22:37.5121342Z Getting action download info 2025-05-07T20:22:37.6335534Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:22:37.8306913Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.9, 12.6.3, 12.6.3, gcc) 2025-05-07T20:22:37.8913616Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:22:37.9048386Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:22:37.9061272Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:22:37.9062810Z ##[endgroup] 2025-05-07T20:22:38.9744665Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:22:38.9745381Z Instance Type: g5.4xlarge 2025-05-07T20:22:38.9745728Z AMI Name: unknown 2025-05-07T20:22:38.9784632Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:22:44.3224232Z ##[group]Run actions/checkout@v4 2025-05-07T20:22:44.3224548Z with: 2025-05-07T20:22:44.3224778Z submodules: true 2025-05-07T20:22:44.3225017Z repository: pytorch/FBGEMM 2025-05-07T20:22:44.3225418Z token: *** 2025-05-07T20:22:44.3225619Z ssh-strict: true 2025-05-07T20:22:44.3225834Z ssh-user: git 2025-05-07T20:22:44.3226052Z persist-credentials: true 2025-05-07T20:22:44.3226304Z clean: true 2025-05-07T20:22:44.3226533Z sparse-checkout-cone-mode: true 2025-05-07T20:22:44.3226798Z fetch-depth: 1 2025-05-07T20:22:44.3227014Z fetch-tags: false 2025-05-07T20:22:44.3227232Z show-progress: true 2025-05-07T20:22:44.3227458Z lfs: false 2025-05-07T20:22:44.3227665Z set-safe-directory: true 2025-05-07T20:22:44.3227923Z env: 2025-05-07T20:22:44.3228138Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:44.3228454Z BUILD_ENV: build_binary 2025-05-07T20:22:44.3228731Z BUILD_TARGET: genai 2025-05-07T20:22:44.3228970Z BUILD_VARIANT: cuda 2025-05-07T20:22:44.3229241Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:44.3229498Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:44.3229843Z ##[endgroup] 2025-05-07T20:22:44.4384315Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:22:44.4385496Z ##[group]Getting Git version info 2025-05-07T20:22:44.4385946Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.4386548Z [command]/usr/bin/git version 2025-05-07T20:22:44.4386812Z git version 2.47.1 2025-05-07T20:22:44.4395353Z ##[endgroup] 2025-05-07T20:22:44.4409287Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/2a7e0901-7173-4864-9b4d-c594ce024a59' before making global git config changes 2025-05-07T20:22:44.4410299Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:22:44.4424281Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.4461167Z Deleting the contents of '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:22:44.4464497Z ##[group]Initializing the repository 2025-05-07T20:22:44.4468684Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:44.4511712Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-05-07T20:22:44.4512380Z hint: is subject to change. To configure the initial branch name to use in all 2025-05-07T20:22:44.4512907Z hint: of your new repositories, which will suppress this warning, call: 2025-05-07T20:22:44.4513273Z hint: 2025-05-07T20:22:44.4513563Z hint: git config --global init.defaultBranch 2025-05-07T20:22:44.4513960Z hint: 2025-05-07T20:22:44.4514321Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-05-07T20:22:44.4514911Z hint: 'development'. The just-created branch can be renamed via this command: 2025-05-07T20:22:44.4515330Z hint: 2025-05-07T20:22:44.4515567Z hint: git branch -m 2025-05-07T20:22:44.4516040Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/ 2025-05-07T20:22:44.4524627Z [command]/usr/bin/git remote add origin https://github.com/pytorch/FBGEMM 2025-05-07T20:22:44.4558784Z ##[endgroup] 2025-05-07T20:22:44.4559234Z ##[group]Disabling automatic garbage collection 2025-05-07T20:22:44.4562977Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:22:44.4594318Z ##[endgroup] 2025-05-07T20:22:44.4594696Z ##[group]Setting up auth 2025-05-07T20:22:44.4601195Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:22:44.4633725Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:22:44.4994686Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:22:44.5027328Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:22:44.5375614Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:44.5424752Z ##[endgroup] 2025-05-07T20:22:44.5425152Z ##[group]Fetching the repository 2025-05-07T20:22:44.5432982Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:22:45.2977236Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:22:45.2977904Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:22:45.3001717Z ##[endgroup] 2025-05-07T20:22:45.3002130Z ##[group]Determining the checkout info 2025-05-07T20:22:45.3005051Z ##[endgroup] 2025-05-07T20:22:45.3020685Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:22:45.3059157Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:22:45.3098953Z ##[group]Checking out the ref 2025-05-07T20:22:45.3102345Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:22:45.4172958Z Note: switching to 'refs/remotes/pull/4066/merge'. 2025-05-07T20:22:45.4173487Z 2025-05-07T20:22:45.4173967Z You are in 'detached HEAD' state. You can look around, make experimental 2025-05-07T20:22:45.4175022Z changes and commit them, and you can discard any commits you make in this 2025-05-07T20:22:45.4175564Z state without impacting any branches by switching back to a branch. 2025-05-07T20:22:45.4175878Z 2025-05-07T20:22:45.4176086Z If you want to create a new branch to retain commits you create, you may 2025-05-07T20:22:45.4176546Z do so (now or later) by using -c with the switch command. Example: 2025-05-07T20:22:45.4176807Z 2025-05-07T20:22:45.4176916Z git switch -c 2025-05-07T20:22:45.4177108Z 2025-05-07T20:22:45.4177234Z Or undo this operation with: 2025-05-07T20:22:45.4177405Z 2025-05-07T20:22:45.4177491Z git switch - 2025-05-07T20:22:45.4177901Z 2025-05-07T20:22:45.4178138Z Turn off this advice by setting config variable advice.detachedHead to false 2025-05-07T20:22:45.4178460Z 2025-05-07T20:22:45.4178837Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:22:45.4184987Z ##[endgroup] 2025-05-07T20:22:45.4185386Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:22:45.4190556Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:22:45.4234742Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:22:45.4267677Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:22:45.4300704Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:22:45.4329604Z ##[endgroup] 2025-05-07T20:22:45.4329977Z ##[group]Fetching submodules 2025-05-07T20:22:45.4332465Z [command]/usr/bin/git submodule sync 2025-05-07T20:22:45.4678933Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:22:45.5010105Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit' 2025-05-07T20:22:45.5012342Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel' 2025-05-07T20:22:45.5015452Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo' 2025-05-07T20:22:45.5018745Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass' 2025-05-07T20:22:45.5022177Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest' 2025-05-07T20:22:45.5026600Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch' 2025-05-07T20:22:45.5029821Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json' 2025-05-07T20:22:45.5060511Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/asmjit'... 2025-05-07T20:22:45.8030356Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/composable_kernel'... 2025-05-07T20:22:46.2894699Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cpuinfo'... 2025-05-07T20:22:46.6255771Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/cutlass'... 2025-05-07T20:22:47.7416867Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/googletest'... 2025-05-07T20:22:48.0659236Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/hipify_torch'... 2025-05-07T20:22:48.4136672Z Cloning into '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/external/json'... 2025-05-07T20:22:49.5516381Z From https://github.com/asmjit/asmjit 2025-05-07T20:22:49.5516843Z * branch e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD 2025-05-07T20:22:49.5995950Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:22:51.0981850Z From https://github.com/jwfromm/composable_kernel 2025-05-07T20:22:51.0982331Z * branch 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD 2025-05-07T20:22:51.3821398Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:22:52.1643157Z From https://github.com/pytorch/cpuinfo 2025-05-07T20:22:52.1643696Z * branch 6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD 2025-05-07T20:22:52.2749983Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:22:53.3783687Z From https://github.com/jwfromm/cutlass 2025-05-07T20:22:53.3784545Z * branch 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD 2025-05-07T20:22:54.0813178Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:22:54.9072379Z From https://github.com/google/googletest 2025-05-07T20:22:54.9072837Z * branch f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD 2025-05-07T20:22:54.9482534Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:22:55.6570347Z From https://github.com/ROCmSoftwarePlatform/hipify_torch 2025-05-07T20:22:55.6570834Z * branch 420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD 2025-05-07T20:22:55.6659493Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:22:56.4234461Z From https://github.com/nlohmann/json 2025-05-07T20:22:56.4234885Z * branch 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD 2025-05-07T20:22:56.5352660Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:22:56.5374175Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:22:56.5707399Z Entering 'external/asmjit' 2025-05-07T20:22:56.5740424Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.5773230Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.5805657Z Entering 'external/cutlass' 2025-05-07T20:22:56.5839586Z Entering 'external/googletest' 2025-05-07T20:22:56.5872450Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.5904937Z Entering 'external/json' 2025-05-07T20:22:56.5953037Z ##[endgroup] 2025-05-07T20:22:56.5953431Z ##[group]Persisting credentials for submodules 2025-05-07T20:22:56.5959682Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:22:56.6302747Z Entering 'external/asmjit' 2025-05-07T20:22:56.6374979Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.6444845Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.6514751Z Entering 'external/cutlass' 2025-05-07T20:22:56.6588979Z Entering 'external/googletest' 2025-05-07T20:22:56.6658699Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.6728867Z Entering 'external/json' 2025-05-07T20:22:56.6814940Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:22:56.7144501Z Entering 'external/asmjit' 2025-05-07T20:22:56.7207474Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:22:56.7210408Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.7273557Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:22:56.7275838Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.7336847Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:22:56.7341021Z Entering 'external/cutlass' 2025-05-07T20:22:56.7401882Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:22:56.7405609Z Entering 'external/googletest' 2025-05-07T20:22:56.7466259Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:22:56.7467643Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.7530339Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:22:56.7532983Z Entering 'external/json' 2025-05-07T20:22:56.7596384Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:22:56.7711222Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:22:56.8036981Z Entering 'external/asmjit' 2025-05-07T20:22:56.8069035Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.8101484Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.8133316Z Entering 'external/cutlass' 2025-05-07T20:22:56.8164278Z Entering 'external/googletest' 2025-05-07T20:22:56.8195866Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.8227879Z Entering 'external/json' 2025-05-07T20:22:56.8274736Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:22:56.8600140Z Entering 'external/asmjit' 2025-05-07T20:22:56.8631847Z Entering 'external/composable_kernel' 2025-05-07T20:22:56.8663734Z Entering 'external/cpuinfo' 2025-05-07T20:22:56.8695046Z Entering 'external/cutlass' 2025-05-07T20:22:56.8726626Z Entering 'external/googletest' 2025-05-07T20:22:56.8757754Z Entering 'external/hipify_torch' 2025-05-07T20:22:56.8789744Z Entering 'external/json' 2025-05-07T20:22:56.8850169Z ##[endgroup] 2025-05-07T20:22:56.8875729Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:22:56.8902747Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:22:56.9095721Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:22:56.9096040Z with: 2025-05-07T20:22:56.9096280Z name: fbgemm_genai_x86_gcc_py3.9_cu12.6.3.whl 2025-05-07T20:22:56.9096598Z merge-multiple: false 2025-05-07T20:22:56.9096844Z repository: pytorch/FBGEMM 2025-05-07T20:22:56.9097095Z run-id: 14891846252 2025-05-07T20:22:56.9097295Z env: 2025-05-07T20:22:56.9097513Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:56.9097800Z BUILD_ENV: build_binary 2025-05-07T20:22:56.9098040Z BUILD_TARGET: genai 2025-05-07T20:22:56.9098254Z BUILD_VARIANT: cuda 2025-05-07T20:22:56.9098487Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:56.9098727Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:56.9098951Z ##[endgroup] 2025-05-07T20:22:57.1409049Z Downloading single artifact 2025-05-07T20:22:57.2443255Z Preparing to download the following artifacts: 2025-05-07T20:22:57.2444069Z - fbgemm_genai_x86_gcc_py3.9_cu12.6.3.whl (ID: 3081362189, Size: 12502543, Expected Digest: sha256:b7fa57ec448da168df38f5dcb4b2b6b212acdb815f6a27a8982ddcd3bf673086) 2025-05-07T20:22:57.2877768Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-e6155e83-5447-52ac-883e-059201805a6b/artifacts/563b5055f9a6d043e54aa78b8ff41f43d378d41ef31414d516f024e334c8085c.zip 2025-05-07T20:22:57.2879172Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:22:57.3490090Z (node:56972) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:22:57.3491030Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:22:57.5340175Z SHA256 digest of downloaded artifact is b7fa57ec448da168df38f5dcb4b2b6b212acdb815f6a27a8982ddcd3bf673086 2025-05-07T20:22:57.5340976Z Artifact download completed successfully. 2025-05-07T20:22:57.5341308Z Total of 1 artifact(s) downloaded 2025-05-07T20:22:57.5346128Z Download artifact has finished successfully 2025-05-07T20:22:57.5605930Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:22:57.5606312Z with: 2025-05-07T20:22:57.5606529Z driver-version: 570.133.07 2025-05-07T20:22:57.5606794Z env: 2025-05-07T20:22:57.5607010Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:57.5607312Z BUILD_ENV: build_binary 2025-05-07T20:22:57.5607565Z BUILD_TARGET: genai 2025-05-07T20:22:57.5607790Z BUILD_VARIANT: cuda 2025-05-07T20:22:57.5608035Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:57.5608294Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:57.5608524Z ##[endgroup] 2025-05-07T20:22:57.5700611Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:22:57.5700995Z with: 2025-05-07T20:22:57.5701191Z timeout_minutes: 10 2025-05-07T20:22:57.5701605Z max_attempts: 3 2025-05-07T20:22:57.5724856Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:22:57.5748113Z retry_wait_seconds: 10 2025-05-07T20:22:57.5748392Z polling_interval_seconds: 1 2025-05-07T20:22:57.5748660Z warning_on_retry: true 2025-05-07T20:22:57.5748908Z continue_on_error: false 2025-05-07T20:22:57.5749147Z env: 2025-05-07T20:22:57.5749363Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:22:57.5749713Z BUILD_ENV: build_binary 2025-05-07T20:22:57.5749949Z BUILD_TARGET: genai 2025-05-07T20:22:57.5750170Z BUILD_VARIANT: cuda 2025-05-07T20:22:57.5750408Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:22:57.5750663Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:22:57.5750906Z DRIVER_VERSION: 570.133.07 2025-05-07T20:22:57.5751150Z ##[endgroup] 2025-05-07T20:22:57.6558649Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:22:57.6560240Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:22:57.6560646Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:22:58.2912457Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:22:58.2913312Z No packages marked for removal. 2025-05-07T20:22:58.2978141Z Dependencies resolved. 2025-05-07T20:22:58.2988966Z Nothing to do. 2025-05-07T20:22:58.2990560Z Complete! 2025-05-07T20:22:58.3314640Z + install_nvidia_driver_common 2025-05-07T20:22:58.3318391Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:22:58.3318714Z + lspci 2025-05-07T20:22:58.3320393Z Before installing NVIDIA driver 2025-05-07T20:22:58.3506930Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:22:58.3507689Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:22:58.3508249Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:22:58.3508765Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:22:58.3509229Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:22:58.3509819Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:22:58.3510300Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:22:58.3510776Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:22:58.3511163Z + lsmod 2025-05-07T20:22:58.3551101Z Module Size Used by 2025-05-07T20:22:58.3551399Z xt_conntrack 16384 1 2025-05-07T20:22:58.3551674Z nft_chain_nat 16384 3 2025-05-07T20:22:58.3551943Z xt_MASQUERADE 20480 1 2025-05-07T20:22:58.3552249Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:22:58.3552584Z nf_conntrack_netlink 57344 0 2025-05-07T20:22:58.3552977Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:22:58.3553413Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:22:58.3553738Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:22:58.3554027Z xfrm_user 57344 1 2025-05-07T20:22:58.3554299Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:22:58.3554588Z xt_addrtype 16384 2 2025-05-07T20:22:58.3554874Z nft_compat 20480 4 2025-05-07T20:22:58.3555174Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:22:58.3555584Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:22:58.3555966Z br_netfilter 36864 0 2025-05-07T20:22:58.3556238Z bridge 323584 1 br_netfilter 2025-05-07T20:22:58.3556540Z stp 16384 1 bridge 2025-05-07T20:22:58.3556825Z llc 16384 2 bridge,stp 2025-05-07T20:22:58.3557103Z overlay 167936 0 2025-05-07T20:22:58.3557355Z tls 135168 0 2025-05-07T20:22:58.3557613Z nls_ascii 16384 1 2025-05-07T20:22:58.3557905Z nls_cp437 20480 1 2025-05-07T20:22:58.3558156Z vfat 24576 1 2025-05-07T20:22:58.3558414Z fat 86016 1 vfat 2025-05-07T20:22:58.3558688Z sunrpc 696320 1 2025-05-07T20:22:58.3558935Z ena 180224 0 2025-05-07T20:22:58.3559184Z i8042 45056 0 2025-05-07T20:22:58.3559439Z serio 28672 3 i8042 2025-05-07T20:22:58.3559710Z ghash_clmulni_intel 16384 0 2025-05-07T20:22:58.3559985Z button 24576 0 2025-05-07T20:22:58.3560238Z sch_fq_codel 20480 17 2025-05-07T20:22:58.3560488Z fuse 163840 1 2025-05-07T20:22:58.3560734Z dm_mod 188416 0 2025-05-07T20:22:58.3560992Z configfs 57344 1 2025-05-07T20:22:58.3561239Z dax 45056 1 dm_mod 2025-05-07T20:22:58.3561509Z loop 36864 0 2025-05-07T20:22:58.3561757Z dmi_sysfs 20480 0 2025-05-07T20:22:58.3561997Z crc32_pclmul 16384 0 2025-05-07T20:22:58.3562254Z crc32c_intel 24576 0 2025-05-07T20:22:58.3562503Z efivarfs 24576 1 2025-05-07T20:22:58.3562742Z + modinfo nvidia 2025-05-07T20:22:58.3570136Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:22:58.3570612Z import_ns: DMA_BUF 2025-05-07T20:22:58.3570860Z alias: char-major-195-* 2025-05-07T20:22:58.3571119Z version: 570.133.07 2025-05-07T20:22:58.3571372Z supported: external 2025-05-07T20:22:58.3571751Z license: Dual MIT/GPL 2025-05-07T20:22:58.3572075Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:22:58.3572410Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:22:58.3572847Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:22:58.3573172Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:22:58.3573514Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:22:58.3573843Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:22:58.3574155Z depends: i2c-core,drm 2025-05-07T20:22:58.3574417Z retpoline: Y 2025-05-07T20:22:58.3574629Z name: nvidia 2025-05-07T20:22:58.3574992Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:22:58.3575468Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:22:58.3575901Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:22:58.3576402Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:22:58.3576712Z parm: NVreg_RmLogonRC:int 2025-05-07T20:22:58.3577027Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:22:58.3577343Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:22:58.3577645Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:22:58.3577947Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:22:58.3578302Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:22:58.3578687Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:22:58.3579018Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:22:58.3579312Z parm: NVreg_EnableMSI:int 2025-05-07T20:22:58.3579622Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:22:58.3579989Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:22:58.3580385Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:22:58.3580759Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:22:58.3581178Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:58.3581584Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:22:58.3582005Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:22:58.3582419Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:22:58.3582758Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:22:58.3583123Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:22:58.3583497Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:22:58.3583838Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:22:58.3584168Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:22:58.3584499Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:22:58.3584822Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:22:58.3585136Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:22:58.3585478Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:22:58.3585842Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:22:58.3586171Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:22:58.3586502Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:22:58.3586852Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:22:58.3587191Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:22:58.3587527Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:22:58.3587858Z parm: NVreg_RmMsg:charp 2025-05-07T20:22:58.3588151Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:22:58.3588475Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:22:58.3588796Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:22:58.3589113Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:22:58.3589442Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:22:58.3589850Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:22:58.3590201Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:22:58.3590539Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:22:58.3590879Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:22:58.3591219Z parm: rm_firmware_active:charp 2025-05-07T20:22:58.3591608Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:22:58.3591853Z ++ command -v nvidia-smi 2025-05-07T20:22:58.3592103Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:22:58.3592364Z + set +e 2025-05-07T20:22:58.3592674Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:23:00.1692137Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:23:00.1692512Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:00.1692766Z + '[' 0 -ne 0 ']' 2025-05-07T20:23:00.1692987Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:23:00.1693246Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:23:00.1693672Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:23:00.1694137Z + set -e 2025-05-07T20:23:00.1694941Z + '[' 1 -eq 0 ']' 2025-05-07T20:23:00.1695328Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:23:00.1695784Z + post_install_nvidia_driver_common 2025-05-07T20:23:00.1698111Z + sudo modprobe nvidia 2025-05-07T20:23:00.2695862Z + echo 'After installing NVIDIA driver' 2025-05-07T20:23:00.2696204Z + lspci 2025-05-07T20:23:00.2696428Z After installing NVIDIA driver 2025-05-07T20:23:00.2812019Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:00.2812525Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:00.2813075Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:00.2813601Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:00.2814072Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:00.2814595Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:00.2815103Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:00.2815570Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:00.2815976Z + lsmod 2025-05-07T20:23:00.2845682Z Module Size Used by 2025-05-07T20:23:00.2846014Z nvidia_uvm 1884160 0 2025-05-07T20:23:00.2846280Z nvidia 11583488 1 nvidia_uvm 2025-05-07T20:23:00.2846568Z drm 602112 1 nvidia 2025-05-07T20:23:00.2846873Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:00.2847177Z backlight 24576 1 drm 2025-05-07T20:23:00.2847463Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:00.2847751Z xt_conntrack 16384 1 2025-05-07T20:23:00.2848009Z nft_chain_nat 16384 3 2025-05-07T20:23:00.2848271Z xt_MASQUERADE 20480 1 2025-05-07T20:23:00.2848625Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:00.2848964Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:00.2849356Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:00.2849796Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:00.2850123Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:00.2850410Z xfrm_user 57344 1 2025-05-07T20:23:00.2850681Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:00.2850981Z xt_addrtype 16384 2 2025-05-07T20:23:00.2851243Z nft_compat 20480 4 2025-05-07T20:23:00.2851545Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:00.2851960Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:00.2852337Z br_netfilter 36864 0 2025-05-07T20:23:00.2852625Z bridge 323584 1 br_netfilter 2025-05-07T20:23:00.2852916Z stp 16384 1 bridge 2025-05-07T20:23:00.2853200Z llc 16384 2 bridge,stp 2025-05-07T20:23:00.2853489Z overlay 167936 0 2025-05-07T20:23:00.2853736Z tls 135168 0 2025-05-07T20:23:00.2853985Z nls_ascii 16384 1 2025-05-07T20:23:00.2854488Z nls_cp437 20480 1 2025-05-07T20:23:00.2854743Z vfat 24576 1 2025-05-07T20:23:00.2854993Z fat 86016 1 vfat 2025-05-07T20:23:00.2855259Z sunrpc 696320 1 2025-05-07T20:23:00.2855506Z ena 180224 0 2025-05-07T20:23:00.2855754Z i8042 45056 0 2025-05-07T20:23:00.2856004Z serio 28672 3 i8042 2025-05-07T20:23:00.2856274Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:00.2856536Z button 24576 0 2025-05-07T20:23:00.2856791Z sch_fq_codel 20480 17 2025-05-07T20:23:00.2857049Z fuse 163840 1 2025-05-07T20:23:00.2857293Z dm_mod 188416 0 2025-05-07T20:23:00.2857542Z configfs 57344 1 2025-05-07T20:23:00.2857972Z dax 45056 1 dm_mod 2025-05-07T20:23:00.2858253Z loop 36864 0 2025-05-07T20:23:00.2858506Z dmi_sysfs 20480 0 2025-05-07T20:23:00.2858759Z crc32_pclmul 16384 0 2025-05-07T20:23:00.2859012Z crc32c_intel 24576 0 2025-05-07T20:23:00.2859262Z efivarfs 24576 1 2025-05-07T20:23:00.2859508Z + modinfo nvidia 2025-05-07T20:23:00.2862468Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:00.2862937Z import_ns: DMA_BUF 2025-05-07T20:23:00.2863189Z alias: char-major-195-* 2025-05-07T20:23:00.2863454Z version: 570.133.07 2025-05-07T20:23:00.2863700Z supported: external 2025-05-07T20:23:00.2863951Z license: Dual MIT/GPL 2025-05-07T20:23:00.2864244Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:00.2864577Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:00.2864898Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:00.2865220Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:00.2865562Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:00.2865893Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:00.2866208Z depends: i2c-core,drm 2025-05-07T20:23:00.2866469Z retpoline: Y 2025-05-07T20:23:00.2866679Z name: nvidia 2025-05-07T20:23:00.2867037Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:00.2867506Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:00.2867954Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:00.2868408Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:00.2868714Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:00.2869015Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:00.2869321Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:00.2869707Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:00.2870016Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:00.2870369Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:00.2870757Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:00.2871087Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:00.2871380Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:00.2871686Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:00.2872043Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:00.2872427Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:00.2872802Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:00.2873216Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:00.2873621Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:00.2874029Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:00.2874433Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:00.2874773Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:00.2875132Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:00.2875607Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:00.2875949Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:00.2876265Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:00.2876598Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:00.2876920Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:00.2877229Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:00.2877568Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:00.2877923Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:00.2878251Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:00.2878631Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:00.2878972Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:00.2879390Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:00.2879720Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:00.2880047Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:00.2880339Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:00.2880662Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:00.2880977Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:00.2881288Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:00.2881615Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:00.2881967Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:00.2882376Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:00.2882696Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:00.2883041Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:00.2883383Z parm: rm_firmware_active:charp 2025-05-07T20:23:00.2883661Z + set +e 2025-05-07T20:23:00.2883854Z + nvidia-smi 2025-05-07T20:23:01.6955431Z Wed May 7 20:23:01 2025 2025-05-07T20:23:01.6955815Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.6956402Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:01.6956969Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:01.6957452Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:01.6957973Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:01.6958400Z | | | MIG M. | 2025-05-07T20:23:01.6958726Z |=========================================+========================+======================| 2025-05-07T20:23:01.7020415Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:01.7021630Z | 0% 29C P0 62W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:01.7022482Z | | | N/A | 2025-05-07T20:23:01.7023350Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:01.7024110Z 2025-05-07T20:23:01.7024865Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:01.7025694Z | Processes: | 2025-05-07T20:23:01.7026542Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:01.7027649Z | ID ID Usage | 2025-05-07T20:23:01.7028196Z |=========================================================================================| 2025-05-07T20:23:01.7028665Z | No running processes found | 2025-05-07T20:23:01.7029367Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:02.1156283Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:03.5225822Z NVIDIA A10G 2025-05-07T20:23:03.7918963Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:03.7919251Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:03.7919590Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:03.7919909Z + set -e 2025-05-07T20:23:03.7920116Z INFO: Ignoring allowed status 0 2025-05-07T20:23:03.7928332Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:03.7931444Z + sudo yum install -y yum-utils 2025-05-07T20:23:04.1903209Z Last metadata expiration check: 0:07:01 ago on Wed May 7 20:16:03 2025. 2025-05-07T20:23:04.2152286Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:04.2546013Z Dependencies resolved. 2025-05-07T20:23:04.2728531Z Nothing to do. 2025-05-07T20:23:04.2728881Z Complete! 2025-05-07T20:23:04.3114697Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:04.3115319Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:04.3116179Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:04.6623250Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:04.7179635Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:05.3279446Z nvidia-container-toolkit 14 kB/s | 833 B 00:00 2025-05-07T20:23:05.3525216Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:05.3927019Z Dependencies resolved. 2025-05-07T20:23:05.4104638Z ================================================================================ 2025-05-07T20:23:05.4105049Z Package Arch Version Repository Size 2025-05-07T20:23:05.4105441Z ================================================================================ 2025-05-07T20:23:05.4105747Z Downgrading: 2025-05-07T20:23:05.4106109Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:05.4106699Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:05.4107050Z 2025-05-07T20:23:05.4107138Z Transaction Summary 2025-05-07T20:23:05.4107387Z ================================================================================ 2025-05-07T20:23:05.4107694Z Downgrade 2 Packages 2025-05-07T20:23:05.4107839Z 2025-05-07T20:23:05.4107960Z Total download size: 6.8 M 2025-05-07T20:23:05.4109868Z Downloading Packages: 2025-05-07T20:23:05.4588880Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64 26 MB/s | 1.2 MB 00:00 2025-05-07T20:23:05.5291708Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x 48 MB/s | 5.6 MB 00:00 2025-05-07T20:23:05.5299987Z -------------------------------------------------------------------------------- 2025-05-07T20:23:05.5302954Z Total 57 MB/s | 6.8 MB 00:00 2025-05-07T20:23:05.5305967Z Running transaction check 2025-05-07T20:23:05.5407832Z Transaction check succeeded. 2025-05-07T20:23:05.5408110Z Running transaction test 2025-05-07T20:23:05.5705489Z Transaction test succeeded. 2025-05-07T20:23:05.5706989Z Running transaction 2025-05-07T20:23:06.1210901Z Preparing : 1/1 2025-05-07T20:23:06.2279722Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:06.2317789Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:06.2518721Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:06.2519300Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:06.2633999Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:06.2671185Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:07.6782672Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:07.6783258Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:07.6783775Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:07.6784307Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:07.8147363Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:07.8148253Z WARNING: 2025-05-07T20:23:07.8148489Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:07.8148727Z 2025-05-07T20:23:07.8148817Z Available Versions: 2025-05-07T20:23:07.8148969Z 2025-05-07T20:23:07.8149071Z Version 2023.7.20250331: 2025-05-07T20:23:07.8149397Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:07.8149762Z 2025-05-07T20:23:07.8149893Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:07.8150097Z 2025-05-07T20:23:07.8150176Z Release notes: 2025-05-07T20:23:07.8150584Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:07.8150949Z 2025-05-07T20:23:07.8151040Z Version 2023.7.20250414: 2025-05-07T20:23:07.8151336Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:07.8151590Z 2025-05-07T20:23:07.8151702Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:07.8151907Z 2025-05-07T20:23:07.8151991Z Release notes: 2025-05-07T20:23:07.8152384Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:07.8152744Z 2025-05-07T20:23:07.8152829Z Version 2023.7.20250428: 2025-05-07T20:23:07.8153134Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:07.8153384Z 2025-05-07T20:23:07.8153494Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:07.8153695Z 2025-05-07T20:23:07.8153801Z Release notes: 2025-05-07T20:23:07.8164917Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:07.8165299Z 2025-05-07T20:23:07.8165418Z ================================================================================ 2025-05-07T20:23:07.8499196Z 2025-05-07T20:23:07.8499464Z 2025-05-07T20:23:07.8499551Z Downgraded: 2025-05-07T20:23:07.8499917Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:07.8500484Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:07.8500853Z 2025-05-07T20:23:07.8500939Z Complete! 2025-05-07T20:23:07.8965599Z + sudo systemctl restart docker 2025-05-07T20:23:12.1351880Z Wed May 7 20:23:12 2025 2025-05-07T20:23:12.1352686Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:12.1353687Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:12.1354645Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:12.1355618Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:12.1356658Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:12.1357506Z | | | MIG M. | 2025-05-07T20:23:12.1358169Z |=========================================+========================+======================| 2025-05-07T20:23:12.1433958Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:12.1434724Z | 0% 29C P0 62W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:12.1435111Z | | | N/A | 2025-05-07T20:23:12.1435511Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:12.1435903Z 2025-05-07T20:23:12.1436296Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:12.1436726Z | Processes: | 2025-05-07T20:23:12.1437169Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:12.1437720Z | ID ID Usage | 2025-05-07T20:23:12.1438071Z |=========================================================================================| 2025-05-07T20:23:12.1438719Z | No running processes found | 2025-05-07T20:23:12.2898970Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:12.6307371Z Command completed after 1 attempt(s). 2025-05-07T20:23:12.6393850Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:12.6394340Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:12.6409500Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:12.6409856Z env: 2025-05-07T20:23:12.6410076Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:12.6410390Z BUILD_ENV: build_binary 2025-05-07T20:23:12.6410642Z BUILD_TARGET: genai 2025-05-07T20:23:12.6410889Z BUILD_VARIANT: cuda 2025-05-07T20:23:12.6411132Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:12.6411400Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:12.6411713Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.6412041Z ##[endgroup] 2025-05-07T20:23:12.9760133Z ################################################################################ 2025-05-07T20:23:12.9760493Z # Print System Info 2025-05-07T20:23:12.9760711Z # 2025-05-07T20:23:12.9776808Z # [2025-05-07T20:23:12.977Z] + print_system_info 2025-05-07T20:23:12.9777272Z ################################################################################ 2025-05-07T20:23:12.9777495Z 2025-05-07T20:23:12.9777606Z ################################################################################ 2025-05-07T20:23:12.9777934Z [INFO] Printing environment variables ... 2025-05-07T20:23:12.9778222Z + printenv 2025-05-07T20:23:12.9778346Z 2025-05-07T20:23:12.9800732Z SHELL=/bin/bash 2025-05-07T20:23:12.9801250Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:12.9801796Z BUILD_VARIANT=cuda 2025-05-07T20:23:12.9802464Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_6c888332-cb40-41f2-a59e-2fe3ef0a577a 2025-05-07T20:23:12.9803267Z GITHUB_ACTION=__run 2025-05-07T20:23:12.9803656Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:12.9804316Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:12.9804638Z RUNNER_NAME=i-00cb9561c833cfdb2 2025-05-07T20:23:12.9804925Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:12.9805227Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:12.9805478Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:12.9805847Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:12.9806267Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:12.9806535Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:12.9806832Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:12.9807478Z *** 2025-05-07T20:23:12.9807690Z LOGNAME=ec2-user 2025-05-07T20:23:12.9807921Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:12.9808190Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:12.9808422Z GITHUB_ACTIONS=true 2025-05-07T20:23:12.9808637Z SYSTEMD_EXEC_PID=55529 2025-05-07T20:23:12.9808918Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:12.9809461Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:12.9809962Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:12.9810244Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:12.9810504Z RUNNER_OS=Linux 2025-05-07T20:23:12.9810744Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:12.9811019Z HOME=/home/ec2-user 2025-05-07T20:23:12.9811273Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:12.9811559Z LANG=C.UTF-8 2025-05-07T20:23:12.9811850Z RUNNER_TRACKING_ID=github_861545f2-750e-499f-bdda-e801da2ef5a8 2025-05-07T20:23:12.9812201Z RUNNER_ARCH=X64 2025-05-07T20:23:12.9812477Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:12.9813106Z BUILD_TARGET=genai 2025-05-07T20:23:12.9813632Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_6c888332-cb40-41f2-a59e-2fe3ef0a577a 2025-05-07T20:23:12.9814491Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_6c888332-cb40-41f2-a59e-2fe3ef0a577a 2025-05-07T20:23:12.9815210Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:12.9815874Z INVOCATION_ID=7ee3562b3fc14d84a45c0646162e5533 2025-05-07T20:23:12.9816194Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:12.9816457Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:12.9817021Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_6c888332-cb40-41f2-a59e-2fe3ef0a577a 2025-05-07T20:23:12.9817627Z BUILD_ENV=build_binary 2025-05-07T20:23:12.9817855Z GITHUB_ACTOR=q10 2025-05-07T20:23:12.9818063Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:12.9818286Z KERN_NAME_LC=linux 2025-05-07T20:23:12.9818514Z BUILD_CUDA_VERSION=12.6.3 2025-05-07T20:23:12.9818810Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:12.9819150Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:12.9819454Z USER=ec2-user 2025-05-07T20:23:12.9819766Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:12.9820147Z SHLVL=1 2025-05-07T20:23:12.9820412Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:12.9820840Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:12.9821427Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:12.9821780Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:12.9822017Z KERN_NAME=Linux 2025-05-07T20:23:12.9822231Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:12.9822774Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:12.9823344Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:12.9823718Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:12.9823952Z JOURNAL_STREAM=8:91669 2025-05-07T20:23:12.9824264Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:12.9824622Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:12.9824922Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:12.9825253Z GITHUB_BASE_REF=main 2025-05-07T20:23:12.9825470Z CI=true 2025-05-07T20:23:12.9825666Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:12.9825954Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:12.9826232Z GITHUB_ACTION_REF= 2025-05-07T20:23:12.9826469Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:12.9827073Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_6c888332-cb40-41f2-a59e-2fe3ef0a577a 2025-05-07T20:23:12.9827654Z MACHINE_NAME=x86_64 2025-05-07T20:23:12.9827874Z _=/usr/bin/printenv 2025-05-07T20:23:12.9828004Z 2025-05-07T20:23:12.9828117Z ################################################################################ 2025-05-07T20:23:12.9828433Z [INFO] Print ldd version ... 2025-05-07T20:23:12.9828696Z + ldd --version 2025-05-07T20:23:12.9828823Z 2025-05-07T20:23:12.9828906Z ldd (GNU libc) 2.34 2025-05-07T20:23:12.9829186Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:12.9829697Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:12.9830221Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:12.9830658Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:12.9830905Z 2025-05-07T20:23:12.9831037Z ################################################################################ 2025-05-07T20:23:12.9831338Z [INFO] Print CPU info ... 2025-05-07T20:23:12.9831560Z + nproc 2025-05-07T20:23:12.9831673Z 2025-05-07T20:23:12.9845940Z 16 2025-05-07T20:23:12.9847697Z 2025-05-07T20:23:12.9848037Z + lscpu 2025-05-07T20:23:12.9848195Z 2025-05-07T20:23:12.9959186Z Architecture: x86_64 2025-05-07T20:23:12.9959681Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:12.9960499Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9961021Z Byte Order: Little Endian 2025-05-07T20:23:12.9961458Z CPU(s): 16 2025-05-07T20:23:12.9961843Z On-line CPU(s) list: 0-15 2025-05-07T20:23:12.9962218Z Vendor ID: AuthenticAMD 2025-05-07T20:23:12.9962548Z Model name: AMD EPYC 7R32 2025-05-07T20:23:12.9962851Z CPU family: 23 2025-05-07T20:23:12.9963452Z Model: 49 2025-05-07T20:23:12.9963741Z Thread(s) per core: 2 2025-05-07T20:23:12.9964016Z Core(s) per socket: 8 2025-05-07T20:23:12.9964292Z Socket(s): 1 2025-05-07T20:23:12.9964560Z Stepping: 0 2025-05-07T20:23:12.9964852Z BogoMIPS: 5600.00 2025-05-07T20:23:12.9966917Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9969073Z Hypervisor vendor: KVM 2025-05-07T20:23:12.9969368Z Virtualization type: full 2025-05-07T20:23:12.9969699Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:12.9970054Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:12.9970396Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:12.9970739Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:12.9971076Z NUMA node(s): 1 2025-05-07T20:23:12.9971385Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:12.9971710Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:12.9972114Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:12.9972639Z Vulnerability L1tf: Not affected 2025-05-07T20:23:12.9973116Z Vulnerability Mds: Not affected 2025-05-07T20:23:12.9973599Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:12.9974085Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:12.9974579Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:12.9975308Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:12.9976098Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:12.9976708Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:12.9977561Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:12.9978413Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:12.9979083Z Vulnerability Srbds: Not affected 2025-05-07T20:23:12.9979444Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:12.9979667Z 2025-05-07T20:23:12.9979837Z + cat /proc/cpuinfo 2025-05-07T20:23:12.9979989Z 2025-05-07T20:23:12.9980074Z processor : 0 2025-05-07T20:23:12.9980303Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9980555Z cpu family : 23 2025-05-07T20:23:12.9980774Z model : 49 2025-05-07T20:23:12.9980996Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9981246Z stepping : 0 2025-05-07T20:23:12.9981471Z microcode : 0x830107f 2025-05-07T20:23:12.9981802Z cpu MHz : 3333.146 2025-05-07T20:23:12.9982013Z cache size : 512 KB 2025-05-07T20:23:12.9982216Z physical id : 0 2025-05-07T20:23:12.9982417Z siblings : 16 2025-05-07T20:23:12.9982614Z core id : 0 2025-05-07T20:23:12.9982802Z cpu cores : 8 2025-05-07T20:23:12.9982994Z apicid : 0 2025-05-07T20:23:12.9983183Z initial apicid : 0 2025-05-07T20:23:12.9983383Z fpu : yes 2025-05-07T20:23:12.9983572Z fpu_exception : yes 2025-05-07T20:23:12.9983780Z cpuid level : 13 2025-05-07T20:23:12.9983973Z wp : yes 2025-05-07T20:23:12.9986003Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9988214Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9988696Z bogomips : 5600.00 2025-05-07T20:23:12.9988905Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9989132Z clflush size : 64 2025-05-07T20:23:12.9989338Z cache_alignment : 64 2025-05-07T20:23:12.9989666Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:12.9989986Z power management: 2025-05-07T20:23:12.9990127Z 2025-05-07T20:23:12.9990205Z processor : 1 2025-05-07T20:23:12.9990417Z vendor_id : AuthenticAMD 2025-05-07T20:23:12.9990661Z cpu family : 23 2025-05-07T20:23:12.9990895Z model : 49 2025-05-07T20:23:12.9991093Z model name : AMD EPYC 7R32 2025-05-07T20:23:12.9991324Z stepping : 0 2025-05-07T20:23:12.9991530Z microcode : 0x830107f 2025-05-07T20:23:12.9991748Z cpu MHz : 3305.353 2025-05-07T20:23:12.9991951Z cache size : 512 KB 2025-05-07T20:23:12.9992169Z physical id : 0 2025-05-07T20:23:12.9992378Z siblings : 16 2025-05-07T20:23:12.9992566Z core id : 1 2025-05-07T20:23:12.9992762Z cpu cores : 8 2025-05-07T20:23:12.9992956Z apicid : 2 2025-05-07T20:23:12.9993139Z initial apicid : 2 2025-05-07T20:23:12.9993350Z fpu : yes 2025-05-07T20:23:12.9993541Z fpu_exception : yes 2025-05-07T20:23:12.9993749Z cpuid level : 13 2025-05-07T20:23:12.9993958Z wp : yes 2025-05-07T20:23:12.9995915Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:12.9998157Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:12.9998665Z bogomips : 5600.00 2025-05-07T20:23:12.9998929Z TLB size : 3072 4K pages 2025-05-07T20:23:12.9999216Z clflush size : 64 2025-05-07T20:23:12.9999446Z cache_alignment : 64 2025-05-07T20:23:12.9999701Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0000011Z power management: 2025-05-07T20:23:13.0000140Z 2025-05-07T20:23:13.0000235Z processor : 2 2025-05-07T20:23:13.0000438Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0000667Z cpu family : 23 2025-05-07T20:23:13.0000868Z model : 49 2025-05-07T20:23:13.0001086Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0001350Z stepping : 0 2025-05-07T20:23:13.0001555Z microcode : 0x830107f 2025-05-07T20:23:13.0001769Z cpu MHz : 3284.425 2025-05-07T20:23:13.0001973Z cache size : 512 KB 2025-05-07T20:23:13.0002179Z physical id : 0 2025-05-07T20:23:13.0002373Z siblings : 16 2025-05-07T20:23:13.0002668Z core id : 2 2025-05-07T20:23:13.0002858Z cpu cores : 8 2025-05-07T20:23:13.0003049Z apicid : 4 2025-05-07T20:23:13.0003238Z initial apicid : 4 2025-05-07T20:23:13.0003443Z fpu : yes 2025-05-07T20:23:13.0003629Z fpu_exception : yes 2025-05-07T20:23:13.0004211Z cpuid level : 13 2025-05-07T20:23:13.0004415Z wp : yes 2025-05-07T20:23:13.0006509Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0008740Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0009221Z bogomips : 5600.00 2025-05-07T20:23:13.0009436Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0009672Z clflush size : 64 2025-05-07T20:23:13.0009878Z cache_alignment : 64 2025-05-07T20:23:13.0010143Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0010455Z power management: 2025-05-07T20:23:13.0010585Z 2025-05-07T20:23:13.0010668Z processor : 3 2025-05-07T20:23:13.0010915Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0011173Z cpu family : 23 2025-05-07T20:23:13.0011362Z model : 49 2025-05-07T20:23:13.0011562Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0011797Z stepping : 0 2025-05-07T20:23:13.0012004Z microcode : 0x830107f 2025-05-07T20:23:13.0012217Z cpu MHz : 3301.887 2025-05-07T20:23:13.0012434Z cache size : 512 KB 2025-05-07T20:23:13.0012646Z physical id : 0 2025-05-07T20:23:13.0012842Z siblings : 16 2025-05-07T20:23:13.0013031Z core id : 3 2025-05-07T20:23:13.0013224Z cpu cores : 8 2025-05-07T20:23:13.0013411Z apicid : 6 2025-05-07T20:23:13.0013598Z initial apicid : 6 2025-05-07T20:23:13.0013802Z fpu : yes 2025-05-07T20:23:13.0013986Z fpu_exception : yes 2025-05-07T20:23:13.0014200Z cpuid level : 13 2025-05-07T20:23:13.0014400Z wp : yes 2025-05-07T20:23:13.0016336Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0018560Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0019037Z bogomips : 5600.00 2025-05-07T20:23:13.0019250Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0019475Z clflush size : 64 2025-05-07T20:23:13.0019682Z cache_alignment : 64 2025-05-07T20:23:13.0019945Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0020248Z power management: 2025-05-07T20:23:13.0020376Z 2025-05-07T20:23:13.0020457Z processor : 4 2025-05-07T20:23:13.0020678Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0020942Z cpu family : 23 2025-05-07T20:23:13.0021138Z model : 49 2025-05-07T20:23:13.0021338Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0021574Z stepping : 0 2025-05-07T20:23:13.0021777Z microcode : 0x830107f 2025-05-07T20:23:13.0021986Z cpu MHz : 3280.572 2025-05-07T20:23:13.0022193Z cache size : 512 KB 2025-05-07T20:23:13.0022403Z physical id : 0 2025-05-07T20:23:13.0022601Z siblings : 16 2025-05-07T20:23:13.0022792Z core id : 4 2025-05-07T20:23:13.0022981Z cpu cores : 8 2025-05-07T20:23:13.0023189Z apicid : 8 2025-05-07T20:23:13.0023501Z initial apicid : 8 2025-05-07T20:23:13.0033522Z fpu : yes 2025-05-07T20:23:13.0033799Z fpu_exception : yes 2025-05-07T20:23:13.0034022Z cpuid level : 13 2025-05-07T20:23:13.0034234Z wp : yes 2025-05-07T20:23:13.0036310Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0038572Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0039053Z bogomips : 5600.00 2025-05-07T20:23:13.0039283Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0039530Z clflush size : 64 2025-05-07T20:23:13.0039743Z cache_alignment : 64 2025-05-07T20:23:13.0040018Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0040337Z power management: 2025-05-07T20:23:13.0040472Z 2025-05-07T20:23:13.0040564Z processor : 5 2025-05-07T20:23:13.0040774Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0041017Z cpu family : 23 2025-05-07T20:23:13.0041231Z model : 49 2025-05-07T20:23:13.0041435Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0041685Z stepping : 0 2025-05-07T20:23:13.0041899Z microcode : 0x830107f 2025-05-07T20:23:13.0042120Z cpu MHz : 3318.665 2025-05-07T20:23:13.0042339Z cache size : 512 KB 2025-05-07T20:23:13.0042557Z physical id : 0 2025-05-07T20:23:13.0042758Z siblings : 16 2025-05-07T20:23:13.0042960Z core id : 5 2025-05-07T20:23:13.0043159Z cpu cores : 8 2025-05-07T20:23:13.0043352Z apicid : 10 2025-05-07T20:23:13.0043561Z initial apicid : 10 2025-05-07T20:23:13.0043774Z fpu : yes 2025-05-07T20:23:13.0043972Z fpu_exception : yes 2025-05-07T20:23:13.0044192Z cpuid level : 13 2025-05-07T20:23:13.0044401Z wp : yes 2025-05-07T20:23:13.0046336Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0048553Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0049047Z bogomips : 5600.00 2025-05-07T20:23:13.0049277Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0049517Z clflush size : 64 2025-05-07T20:23:13.0049729Z cache_alignment : 64 2025-05-07T20:23:13.0049999Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0050316Z power management: 2025-05-07T20:23:13.0050445Z 2025-05-07T20:23:13.0050528Z processor : 6 2025-05-07T20:23:13.0050738Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0050967Z cpu family : 23 2025-05-07T20:23:13.0051160Z model : 49 2025-05-07T20:23:13.0051365Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0051599Z stepping : 0 2025-05-07T20:23:13.0051798Z microcode : 0x830107f 2025-05-07T20:23:13.0052024Z cpu MHz : 3304.334 2025-05-07T20:23:13.0052231Z cache size : 512 KB 2025-05-07T20:23:13.0052437Z physical id : 0 2025-05-07T20:23:13.0052636Z siblings : 16 2025-05-07T20:23:13.0052830Z core id : 6 2025-05-07T20:23:13.0053019Z cpu cores : 8 2025-05-07T20:23:13.0053214Z apicid : 12 2025-05-07T20:23:13.0053415Z initial apicid : 12 2025-05-07T20:23:13.0053618Z fpu : yes 2025-05-07T20:23:13.0053813Z fpu_exception : yes 2025-05-07T20:23:13.0054027Z cpuid level : 13 2025-05-07T20:23:13.0054313Z wp : yes 2025-05-07T20:23:13.0056355Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0058587Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0059070Z bogomips : 5600.00 2025-05-07T20:23:13.0059292Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0059514Z clflush size : 64 2025-05-07T20:23:13.0059727Z cache_alignment : 64 2025-05-07T20:23:13.0059994Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0060305Z power management: 2025-05-07T20:23:13.0060444Z 2025-05-07T20:23:13.0060525Z processor : 7 2025-05-07T20:23:13.0060762Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0061014Z cpu family : 23 2025-05-07T20:23:13.0061219Z model : 49 2025-05-07T20:23:13.0061430Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0061666Z stepping : 0 2025-05-07T20:23:13.0061867Z microcode : 0x830107f 2025-05-07T20:23:13.0062085Z cpu MHz : 3297.610 2025-05-07T20:23:13.0062303Z cache size : 512 KB 2025-05-07T20:23:13.0062511Z physical id : 0 2025-05-07T20:23:13.0062718Z siblings : 16 2025-05-07T20:23:13.0062916Z core id : 7 2025-05-07T20:23:13.0063106Z cpu cores : 8 2025-05-07T20:23:13.0063306Z apicid : 14 2025-05-07T20:23:13.0063508Z initial apicid : 14 2025-05-07T20:23:13.0063709Z fpu : yes 2025-05-07T20:23:13.0063901Z fpu_exception : yes 2025-05-07T20:23:13.0064112Z cpuid level : 13 2025-05-07T20:23:13.0064304Z wp : yes 2025-05-07T20:23:13.0066252Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0068482Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0068971Z bogomips : 5600.00 2025-05-07T20:23:13.0069180Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0069409Z clflush size : 64 2025-05-07T20:23:13.0069728Z cache_alignment : 64 2025-05-07T20:23:13.0069987Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0070296Z power management: 2025-05-07T20:23:13.0070434Z 2025-05-07T20:23:13.0070512Z processor : 8 2025-05-07T20:23:13.0070722Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0070950Z cpu family : 23 2025-05-07T20:23:13.0071180Z model : 49 2025-05-07T20:23:13.0071409Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0071642Z stepping : 0 2025-05-07T20:23:13.0071844Z microcode : 0x830107f 2025-05-07T20:23:13.0072065Z cpu MHz : 3302.130 2025-05-07T20:23:13.0072267Z cache size : 512 KB 2025-05-07T20:23:13.0072488Z physical id : 0 2025-05-07T20:23:13.0072706Z siblings : 16 2025-05-07T20:23:13.0072905Z core id : 0 2025-05-07T20:23:13.0073100Z cpu cores : 8 2025-05-07T20:23:13.0073302Z apicid : 1 2025-05-07T20:23:13.0073491Z initial apicid : 1 2025-05-07T20:23:13.0073704Z fpu : yes 2025-05-07T20:23:13.0073890Z fpu_exception : yes 2025-05-07T20:23:13.0074094Z cpuid level : 13 2025-05-07T20:23:13.0074292Z wp : yes 2025-05-07T20:23:13.0076229Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0078748Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0079231Z bogomips : 5600.00 2025-05-07T20:23:13.0079440Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0079672Z clflush size : 64 2025-05-07T20:23:13.0079886Z cache_alignment : 64 2025-05-07T20:23:13.0080140Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0080452Z power management: 2025-05-07T20:23:13.0080578Z 2025-05-07T20:23:13.0080659Z processor : 9 2025-05-07T20:23:13.0080863Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0081090Z cpu family : 23 2025-05-07T20:23:13.0081283Z model : 49 2025-05-07T20:23:13.0081471Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0081703Z stepping : 0 2025-05-07T20:23:13.0081901Z microcode : 0x830107f 2025-05-07T20:23:13.0082120Z cpu MHz : 3286.636 2025-05-07T20:23:13.0082316Z cache size : 512 KB 2025-05-07T20:23:13.0082517Z physical id : 0 2025-05-07T20:23:13.0082718Z siblings : 16 2025-05-07T20:23:13.0082904Z core id : 1 2025-05-07T20:23:13.0083101Z cpu cores : 8 2025-05-07T20:23:13.0083293Z apicid : 3 2025-05-07T20:23:13.0083475Z initial apicid : 3 2025-05-07T20:23:13.0083677Z fpu : yes 2025-05-07T20:23:13.0083870Z fpu_exception : yes 2025-05-07T20:23:13.0084078Z cpuid level : 13 2025-05-07T20:23:13.0084279Z wp : yes 2025-05-07T20:23:13.0086219Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0088458Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0088926Z bogomips : 5600.00 2025-05-07T20:23:13.0089139Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0089364Z clflush size : 64 2025-05-07T20:23:13.0089567Z cache_alignment : 64 2025-05-07T20:23:13.0089879Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0090286Z power management: 2025-05-07T20:23:13.0090413Z 2025-05-07T20:23:13.0090499Z processor : 10 2025-05-07T20:23:13.0090729Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0090992Z cpu family : 23 2025-05-07T20:23:13.0091184Z model : 49 2025-05-07T20:23:13.0091379Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0091614Z stepping : 0 2025-05-07T20:23:13.0091811Z microcode : 0x830107f 2025-05-07T20:23:13.0092024Z cpu MHz : 3275.585 2025-05-07T20:23:13.0092230Z cache size : 512 KB 2025-05-07T20:23:13.0092429Z physical id : 0 2025-05-07T20:23:13.0092625Z siblings : 16 2025-05-07T20:23:13.0092814Z core id : 2 2025-05-07T20:23:13.0092999Z cpu cores : 8 2025-05-07T20:23:13.0093184Z apicid : 5 2025-05-07T20:23:13.0093379Z initial apicid : 5 2025-05-07T20:23:13.0093584Z fpu : yes 2025-05-07T20:23:13.0093763Z fpu_exception : yes 2025-05-07T20:23:13.0093966Z cpuid level : 13 2025-05-07T20:23:13.0094162Z wp : yes 2025-05-07T20:23:13.0096142Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0098451Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0099107Z bogomips : 5600.00 2025-05-07T20:23:13.0099513Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0099831Z clflush size : 64 2025-05-07T20:23:13.0100051Z cache_alignment : 64 2025-05-07T20:23:13.0100317Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0100653Z power management: 2025-05-07T20:23:13.0100859Z 2025-05-07T20:23:13.0100996Z processor : 11 2025-05-07T20:23:13.0101290Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0101624Z cpu family : 23 2025-05-07T20:23:13.0101836Z model : 49 2025-05-07T20:23:13.0102053Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0102287Z stepping : 0 2025-05-07T20:23:13.0102483Z microcode : 0x830107f 2025-05-07T20:23:13.0102700Z cpu MHz : 3269.732 2025-05-07T20:23:13.0102908Z cache size : 512 KB 2025-05-07T20:23:13.0103177Z physical id : 0 2025-05-07T20:23:13.0103456Z siblings : 16 2025-05-07T20:23:13.0103935Z core id : 3 2025-05-07T20:23:13.0104193Z cpu cores : 8 2025-05-07T20:23:13.0104449Z apicid : 7 2025-05-07T20:23:13.0104713Z initial apicid : 7 2025-05-07T20:23:13.0104996Z fpu : yes 2025-05-07T20:23:13.0105252Z fpu_exception : yes 2025-05-07T20:23:13.0105539Z cpuid level : 13 2025-05-07T20:23:13.0105754Z wp : yes 2025-05-07T20:23:13.0107695Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0109999Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0110477Z bogomips : 5600.00 2025-05-07T20:23:13.0110693Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0110921Z clflush size : 64 2025-05-07T20:23:13.0111133Z cache_alignment : 64 2025-05-07T20:23:13.0111400Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0111705Z power management: 2025-05-07T20:23:13.0111841Z 2025-05-07T20:23:13.0111922Z processor : 12 2025-05-07T20:23:13.0112131Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0112358Z cpu family : 23 2025-05-07T20:23:13.0112554Z model : 49 2025-05-07T20:23:13.0112752Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0112991Z stepping : 0 2025-05-07T20:23:13.0113184Z microcode : 0x830107f 2025-05-07T20:23:13.0113401Z cpu MHz : 3263.783 2025-05-07T20:23:13.0113612Z cache size : 512 KB 2025-05-07T20:23:13.0113815Z physical id : 0 2025-05-07T20:23:13.0114042Z siblings : 16 2025-05-07T20:23:13.0114314Z core id : 4 2025-05-07T20:23:13.0114575Z cpu cores : 8 2025-05-07T20:23:13.0114840Z apicid : 9 2025-05-07T20:23:13.0115102Z initial apicid : 9 2025-05-07T20:23:13.0115389Z fpu : yes 2025-05-07T20:23:13.0115665Z fpu_exception : yes 2025-05-07T20:23:13.0115964Z cpuid level : 13 2025-05-07T20:23:13.0116237Z wp : yes 2025-05-07T20:23:13.0118712Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0121133Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0121609Z bogomips : 5600.00 2025-05-07T20:23:13.0121816Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0122039Z clflush size : 64 2025-05-07T20:23:13.0122246Z cache_alignment : 64 2025-05-07T20:23:13.0122634Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0122933Z power management: 2025-05-07T20:23:13.0123064Z 2025-05-07T20:23:13.0123143Z processor : 13 2025-05-07T20:23:13.0123347Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0123565Z cpu family : 23 2025-05-07T20:23:13.0123754Z model : 49 2025-05-07T20:23:13.0123950Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0124173Z stepping : 0 2025-05-07T20:23:13.0124366Z microcode : 0x830107f 2025-05-07T20:23:13.0124586Z cpu MHz : 3300.457 2025-05-07T20:23:13.0124785Z cache size : 512 KB 2025-05-07T20:23:13.0124992Z physical id : 0 2025-05-07T20:23:13.0125191Z siblings : 16 2025-05-07T20:23:13.0125374Z core id : 5 2025-05-07T20:23:13.0125564Z cpu cores : 8 2025-05-07T20:23:13.0125752Z apicid : 11 2025-05-07T20:23:13.0126010Z initial apicid : 11 2025-05-07T20:23:13.0126298Z fpu : yes 2025-05-07T20:23:13.0126554Z fpu_exception : yes 2025-05-07T20:23:13.0126833Z cpuid level : 13 2025-05-07T20:23:13.0127116Z wp : yes 2025-05-07T20:23:13.0129820Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0132040Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0132509Z bogomips : 5600.00 2025-05-07T20:23:13.0132713Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0132936Z clflush size : 64 2025-05-07T20:23:13.0133140Z cache_alignment : 64 2025-05-07T20:23:13.0133394Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0133712Z power management: 2025-05-07T20:23:13.0133843Z 2025-05-07T20:23:13.0133926Z processor : 14 2025-05-07T20:23:13.0134125Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0134347Z cpu family : 23 2025-05-07T20:23:13.0134539Z model : 49 2025-05-07T20:23:13.0134725Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0134952Z stepping : 0 2025-05-07T20:23:13.0135152Z microcode : 0x830107f 2025-05-07T20:23:13.0135358Z cpu MHz : 3285.956 2025-05-07T20:23:13.0135561Z cache size : 512 KB 2025-05-07T20:23:13.0135768Z physical id : 0 2025-05-07T20:23:13.0135956Z siblings : 16 2025-05-07T20:23:13.0136144Z core id : 6 2025-05-07T20:23:13.0136333Z cpu cores : 8 2025-05-07T20:23:13.0136518Z apicid : 13 2025-05-07T20:23:13.0136710Z initial apicid : 13 2025-05-07T20:23:13.0136910Z fpu : yes 2025-05-07T20:23:13.0137093Z fpu_exception : yes 2025-05-07T20:23:13.0137296Z cpuid level : 13 2025-05-07T20:23:13.0137491Z wp : yes 2025-05-07T20:23:13.0139802Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0143192Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0143660Z bogomips : 5600.00 2025-05-07T20:23:13.0143871Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0144092Z clflush size : 64 2025-05-07T20:23:13.0144294Z cache_alignment : 64 2025-05-07T20:23:13.0144552Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0144858Z power management: 2025-05-07T20:23:13.0144984Z 2025-05-07T20:23:13.0145158Z processor : 15 2025-05-07T20:23:13.0145367Z vendor_id : AuthenticAMD 2025-05-07T20:23:13.0145593Z cpu family : 23 2025-05-07T20:23:13.0145784Z model : 49 2025-05-07T20:23:13.0145981Z model name : AMD EPYC 7R32 2025-05-07T20:23:13.0146208Z stepping : 0 2025-05-07T20:23:13.0146406Z microcode : 0x830107f 2025-05-07T20:23:13.0146617Z cpu MHz : 3291.566 2025-05-07T20:23:13.0146827Z cache size : 512 KB 2025-05-07T20:23:13.0147028Z physical id : 0 2025-05-07T20:23:13.0147236Z siblings : 16 2025-05-07T20:23:13.0147431Z core id : 7 2025-05-07T20:23:13.0147614Z cpu cores : 8 2025-05-07T20:23:13.0147805Z apicid : 15 2025-05-07T20:23:13.0147997Z initial apicid : 15 2025-05-07T20:23:13.0148196Z fpu : yes 2025-05-07T20:23:13.0148383Z fpu_exception : yes 2025-05-07T20:23:13.0148594Z cpuid level : 13 2025-05-07T20:23:13.0148787Z wp : yes 2025-05-07T20:23:13.0150832Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:13.0153665Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:13.0154337Z bogomips : 5600.00 2025-05-07T20:23:13.0154626Z TLB size : 3072 4K pages 2025-05-07T20:23:13.0154856Z clflush size : 64 2025-05-07T20:23:13.0155061Z cache_alignment : 64 2025-05-07T20:23:13.0155314Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:13.0155620Z power management: 2025-05-07T20:23:13.0155751Z 2025-05-07T20:23:13.0155756Z 2025-05-07T20:23:13.0155871Z ################################################################################ 2025-05-07T20:23:13.0156209Z [INFO] Print PCI info ... 2025-05-07T20:23:13.0156458Z + lspci -v 2025-05-07T20:23:13.0156583Z 2025-05-07T20:23:13.0156790Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:13.0157162Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:13.0157469Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:13.0157672Z 2025-05-07T20:23:13.0157871Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:13.0158236Z Physical Slot: 1 2025-05-07T20:23:13.0158469Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:13.0158665Z 2025-05-07T20:23:13.0158912Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:13.0159327Z Physical Slot: 1 2025-05-07T20:23:13.0159575Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:13.0159791Z 2025-05-07T20:23:13.0160054Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:13.0160484Z Physical Slot: 3 2025-05-07T20:23:13.0160714Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:13.0161042Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:13.0161388Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:13.0161605Z 2025-05-07T20:23:13.0161897Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:13.0162503Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:13.0162778Z Physical Slot: 4 2025-05-07T20:23:13.0163027Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:13.0163398Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:13.0163742Z Capabilities: 2025-05-07T20:23:13.0164003Z Kernel driver in use: nvme 2025-05-07T20:23:13.0164160Z 2025-05-07T20:23:13.0164450Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:13.0164924Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:13.0165259Z Physical Slot: 5 2025-05-07T20:23:13.0165493Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:13.0165845Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:13.0166217Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:13.0166526Z Capabilities: 2025-05-07T20:23:13.0166789Z Kernel driver in use: ena 2025-05-07T20:23:13.0167029Z Kernel modules: ena 2025-05-07T20:23:13.0167165Z 2025-05-07T20:23:13.0167334Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:13.0167696Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:13.0167982Z Physical Slot: 30 2025-05-07T20:23:13.0168232Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:13.0168594Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:13.0168975Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:13.0169466Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:13.0169907Z Capabilities: 2025-05-07T20:23:13.0170264Z Kernel driver in use: nvidia 2025-05-07T20:23:13.0170585Z Kernel modules: nvidia 2025-05-07T20:23:13.0170730Z 2025-05-07T20:23:13.0171028Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:13.0171526Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:13.0171805Z Physical Slot: 31 2025-05-07T20:23:13.0172043Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:13.0172380Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:13.0172754Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:13.0173068Z Capabilities: 2025-05-07T20:23:13.0173319Z Kernel driver in use: nvme 2025-05-07T20:23:13.0173480Z 2025-05-07T20:23:13.0173485Z 2025-05-07T20:23:13.0173599Z ################################################################################ 2025-05-07T20:23:13.0173909Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:13.0174187Z + uname -a 2025-05-07T20:23:13.0174294Z 2025-05-07T20:23:13.0174689Z Linux ip-10-0-73-154.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:13.0175178Z 2025-05-07T20:23:13.0175254Z + uname -m 2025-05-07T20:23:13.0175364Z 2025-05-07T20:23:13.0175439Z x86_64 2025-05-07T20:23:13.0175542Z 2025-05-07T20:23:13.0175627Z + cat /proc/version 2025-05-07T20:23:13.0175765Z 2025-05-07T20:23:13.0176291Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:13.0176908Z 2025-05-07T20:23:13.0176993Z + cat /etc/os-release 2025-05-07T20:23:13.0177131Z 2025-05-07T20:23:13.0177224Z NAME="Amazon Linux" 2025-05-07T20:23:13.0177428Z VERSION="2023" 2025-05-07T20:23:13.0177630Z ID="amzn" 2025-05-07T20:23:13.0177818Z ID_LIKE="fedora" 2025-05-07T20:23:13.0178015Z VERSION_ID="2023" 2025-05-07T20:23:13.0178242Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:13.0178519Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:13.0186147Z ANSI_COLOR="0;33" 2025-05-07T20:23:13.0186434Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:13.0186947Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:13.0187378Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:13.0187792Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:13.0188225Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:13.0188592Z VENDOR_NAME="AWS" 2025-05-07T20:23:13.0188834Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:13.0189116Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:13.0189274Z 2025-05-07T20:23:13.0189502Z ################################################################################ 2025-05-07T20:23:13.0189908Z # Print EC2 Instance Info 2025-05-07T20:23:13.0190141Z # 2025-05-07T20:23:13.0190354Z # [2025-05-07T20:23:13.016Z] + print_ec2_info 2025-05-07T20:23:13.0190667Z ################################################################################ 2025-05-07T20:23:13.0190875Z 2025-05-07T20:23:13.0289197Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:13.0412854Z instance-id: i-00cb9561c833cfdb2 2025-05-07T20:23:13.0527099Z instance-type: g5.4xlarge 2025-05-07T20:23:13.0572018Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:13.0572373Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:13.0581803Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:13.0582154Z env: 2025-05-07T20:23:13.0582373Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:13.0582677Z BUILD_ENV: build_binary 2025-05-07T20:23:13.0582926Z BUILD_TARGET: genai 2025-05-07T20:23:13.0583149Z BUILD_VARIANT: cuda 2025-05-07T20:23:13.0583391Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:13.0583644Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:13.0583945Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:13.0584275Z ##[endgroup] 2025-05-07T20:23:13.3930697Z ################################################################################ 2025-05-07T20:23:13.3931248Z [INFO] Printing general display info ... 2025-05-07T20:23:13.3961500Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:13.5084727Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:13.5095730Z /usr/bin/sudo 2025-05-07T20:23:13.5106617Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:13.5116843Z /usr/bin/yum 2025-05-07T20:23:13.5118627Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:13.5139399Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:13.9555190Z Last metadata expiration check: 0:00:08 ago on Wed May 7 20:23:05 2025. 2025-05-07T20:23:14.0307718Z ================================================================================ 2025-05-07T20:23:14.0308078Z WARNING: 2025-05-07T20:23:14.0308317Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:14.0308548Z 2025-05-07T20:23:14.0308645Z Available Versions: 2025-05-07T20:23:14.0308787Z 2025-05-07T20:23:14.0308873Z Version 2023.7.20250331: 2025-05-07T20:23:14.0309180Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:14.0309467Z 2025-05-07T20:23:14.0309658Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:14.0309866Z 2025-05-07T20:23:14.0309953Z Release notes: 2025-05-07T20:23:14.0310346Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:14.0310718Z 2025-05-07T20:23:14.0310803Z Version 2023.7.20250414: 2025-05-07T20:23:14.0311105Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:14.0311380Z 2025-05-07T20:23:14.0311506Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:14.0311714Z 2025-05-07T20:23:14.0311795Z Release notes: 2025-05-07T20:23:14.0312179Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:14.0312539Z 2025-05-07T20:23:14.0312630Z Version 2023.7.20250428: 2025-05-07T20:23:14.0312923Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:14.0313168Z 2025-05-07T20:23:14.0313520Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:14.0313728Z 2025-05-07T20:23:14.0313818Z Release notes: 2025-05-07T20:23:14.0314204Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:14.0314561Z 2025-05-07T20:23:14.0314678Z ================================================================================ 2025-05-07T20:23:14.1476090Z Dependencies resolved. 2025-05-07T20:23:14.1762502Z ================================================================================ 2025-05-07T20:23:14.1762905Z Package Arch Version Repository Size 2025-05-07T20:23:14.1763266Z ================================================================================ 2025-05-07T20:23:14.1763559Z Upgrading: 2025-05-07T20:23:14.1763913Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:14.1764481Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:14.1764842Z 2025-05-07T20:23:14.1765134Z Transaction Summary 2025-05-07T20:23:14.1765382Z ================================================================================ 2025-05-07T20:23:14.1765680Z Upgrade 2 Packages 2025-05-07T20:23:14.1765814Z 2025-05-07T20:23:14.1765933Z Total download size: 6.9 M 2025-05-07T20:23:14.1767602Z Downloading Packages: 2025-05-07T20:23:14.2318707Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 23 MB/s | 1.2 MB 00:00 2025-05-07T20:23:14.2668403Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 64 MB/s | 5.7 MB 00:00 2025-05-07T20:23:14.2676025Z -------------------------------------------------------------------------------- 2025-05-07T20:23:14.2679271Z Total 76 MB/s | 6.9 MB 00:00 2025-05-07T20:23:14.2681639Z Running transaction check 2025-05-07T20:23:14.2779918Z Transaction check succeeded. 2025-05-07T20:23:14.2780516Z Running transaction test 2025-05-07T20:23:14.3077415Z Transaction test succeeded. 2025-05-07T20:23:14.3080007Z Running transaction 2025-05-07T20:23:14.8608921Z Preparing : 1/1 2025-05-07T20:23:14.9666551Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:14.9693638Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.9895919Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:14.9897295Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:15.0008014Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:15.0035975Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:15.1497769Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:15.1498930Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:15.1500031Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:15.1501057Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:15.2903244Z ================================================================================ 2025-05-07T20:23:15.2903634Z WARNING: 2025-05-07T20:23:15.2904088Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:15.2904316Z 2025-05-07T20:23:15.2904423Z Available Versions: 2025-05-07T20:23:15.2904569Z 2025-05-07T20:23:15.2904663Z Version 2023.7.20250331: 2025-05-07T20:23:15.2904968Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:15.2905223Z 2025-05-07T20:23:15.2905345Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:15.2905551Z 2025-05-07T20:23:15.2905646Z Release notes: 2025-05-07T20:23:15.2906059Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:15.2906689Z 2025-05-07T20:23:15.2906794Z Version 2023.7.20250414: 2025-05-07T20:23:15.2907101Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:15.2907348Z 2025-05-07T20:23:15.2907471Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:15.2907677Z 2025-05-07T20:23:15.2907759Z Release notes: 2025-05-07T20:23:15.2908158Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:15.2908523Z 2025-05-07T20:23:15.2908619Z Version 2023.7.20250428: 2025-05-07T20:23:15.2908920Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:15.2909174Z 2025-05-07T20:23:15.2909285Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:15.2909494Z 2025-05-07T20:23:15.2909644Z Release notes: 2025-05-07T20:23:15.2910030Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:15.2910391Z 2025-05-07T20:23:15.2910703Z ================================================================================ 2025-05-07T20:23:15.3475346Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:15.3476285Z 2025-05-07T20:23:15.3476505Z Upgraded: 2025-05-07T20:23:15.3477471Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:15.3479185Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:15.3480193Z 2025-05-07T20:23:15.3480413Z Complete! 2025-05-07T20:23:15.3939774Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:15.3961505Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:15.8489949Z Last metadata expiration check: 0:00:10 ago on Wed May 7 20:23:05 2025. 2025-05-07T20:23:15.8730526Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:15.9135351Z Dependencies resolved. 2025-05-07T20:23:15.9312617Z ================================================================================ 2025-05-07T20:23:15.9313067Z Package Architecture Version Repository Size 2025-05-07T20:23:15.9313477Z ================================================================================ 2025-05-07T20:23:15.9313773Z Installing: 2025-05-07T20:23:15.9314058Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:15.9314322Z 2025-05-07T20:23:15.9314414Z Transaction Summary 2025-05-07T20:23:15.9314649Z ================================================================================ 2025-05-07T20:23:15.9314943Z Install 1 Package 2025-05-07T20:23:15.9315078Z 2025-05-07T20:23:15.9315394Z Total download size: 319 k 2025-05-07T20:23:15.9315747Z Installed size: 837 k 2025-05-07T20:23:15.9317540Z Downloading Packages: 2025-05-07T20:23:16.0101664Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 6.5 MB/s | 319 kB 00:00 2025-05-07T20:23:16.0107608Z -------------------------------------------------------------------------------- 2025-05-07T20:23:16.0110539Z Total 3.9 MB/s | 319 kB 00:00 2025-05-07T20:23:16.0264874Z Running transaction check 2025-05-07T20:23:16.0319291Z Transaction check succeeded. 2025-05-07T20:23:16.0319895Z Running transaction test 2025-05-07T20:23:16.0773978Z Transaction test succeeded. 2025-05-07T20:23:16.0777673Z Running transaction 2025-05-07T20:23:16.1819599Z Preparing : 1/1 2025-05-07T20:23:16.2355667Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:16.4124448Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:16.5465345Z ================================================================================ 2025-05-07T20:23:16.5465747Z WARNING: 2025-05-07T20:23:16.5465994Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:16.5466584Z 2025-05-07T20:23:16.5466676Z Available Versions: 2025-05-07T20:23:16.5466851Z 2025-05-07T20:23:16.5466940Z Version 2023.7.20250331: 2025-05-07T20:23:16.5467252Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:16.5467503Z 2025-05-07T20:23:16.5467630Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:16.5467838Z 2025-05-07T20:23:16.5467922Z Release notes: 2025-05-07T20:23:16.5468328Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:16.5468698Z 2025-05-07T20:23:16.5468791Z Version 2023.7.20250414: 2025-05-07T20:23:16.5469088Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:16.5469339Z 2025-05-07T20:23:16.5469450Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:16.5469748Z 2025-05-07T20:23:16.5469831Z Release notes: 2025-05-07T20:23:16.5470221Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:16.5470587Z 2025-05-07T20:23:16.5470841Z Version 2023.7.20250428: 2025-05-07T20:23:16.5471143Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:16.5471387Z 2025-05-07T20:23:16.5471503Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:16.5471705Z 2025-05-07T20:23:16.5471794Z Release notes: 2025-05-07T20:23:16.5472172Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:16.5472535Z 2025-05-07T20:23:16.5472657Z ================================================================================ 2025-05-07T20:23:16.5810054Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:16.5810387Z 2025-05-07T20:23:16.5810472Z Installed: 2025-05-07T20:23:16.5810779Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:16.5811063Z 2025-05-07T20:23:16.5811156Z Complete! 2025-05-07T20:23:16.6285611Z + hostname 2025-05-07T20:23:16.6285797Z 2025-05-07T20:23:16.6299074Z ip-10-0-73-154.ec2.internal 2025-05-07T20:23:16.6300230Z 2025-05-07T20:23:16.6301019Z + sudo lshw -C display 2025-05-07T20:23:16.6301234Z 2025-05-07T20:23:17.0502701Z *-display:0 UNCLAIMED 2025-05-07T20:23:17.0503037Z description: VGA compatible controller 2025-05-07T20:23:17.0503360Z product: Amazon.com, Inc. 2025-05-07T20:23:17.0503636Z vendor: Amazon.com, Inc. 2025-05-07T20:23:17.0504125Z physical id: 3 2025-05-07T20:23:17.0504354Z bus info: pci@0000:00:03.0 2025-05-07T20:23:17.0504606Z version: 00 2025-05-07T20:23:17.0504815Z width: 32 bits 2025-05-07T20:23:17.0505036Z clock: 33MHz 2025-05-07T20:23:17.0505285Z capabilities: vga_controller bus_master 2025-05-07T20:23:17.0505593Z configuration: latency=0 2025-05-07T20:23:17.0505913Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:17.0506235Z *-display:1 2025-05-07T20:23:17.0506450Z description: 3D controller 2025-05-07T20:23:17.0506766Z product: GA102GL [A10G] 2025-05-07T20:23:17.0507023Z vendor: NVIDIA Corporation 2025-05-07T20:23:17.0507283Z physical id: 1e 2025-05-07T20:23:17.0507515Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:17.0507757Z version: a1 2025-05-07T20:23:17.0507970Z width: 64 bits 2025-05-07T20:23:17.0508187Z clock: 33MHz 2025-05-07T20:23:17.0508468Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:17.0508838Z configuration: driver=nvidia latency=0 2025-05-07T20:23:17.0509457Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:17.0542565Z 2025-05-07T20:23:17.0542975Z ################################################################################ 2025-05-07T20:23:17.0543487Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:17.0675141Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:17.0843593Z Wed May 7 20:23:17 2025 2025-05-07T20:23:17.0843982Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:17.0844474Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:17.0844953Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:17.0845441Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:17.0845957Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:17.0846375Z | | | MIG M. | 2025-05-07T20:23:17.0846706Z |=========================================+========================+======================| 2025-05-07T20:23:17.0922261Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:17.0922912Z | 0% 30C P0 58W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:17.0923280Z | | | N/A | 2025-05-07T20:23:17.0923670Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:17.0924065Z 2025-05-07T20:23:17.0924445Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:17.0924857Z | Processes: | 2025-05-07T20:23:17.0925283Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:17.0925686Z | ID ID Usage | 2025-05-07T20:23:17.0926021Z |=========================================================================================| 2025-05-07T20:23:17.0926822Z | No running processes found | 2025-05-07T20:23:17.0927282Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:17.2355609Z ################################################################################ 2025-05-07T20:23:17.2355946Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:17.2497831Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:17.2498610Z [CHECK] rocminfo not found 2025-05-07T20:23:17.2507602Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:17.2508668Z [CHECK] rocm-smi not found 2025-05-07T20:23:17.2558429Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:17.2558856Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:17.2571936Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:17.2572291Z env: 2025-05-07T20:23:17.2572504Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:17.2572801Z BUILD_ENV: build_binary 2025-05-07T20:23:17.2573043Z BUILD_TARGET: genai 2025-05-07T20:23:17.2573261Z BUILD_VARIANT: cuda 2025-05-07T20:23:17.2573491Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:17.2573740Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:17.2574031Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:17.2574357Z ##[endgroup] 2025-05-07T20:23:17.5919225Z ################################################################################ 2025-05-07T20:23:17.5919569Z # Setup Miniconda 2025-05-07T20:23:17.5919781Z # 2025-05-07T20:23:17.5935761Z # [2025-05-07T20:23:17.593Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:17.5936221Z ################################################################################ 2025-05-07T20:23:17.5936467Z 2025-05-07T20:23:17.5952353Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:17.6995409Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:17.6995765Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:17.6995963Z 2025-05-07T20:23:17.7013160Z 2025-05-07T20:23:17.7013501Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:17.7036189Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:18.5697896Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:18.5698619Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:18.5699109Z 2025-05-07T20:23:18.5841855Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:19.0342314Z Unpacking payload ... 2025-05-07T20:23:19.5535269Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:20.3524196Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:22.4444672Z 2025-05-07T20:23:22.4445015Z Installing base environment... 2025-05-07T20:23:22.4445233Z 2025-05-07T20:23:23.5179612Z Preparing transaction: ...working... done 2025-05-07T20:23:26.5099633Z Executing transaction: ...working... done 2025-05-07T20:23:27.1765996Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:27.2659127Z installation finished. 2025-05-07T20:23:27.2667877Z 2025-05-07T20:23:27.2668283Z + rm -f miniconda.sh 2025-05-07T20:23:27.2668525Z 2025-05-07T20:23:27.2979328Z 2025-05-07T20:23:27.2979744Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:23:27.2980243Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:23:27.2980559Z 2025-05-07T20:23:27.6628546Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:23:27.6628963Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:23:27.6629431Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:23:27.6629870Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:23:27.6630221Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:23:27.6630614Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:23:27.6631045Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:23:27.6631474Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:23:27.6631923Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:23:27.6632701Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:23:27.6633220Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:23:27.6633582Z modified /home/ec2-user/.bashrc 2025-05-07T20:23:27.6633777Z 2025-05-07T20:23:27.6633968Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:23:27.6634258Z 2025-05-07T20:23:27.7276320Z 2025-05-07T20:23:27.7276889Z + . /home/ec2-user/.bashrc 2025-05-07T20:23:27.7277097Z 2025-05-07T20:23:28.5626098Z 2025-05-07T20:23:28.5626790Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:23:28.5650724Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:23:42.0299860Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:23:43.5908410Z Solving environment: - \ | / - \ | / - \ | / done 2025-05-07T20:23:43.6882468Z 2025-05-07T20:23:43.6882929Z ## Package Plan ## 2025-05-07T20:23:43.6883110Z 2025-05-07T20:23:43.6883248Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:43.6883499Z 2025-05-07T20:23:43.6883593Z added / updated specs: 2025-05-07T20:23:43.6883860Z - conda-libmamba-solver 2025-05-07T20:23:43.6884101Z - libarchive 2025-05-07T20:23:43.6884304Z - libmamba 2025-05-07T20:23:43.6884507Z - libmambapy 2025-05-07T20:23:43.6884630Z 2025-05-07T20:23:43.6884634Z 2025-05-07T20:23:43.6884775Z The following packages will be downloaded: 2025-05-07T20:23:43.6884998Z 2025-05-07T20:23:43.6885107Z package | build 2025-05-07T20:23:43.6885425Z ---------------------------|----------------- 2025-05-07T20:23:43.6885831Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:23:43.6886300Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:23:43.6886730Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:23:43.6887196Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:23:43.6887630Z ------------------------------------------------------------ 2025-05-07T20:23:43.6887965Z Total: 1.4 MB 2025-05-07T20:23:43.6888175Z 2025-05-07T20:23:43.6888282Z The following packages will be UPDATED: 2025-05-07T20:23:43.6888483Z 2025-05-07T20:23:43.6893308Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:23:43.6894084Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:23:43.6894472Z 2025-05-07T20:23:43.6894693Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:23:43.6895013Z 2025-05-07T20:23:43.6895326Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:23:43.6896119Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:23:43.6896598Z 2025-05-07T20:23:43.6896602Z 2025-05-07T20:23:43.6896606Z 2025-05-07T20:23:43.6896754Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:43.6897119Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:23:43.6897346Z 2025-05-07T20:23:43.6905480Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:23:43.6905737Z 2025-05-07T20:23:43.6905741Z 2025-05-07T20:23:43.6914489Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:23:43.6914746Z 2025-05-07T20:23:43.6914763Z 2025-05-07T20:23:43.6914767Z 2025-05-07T20:23:43.7401981Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:23:43.7402261Z 2025-05-07T20:23:43.7404229Z 2025-05-07T20:23:43.7479701Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.7479964Z 2025-05-07T20:23:43.7479968Z 2025-05-07T20:23:43.7849430Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:23:43.7849685Z 2025-05-07T20:23:43.7849689Z 2025-05-07T20:23:43.7850264Z 2025-05-07T20:23:43.7876118Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.7877012Z 2025-05-07T20:23:43.7913833Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.8055609Z conda-25.3.1 | 1.1 MB | ######4 | 65% 2025-05-07T20:23:43.8055853Z 2025-05-07T20:23:43.8055858Z 2025-05-07T20:23:43.8055862Z 2025-05-07T20:23:43.8056688Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.8057174Z 2025-05-07T20:23:43.8057183Z 2025-05-07T20:23:43.8057187Z 2025-05-07T20:23:43.8065001Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:23:43.8068956Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.8069241Z 2025-05-07T20:23:43.8071088Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.8071335Z 2025-05-07T20:23:43.9166188Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:23:43.9171677Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:23:43.9172072Z 2025-05-07T20:23:43.9172269Z 2025-05-07T20:23:43.9172460Z  2025-05-07T20:23:43.9172702Z 2025-05-07T20:23:43.9172708Z 2025-05-07T20:23:43.9172957Z  2025-05-07T20:23:43.9173260Z 2025-05-07T20:23:43.9173266Z 2025-05-07T20:23:43.9173271Z 2025-05-07T20:23:43.9173494Z  done 2025-05-07T20:23:44.0178057Z Preparing transaction: \ done 2025-05-07T20:23:44.1184047Z Verifying transaction: / done 2025-05-07T20:23:45.4203010Z Executing transaction: \ | / - \ | / - \ | / - \ done 2025-05-07T20:23:47.1263042Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:23:47.1287718Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:23:48.0658478Z Channels: 2025-05-07T20:23:48.0658867Z - defaults 2025-05-07T20:23:48.0659230Z Platform: linux-64 2025-05-07T20:23:49.2879242Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:23:49.4049347Z Solving environment: - \ Channels: 2025-05-07T20:23:49.4049657Z - defaults 2025-05-07T20:23:49.4050018Z Platform: linux-64 2025-05-07T20:23:49.6971772Z Collecting package metadata (repodata.json): / - \ done 2025-05-07T20:23:49.9127057Z Solving environment: / - \ | done 2025-05-07T20:23:49.9954399Z done 2025-05-07T20:23:50.0619482Z 2025-05-07T20:23:50.0619760Z ## Package Plan ## 2025-05-07T20:23:50.0620003Z 2025-05-07T20:23:50.0620151Z environment location: /home/ec2-user/miniconda 2025-05-07T20:23:50.0620409Z 2025-05-07T20:23:50.0620501Z added / updated specs: 2025-05-07T20:23:50.0620746Z - conda 2025-05-07T20:23:50.0620858Z 2025-05-07T20:23:50.0620862Z 2025-05-07T20:23:50.0620983Z The following packages will be downloaded: 2025-05-07T20:23:50.0621192Z 2025-05-07T20:23:50.0621303Z package | build 2025-05-07T20:23:50.0621612Z ---------------------------|----------------- 2025-05-07T20:23:50.0621946Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:23:50.0622565Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:23:50.0623000Z ------------------------------------------------------------ 2025-05-07T20:23:50.0623366Z Total: 1.4 MB 2025-05-07T20:23:50.0623576Z 2025-05-07T20:23:50.0623688Z The following packages will be UPDATED: 2025-05-07T20:23:50.0623892Z 2025-05-07T20:23:50.0624178Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:23:50.0624672Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:23:50.0624915Z 2025-05-07T20:23:50.0624918Z 2025-05-07T20:23:50.0624929Z 2025-05-07T20:23:50.0625065Z Downloading and Extracting Packages: ...working... 2025-05-07T20:23:50.0625423Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:23:50.0625630Z 2025-05-07T20:23:50.1137293Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:23:50.2116060Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.2116327Z 2025-05-07T20:23:50.2159745Z tzdata-2025b | 116 KB | #3 | 14%  2025-05-07T20:23:50.2160481Z 2025-05-07T20:23:50.2194904Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.2197953Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.3352011Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:23:50.3352389Z 2025-05-07T20:23:50.3353141Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.3353490Z 2025-05-07T20:23:50.3357725Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:23:50.3358060Z 2025-05-07T20:23:50.3358251Z 2025-05-07T20:23:50.3358417Z  done 2025-05-07T20:23:50.4361114Z Preparing transaction: - done 2025-05-07T20:23:50.5367065Z Verifying transaction: | done 2025-05-07T20:23:52.5441943Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:23:53.1513493Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:23:53.1517985Z + conda clean --packages --tarball -y 2025-05-07T20:23:53.1518195Z 2025-05-07T20:23:54.1502565Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:23:54.1503019Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:23:54.2140661Z 2025-05-07T20:23:54.2148794Z + conda clean --all -y 2025-05-07T20:23:54.2149000Z 2025-05-07T20:23:54.8089602Z There are no unused tarball(s) to remove. 2025-05-07T20:23:54.8089928Z Will remove 1 index cache(s). 2025-05-07T20:23:54.8090209Z There are no unused package(s) to remove. 2025-05-07T20:23:54.8090505Z There are no tempfile(s) to remove. 2025-05-07T20:23:54.8090794Z There are no logfile(s) to remove. 2025-05-07T20:23:54.8773910Z 2025-05-07T20:23:54.8778573Z + conda info 2025-05-07T20:23:54.8778734Z 2025-05-07T20:23:55.6667852Z 2025-05-07T20:23:55.6668398Z active environment : base 2025-05-07T20:23:55.6668762Z active env location : /home/ec2-user/miniconda 2025-05-07T20:23:55.6669092Z shell level : 1 2025-05-07T20:23:55.6669370Z user config file : /home/ec2-user/.condarc 2025-05-07T20:23:55.6669858Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:23:55.6670212Z conda version : 25.3.1 2025-05-07T20:23:55.6670501Z conda-build version : not installed 2025-05-07T20:23:55.6670805Z python version : 3.13.2.final.0 2025-05-07T20:23:55.6671101Z solver : libmamba (default) 2025-05-07T20:23:55.6671448Z virtual packages : __archspec=1=zen2 2025-05-07T20:23:55.6671760Z __conda=25.3.1=0 2025-05-07T20:23:55.6672032Z __cuda=12.8=0 2025-05-07T20:23:55.6672307Z __glibc=2.34=0 2025-05-07T20:23:55.6672600Z __linux=6.1.130=0 2025-05-07T20:23:55.6672878Z __unix=0=0 2025-05-07T20:23:55.6673548Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:23:55.6673961Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:23:55.6674315Z conda av metadata url : None 2025-05-07T20:23:55.6674674Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:23:55.6675106Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:23:55.6675492Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:23:55.6675872Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:23:55.6676233Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:23:55.6676576Z /home/ec2-user/.conda/pkgs 2025-05-07T20:23:55.6676916Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:23:55.6677246Z /home/ec2-user/.conda/envs 2025-05-07T20:23:55.6677550Z platform : linux-64 2025-05-07T20:23:55.6678384Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:23:55.6679356Z UID:GID : 1000:1000 2025-05-07T20:23:55.6679626Z netrc file : None 2025-05-07T20:23:55.6679888Z offline mode : False 2025-05-07T20:23:55.6680055Z 2025-05-07T20:23:55.7321783Z 2025-05-07T20:23:55.7322020Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:23:55.7322756Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_589e50b1-f869-4197-9b6e-dcb1911e9ee8 ... 2025-05-07T20:23:55.7324338Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:23:55.7404194Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.9 2025-05-07T20:23:55.7404688Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.9 2025-05-07T20:23:55.7421226Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:55.7421568Z env: 2025-05-07T20:23:55.7421779Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:55.7422096Z BUILD_ENV: build_binary 2025-05-07T20:23:55.7422347Z BUILD_TARGET: genai 2025-05-07T20:23:55.7422578Z BUILD_VARIANT: cuda 2025-05-07T20:23:55.7422805Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:55.7423061Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:55.7423364Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:55.7423687Z ##[endgroup] 2025-05-07T20:23:56.0804215Z ################################################################################ 2025-05-07T20:23:56.0804604Z # Create Conda Environment 2025-05-07T20:23:56.0804844Z # 2025-05-07T20:23:56.0820809Z # [2025-05-07T20:23:56.081Z] + create_conda_environment build_binary 3.9 2025-05-07T20:23:56.0821281Z ################################################################################ 2025-05-07T20:23:56.0821542Z 2025-05-07T20:23:56.0837693Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:56.1753756Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:56.1754195Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:23:56.1754561Z + conda info --envs 2025-05-07T20:23:56.1754766Z 2025-05-07T20:23:56.9546038Z 2025-05-07T20:23:56.9546703Z # conda environments: 2025-05-07T20:23:56.9546989Z # 2025-05-07T20:23:56.9547217Z base /home/ec2-user/miniconda 2025-05-07T20:23:56.9547445Z 2025-05-07T20:23:57.0194744Z 2025-05-07T20:23:57.0195148Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:23:58.6480869Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:23:58.6481150Z 2025-05-07T20:23:58.6496216Z 2025-05-07T20:23:58.6505879Z [SETUP] Creating new Conda environment (Python 3.9) ... 2025-05-07T20:23:58.6528496Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.9 2025-05-07T20:23:59.4306612Z Channels: 2025-05-07T20:23:59.4306849Z - defaults 2025-05-07T20:23:59.4307059Z Platform: linux-64 2025-05-07T20:24:00.8902186Z Collecting package metadata (repodata.json): - \ | / - \ | / - done 2025-05-07T20:24:00.9907899Z Solving environment: | done 2025-05-07T20:24:01.0201146Z 2025-05-07T20:24:01.0201613Z ## Package Plan ## 2025-05-07T20:24:01.0201807Z 2025-05-07T20:24:01.0202046Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:01.0202419Z 2025-05-07T20:24:01.0202530Z added / updated specs: 2025-05-07T20:24:01.0202820Z - python=3.9 2025-05-07T20:24:01.0202962Z 2025-05-07T20:24:01.0202968Z 2025-05-07T20:24:01.0203088Z The following packages will be downloaded: 2025-05-07T20:24:01.0203304Z 2025-05-07T20:24:01.0203458Z package | build 2025-05-07T20:24:01.0204032Z ---------------------------|----------------- 2025-05-07T20:24:01.0204397Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:01.0204798Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:01.0205339Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:01.0206259Z python-3.9.21 | he870216_1 25.1 MB 2025-05-07T20:24:01.0206657Z setuptools-78.1.1 | py39h06a4308_0 1.7 MB 2025-05-07T20:24:01.0207056Z wheel-0.45.1 | py39h06a4308_0 114 KB 2025-05-07T20:24:01.0207412Z ------------------------------------------------------------ 2025-05-07T20:24:01.0207746Z Total: 27.1 MB 2025-05-07T20:24:01.0207952Z 2025-05-07T20:24:01.0208085Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:01.0208303Z 2025-05-07T20:24:01.0208737Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:01.0209180Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:01.0209690Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:01.0210231Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:01.0210683Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:01.0211120Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:01.0211554Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:01.0212014Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:01.0212606Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:01.0213186Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:01.0213595Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:01.0214004Z python pkgs/main/linux-64::python-3.9.21-he870216_1 2025-05-07T20:24:01.0214424Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:01.0214891Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py39h06a4308_0 2025-05-07T20:24:01.0215351Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:01.0215747Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:01.0216121Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:01.0216532Z wheel pkgs/main/linux-64::wheel-0.45.1-py39h06a4308_0 2025-05-07T20:24:01.0216921Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:01.0217283Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:01.0217536Z 2025-05-07T20:24:01.0217541Z 2025-05-07T20:24:01.0217545Z 2025-05-07T20:24:01.0217691Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:01.0218075Z python-3.9.21 | 25.1 MB | | 0% 2025-05-07T20:24:01.0218300Z 2025-05-07T20:24:01.0218681Z setuptools-78.1.1 | 1.7 MB | | 0%  2025-05-07T20:24:01.0218921Z 2025-05-07T20:24:01.0218925Z 2025-05-07T20:24:01.0233820Z ca-certificates-2025 | 129 KB | | 0%  2025-05-07T20:24:01.0234096Z 2025-05-07T20:24:01.0234100Z 2025-05-07T20:24:01.0236481Z 2025-05-07T20:24:01.0247913Z wheel-0.45.1 | 114 KB | | 0%  2025-05-07T20:24:01.0248162Z 2025-05-07T20:24:01.0248165Z 2025-05-07T20:24:01.0248169Z 2025-05-07T20:24:01.0248173Z 2025-05-07T20:24:01.0255378Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:01.0255657Z 2025-05-07T20:24:01.0255661Z 2025-05-07T20:24:01.0255665Z 2025-05-07T20:24:01.0255669Z 2025-05-07T20:24:01.0255672Z 2025-05-07T20:24:01.0580847Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:01.0581141Z 2025-05-07T20:24:01.0581145Z 2025-05-07T20:24:01.0581149Z 2025-05-07T20:24:01.0582748Z 2025-05-07T20:24:01.0740245Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.0740530Z 2025-05-07T20:24:01.0740534Z 2025-05-07T20:24:01.0740538Z 2025-05-07T20:24:01.0740542Z 2025-05-07T20:24:01.0744354Z 2025-05-07T20:24:01.0873276Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:01.0873799Z 2025-05-07T20:24:01.0873803Z 2025-05-07T20:24:01.0875953Z 2025-05-07T20:24:01.1081162Z wheel-0.45.1 | 114 KB | ########## | 100%  2025-05-07T20:24:01.1081421Z 2025-05-07T20:24:01.1081425Z 2025-05-07T20:24:01.1081429Z 2025-05-07T20:24:01.1081433Z 2025-05-07T20:24:01.1084552Z 2025-05-07T20:24:01.1179288Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:01.1179668Z 2025-05-07T20:24:01.1181475Z 2025-05-07T20:24:01.1204917Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.1632770Z python-3.9.21 | 25.1 MB | #2 | 12% 2025-05-07T20:24:01.1633128Z 2025-05-07T20:24:01.1633473Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:01.1636766Z 2025-05-07T20:24:01.1679367Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:01.1679714Z 2025-05-07T20:24:01.1679720Z 2025-05-07T20:24:01.1685583Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.1685949Z 2025-05-07T20:24:01.1685954Z 2025-05-07T20:24:01.2038872Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:01.2039251Z 2025-05-07T20:24:01.2039257Z 2025-05-07T20:24:01.2039262Z 2025-05-07T20:24:01.2041922Z 2025-05-07T20:24:01.2045890Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.2046276Z 2025-05-07T20:24:01.2046282Z 2025-05-07T20:24:01.2046287Z 2025-05-07T20:24:01.2046292Z 2025-05-07T20:24:01.2205641Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:01.2279146Z python-3.9.21 | 25.1 MB | ###1 | 31% 2025-05-07T20:24:01.2279453Z 2025-05-07T20:24:01.2279468Z 2025-05-07T20:24:01.2281029Z 2025-05-07T20:24:01.2285641Z wheel-0.45.1 | 114 KB | ########## | 100%  2025-05-07T20:24:01.2285883Z 2025-05-07T20:24:01.2286104Z 2025-05-07T20:24:01.2286213Z 2025-05-07T20:24:01.3210705Z wheel-0.45.1 | 114 KB | ########## | 100%  2025-05-07T20:24:01.3746192Z python-3.9.21 | 25.1 MB | #########8 | 98% 2025-05-07T20:24:01.5591806Z python-3.9.21 | 25.1 MB | ########## | 100% 2025-05-07T20:24:01.5592057Z 2025-05-07T20:24:02.0327954Z setuptools-78.1.1 | 1.7 MB | ########## | 100%  2025-05-07T20:24:02.0334772Z python-3.9.21 | 25.1 MB | ########## | 100% 2025-05-07T20:24:02.0335120Z 2025-05-07T20:24:02.0335321Z 2025-05-07T20:24:02.0335505Z  2025-05-07T20:24:02.0335715Z 2025-05-07T20:24:02.0335719Z 2025-05-07T20:24:02.0335884Z  2025-05-07T20:24:02.0336096Z 2025-05-07T20:24:02.0336100Z 2025-05-07T20:24:02.0336105Z 2025-05-07T20:24:02.0336271Z  2025-05-07T20:24:02.0336487Z 2025-05-07T20:24:02.0336497Z 2025-05-07T20:24:02.0336501Z 2025-05-07T20:24:02.0336505Z 2025-05-07T20:24:02.0336682Z  2025-05-07T20:24:02.0336898Z 2025-05-07T20:24:02.0336902Z 2025-05-07T20:24:02.0336905Z 2025-05-07T20:24:02.0336909Z 2025-05-07T20:24:02.0336913Z 2025-05-07T20:24:02.0337091Z  done 2025-05-07T20:24:02.2442660Z Preparing transaction: - \ done 2025-05-07T20:24:03.3854561Z Verifying transaction: / - \ | / - \ | / - \ done 2025-05-07T20:24:05.6031351Z Executing transaction: / - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:05.6528335Z # 2025-05-07T20:24:05.6528572Z # To activate this environment, use 2025-05-07T20:24:05.6528871Z # 2025-05-07T20:24:05.6529081Z # $ conda activate build_binary 2025-05-07T20:24:05.6529348Z # 2025-05-07T20:24:05.6529554Z # To deactivate an active environment, use 2025-05-07T20:24:05.6530120Z # 2025-05-07T20:24:05.6530311Z # $ conda deactivate 2025-05-07T20:24:05.6530465Z 2025-05-07T20:24:05.7565187Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:05.7586964Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:08.5698528Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (25.1) 2025-05-07T20:24:08.5699144Z Collecting pip 2025-05-07T20:24:08.5699468Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:08.5699890Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:08.5701108Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 102.3 MB/s eta 0:00:00 2025-05-07T20:24:08.5701493Z Installing collected packages: pip 2025-05-07T20:24:08.5701793Z Attempting uninstall: pip 2025-05-07T20:24:08.5702075Z Found existing installation: pip 25.1 2025-05-07T20:24:08.5702386Z Uninstalling pip-25.1: 2025-05-07T20:24:08.5702688Z Successfully uninstalled pip-25.1 2025-05-07T20:24:08.5703006Z Successfully installed pip-25.1.1 2025-05-07T20:24:08.5703194Z 2025-05-07T20:24:08.6343055Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:08.6366710Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:09.5160748Z Channels: 2025-05-07T20:24:09.5160999Z - conda-forge 2025-05-07T20:24:09.5161223Z Platform: linux-64 2025-05-07T20:24:19.9941195Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:21.5047694Z Solving environment: | / - \ | done 2025-05-07T20:24:21.5702603Z 2025-05-07T20:24:21.5703171Z ## Package Plan ## 2025-05-07T20:24:21.5703603Z 2025-05-07T20:24:21.5704411Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:21.5705011Z 2025-05-07T20:24:21.5705198Z added / updated specs: 2025-05-07T20:24:21.5705708Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:24:21.5706103Z 2025-05-07T20:24:21.5706113Z 2025-05-07T20:24:21.5706343Z The following packages will be downloaded: 2025-05-07T20:24:21.5706765Z 2025-05-07T20:24:21.5706984Z package | build 2025-05-07T20:24:21.5707433Z ---------------------------|----------------- 2025-05-07T20:24:21.5707806Z cffi-1.17.1 | py39h15c3d72_0 236 KB conda-forge 2025-05-07T20:24:21.5708244Z cryptography-44.0.3 | py39h7170ec2_0 1.5 MB conda-forge 2025-05-07T20:24:21.5708689Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:24:21.5709112Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:24:21.5709523Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:24:21.5710006Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:24:21.5710421Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:24:21.5710858Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:24:21.5711277Z python_abi-3.9 | 2_cp39 4 KB conda-forge 2025-05-07T20:24:21.5711723Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:24:21.5712306Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:24:21.5712868Z ------------------------------------------------------------ 2025-05-07T20:24:21.5713222Z Total: 6.3 MB 2025-05-07T20:24:21.5713438Z 2025-05-07T20:24:21.5713566Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:21.5713816Z 2025-05-07T20:24:21.5714084Z cffi conda-forge/linux-64::cffi-1.17.1-py39h15c3d72_0 2025-05-07T20:24:21.5714771Z cryptography conda-forge/linux-64::cryptography-44.0.3-py39h7170ec2_0 2025-05-07T20:24:21.5715672Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:24:21.5716125Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:24:21.5716603Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:24:21.5717054Z python_abi conda-forge/linux-64::python_abi-3.9-2_cp39 2025-05-07T20:24:21.5717565Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:24:21.5718196Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:24:21.5718530Z 2025-05-07T20:24:21.5718822Z The following packages will be UPDATED: 2025-05-07T20:24:21.5719027Z 2025-05-07T20:24:21.5719665Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:21.5720515Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:24:21.5721166Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:24:21.5721792Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:24:21.5722149Z 2025-05-07T20:24:21.5722153Z 2025-05-07T20:24:21.5722157Z 2025-05-07T20:24:21.5722309Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:21.5722673Z openssl-3.5.0 | 3.0 MB | | 0% 2025-05-07T20:24:21.5722911Z 2025-05-07T20:24:21.5723292Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:24:21.5723543Z 2025-05-07T20:24:21.5723554Z 2025-05-07T20:24:21.5724645Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:24:21.5724967Z 2025-05-07T20:24:21.5724972Z 2025-05-07T20:24:21.5724981Z 2025-05-07T20:24:21.5740832Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:24:21.5741173Z 2025-05-07T20:24:21.5741179Z 2025-05-07T20:24:21.5741192Z 2025-05-07T20:24:21.5741198Z 2025-05-07T20:24:21.5762759Z cffi-1.17.1 | 236 KB | | 0%  2025-05-07T20:24:21.5763085Z 2025-05-07T20:24:21.5763091Z 2025-05-07T20:24:21.5763096Z 2025-05-07T20:24:21.5763101Z 2025-05-07T20:24:21.5773086Z 2025-05-07T20:24:21.5781704Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:24:21.5782074Z 2025-05-07T20:24:21.5782080Z 2025-05-07T20:24:21.5782085Z 2025-05-07T20:24:21.5782090Z 2025-05-07T20:24:21.5782096Z 2025-05-07T20:24:21.5782114Z 2025-05-07T20:24:21.5783536Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:24:21.5783910Z 2025-05-07T20:24:21.5783916Z 2025-05-07T20:24:21.5783921Z 2025-05-07T20:24:21.5783938Z 2025-05-07T20:24:21.5783944Z 2025-05-07T20:24:21.5783949Z 2025-05-07T20:24:21.5783954Z 2025-05-07T20:24:21.5790705Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:24:21.5791123Z 2025-05-07T20:24:21.5791137Z 2025-05-07T20:24:21.5791143Z 2025-05-07T20:24:21.5791148Z 2025-05-07T20:24:21.5791154Z 2025-05-07T20:24:21.5791159Z 2025-05-07T20:24:21.5791164Z 2025-05-07T20:24:21.5791169Z 2025-05-07T20:24:21.5792170Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:24:21.5792577Z 2025-05-07T20:24:21.5792582Z 2025-05-07T20:24:21.5792587Z 2025-05-07T20:24:21.5792592Z 2025-05-07T20:24:21.5792598Z 2025-05-07T20:24:21.5792603Z 2025-05-07T20:24:21.5792612Z 2025-05-07T20:24:21.5792618Z 2025-05-07T20:24:21.5792623Z 2025-05-07T20:24:21.5793727Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:21.5794103Z 2025-05-07T20:24:21.5794109Z 2025-05-07T20:24:21.5794115Z 2025-05-07T20:24:21.5794120Z 2025-05-07T20:24:21.5794125Z 2025-05-07T20:24:21.5794137Z 2025-05-07T20:24:21.5794143Z 2025-05-07T20:24:21.5794147Z 2025-05-07T20:24:21.5794152Z 2025-05-07T20:24:21.5794157Z 2025-05-07T20:24:21.6291388Z python_abi-3.9 | 4 KB | | 0%  2025-05-07T20:24:21.6291763Z 2025-05-07T20:24:21.6293847Z 2025-05-07T20:24:21.6293853Z 2025-05-07T20:24:21.6293858Z 2025-05-07T20:24:21.6446427Z cffi-1.17.1 | 236 KB | ########## | 100%  2025-05-07T20:24:21.6446765Z 2025-05-07T20:24:21.6446770Z 2025-05-07T20:24:21.6446776Z 2025-05-07T20:24:21.6714546Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:21.6715921Z 2025-05-07T20:24:21.6715932Z 2025-05-07T20:24:21.6715943Z 2025-05-07T20:24:21.6715953Z 2025-05-07T20:24:21.6715963Z 2025-05-07T20:24:21.6727700Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:21.6737929Z 2025-05-07T20:24:21.6757440Z cryptography-44.0.3 | 1.5 MB | #4 | 15%  2025-05-07T20:24:21.6757726Z 2025-05-07T20:24:21.6757730Z 2025-05-07T20:24:21.6810755Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:21.6901214Z openssl-3.5.0 | 3.0 MB | | 1% 2025-05-07T20:24:21.6901481Z 2025-05-07T20:24:21.6901796Z 2025-05-07T20:24:21.6901801Z 2025-05-07T20:24:21.6907237Z 2025-05-07T20:24:21.6916389Z cffi-1.17.1 | 236 KB | ########## | 100%  2025-05-07T20:24:21.6916639Z 2025-05-07T20:24:21.6916643Z 2025-05-07T20:24:21.6916647Z 2025-05-07T20:24:21.6919396Z 2025-05-07T20:24:21.6923056Z cffi-1.17.1 | 236 KB | ########## | 100%  2025-05-07T20:24:21.6923299Z 2025-05-07T20:24:21.6923302Z 2025-05-07T20:24:21.6923306Z 2025-05-07T20:24:21.6923310Z 2025-05-07T20:24:21.6923313Z 2025-05-07T20:24:21.6924818Z 2025-05-07T20:24:21.7019176Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:24:21.7019456Z 2025-05-07T20:24:21.7019460Z 2025-05-07T20:24:21.7047478Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:21.7047833Z 2025-05-07T20:24:21.7047839Z 2025-05-07T20:24:21.7047845Z 2025-05-07T20:24:21.7047862Z 2025-05-07T20:24:21.7047865Z 2025-05-07T20:24:21.7047877Z 2025-05-07T20:24:21.7180243Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:21.7180630Z 2025-05-07T20:24:21.7180634Z 2025-05-07T20:24:21.7180638Z 2025-05-07T20:24:21.7180642Z 2025-05-07T20:24:21.7180652Z 2025-05-07T20:24:21.7180656Z 2025-05-07T20:24:21.7181757Z 2025-05-07T20:24:21.7235361Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:24:21.7235710Z 2025-05-07T20:24:21.7235715Z 2025-05-07T20:24:21.7235719Z 2025-05-07T20:24:21.7235722Z 2025-05-07T20:24:21.7235726Z 2025-05-07T20:24:21.7235740Z 2025-05-07T20:24:21.7235744Z 2025-05-07T20:24:21.7574166Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:21.7574603Z 2025-05-07T20:24:21.7574609Z 2025-05-07T20:24:21.7574613Z 2025-05-07T20:24:21.7574617Z 2025-05-07T20:24:21.7574620Z 2025-05-07T20:24:21.7574624Z 2025-05-07T20:24:21.7574628Z 2025-05-07T20:24:21.7574640Z 2025-05-07T20:24:21.7577797Z 2025-05-07T20:24:21.7617112Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:21.7617443Z 2025-05-07T20:24:21.7617447Z 2025-05-07T20:24:21.7617451Z 2025-05-07T20:24:21.7617454Z 2025-05-07T20:24:21.7617458Z 2025-05-07T20:24:21.7617462Z 2025-05-07T20:24:21.7617465Z 2025-05-07T20:24:21.7617469Z 2025-05-07T20:24:21.7617473Z 2025-05-07T20:24:21.7617476Z 2025-05-07T20:24:21.7639086Z python_abi-3.9 | 4 KB | ########## | 100%  2025-05-07T20:24:21.7639496Z 2025-05-07T20:24:21.7639503Z 2025-05-07T20:24:21.7639520Z 2025-05-07T20:24:21.7639526Z 2025-05-07T20:24:21.7639532Z 2025-05-07T20:24:21.7639537Z 2025-05-07T20:24:21.7639543Z 2025-05-07T20:24:21.7639548Z 2025-05-07T20:24:21.7639554Z 2025-05-07T20:24:21.7641369Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:21.7641645Z 2025-05-07T20:24:21.7641649Z 2025-05-07T20:24:21.7641870Z 2025-05-07T20:24:21.7641874Z 2025-05-07T20:24:21.7641878Z 2025-05-07T20:24:21.7641881Z 2025-05-07T20:24:21.7641894Z 2025-05-07T20:24:21.7641897Z 2025-05-07T20:24:21.7641901Z 2025-05-07T20:24:21.7641905Z 2025-05-07T20:24:21.7650298Z python_abi-3.9 | 4 KB | ########## | 100%  2025-05-07T20:24:21.7650572Z 2025-05-07T20:24:21.7694402Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:21.7694655Z 2025-05-07T20:24:21.7694658Z 2025-05-07T20:24:21.7694662Z 2025-05-07T20:24:21.7694666Z 2025-05-07T20:24:21.7696457Z 2025-05-07T20:24:21.7707091Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:21.7707372Z 2025-05-07T20:24:21.7707376Z 2025-05-07T20:24:21.7707380Z 2025-05-07T20:24:21.7707383Z 2025-05-07T20:24:21.7707416Z 2025-05-07T20:24:21.7712197Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:21.7712470Z 2025-05-07T20:24:21.7712473Z 2025-05-07T20:24:21.7712487Z 2025-05-07T20:24:21.7712498Z 2025-05-07T20:24:21.7712502Z 2025-05-07T20:24:21.7712506Z 2025-05-07T20:24:21.7712510Z 2025-05-07T20:24:21.7712513Z 2025-05-07T20:24:21.7740901Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:24:21.7741203Z 2025-05-07T20:24:21.7741207Z 2025-05-07T20:24:21.7741211Z 2025-05-07T20:24:21.7741214Z 2025-05-07T20:24:21.7741218Z 2025-05-07T20:24:21.7741221Z 2025-05-07T20:24:21.7741225Z 2025-05-07T20:24:21.7741229Z 2025-05-07T20:24:21.7840528Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:21.7840831Z 2025-05-07T20:24:21.7840835Z 2025-05-07T20:24:21.7842667Z 2025-05-07T20:24:21.7849324Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:21.7849578Z 2025-05-07T20:24:21.7849582Z 2025-05-07T20:24:21.7849586Z 2025-05-07T20:24:21.8055450Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:21.8056023Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:21.8133285Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:21.8133542Z 2025-05-07T20:24:21.8133546Z 2025-05-07T20:24:21.8133550Z 2025-05-07T20:24:21.8133554Z 2025-05-07T20:24:21.8133558Z 2025-05-07T20:24:21.8133562Z 2025-05-07T20:24:21.8133565Z 2025-05-07T20:24:21.8620862Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:21.8621171Z 2025-05-07T20:24:21.8621176Z 2025-05-07T20:24:21.8634432Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:21.8634679Z 2025-05-07T20:24:21.8634693Z 2025-05-07T20:24:21.8634697Z 2025-05-07T20:24:21.8634701Z 2025-05-07T20:24:21.8634704Z 2025-05-07T20:24:21.8634708Z 2025-05-07T20:24:21.8634712Z 2025-05-07T20:24:21.8634715Z 2025-05-07T20:24:21.8635214Z 2025-05-07T20:24:21.8640989Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:21.8641359Z 2025-05-07T20:24:21.8641375Z 2025-05-07T20:24:21.8641379Z 2025-05-07T20:24:21.8641383Z 2025-05-07T20:24:21.8641387Z 2025-05-07T20:24:21.8641391Z 2025-05-07T20:24:21.8641394Z 2025-05-07T20:24:21.8641398Z 2025-05-07T20:24:21.8641402Z 2025-05-07T20:24:21.8801185Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:21.8801510Z 2025-05-07T20:24:21.8801514Z 2025-05-07T20:24:21.8801518Z 2025-05-07T20:24:21.8801521Z 2025-05-07T20:24:21.8801525Z 2025-05-07T20:24:21.8801529Z 2025-05-07T20:24:21.8801533Z 2025-05-07T20:24:21.8801536Z 2025-05-07T20:24:21.8801540Z 2025-05-07T20:24:21.8801551Z 2025-05-07T20:24:21.9046276Z python_abi-3.9 | 4 KB | ########## | 100%  2025-05-07T20:24:21.9046654Z 2025-05-07T20:24:21.9046658Z 2025-05-07T20:24:21.9046662Z 2025-05-07T20:24:21.9046672Z 2025-05-07T20:24:21.9046676Z 2025-05-07T20:24:21.9046680Z 2025-05-07T20:24:21.9047559Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:21.9048085Z 2025-05-07T20:24:21.9048089Z 2025-05-07T20:24:21.9048104Z 2025-05-07T20:24:21.9048108Z 2025-05-07T20:24:21.9048111Z 2025-05-07T20:24:21.9048115Z 2025-05-07T20:24:21.9057457Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:21.9057826Z 2025-05-07T20:24:21.9057837Z 2025-05-07T20:24:21.9057841Z 2025-05-07T20:24:21.9057845Z 2025-05-07T20:24:21.9057848Z 2025-05-07T20:24:21.9057852Z 2025-05-07T20:24:21.9057855Z 2025-05-07T20:24:21.9059252Z 2025-05-07T20:24:21.9063160Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:21.9063657Z 2025-05-07T20:24:21.9063665Z 2025-05-07T20:24:21.9063669Z 2025-05-07T20:24:21.9063672Z 2025-05-07T20:24:21.9063676Z 2025-05-07T20:24:21.9063680Z 2025-05-07T20:24:21.9063683Z 2025-05-07T20:24:21.9063690Z 2025-05-07T20:24:22.0197949Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:22.0275867Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:22.0276186Z 2025-05-07T20:24:22.0276680Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:22.0276951Z 2025-05-07T20:24:22.0284096Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:22.0284452Z 2025-05-07T20:24:22.0284648Z 2025-05-07T20:24:22.0284821Z  2025-05-07T20:24:22.0285021Z 2025-05-07T20:24:22.0285025Z 2025-05-07T20:24:22.0285198Z  2025-05-07T20:24:22.0285407Z 2025-05-07T20:24:22.0285420Z 2025-05-07T20:24:22.0285424Z 2025-05-07T20:24:22.0285633Z  2025-05-07T20:24:22.0285933Z 2025-05-07T20:24:22.0285938Z 2025-05-07T20:24:22.0285944Z 2025-05-07T20:24:22.0285949Z 2025-05-07T20:24:22.0286137Z  2025-05-07T20:24:22.0286412Z 2025-05-07T20:24:22.0286415Z 2025-05-07T20:24:22.0286419Z 2025-05-07T20:24:22.0286423Z 2025-05-07T20:24:22.0286426Z 2025-05-07T20:24:22.0286603Z  2025-05-07T20:24:22.0286821Z 2025-05-07T20:24:22.0286825Z 2025-05-07T20:24:22.0286828Z 2025-05-07T20:24:22.0286832Z 2025-05-07T20:24:22.0286836Z 2025-05-07T20:24:22.0286839Z 2025-05-07T20:24:22.0287017Z  2025-05-07T20:24:22.0287256Z 2025-05-07T20:24:22.0287260Z 2025-05-07T20:24:22.0287265Z 2025-05-07T20:24:22.0287269Z 2025-05-07T20:24:22.0287279Z 2025-05-07T20:24:22.0287283Z 2025-05-07T20:24:22.0287287Z 2025-05-07T20:24:22.0287497Z  2025-05-07T20:24:22.0287720Z 2025-05-07T20:24:22.0287724Z 2025-05-07T20:24:22.0287727Z 2025-05-07T20:24:22.0287731Z 2025-05-07T20:24:22.0287734Z 2025-05-07T20:24:22.0287738Z 2025-05-07T20:24:22.0287746Z 2025-05-07T20:24:22.0287750Z 2025-05-07T20:24:22.0287929Z  2025-05-07T20:24:22.0288149Z 2025-05-07T20:24:22.0288153Z 2025-05-07T20:24:22.0288156Z 2025-05-07T20:24:22.0288160Z 2025-05-07T20:24:22.0288164Z 2025-05-07T20:24:22.0288167Z 2025-05-07T20:24:22.0288171Z 2025-05-07T20:24:22.0288174Z 2025-05-07T20:24:22.0288178Z 2025-05-07T20:24:22.0288368Z  2025-05-07T20:24:22.0288589Z 2025-05-07T20:24:22.0288593Z 2025-05-07T20:24:22.0288596Z 2025-05-07T20:24:22.0288605Z 2025-05-07T20:24:22.0288609Z 2025-05-07T20:24:22.0288612Z 2025-05-07T20:24:22.0288616Z 2025-05-07T20:24:22.0288620Z 2025-05-07T20:24:22.0288623Z 2025-05-07T20:24:22.0288627Z 2025-05-07T20:24:22.0288824Z  done 2025-05-07T20:24:22.1292614Z Preparing transaction: - done 2025-05-07T20:24:22.2298895Z Verifying transaction: | done 2025-05-07T20:24:23.7321962Z Executing transaction: - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:23.9018952Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:24:25.6213877Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:24:25.6227078Z [SETUP] Installing libxcrypt ... 2025-05-07T20:24:25.6250195Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:24:26.4935096Z Channels: 2025-05-07T20:24:26.4935400Z - conda-forge 2025-05-07T20:24:26.4935695Z Platform: linux-64 2025-05-07T20:24:29.7664462Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:30.1334518Z Solving environment: \ done 2025-05-07T20:24:30.1941824Z 2025-05-07T20:24:30.1942229Z ## Package Plan ## 2025-05-07T20:24:30.1942451Z 2025-05-07T20:24:30.1942736Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:30.1943150Z 2025-05-07T20:24:30.1943308Z added / updated specs: 2025-05-07T20:24:30.1943595Z - libxcrypt 2025-05-07T20:24:30.1943724Z 2025-05-07T20:24:30.1943729Z 2025-05-07T20:24:30.1943854Z The following packages will be downloaded: 2025-05-07T20:24:30.1944069Z 2025-05-07T20:24:30.1944192Z package | build 2025-05-07T20:24:30.1944508Z ---------------------------|----------------- 2025-05-07T20:24:30.1944883Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:24:30.1945285Z ------------------------------------------------------------ 2025-05-07T20:24:30.1945629Z Total: 98 KB 2025-05-07T20:24:30.1945839Z 2025-05-07T20:24:30.1945964Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:30.1946180Z 2025-05-07T20:24:30.1946405Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:24:30.1946687Z 2025-05-07T20:24:30.1946691Z 2025-05-07T20:24:30.1946700Z 2025-05-07T20:24:30.1946845Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:30.3529774Z libxcrypt-4.4.36 | 98 KB | | 0% 2025-05-07T20:24:30.3545557Z libxcrypt-4.4.36 | 98 KB | #6 | 16% 2025-05-07T20:24:30.3649164Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:30.3651645Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:30.3652101Z 2025-05-07T20:24:30.3652470Z done 2025-05-07T20:24:30.4655805Z Preparing transaction: / done 2025-05-07T20:24:30.5662084Z Verifying transaction: \ done 2025-05-07T20:24:30.6669502Z Executing transaction: / done 2025-05-07T20:24:34.0883840Z [SETUP] Copying over ... 2025-05-07T20:24:34.0884529Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.9/crypt.h 2025-05-07T20:24:34.0885064Z 2025-05-07T20:24:34.0913504Z 2025-05-07T20:24:35.7221700Z [SETUP] Installed Python version: Python 3.9.21 2025-05-07T20:24:35.7222156Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:24:35.7256194Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:35.7256697Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:24:35.7270600Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:35.7270952Z env: 2025-05-07T20:24:35.7271190Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:35.7271515Z BUILD_ENV: build_binary 2025-05-07T20:24:35.7271757Z BUILD_TARGET: genai 2025-05-07T20:24:35.7271987Z BUILD_VARIANT: cuda 2025-05-07T20:24:35.7272226Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:24:35.7272473Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:35.7272774Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:35.7273109Z ##[endgroup] 2025-05-07T20:24:36.0621957Z ################################################################################ 2025-05-07T20:24:36.0622706Z # Install C/C++ Compilers 2025-05-07T20:24:36.0622951Z # 2025-05-07T20:24:36.0646661Z # [2025-05-07T20:24:36.063Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:24:36.0647130Z ################################################################################ 2025-05-07T20:24:36.0647427Z 2025-05-07T20:24:36.0654221Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:36.1543762Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:36.1552502Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:24:36.1574976Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:24:37.0254470Z Channels: 2025-05-07T20:24:37.0255127Z - conda-forge 2025-05-07T20:24:37.0255605Z Platform: linux-64 2025-05-07T20:24:40.3381534Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:40.7048061Z Solving environment: \ done 2025-05-07T20:24:40.7659280Z 2025-05-07T20:24:40.7659734Z ## Package Plan ## 2025-05-07T20:24:40.7660026Z 2025-05-07T20:24:40.7660477Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:40.7661072Z 2025-05-07T20:24:40.7661253Z added / updated specs: 2025-05-07T20:24:40.7661665Z - sysroot_linux-64=2.17 2025-05-07T20:24:40.7661855Z 2025-05-07T20:24:40.7661859Z 2025-05-07T20:24:40.7662010Z The following packages will be downloaded: 2025-05-07T20:24:40.7662223Z 2025-05-07T20:24:40.7662335Z package | build 2025-05-07T20:24:40.7662649Z ---------------------------|----------------- 2025-05-07T20:24:40.7663064Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:24:40.7663536Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:24:40.7663945Z ------------------------------------------------------------ 2025-05-07T20:24:40.7664281Z Total: 15.4 MB 2025-05-07T20:24:40.7664495Z 2025-05-07T20:24:40.7664635Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:40.7664853Z 2025-05-07T20:24:40.7665131Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:24:40.7665683Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:24:40.7665989Z 2025-05-07T20:24:40.7665993Z 2025-05-07T20:24:40.7665997Z 2025-05-07T20:24:40.7666135Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:40.7666509Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:40.7666728Z 2025-05-07T20:24:40.9750215Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:24:40.9967854Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:24:40.9968212Z 2025-05-07T20:24:41.0124063Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:24:41.0124322Z 2025-05-07T20:24:41.0750732Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:41.1114597Z sysroot_linux-64-2.1 | 14.5 MB | #########3 | 93% 2025-05-07T20:24:41.2756071Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:41.2756320Z 2025-05-07T20:24:41.2757118Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:41.2757552Z 2025-05-07T20:24:41.7138586Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:24:41.7142704Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:24:41.7143229Z 2025-05-07T20:24:41.7143514Z 2025-05-07T20:24:41.7143798Z  done 2025-05-07T20:24:41.8149050Z Preparing transaction: / done 2025-05-07T20:24:42.0156405Z Verifying transaction: \ | done 2025-05-07T20:24:42.2231414Z Executing transaction: - \ done 2025-05-07T20:24:42.3753437Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:24:42.3753729Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:24:44.0769907Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:24:44.0783124Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:24:44.0807032Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:24:44.9718367Z Channels: 2025-05-07T20:24:44.9718593Z - conda-forge 2025-05-07T20:24:44.9718820Z Platform: linux-64 2025-05-07T20:24:48.2311424Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:49.1870806Z Solving environment: \ | / done 2025-05-07T20:24:49.2506786Z 2025-05-07T20:24:49.2506947Z ## Package Plan ## 2025-05-07T20:24:49.2507098Z 2025-05-07T20:24:49.2507314Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:49.2507610Z 2025-05-07T20:24:49.2507715Z added / updated specs: 2025-05-07T20:24:49.2507969Z - gxx_linux-64=11.4.0 2025-05-07T20:24:49.2508163Z 2025-05-07T20:24:49.2508167Z 2025-05-07T20:24:49.2508283Z The following packages will be downloaded: 2025-05-07T20:24:49.2508504Z 2025-05-07T20:24:49.2508624Z package | build 2025-05-07T20:24:49.2508956Z ---------------------------|----------------- 2025-05-07T20:24:49.2509349Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:24:49.2509915Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:24:49.2510372Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:24:49.2510801Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:24:49.2511235Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:24:49.2511664Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:24:49.2512085Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:24:49.2512548Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:24:49.2513012Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:24:49.2513442Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:24:49.2513934Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:24:49.2514415Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:24:49.2514809Z ------------------------------------------------------------ 2025-05-07T20:24:49.2515141Z Total: 91.6 MB 2025-05-07T20:24:49.2515343Z 2025-05-07T20:24:49.2515472Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:49.2515691Z 2025-05-07T20:24:49.2515954Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:24:49.2516506Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:24:49.2517389Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:24:49.2517889Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:24:49.2518385Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:24:49.2518874Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:24:49.2519388Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:49.2519925Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:24:49.2520406Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:24:49.2520933Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:24:49.2521286Z 2025-05-07T20:24:49.2521544Z The following packages will be UPDATED: 2025-05-07T20:24:49.2521744Z 2025-05-07T20:24:49.2522054Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:24:49.2522755Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:24:49.2523157Z 2025-05-07T20:24:49.2523161Z 2025-05-07T20:24:49.2523165Z 2025-05-07T20:24:49.2523304Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:49.2523672Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:49.2523894Z 2025-05-07T20:24:49.2524209Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:49.2524438Z 2025-05-07T20:24:49.2524442Z 2025-05-07T20:24:49.2532151Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:49.2532409Z 2025-05-07T20:24:49.2532555Z 2025-05-07T20:24:49.2538593Z 2025-05-07T20:24:49.2547368Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:24:49.2547643Z 2025-05-07T20:24:49.2547833Z 2025-05-07T20:24:49.2547839Z 2025-05-07T20:24:49.2551761Z 2025-05-07T20:24:49.2565744Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:49.2566033Z 2025-05-07T20:24:49.2566045Z 2025-05-07T20:24:49.2566049Z 2025-05-07T20:24:49.2566053Z 2025-05-07T20:24:49.2580056Z 2025-05-07T20:24:49.2582121Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:49.2582457Z 2025-05-07T20:24:49.2582461Z 2025-05-07T20:24:49.2582465Z 2025-05-07T20:24:49.2582469Z 2025-05-07T20:24:49.2582473Z 2025-05-07T20:24:49.2582476Z 2025-05-07T20:24:49.2589479Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:24:49.2589867Z 2025-05-07T20:24:49.2589871Z 2025-05-07T20:24:49.2589875Z 2025-05-07T20:24:49.2589879Z 2025-05-07T20:24:49.2589883Z 2025-05-07T20:24:49.2589886Z 2025-05-07T20:24:49.2600580Z 2025-05-07T20:24:49.2604145Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:24:49.2604528Z 2025-05-07T20:24:49.2604532Z 2025-05-07T20:24:49.2604545Z 2025-05-07T20:24:49.2604549Z 2025-05-07T20:24:49.2604553Z 2025-05-07T20:24:49.2604556Z 2025-05-07T20:24:49.2604560Z 2025-05-07T20:24:49.2604564Z 2025-05-07T20:24:49.2609445Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:49.2609783Z 2025-05-07T20:24:49.2609786Z 2025-05-07T20:24:49.2609790Z 2025-05-07T20:24:49.2609794Z 2025-05-07T20:24:49.2609797Z 2025-05-07T20:24:49.2609801Z 2025-05-07T20:24:49.2609805Z 2025-05-07T20:24:49.2609809Z 2025-05-07T20:24:49.2609812Z 2025-05-07T20:24:49.2614494Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:24:49.2614791Z 2025-05-07T20:24:49.2614797Z 2025-05-07T20:24:49.2614809Z 2025-05-07T20:24:49.2614813Z 2025-05-07T20:24:49.2614816Z 2025-05-07T20:24:49.2614820Z 2025-05-07T20:24:49.2614823Z 2025-05-07T20:24:49.2614837Z 2025-05-07T20:24:49.2614841Z 2025-05-07T20:24:49.2614845Z 2025-05-07T20:24:49.2615900Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:24:49.2616191Z 2025-05-07T20:24:49.2616195Z 2025-05-07T20:24:49.2616199Z 2025-05-07T20:24:49.2616208Z 2025-05-07T20:24:49.2616212Z 2025-05-07T20:24:49.2616215Z 2025-05-07T20:24:49.2616219Z 2025-05-07T20:24:49.2616223Z 2025-05-07T20:24:49.2616227Z 2025-05-07T20:24:49.2616230Z 2025-05-07T20:24:49.2616234Z 2025-05-07T20:24:49.3582953Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:24:49.3583285Z 2025-05-07T20:24:49.3583289Z 2025-05-07T20:24:49.3634866Z 2025-05-07T20:24:49.3850507Z binutils_impl_linux- | 6.0 MB | 3 | 3%  2025-05-07T20:24:49.3850899Z 2025-05-07T20:24:49.3850913Z 2025-05-07T20:24:49.3850919Z 2025-05-07T20:24:49.3852011Z 2025-05-07T20:24:49.4618426Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:24:49.4619006Z 2025-05-07T20:24:49.4619011Z 2025-05-07T20:24:49.4619016Z 2025-05-07T20:24:49.4932834Z binutils_impl_linux- | 6.0 MB | ##2 | 23%  2025-05-07T20:24:49.4933137Z 2025-05-07T20:24:49.4933141Z 2025-05-07T20:24:49.4933145Z 2025-05-07T20:24:49.5037040Z 2025-05-07T20:24:49.5280662Z libstdcxx-15.1.0 | 3.7 MB | #6 | 16%  2025-05-07T20:24:49.5620131Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:24:49.5620456Z 2025-05-07T20:24:49.5620460Z 2025-05-07T20:24:49.5620522Z 2025-05-07T20:24:49.5934956Z binutils_impl_linux- | 6.0 MB | #####1 | 51%  2025-05-07T20:24:49.5935235Z 2025-05-07T20:24:49.5935239Z 2025-05-07T20:24:49.5935243Z 2025-05-07T20:24:49.5935247Z 2025-05-07T20:24:49.6022324Z libstdcxx-15.1.0 | 3.7 MB | #########9 | 100%  2025-05-07T20:24:49.6022665Z 2025-05-07T20:24:49.6205257Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:24:49.6205531Z 2025-05-07T20:24:49.6206146Z 2025-05-07T20:24:49.6280764Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:24:49.6533441Z gcc_impl_linux-64-11 | 53.0 MB | 9 | 9% 2025-05-07T20:24:49.6533735Z 2025-05-07T20:24:49.6533739Z 2025-05-07T20:24:49.6533743Z 2025-05-07T20:24:49.6535521Z 2025-05-07T20:24:49.7019162Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:49.7019433Z 2025-05-07T20:24:49.7019437Z 2025-05-07T20:24:49.7019441Z 2025-05-07T20:24:49.7022812Z 2025-05-07T20:24:49.7022820Z 2025-05-07T20:24:49.7027359Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:24:49.7027651Z 2025-05-07T20:24:49.7212058Z gxx_impl_linux-64-11 | 11.2 MB | ##5 | 26%  2025-05-07T20:24:49.7212316Z 2025-05-07T20:24:49.7214615Z 2025-05-07T20:24:49.7283266Z libstdcxx-devel_linu | 11.1 MB | ##9 | 30%  2025-05-07T20:24:49.7630178Z gcc_impl_linux-64-11 | 53.0 MB | #5 | 16% 2025-05-07T20:24:49.7630485Z 2025-05-07T20:24:49.7630489Z 2025-05-07T20:24:49.7630499Z 2025-05-07T20:24:49.7630793Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:49.7631052Z 2025-05-07T20:24:49.7631056Z 2025-05-07T20:24:49.7631064Z 2025-05-07T20:24:49.8027894Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:49.8028291Z 2025-05-07T20:24:49.8028296Z 2025-05-07T20:24:49.8028302Z 2025-05-07T20:24:49.8028306Z 2025-05-07T20:24:49.8028310Z 2025-05-07T20:24:49.8030404Z libsanitizer-11.4.0 | 3.5 MB | ######## | 81%  2025-05-07T20:24:49.8030694Z 2025-05-07T20:24:49.8143126Z gxx_impl_linux-64-11 | 11.2 MB | ####9 | 49%  2025-05-07T20:24:49.8143380Z 2025-05-07T20:24:49.8143384Z 2025-05-07T20:24:49.8143388Z 2025-05-07T20:24:49.8143392Z 2025-05-07T20:24:49.8143396Z 2025-05-07T20:24:49.8145516Z 2025-05-07T20:24:49.8214978Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:24:49.8215296Z 2025-05-07T20:24:49.8215300Z 2025-05-07T20:24:49.8533790Z libstdcxx-devel_linu | 11.1 MB | #####3 | 53%  2025-05-07T20:24:49.9032516Z gcc_impl_linux-64-11 | 53.0 MB | ##1 | 22% 2025-05-07T20:24:49.9032772Z 2025-05-07T20:24:49.9219047Z gxx_impl_linux-64-11 | 11.2 MB | #######1 | 72%  2025-05-07T20:24:49.9219301Z 2025-05-07T20:24:49.9219306Z 2025-05-07T20:24:49.9650943Z libstdcxx-devel_linu | 11.1 MB | #######6 | 76%  2025-05-07T20:24:49.9651237Z 2025-05-07T20:24:49.9651241Z 2025-05-07T20:24:49.9651245Z 2025-05-07T20:24:49.9651249Z 2025-05-07T20:24:49.9652675Z 2025-05-07T20:24:49.9768606Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:49.9992762Z gcc_impl_linux-64-11 | 53.0 MB | ##7 | 27% 2025-05-07T20:24:49.9993121Z 2025-05-07T20:24:49.9993127Z 2025-05-07T20:24:49.9993133Z 2025-05-07T20:24:49.9993138Z 2025-05-07T20:24:49.9993143Z 2025-05-07T20:24:49.9995618Z 2025-05-07T20:24:50.0001031Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:50.0001451Z 2025-05-07T20:24:50.0001457Z 2025-05-07T20:24:50.0001479Z 2025-05-07T20:24:50.0001485Z 2025-05-07T20:24:50.0001490Z 2025-05-07T20:24:50.0001495Z 2025-05-07T20:24:50.0034607Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:50.0035280Z 2025-05-07T20:24:50.0221123Z gxx_impl_linux-64-11 | 11.2 MB | #########6 | 97%  2025-05-07T20:24:50.0221486Z 2025-05-07T20:24:50.0221492Z 2025-05-07T20:24:50.0221497Z 2025-05-07T20:24:50.0221503Z 2025-05-07T20:24:50.0221508Z 2025-05-07T20:24:50.0221513Z 2025-05-07T20:24:50.0221518Z 2025-05-07T20:24:50.0417369Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:24:50.0417772Z 2025-05-07T20:24:50.0417777Z 2025-05-07T20:24:50.0417783Z 2025-05-07T20:24:50.0417788Z 2025-05-07T20:24:50.0417794Z 2025-05-07T20:24:50.0417799Z 2025-05-07T20:24:50.0417805Z 2025-05-07T20:24:50.0417811Z 2025-05-07T20:24:50.0476412Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:50.0476823Z 2025-05-07T20:24:50.0476849Z 2025-05-07T20:24:50.0476854Z 2025-05-07T20:24:50.0476860Z 2025-05-07T20:24:50.0476865Z 2025-05-07T20:24:50.0476871Z 2025-05-07T20:24:50.0476876Z 2025-05-07T20:24:50.0480280Z 2025-05-07T20:24:50.0687382Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:50.0687792Z 2025-05-07T20:24:50.0687798Z 2025-05-07T20:24:50.0687803Z 2025-05-07T20:24:50.0687808Z 2025-05-07T20:24:50.0687814Z 2025-05-07T20:24:50.0687819Z 2025-05-07T20:24:50.0688158Z 2025-05-07T20:24:50.0772598Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:50.0876275Z gcc_impl_linux-64-11 | 53.0 MB | ###2 | 33% 2025-05-07T20:24:50.0876613Z 2025-05-07T20:24:50.0876619Z 2025-05-07T20:24:50.0876633Z 2025-05-07T20:24:50.0876639Z 2025-05-07T20:24:50.0876644Z 2025-05-07T20:24:50.0876673Z 2025-05-07T20:24:50.0876678Z 2025-05-07T20:24:50.0876684Z 2025-05-07T20:24:50.0876689Z 2025-05-07T20:24:50.0907718Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:24:50.0908118Z 2025-05-07T20:24:50.0908125Z 2025-05-07T20:24:50.0908130Z 2025-05-07T20:24:50.0908136Z 2025-05-07T20:24:50.0908141Z 2025-05-07T20:24:50.0908147Z 2025-05-07T20:24:50.0908152Z 2025-05-07T20:24:50.0908157Z 2025-05-07T20:24:50.0908347Z 2025-05-07T20:24:50.1239506Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:50.1239833Z 2025-05-07T20:24:50.1239839Z 2025-05-07T20:24:50.1239844Z 2025-05-07T20:24:50.1239849Z 2025-05-07T20:24:50.1239855Z 2025-05-07T20:24:50.1239860Z 2025-05-07T20:24:50.1239866Z 2025-05-07T20:24:50.1239871Z 2025-05-07T20:24:50.1239876Z 2025-05-07T20:24:50.1239888Z 2025-05-07T20:24:50.1281602Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:24:50.1282052Z 2025-05-07T20:24:50.1282060Z 2025-05-07T20:24:50.1282068Z 2025-05-07T20:24:50.1282082Z 2025-05-07T20:24:50.1282088Z 2025-05-07T20:24:50.1282094Z 2025-05-07T20:24:50.1282352Z 2025-05-07T20:24:50.1282362Z 2025-05-07T20:24:50.1282367Z 2025-05-07T20:24:50.1284466Z 2025-05-07T20:24:50.1335168Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:50.1335654Z 2025-05-07T20:24:50.1335664Z 2025-05-07T20:24:50.1335674Z 2025-05-07T20:24:50.1335685Z 2025-05-07T20:24:50.1335696Z 2025-05-07T20:24:50.1335705Z 2025-05-07T20:24:50.1335714Z 2025-05-07T20:24:50.1335722Z 2025-05-07T20:24:50.1335732Z 2025-05-07T20:24:50.1335741Z 2025-05-07T20:24:50.1338130Z 2025-05-07T20:24:50.1394797Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:24:50.1395192Z 2025-05-07T20:24:50.1395198Z 2025-05-07T20:24:50.1395203Z 2025-05-07T20:24:50.1395207Z 2025-05-07T20:24:50.1395212Z 2025-05-07T20:24:50.1395216Z 2025-05-07T20:24:50.1395451Z 2025-05-07T20:24:50.1395456Z 2025-05-07T20:24:50.1395460Z 2025-05-07T20:24:50.1395465Z 2025-05-07T20:24:50.1397556Z 2025-05-07T20:24:50.1774990Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:50.2147304Z gcc_impl_linux-64-11 | 53.0 MB | #### | 40% 2025-05-07T20:24:50.2147567Z 2025-05-07T20:24:50.2147576Z 2025-05-07T20:24:50.2147581Z 2025-05-07T20:24:50.2147586Z 2025-05-07T20:24:50.2778054Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:24:50.3145978Z gcc_impl_linux-64-11 | 53.0 MB | ####8 | 48% 2025-05-07T20:24:50.3148777Z 2025-05-07T20:24:50.3355931Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:50.3356212Z 2025-05-07T20:24:50.3359703Z 2025-05-07T20:24:50.3360143Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:50.3360472Z 2025-05-07T20:24:50.3360479Z 2025-05-07T20:24:50.3780143Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:50.4390508Z gcc_impl_linux-64-11 | 53.0 MB | #####7 | 58% 2025-05-07T20:24:50.4390943Z 2025-05-07T20:24:50.4390981Z 2025-05-07T20:24:50.4390987Z 2025-05-07T20:24:50.4390992Z 2025-05-07T20:24:50.4390997Z 2025-05-07T20:24:50.4391003Z 2025-05-07T20:24:50.4783050Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:24:50.4861608Z gcc_impl_linux-64-11 | 53.0 MB | #######2 | 72% 2025-05-07T20:24:50.4862275Z 2025-05-07T20:24:50.4862281Z 2025-05-07T20:24:50.4862287Z 2025-05-07T20:24:50.4862292Z 2025-05-07T20:24:50.4863861Z 2025-05-07T20:24:50.5212629Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:24:50.5213298Z 2025-05-07T20:24:50.5213302Z 2025-05-07T20:24:50.5213307Z 2025-05-07T20:24:50.5213310Z 2025-05-07T20:24:50.5213314Z 2025-05-07T20:24:50.5213318Z 2025-05-07T20:24:50.5213323Z 2025-05-07T20:24:50.5213328Z 2025-05-07T20:24:50.5217479Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:50.5217966Z 2025-05-07T20:24:50.5217972Z 2025-05-07T20:24:50.5217978Z 2025-05-07T20:24:50.5218000Z 2025-05-07T20:24:50.5218006Z 2025-05-07T20:24:50.5218012Z 2025-05-07T20:24:50.5218017Z 2025-05-07T20:24:50.5218083Z 2025-05-07T20:24:50.5789964Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:50.5987626Z gcc_impl_linux-64-11 | 53.0 MB | ########1 | 82% 2025-05-07T20:24:50.5988077Z 2025-05-07T20:24:50.5988084Z 2025-05-07T20:24:50.5988091Z 2025-05-07T20:24:50.5988096Z 2025-05-07T20:24:50.5988101Z 2025-05-07T20:24:50.5988108Z 2025-05-07T20:24:50.5988112Z 2025-05-07T20:24:50.5988977Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:50.5989389Z 2025-05-07T20:24:50.5989395Z 2025-05-07T20:24:50.5989400Z 2025-05-07T20:24:50.5989405Z 2025-05-07T20:24:50.5989415Z 2025-05-07T20:24:50.5989421Z 2025-05-07T20:24:50.5989427Z 2025-05-07T20:24:50.6021498Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:24:50.6022101Z 2025-05-07T20:24:50.6022107Z 2025-05-07T20:24:50.6022382Z 2025-05-07T20:24:50.6022390Z 2025-05-07T20:24:50.6022395Z 2025-05-07T20:24:50.6022401Z 2025-05-07T20:24:50.6022406Z 2025-05-07T20:24:50.6022411Z 2025-05-07T20:24:50.6022416Z 2025-05-07T20:24:50.6026668Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:50.6027116Z 2025-05-07T20:24:50.6027122Z 2025-05-07T20:24:50.6027127Z 2025-05-07T20:24:50.6027132Z 2025-05-07T20:24:50.6027138Z 2025-05-07T20:24:50.6027143Z 2025-05-07T20:24:50.6027226Z 2025-05-07T20:24:50.6027231Z 2025-05-07T20:24:50.6027236Z 2025-05-07T20:24:50.6677247Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:24:50.6677601Z 2025-05-07T20:24:50.6677606Z 2025-05-07T20:24:50.6677611Z 2025-05-07T20:24:50.6677615Z 2025-05-07T20:24:50.6677620Z 2025-05-07T20:24:50.6677624Z 2025-05-07T20:24:50.6677874Z 2025-05-07T20:24:50.6677879Z 2025-05-07T20:24:50.6677883Z 2025-05-07T20:24:50.6678112Z 2025-05-07T20:24:50.6678122Z 2025-05-07T20:24:50.6680341Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:50.6680820Z 2025-05-07T20:24:50.6680827Z 2025-05-07T20:24:50.6680878Z 2025-05-07T20:24:50.6680884Z 2025-05-07T20:24:50.6680889Z 2025-05-07T20:24:50.6680895Z 2025-05-07T20:24:50.6680900Z 2025-05-07T20:24:50.6680906Z 2025-05-07T20:24:50.6680911Z 2025-05-07T20:24:50.6680916Z 2025-05-07T20:24:50.6680922Z 2025-05-07T20:24:50.6709794Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:24:50.6710289Z 2025-05-07T20:24:50.6710294Z 2025-05-07T20:24:50.6710300Z 2025-05-07T20:24:50.6710305Z 2025-05-07T20:24:50.6710310Z 2025-05-07T20:24:50.6710316Z 2025-05-07T20:24:50.6710321Z 2025-05-07T20:24:50.6710326Z 2025-05-07T20:24:50.6710331Z 2025-05-07T20:24:50.6710337Z 2025-05-07T20:24:50.6717799Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:50.6718140Z 2025-05-07T20:24:50.6718152Z 2025-05-07T20:24:50.6718156Z 2025-05-07T20:24:50.6718160Z 2025-05-07T20:24:50.6718164Z 2025-05-07T20:24:50.6718167Z 2025-05-07T20:24:50.6718171Z 2025-05-07T20:24:50.6718175Z 2025-05-07T20:24:50.6718178Z 2025-05-07T20:24:50.6718182Z 2025-05-07T20:24:50.6790611Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:24:50.7660819Z gcc_impl_linux-64-11 | 53.0 MB | #########8 | 98% 2025-05-07T20:24:50.7661330Z 2025-05-07T20:24:50.7661334Z 2025-05-07T20:24:50.7662667Z 2025-05-07T20:24:50.8983099Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:24:50.9857519Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:50.9858245Z 2025-05-07T20:24:51.2230004Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:24:51.2230311Z 2025-05-07T20:24:51.2230346Z 2025-05-07T20:24:51.6342830Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:24:51.6349279Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:24:51.6349890Z 2025-05-07T20:24:51.6350234Z 2025-05-07T20:24:51.6350664Z  2025-05-07T20:24:51.6350981Z 2025-05-07T20:24:51.6351031Z 2025-05-07T20:24:51.6351318Z  2025-05-07T20:24:51.6351551Z 2025-05-07T20:24:51.6351555Z 2025-05-07T20:24:51.6351558Z 2025-05-07T20:24:51.6351778Z  2025-05-07T20:24:51.6352073Z 2025-05-07T20:24:51.6352076Z 2025-05-07T20:24:51.6352080Z 2025-05-07T20:24:51.6352084Z 2025-05-07T20:24:51.6352400Z  2025-05-07T20:24:51.6352693Z 2025-05-07T20:24:51.6352697Z 2025-05-07T20:24:51.6352701Z 2025-05-07T20:24:51.6352717Z 2025-05-07T20:24:51.6352721Z 2025-05-07T20:24:51.6352979Z  2025-05-07T20:24:51.6353430Z 2025-05-07T20:24:51.6353435Z 2025-05-07T20:24:51.6353439Z 2025-05-07T20:24:51.6353442Z 2025-05-07T20:24:51.6353446Z 2025-05-07T20:24:51.6353450Z 2025-05-07T20:24:51.6353805Z  2025-05-07T20:24:51.6354061Z 2025-05-07T20:24:51.6354065Z 2025-05-07T20:24:51.6354068Z 2025-05-07T20:24:51.6354072Z 2025-05-07T20:24:51.6354076Z 2025-05-07T20:24:51.6354079Z 2025-05-07T20:24:51.6354083Z 2025-05-07T20:24:51.6354324Z  2025-05-07T20:24:51.6354563Z 2025-05-07T20:24:51.6354567Z 2025-05-07T20:24:51.6354571Z 2025-05-07T20:24:51.6354575Z 2025-05-07T20:24:51.6354578Z 2025-05-07T20:24:51.6354582Z 2025-05-07T20:24:51.6354586Z 2025-05-07T20:24:51.6354589Z 2025-05-07T20:24:51.6354833Z  2025-05-07T20:24:51.6355227Z 2025-05-07T20:24:51.6355231Z 2025-05-07T20:24:51.6355246Z 2025-05-07T20:24:51.6355252Z 2025-05-07T20:24:51.6355256Z 2025-05-07T20:24:51.6355260Z 2025-05-07T20:24:51.6355263Z 2025-05-07T20:24:51.6355354Z 2025-05-07T20:24:51.6355358Z 2025-05-07T20:24:51.6355627Z  2025-05-07T20:24:51.6355869Z 2025-05-07T20:24:51.6355872Z 2025-05-07T20:24:51.6355876Z 2025-05-07T20:24:51.6355880Z 2025-05-07T20:24:51.6355883Z 2025-05-07T20:24:51.6355921Z 2025-05-07T20:24:51.6355925Z 2025-05-07T20:24:51.6355929Z 2025-05-07T20:24:51.6355932Z 2025-05-07T20:24:51.6355936Z 2025-05-07T20:24:51.6356150Z  2025-05-07T20:24:51.6356417Z 2025-05-07T20:24:51.6356421Z 2025-05-07T20:24:51.6356459Z 2025-05-07T20:24:51.6356463Z 2025-05-07T20:24:51.6356467Z 2025-05-07T20:24:51.6356477Z 2025-05-07T20:24:51.6356480Z 2025-05-07T20:24:51.6356484Z 2025-05-07T20:24:51.6356487Z 2025-05-07T20:24:51.6356491Z 2025-05-07T20:24:51.6356498Z 2025-05-07T20:24:51.6356709Z  done 2025-05-07T20:24:51.7358331Z Preparing transaction: \ done 2025-05-07T20:24:52.0366707Z Verifying transaction: / - \ done 2025-05-07T20:24:52.1377103Z Executing transaction: / done 2025-05-07T20:24:52.2991185Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:24:56.1819109Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:56.1819733Z 2025-05-07T20:24:56.1833355Z 2025-05-07T20:24:56.1850955Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:24:56.1851527Z 2025-05-07T20:24:56.1862916Z 2025-05-07T20:24:56.1880209Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:24:56.1880829Z 2025-05-07T20:24:56.1892810Z 2025-05-07T20:24:56.1910809Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:24:56.1911394Z 2025-05-07T20:24:56.1922565Z 2025-05-07T20:24:58.0841490Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:24:58.0842041Z 2025-05-07T20:24:58.1473154Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:00.0280380Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:00.0281061Z 2025-05-07T20:25:00.0903544Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:01.9653904Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:01.9654331Z 2025-05-07T20:25:02.0268756Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:03.9045444Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:03.9045898Z 2025-05-07T20:25:03.9678623Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:03.9682406Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:03.9683135Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:03.9683372Z 2025-05-07T20:25:05.8486761Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:05.8487343Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:05.8488030Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:05.8489048Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:05.8489623Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:05.8490310Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:05.8490774Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:05.8491403Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:05.8491927Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:05.8492322Z #define __CHAR_BIT__ 8 2025-05-07T20:25:05.8492773Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:05.8493316Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:05.8494145Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:05.8494494Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:05.8494933Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:05.8495347Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.8495685Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:05.8496117Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:05.8496548Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:05.8496906Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:05.8497450Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:05.8497963Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:05.8498387Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:05.8498732Z #define __GCC_IEC_559 2 2025-05-07T20:25:05.8499084Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:05.8499477Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:05.8499811Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:05.8500207Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:05.8500659Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.8501043Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:05.8501418Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:05.8501817Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:05.8502207Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:05.8502525Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:05.8502910Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:05.8503294Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:05.8503608Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:05.8504590Z #define __INT8_C(c) c 2025-05-07T20:25:05.8504979Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:05.8505345Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.8505796Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:05.8506240Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:05.8506681Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:05.8507066Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:05.8507489Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.8507882Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:05.8508281Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:05.8508786Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:05.8509291Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:05.8509797Z #define __linux 1 2025-05-07T20:25:05.8510142Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:05.8510517Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:05.8510968Z #define __unix 1 2025-05-07T20:25:05.8511251Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:05.8511646Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:05.8512054Z #define __WINT_MIN__ 0U 2025-05-07T20:25:05.8512354Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:05.8512748Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:05.8513161Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:05.8513478Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:05.8514064Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:05.8514498Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:05.8514848Z #define __INT64_C(c) c ## L 2025-05-07T20:25:05.8515223Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:05.8515652Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:05.8516024Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:05.8516427Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:05.8516944Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:05.8517306Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:05.8517619Z #define __DBL_DIG__ 15 2025-05-07T20:25:05.8517985Z #define __FLT32_DIG__ 6 2025-05-07T20:25:05.8518401Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:05.8518805Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:05.8519198Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:05.8519773Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:05.8520188Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:05.8520591Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:05.8520941Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:05.8521391Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:05.8521929Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:05.8522290Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:05.8522619Z #define __unix__ 1 2025-05-07T20:25:05.8523041Z #define __INT_WIDTH__ 32 2025-05-07T20:25:05.8523335Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:05.8523679Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:05.8524094Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:05.8524415Z #define __UINT16_C(c) c 2025-05-07T20:25:05.8524745Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:05.8525150Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:05.8525559Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:05.8526025Z #define __gnu_linux__ 1 2025-05-07T20:25:05.8526421Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:05.8526756Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:05.8527145Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.8527430Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:05.8527721Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:05.8527977Z #define __GNUC__ 11 2025-05-07T20:25:05.8528188Z #define __pie__ 2 2025-05-07T20:25:05.8528405Z #define __MMX__ 1 2025-05-07T20:25:05.8528628Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:05.8528887Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:05.8529168Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:05.8529436Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:05.8529782Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:05.8530175Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.8530492Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:05.8530761Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:05.8531019Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:05.8531325Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:05.8531671Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:05.8531930Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:05.8532212Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:05.8532501Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:05.8532765Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:05.8543569Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:05.8543842Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:05.8544103Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:05.8544369Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:05.8544624Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:05.8544873Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:05.8545186Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:05.8545539Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:05.8545817Z #define __SSE2_MATH__ 1 2025-05-07T20:25:05.8546055Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:05.8546569Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.8546857Z #define __amd64 1 2025-05-07T20:25:05.8547072Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:05.8547334Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:05.8547633Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:05.8547934Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:05.8548184Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:05.8548455Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:05.8548699Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:05.8548961Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:05.8549222Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:05.8549481Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:05.8549881Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:05.8550174Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:05.8550417Z #define __x86_64 1 2025-05-07T20:25:05.8550758Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:05.8551136Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:05.8551612Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:05.8552062Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:05.8552538Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:05.8552933Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:05.8553184Z #define __LP64__ 1 2025-05-07T20:25:05.8553426Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.8553773Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:05.8554151Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:05.8554426Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:05.8554698Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:05.8554978Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:05.8555254Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:05.8555522Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:05.8555783Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:05.8556046Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:05.8556302Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:05.8556636Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:05.8556999Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:05.8557276Z #define __FLT_DIG__ 6 2025-05-07T20:25:05.8557503Z #define __NO_INLINE__ 1 2025-05-07T20:25:05.8557746Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:05.8558072Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:05.8558415Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:05.8558671Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:05.8558933Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:05.8559184Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:05.8559444Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:05.8559705Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:05.8559998Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:05.8560286Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:05.8560566Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:05.8560868Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:05.8561198Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:05.8561464Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:05.8561726Z #define __FLT128_DIG__ 33 2025-05-07T20:25:05.8561960Z #define __INT32_C(c) c 2025-05-07T20:25:05.8562205Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:05.8562487Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:05.8562765Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:05.8563047Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:05.8563364Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:05.8563667Z #define unix 1 2025-05-07T20:25:05.8563898Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:05.8564211Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.8564516Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:05.8564829Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:05.8565271Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:05.8565522Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:05.8565785Z #define __ELF__ 1 2025-05-07T20:25:05.8566017Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:05.8566293Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:05.8566569Z #define __FLT_RADIX__ 2 2025-05-07T20:25:05.8566817Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:05.8567179Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:05.8567537Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:05.8567792Z #define __SSE_MATH__ 1 2025-05-07T20:25:05.8568020Z #define __k8 1 2025-05-07T20:25:05.8568312Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:05.8568687Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:05.8568986Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:05.8569369Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:05.8569628Z #define __LDBL_DIG__ 18 2025-05-07T20:25:05.8569881Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:05.8570135Z #define __x86_64__ 1 2025-05-07T20:25:05.8570375Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:05.8570677Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:05.8571018Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.8571320Z #define __FLT64_DIG__ 15 2025-05-07T20:25:05.8571605Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.8571955Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:05.8572263Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.8572532Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:05.8572810Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.8573100Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:05.8573470Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:05.8573870Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:05.8574159Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:05.8574500Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:05.8574824Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:05.8575122Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:05.8575398Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:05.8575709Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:05.8575998Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:05.8576231Z #define __SEG_FS 1 2025-05-07T20:25:05.8576464Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:05.8576742Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:05.8577013Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.8577304Z #define __SEG_GS 1 2025-05-07T20:25:05.8577617Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:05.8577991Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:05.8578268Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:05.8578558Z #define __INT16_TYPE__ short int 2025-05-07T20:25:05.8578836Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:05.8579128Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:05.8579394Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:05.8579640Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:05.8579894Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:05.8580238Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:05.8580622Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.8580904Z #define linux 1 2025-05-07T20:25:05.8581132Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.8581409Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:05.8581676Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:05.8581928Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:05.8582190Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:05.8582446Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:05.8582790Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:05.8583204Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:05.8583634Z #define __code_model_small__ 1 2025-05-07T20:25:05.8583904Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:05.8584190Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:05.8584435Z #define __k8__ 1 2025-05-07T20:25:05.8584657Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:05.8584945Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:05.8585242Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:05.8585482Z #define __pic__ 2 2025-05-07T20:25:05.8585735Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.8586046Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:05.8586333Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.8586663Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:05.8587033Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:05.8587532Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:05.8587837Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:05.8588139Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:05.8588452Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:05.8588699Z #define __linux__ 1 2025-05-07T20:25:05.8588929Z #define __INT64_TYPE__ long int 2025-05-07T20:25:05.8589197Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:05.8589452Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:05.8589869Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:05.8590133Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:05.8590424Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.8590753Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:05.8591054Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:05.8591319Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:05.8591612Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:05.8591911Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:05.8592238Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:05.8592603Z #define __SSE__ 1 2025-05-07T20:25:05.8592835Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:05.8593181Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:05.8593518Z #define __amd64__ 1 2025-05-07T20:25:05.8593743Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:05.8593993Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:05.8594256Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:05.8594528Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:05.8594799Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:05.8595064Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:05.8595325Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:05.8595601Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:05.8595863Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:05.8596221Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:05.8596687Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:05.8597046Z #define _LP64 1 2025-05-07T20:25:05.8597256Z #define __UINT8_C(c) c 2025-05-07T20:25:05.8597515Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:05.8597815Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:05.8598079Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:05.8598355Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:05.8598659Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:05.8599015Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:05.8599481Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:05.8599857Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.8600145Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:05.8600463Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:05.8600830Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:05.8601197Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:05.8601466Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:05.8601807Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:05.8602289Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:05.8602549Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:05.8602806Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:05.8603058Z #define __FXSR__ 1 2025-05-07T20:25:05.8603357Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:05.8604257Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:05.8604680Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:05.8605002Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:05.8605257Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:05.8605589Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:05.8605941Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:05.8606188Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:05.8606431Z #define __PIC__ 2 2025-05-07T20:25:05.8606906Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:05.8607306Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:05.8607689Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:05.8608060Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:05.8608397Z #define __SSE2__ 1 2025-05-07T20:25:05.8608615Z #define __INT32_TYPE__ int 2025-05-07T20:25:05.8608852Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:05.8609103Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:05.8609434Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:05.8609774Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:05.8610045Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:05.8610311Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:05.8610580Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.8610847Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:05.8611091Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:05.8611337Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:05.8611614Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.8611911Z #define __PIE__ 2 2025-05-07T20:25:05.8612227Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:05.8612605Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:05.8612949Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:05.8613307Z #define __INT16_C(c) c 2025-05-07T20:25:05.8613520Z #define __STDC__ 1 2025-05-07T20:25:05.8613748Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:05.8614016Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:05.8614270Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:05.8614561Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:05.8614905Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:05.8615231Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:05.8615488Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:05.8615767Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:05.8616029Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:05.8616306Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:05.8616597Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:05.8616868Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:05.8617159Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:05.8617555Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:05.8617929Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:05.8618229Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:05.8618514Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:05.8618762Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:05.8618919Z 2025-05-07T20:25:05.9122212Z 2025-05-07T20:25:05.9122802Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:05.9123425Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:05.9123733Z 2025-05-07T20:25:07.8030701Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:07.8031099Z #define __cpp_attributes 200809L 2025-05-07T20:25:07.8031736Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:07.8032188Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:07.8032568Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:07.8032828Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:07.8033159Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:07.8033506Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:07.8033776Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:07.8034084Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:07.8034389Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:07.8034650Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:07.8034893Z #define __CHAR_BIT__ 8 2025-05-07T20:25:07.8035129Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:07.8035375Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:07.8035619Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:07.8035889Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:07.8036328Z #define __cpp_static_assert 201411L 2025-05-07T20:25:07.8036612Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:07.8036910Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.8037209Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:07.8037490Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:07.8037816Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:07.8038184Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:07.8038572Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:07.8038980Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:07.8039287Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:07.8039565Z #define __GCC_IEC_559 2 2025-05-07T20:25:07.8039802Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:07.8040077Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:07.8040352Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:07.8040637Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:07.8040928Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:07.8041426Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:07.8041730Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:07.8042061Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.8042382Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:07.8042646Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.8042926Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:07.8043204Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:07.8043499Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:07.8043756Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:07.8044013Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:07.8044284Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:07.8044605Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:07.8044929Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:07.8045178Z #define __INT8_C(c) c 2025-05-07T20:25:07.8045412Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:07.8045681Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:07.8046003Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.8046317Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:07.8046586Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:07.8046873Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:07.8047178Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:07.8047528Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:07.8047807Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:07.8048083Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:07.8048339Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.8048613Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:07.8048885Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:07.8049266Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:07.8049673Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:07.8050112Z #define __linux 1 2025-05-07T20:25:07.8050334Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:07.8050711Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:07.8050993Z #define __unix 1 2025-05-07T20:25:07.8051212Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:07.8051491Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:07.8051775Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:07.8052041Z #define __WINT_MIN__ 0U 2025-05-07T20:25:07.8052277Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.8052557Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:07.8052832Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:07.8053094Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:07.8053345Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:07.8053624Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:07.8053912Z #define __INT64_C(c) c ## L 2025-05-07T20:25:07.8054179Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:07.8054557Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:07.8054823Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:07.8055130Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:07.8055404Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:07.8055659Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:07.8056007Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:07.8056379Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:07.8056634Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:07.8056901Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:07.8057176Z #define __DBL_DIG__ 15 2025-05-07T20:25:07.8057410Z #define __FLT32_DIG__ 6 2025-05-07T20:25:07.8057706Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:07.8058046Z #define __GXX_WEAK__ 1 2025-05-07T20:25:07.8058278Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:07.8058591Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:07.8058912Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:07.8059261Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:07.8059515Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:07.8059813Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:07.8060138Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:07.8060539Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:07.8060923Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:07.8061199Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:07.8061451Z #define __unix__ 1 2025-05-07T20:25:07.8061666Z #define __INT_WIDTH__ 32 2025-05-07T20:25:07.8061907Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:07.8062151Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:07.8062394Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:07.8062657Z #define __UINT16_C(c) c 2025-05-07T20:25:07.8062895Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:07.8063139Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:07.8063489Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:07.8063851Z #define __gnu_linux__ 1 2025-05-07T20:25:07.8064092Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:07.8064345Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:07.8064638Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.8073141Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.8073430Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:07.8073690Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:07.8073945Z #define __GNUC__ 11 2025-05-07T20:25:07.8074163Z #define __GXX_RTTI 1 2025-05-07T20:25:07.8074386Z #define __pie__ 2 2025-05-07T20:25:07.8074601Z #define __MMX__ 1 2025-05-07T20:25:07.8074829Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:07.8075095Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:07.8075381Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:07.8075650Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:07.8075899Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:07.8076204Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:07.8076537Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:07.8077012Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:07.8077385Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:07.8077690Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.8078007Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:07.8078314Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:07.8078592Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:07.8078892Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:07.8079187Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:07.8079459Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:07.8079719Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:07.8079996Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:07.8080288Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:07.8080558Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:07.8080830Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:07.8081196Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:07.8081456Z #define __cplusplus 201703L 2025-05-07T20:25:07.8081729Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:07.8082011Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:07.8082257Z #define __DEPRECATED 1 2025-05-07T20:25:07.8082509Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:07.8082801Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:07.8083050Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:07.8083366Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:07.8083722Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:07.8083985Z #define __SSE2_MATH__ 1 2025-05-07T20:25:07.8084229Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:07.8084529Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.8084814Z #define __amd64 1 2025-05-07T20:25:07.8085035Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:07.8085304Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:07.8085565Z #define __GNUG__ 11 2025-05-07T20:25:07.8085823Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:07.8086131Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:07.8086387Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:07.8086640Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:07.8086916Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:07.8087169Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:07.8087437Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:07.8087729Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:07.8087991Z #define __cpp_hex_float 201603L 2025-05-07T20:25:07.8088251Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:07.8088513Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:07.8088789Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:07.8089055Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:07.8089321Z #define __x86_64 1 2025-05-07T20:25:07.8089550Z #define __cpp_lambdas 200907L 2025-05-07T20:25:07.8089813Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:07.8090186Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:07.8090574Z #define __cpp_template_auto 201606L 2025-05-07T20:25:07.8090933Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:07.8091371Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:07.8091836Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:07.8092218Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:07.8092460Z #define __LP64__ 1 2025-05-07T20:25:07.8092688Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.8093033Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:07.8093401Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:07.8093671Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.8093954Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:07.8094219Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:07.8094484Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:07.8094748Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:07.8095009Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:07.8095442Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:07.8095799Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:07.8096071Z #define __FLT_DIG__ 6 2025-05-07T20:25:07.8096294Z #define __NO_INLINE__ 1 2025-05-07T20:25:07.8096532Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:07.8096855Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:07.8097191Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:07.8097443Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:07.8097703Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:07.8097976Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:07.8098276Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:07.8098569Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:07.8098816Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:07.8099107Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:07.8099464Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:07.8099726Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:07.8100030Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:07.8100366Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:07.8100650Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:07.8100910Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:07.8101164Z #define __FLT128_DIG__ 33 2025-05-07T20:25:07.8101400Z #define __INT32_C(c) c 2025-05-07T20:25:07.8101634Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:07.8101913Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:07.8102191Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:07.8102465Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:07.8102776Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:07.8103079Z #define unix 1 2025-05-07T20:25:07.8103292Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:07.8103552Z #define __cpp_rtti 199711L 2025-05-07T20:25:07.8104197Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:07.8104512Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.8104811Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:07.8105110Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:07.8105428Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:07.8105668Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:07.8105947Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:07.8106223Z #define __ELF__ 1 2025-05-07T20:25:07.8106444Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:07.8106716Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:07.8106979Z #define __FLT_RADIX__ 2 2025-05-07T20:25:07.8107209Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:07.8107559Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:07.8107914Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:07.8108178Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:07.8108483Z #define __k8 1 2025-05-07T20:25:07.8108783Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:07.8109152Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:07.8109431Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:07.8109775Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:07.8110028Z #define __LDBL_DIG__ 18 2025-05-07T20:25:07.8110256Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:07.8110506Z #define __x86_64__ 1 2025-05-07T20:25:07.8110736Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:07.8111019Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:07.8111349Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.8111645Z #define __FLT64_DIG__ 15 2025-05-07T20:25:07.8111913Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.8112253Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:07.8112557Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.8112815Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:07.8113082Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.8113370Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:07.8113873Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:07.8114257Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:07.8114539Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:07.8114852Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:07.8115154Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:07.8115471Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:07.8115761Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:07.8116029Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:07.8116326Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:07.8116599Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:07.8116831Z #define __SEG_FS 1 2025-05-07T20:25:07.8117049Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:07.8117321Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:07.8117589Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.8117987Z #define __SEG_GS 1 2025-05-07T20:25:07.8118297Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:07.8118666Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:07.8118930Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:07.8119211Z #define __INT16_TYPE__ short int 2025-05-07T20:25:07.8119479Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:07.8119771Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:07.8120062Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:07.8120302Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:07.8120548Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:07.8120881Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:07.8121253Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.8121557Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:07.8121867Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:07.8122161Z #define linux 1 2025-05-07T20:25:07.8122381Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.8122643Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:07.8122915Z #define __EXCEPTIONS 1 2025-05-07T20:25:07.8123156Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:07.8123404Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:07.8123666Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:07.8123946Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:07.8124278Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:07.8124657Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:07.8124997Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:07.8125317Z #define __code_model_small__ 1 2025-05-07T20:25:07.8125577Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:07.8125872Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:07.8126168Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:07.8126435Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:07.8126723Z #define __k8__ 1 2025-05-07T20:25:07.8126943Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:07.8127216Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:07.8127509Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:07.8127797Z #define __pic__ 2 2025-05-07T20:25:07.8128091Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.8128471Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:07.8128794Z #define __cpp_decltype 200707L 2025-05-07T20:25:07.8129148Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.8129548Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:07.8130003Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:07.8130403Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:07.8130692Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:07.8131003Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:07.8131292Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:07.8131541Z #define __linux__ 1 2025-05-07T20:25:07.8131769Z #define __INT64_TYPE__ long int 2025-05-07T20:25:07.8132026Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:07.8132367Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:07.8132631Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:07.8132899Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:07.8133207Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:07.8133501Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.8133800Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:07.8134062Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:07.8134350Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:07.8134639Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:07.8134954Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:07.8135301Z #define __SSE__ 1 2025-05-07T20:25:07.8135523Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:07.8135852Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:07.8136266Z #define __amd64__ 1 2025-05-07T20:25:07.8136490Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:07.8136732Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:07.8137004Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:07.8137263Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:07.8137520Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:07.8137805Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:07.8138132Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:07.8138453Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:07.8138874Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:07.8139434Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:07.8139859Z #define _LP64 1 2025-05-07T20:25:07.8140067Z #define __UINT8_C(c) c 2025-05-07T20:25:07.8140301Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:07.8140569Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:07.8140825Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:07.8141090Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:07.8141446Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:07.8141896Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:07.8142260Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.8142547Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:07.8142848Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:07.8143142Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:07.8143513Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:07.8143872Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:07.8144128Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:07.8144385Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:07.8144719Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:07.8145073Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:07.8145329Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:07.8145577Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:07.8145817Z #define __FXSR__ 1 2025-05-07T20:25:07.8146118Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:07.8146561Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:07.8146959Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:07.8147250Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:07.8147508Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:07.8147802Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:07.8148081Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:07.8148342Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:07.8148692Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:07.8149043Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:07.8149303Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:07.8149542Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:07.8149823Z #define __PIC__ 2 2025-05-07T20:25:07.8150072Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:07.8150566Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:07.8150942Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:07.8151260Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:07.8151597Z #define __cpp_constexpr 201603L 2025-05-07T20:25:07.8151849Z #define __SSE2__ 1 2025-05-07T20:25:07.8152067Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:07.8152342Z #define __INT32_TYPE__ int 2025-05-07T20:25:07.8152581Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:07.8152826Z #define __cpp_exceptions 199711L 2025-05-07T20:25:07.8153092Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:07.8153417Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:07.8153755Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:07.8154015Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:07.8154275Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:07.8154661Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.8154925Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:07.8155171Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:07.8155414Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:07.8155691Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:07.8155972Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.8156256Z #define __PIE__ 2 2025-05-07T20:25:07.8156562Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:07.8156961Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:07.8157261Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:07.8157588Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:07.8157962Z #define __INT16_C(c) c 2025-05-07T20:25:07.8158206Z #define __STDC__ 1 2025-05-07T20:25:07.8158410Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:07.8158656Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:07.8158926Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:07.8159167Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:07.8159459Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:07.8159794Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:07.8160117Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:07.8160367Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:07.8160645Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:07.8160911Z #define __SSE_MATH__ 1 2025-05-07T20:25:07.8161133Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:07.8161405Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:07.8161703Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:07.8161967Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:07.8162249Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:07.8162512Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:07.8162792Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:07.8163174Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:07.8163537Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:07.8163830Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:07.8164103Z #define _GNU_SOURCE 1 2025-05-07T20:25:07.8164339Z #define __cpp_init_captures 201304L 2025-05-07T20:25:07.8164605Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:07.8164837Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:07.8164994Z 2025-05-07T20:25:07.8650497Z 2025-05-07T20:25:07.8650828Z + conda run -n build_binary c++ --version 2025-05-07T20:25:07.8651102Z 2025-05-07T20:25:09.7510390Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:09.7510916Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:09.7511499Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:09.7512027Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:09.7512360Z 2025-05-07T20:25:09.7512364Z 2025-05-07T20:25:09.8126422Z 2025-05-07T20:25:09.8126884Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:09.8127904Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:09.8128224Z 2025-05-07T20:25:11.7575784Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:11.7578088Z 2025-05-07T20:25:11.7578665Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:11.7579249Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:11.7579578Z 2025-05-07T20:25:13.6972637Z #define __cplusplus 201703L 2025-05-07T20:25:13.6975271Z 2025-05-07T20:25:13.6976771Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:13.7022838Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:13.7023263Z . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:13.7035485Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:13.7035988Z env: 2025-05-07T20:25:13.7036212Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:13.7036508Z BUILD_ENV: build_binary 2025-05-07T20:25:13.7036754Z BUILD_TARGET: genai 2025-05-07T20:25:13.7036980Z BUILD_VARIANT: cuda 2025-05-07T20:25:13.7037210Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:13.7037456Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:13.7037754Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:13.7038086Z ##[endgroup] 2025-05-07T20:25:14.0357694Z ################################################################################ 2025-05-07T20:25:14.0358065Z # Install CUDA 2025-05-07T20:25:14.0358267Z # 2025-05-07T20:25:14.0374259Z # [2025-05-07T20:25:14.037Z] + install_cuda build_binary 12.6.3 2025-05-07T20:25:14.0374649Z ################################################################################ 2025-05-07T20:25:14.0374867Z 2025-05-07T20:25:14.0389523Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:14.1340916Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:14.1341853Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:14.1345248Z + conda clean --packages --tarball -y 2025-05-07T20:25:14.1345744Z 2025-05-07T20:25:14.8414031Z Will remove 32 (140.4 MB) tarball(s). 2025-05-07T20:25:14.8414366Z Will remove 6 (617 KB) package(s). 2025-05-07T20:25:14.9033599Z 2025-05-07T20:25:14.9043763Z + conda clean --all -y 2025-05-07T20:25:14.9044013Z 2025-05-07T20:25:15.5754008Z There are no unused tarball(s) to remove. 2025-05-07T20:25:15.5754344Z Will remove 1 index cache(s). 2025-05-07T20:25:15.5754638Z There are no unused package(s) to remove. 2025-05-07T20:25:15.5754946Z There are no tempfile(s) to remove. 2025-05-07T20:25:15.5755262Z There are no logfile(s) to remove. 2025-05-07T20:25:15.6370551Z 2025-05-07T20:25:15.6384825Z [INSTALL] Installing CUDA 12.6.3 ... 2025-05-07T20:25:15.6408361Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3 2025-05-07T20:25:16.5506490Z Channels: 2025-05-07T20:25:16.5506732Z - conda-forge 2025-05-07T20:25:16.5506960Z Platform: linux-64 2025-05-07T20:25:27.0679357Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:25:28.1693859Z Solving environment: - \ | / done 2025-05-07T20:25:28.2426888Z 2025-05-07T20:25:28.2427697Z ## Package Plan ## 2025-05-07T20:25:28.2428124Z 2025-05-07T20:25:28.2428528Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:28.2429123Z 2025-05-07T20:25:28.2429310Z added / updated specs: 2025-05-07T20:25:28.2430095Z - cuda=12.6.3 2025-05-07T20:25:28.2430362Z 2025-05-07T20:25:28.2430405Z 2025-05-07T20:25:28.2430652Z The following packages will be downloaded: 2025-05-07T20:25:28.2431075Z 2025-05-07T20:25:28.2431291Z package | build 2025-05-07T20:25:28.2431910Z ---------------------------|----------------- 2025-05-07T20:25:28.2432650Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:25:28.2433057Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:25:28.2433459Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:25:28.2433861Z bzip2-1.0.8 | h4bc722e_7 247 KB conda-forge 2025-05-07T20:25:28.2434267Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:25:28.2434664Z cuda-12.6.3 | ha804496_0 26 KB conda-forge 2025-05-07T20:25:28.2435501Z cuda-cccl_linux-64-12.6.77 | ha770c72_0 1.0 MB conda-forge 2025-05-07T20:25:28.2435999Z cuda-command-line-tools-12.6.3| ha770c72_0 20 KB conda-forge 2025-05-07T20:25:28.2436479Z cuda-compiler-12.6.3 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:25:28.2436948Z cuda-crt-dev_linux-64-12.6.85| ha770c72_0 87 KB conda-forge 2025-05-07T20:25:28.2437575Z cuda-crt-tools-12.6.85 | ha770c72_0 26 KB conda-forge 2025-05-07T20:25:28.2438025Z cuda-cudart-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:28.2438484Z cuda-cudart-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:28.2438974Z cuda-cudart-dev_linux-64-12.6.77| h3f2d84a_0 357 KB conda-forge 2025-05-07T20:25:28.2439472Z cuda-cudart-static-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:28.2439980Z cuda-cudart-static_linux-64-12.6.77| h3f2d84a_0 744 KB conda-forge 2025-05-07T20:25:28.2440485Z cuda-cudart_linux-64-12.6.77| h3f2d84a_0 184 KB conda-forge 2025-05-07T20:25:28.2440955Z cuda-cuobjdump-12.6.77 | hbd13f7d_1 241 KB conda-forge 2025-05-07T20:25:28.2441405Z cuda-cupti-12.6.80 | hbd13f7d_0 1.9 MB conda-forge 2025-05-07T20:25:28.2441849Z cuda-cupti-dev-12.6.80 | h5888daf_0 3.4 MB conda-forge 2025-05-07T20:25:28.2442304Z cuda-cuxxfilt-12.6.77 | hbd13f7d_1 211 KB conda-forge 2025-05-07T20:25:28.2442758Z cuda-driver-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:28.2443242Z cuda-driver-dev_linux-64-12.6.77| h3f2d84a_0 35 KB conda-forge 2025-05-07T20:25:28.2443696Z cuda-gdb-12.6.77 | h50b4baa_1 370 KB conda-forge 2025-05-07T20:25:28.2444135Z cuda-libraries-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:28.2444608Z cuda-libraries-dev-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:28.2445061Z cuda-nsight-12.6.77 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:25:28.2445486Z cuda-nvcc-12.6.85 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:25:28.2445937Z cuda-nvcc-dev_linux-64-12.6.85| he91c749_0 10.8 MB conda-forge 2025-05-07T20:25:28.2446405Z cuda-nvcc-impl-12.6.85 | h85509e4_0 25 KB conda-forge 2025-05-07T20:25:28.2446852Z cuda-nvcc-tools-12.6.85 | he02047a_0 23.0 MB conda-forge 2025-05-07T20:25:28.2447310Z cuda-nvcc_linux-64-12.6.85 | h04802cd_0 25 KB conda-forge 2025-05-07T20:25:28.2447764Z cuda-nvdisasm-12.6.77 | hbd13f7d_1 47.6 MB conda-forge 2025-05-07T20:25:28.2448210Z cuda-nvml-dev-12.6.77 | hbd13f7d_1 159 KB conda-forge 2025-05-07T20:25:28.2448644Z cuda-nvprof-12.6.80 | hbd13f7d_0 2.6 MB conda-forge 2025-05-07T20:25:28.2449092Z cuda-nvprune-12.6.77 | hbd13f7d_1 66 KB conda-forge 2025-05-07T20:25:28.2449526Z cuda-nvrtc-12.6.85 | hbd13f7d_0 17.3 MB conda-forge 2025-05-07T20:25:28.2449958Z cuda-nvrtc-dev-12.6.85 | h5888daf_0 31 KB conda-forge 2025-05-07T20:25:28.2450397Z cuda-nvtx-12.6.77 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:25:28.2450849Z cuda-nvvm-dev_linux-64-12.6.85| ha770c72_0 25 KB conda-forge 2025-05-07T20:25:28.2451309Z cuda-nvvm-impl-12.6.85 | he02047a_0 7.7 MB conda-forge 2025-05-07T20:25:28.2451756Z cuda-nvvm-tools-12.6.85 | he02047a_0 10.4 MB conda-forge 2025-05-07T20:25:28.2452193Z cuda-nvvp-12.6.80 | hbd13f7d_1 109.3 MB conda-forge 2025-05-07T20:25:28.2452616Z cuda-opencl-12.6.77 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:25:28.2453180Z cuda-opencl-dev-12.6.77 | h5888daf_0 93 KB conda-forge 2025-05-07T20:25:28.2453652Z cuda-profiler-api-12.6.77 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:25:28.2454113Z cuda-runtime-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:28.2454572Z cuda-sanitizer-api-12.6.77 | hbd13f7d_1 8.9 MB conda-forge 2025-05-07T20:25:28.2455107Z cuda-toolkit-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:28.2455539Z cuda-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:28.2455963Z cuda-version-12.6 | h7480c83_3 20 KB conda-forge 2025-05-07T20:25:28.2456411Z cuda-visual-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:28.2456864Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:25:28.2457271Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:25:28.2457663Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:25:28.2458117Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:25:28.2458634Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:25:28.2459150Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:25:28.2459643Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:25:28.2460077Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:25:28.2460538Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:25:28.2461007Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:25:28.2461435Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:25:28.2461826Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:28.2462228Z gds-tools-1.11.1.6 | h5888daf_4 37.8 MB conda-forge 2025-05-07T20:25:28.2462623Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:25:28.2462996Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:28.2463393Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:25:28.2463788Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:25:28.2464173Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:25:28.2464591Z libcublas-12.6.4.1 | h5888daf_1 256.2 MB conda-forge 2025-05-07T20:25:28.2465034Z libcublas-dev-12.6.4.1 | h5888daf_1 88 KB conda-forge 2025-05-07T20:25:28.2465473Z libcufft-11.3.0.4 | hbd13f7d_0 156.2 MB conda-forge 2025-05-07T20:25:28.2465905Z libcufft-dev-11.3.0.4 | h5888daf_0 33 KB conda-forge 2025-05-07T20:25:28.2466340Z libcufile-1.11.1.6 | h12f29b5_4 900 KB conda-forge 2025-05-07T20:25:28.2466783Z libcufile-dev-1.11.1.6 | h5888daf_4 35 KB conda-forge 2025-05-07T20:25:28.2467225Z libcurand-10.3.7.77 | hbd13f7d_0 39.9 MB conda-forge 2025-05-07T20:25:28.2467668Z libcurand-dev-10.3.7.77 | h5888daf_0 262 KB conda-forge 2025-05-07T20:25:28.2468117Z libcusolver-11.7.1.2 | h5888daf_1 95.8 MB conda-forge 2025-05-07T20:25:28.2468573Z libcusolver-dev-11.7.1.2 | h5888daf_1 59 KB conda-forge 2025-05-07T20:25:28.2469027Z libcusparse-12.5.4.2 | hbd13f7d_0 118.6 MB conda-forge 2025-05-07T20:25:28.2469483Z libcusparse-dev-12.5.4.2 | h5888daf_0 51 KB conda-forge 2025-05-07T20:25:28.2470020Z libedit-3.1.20191231 | he28a2e2_2 121 KB conda-forge 2025-05-07T20:25:28.2470605Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:25:28.2471098Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:25:28.2471610Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:25:28.2472203Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:25:28.2472705Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:25:28.2473247Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:25:28.2473734Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:25:28.2474185Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:25:28.2474644Z libnpp-12.3.1.54 | h5888daf_0 93.4 MB conda-forge 2025-05-07T20:25:28.2475133Z libnpp-dev-12.3.1.54 | h5888daf_0 441 KB conda-forge 2025-05-07T20:25:28.2475612Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T20:25:28.2476063Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:25:28.2476555Z libnvfatbin-12.6.77 | hbd13f7d_0 783 KB conda-forge 2025-05-07T20:25:28.2477097Z libnvfatbin-dev-12.6.77 | h5888daf_0 26 KB conda-forge 2025-05-07T20:25:28.2477630Z libnvjitlink-12.6.85 | hbd13f7d_0 14.9 MB conda-forge 2025-05-07T20:25:28.2478169Z libnvjitlink-dev-12.6.85 | h5888daf_0 25 KB conda-forge 2025-05-07T20:25:28.2478694Z libnvjpeg-12.3.3.54 | h5888daf_0 2.4 MB conda-forge 2025-05-07T20:25:28.2479211Z libnvjpeg-dev-12.3.3.54 | ha770c72_0 31 KB conda-forge 2025-05-07T20:25:28.2479696Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:25:28.2480168Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:25:28.2480662Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:25:28.2481146Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:25:28.2481621Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:25:28.2482078Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:25:28.2482561Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:25:28.2483065Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:25:28.2483590Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:25:28.2484050Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:25:28.2484494Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:25:28.2484993Z nsight-compute-2024.3.2.3 | hb5ebaad_0 443.1 MB conda-forge 2025-05-07T20:25:28.2485490Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:25:28.2485916Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:25:28.2486353Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:25:28.2486862Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:25:28.2487362Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:25:28.2487845Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:25:28.2488344Z python-3.9.18 |h0755675_1_cpython 22.7 MB conda-forge 2025-05-07T20:25:28.2488827Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:25:28.2489387Z sqlite-3.32.3 | hcee41ef_1 1.4 MB conda-forge 2025-05-07T20:25:28.2489832Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:25:28.2490283Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:25:28.2490788Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:25:28.2491215Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:25:28.2491658Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:25:28.2492109Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:25:28.2492579Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:25:28.2493026Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:25:28.2493470Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:25:28.2493924Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:25:28.2494347Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:25:28.2494767Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:25:28.2495204Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:25:28.2495669Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:25:28.2496139Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:25:28.2496594Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:28.2497035Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:25:28.2497481Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:28.2497914Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:25:28.2498356Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:25:28.2498815Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:25:28.2499278Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:25:28.2499676Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:25:28.2500054Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:25:28.2500425Z ------------------------------------------------------------ 2025-05-07T20:25:28.2500758Z Total: 1.63 GB 2025-05-07T20:25:28.2500972Z 2025-05-07T20:25:28.2501098Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:28.2501319Z 2025-05-07T20:25:28.2501523Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:25:28.2501940Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:25:28.2502347Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:25:28.2502774Z bzip2 conda-forge/linux-64::bzip2-1.0.8-h4bc722e_7 2025-05-07T20:25:28.2503215Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:25:28.2503641Z cuda conda-forge/noarch::cuda-12.6.3-ha804496_0 2025-05-07T20:25:28.2504349Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 2025-05-07T20:25:28.2505310Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 2025-05-07T20:25:28.2505923Z cuda-compiler conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 2025-05-07T20:25:28.2506463Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:28.2507006Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 2025-05-07T20:25:28.2507749Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 2025-05-07T20:25:28.2508269Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 2025-05-07T20:25:28.2508828Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:28.2511735Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 2025-05-07T20:25:28.2512354Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:28.2512955Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:28.2513506Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.2514020Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 2025-05-07T20:25:28.2514523Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 2025-05-07T20:25:28.2515064Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.2515590Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 2025-05-07T20:25:28.2516161Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:28.2516693Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 2025-05-07T20:25:28.2517179Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 2025-05-07T20:25:28.2517730Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 2025-05-07T20:25:28.2518267Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 2025-05-07T20:25:28.2518740Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 2025-05-07T20:25:28.2519258Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 2025-05-07T20:25:28.2519802Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 2025-05-07T20:25:28.2520342Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 2025-05-07T20:25:28.2520895Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 2025-05-07T20:25:28.2521429Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.2521943Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.2522442Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 2025-05-07T20:25:28.2522944Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.2523440Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 2025-05-07T20:25:28.2523932Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 2025-05-07T20:25:28.2524423Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 2025-05-07T20:25:28.2524942Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:28.2525496Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 2025-05-07T20:25:28.2526027Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 2025-05-07T20:25:28.2526532Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 2025-05-07T20:25:28.2527010Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 2025-05-07T20:25:28.2527521Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 2025-05-07T20:25:28.2528085Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 2025-05-07T20:25:28.2528617Z cuda-runtime conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 2025-05-07T20:25:28.2529158Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 2025-05-07T20:25:28.2529697Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 2025-05-07T20:25:28.2530281Z cuda-tools conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 2025-05-07T20:25:28.2530751Z cuda-version conda-forge/noarch::cuda-version-12.6-h7480c83_3 2025-05-07T20:25:28.2531272Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 2025-05-07T20:25:28.2531879Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:25:28.2532324Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:25:28.2532728Z expat conda-forge/linux-64::expat-2.7.0-h5888daf_0 2025-05-07T20:25:28.2533275Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:25:28.2533865Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:25:28.2534455Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:25:28.2535018Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:25:28.2535522Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:25:28.2536008Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:25:28.2536494Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:25:28.2536960Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:25:28.2537369Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:25:28.2537786Z gds-tools conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 2025-05-07T20:25:28.2538203Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:25:28.2538579Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:25:28.2538977Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:25:28.2539388Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:25:28.2539791Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:25:28.2540230Z libcublas conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 2025-05-07T20:25:28.2540721Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 2025-05-07T20:25:28.2541211Z libcufft conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 2025-05-07T20:25:28.2541700Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 2025-05-07T20:25:28.2542183Z libcufile conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 2025-05-07T20:25:28.2542731Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 2025-05-07T20:25:28.2543412Z libcurand conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 2025-05-07T20:25:28.2543917Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 2025-05-07T20:25:28.2544435Z libcusolver conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 2025-05-07T20:25:28.2544966Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 2025-05-07T20:25:28.2545547Z libcusparse conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 2025-05-07T20:25:28.2546081Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 2025-05-07T20:25:28.2546720Z libedit conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 2025-05-07T20:25:28.2547184Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:25:28.2547657Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:25:28.2548162Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:25:28.2548667Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:25:28.2549147Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:25:28.2549606Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:25:28.2550285Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:25:28.2550702Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:25:28.2551122Z libnpp conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 2025-05-07T20:25:28.2551655Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 2025-05-07T20:25:28.2552101Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T20:25:28.2552513Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:25:28.2552973Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 2025-05-07T20:25:28.2553495Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 2025-05-07T20:25:28.2554021Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 2025-05-07T20:25:28.2554554Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 2025-05-07T20:25:28.2555077Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 2025-05-07T20:25:28.2555577Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 2025-05-07T20:25:28.2556053Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:25:28.2556644Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T20:25:28.2557220Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:25:28.2557673Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:25:28.2558098Z libuuid conda-forge/linux-64::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:25:28.2558517Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:25:28.2558972Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:25:28.2559452Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:25:28.2559894Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:25:28.2560314Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:28.2560722Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:25:28.2561199Z nsight-compute conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 2025-05-07T20:25:28.2561676Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:25:28.2562049Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:25:28.2562443Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:25:28.2562928Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:25:28.2563409Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:25:28.2563870Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:25:28.2564358Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:25:28.2564844Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:25:28.2574908Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:25:28.2575459Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:25:28.2576003Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:25:28.2576548Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:25:28.2577220Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:25:28.2577852Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:25:28.2578357Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:25:28.2578878Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:25:28.2579528Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:25:28.2580086Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:25:28.2580555Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:25:28.2581101Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:25:28.2581769Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:25:28.2582369Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:25:28.2582874Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:25:28.2583388Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:25:28.2583892Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:25:28.2584378Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:25:28.2584920Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:25:28.2585488Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:25:28.2585935Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:25:28.2586180Z 2025-05-07T20:25:28.2586295Z The following packages will be UPDATED: 2025-05-07T20:25:28.2586503Z 2025-05-07T20:25:28.2586740Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:28.2587072Z 2025-05-07T20:25:28.2587286Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:25:28.2587594Z 2025-05-07T20:25:28.2587876Z python pkgs/main::python-3.9.21-he870216_1 --> conda-forge::python-3.9.18-h0755675_1_cpython 2025-05-07T20:25:28.2588544Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 2025-05-07T20:25:28.2589190Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:25:28.2589516Z 2025-05-07T20:25:28.2589520Z 2025-05-07T20:25:28.2589524Z 2025-05-07T20:25:28.2589664Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:28.2590145Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:25:28.2590384Z 2025-05-07T20:25:28.2590774Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:25:28.2591010Z 2025-05-07T20:25:28.2591014Z 2025-05-07T20:25:28.2591220Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:25:28.2591482Z 2025-05-07T20:25:28.2591487Z 2025-05-07T20:25:28.2591492Z 2025-05-07T20:25:28.2591815Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:25:28.2592079Z 2025-05-07T20:25:28.2592083Z 2025-05-07T20:25:28.2592086Z 2025-05-07T20:25:28.2592090Z 2025-05-07T20:25:28.2592376Z cuda-nsight-12.6.77 | 113.2 MB | | 0%  2025-05-07T20:25:28.2592773Z 2025-05-07T20:25:28.2592789Z 2025-05-07T20:25:28.2592796Z 2025-05-07T20:25:28.2592802Z 2025-05-07T20:25:28.2592812Z 2025-05-07T20:25:28.2601452Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:25:28.2601733Z 2025-05-07T20:25:28.2601736Z 2025-05-07T20:25:28.2601748Z 2025-05-07T20:25:28.2601751Z 2025-05-07T20:25:28.2601755Z 2025-05-07T20:25:28.2602008Z 2025-05-07T20:25:28.2605384Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:25:28.2605669Z 2025-05-07T20:25:28.2605673Z 2025-05-07T20:25:28.2605677Z 2025-05-07T20:25:28.2605680Z 2025-05-07T20:25:28.2605684Z 2025-05-07T20:25:28.2605688Z 2025-05-07T20:25:28.2607711Z 2025-05-07T20:25:28.2609969Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:25:28.2610309Z 2025-05-07T20:25:28.2610314Z 2025-05-07T20:25:28.2610319Z 2025-05-07T20:25:28.2610325Z 2025-05-07T20:25:28.2610329Z 2025-05-07T20:25:28.2610332Z 2025-05-07T20:25:28.2610336Z 2025-05-07T20:25:28.2610576Z 2025-05-07T20:25:28.2611300Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:25:28.2611591Z 2025-05-07T20:25:28.2611595Z 2025-05-07T20:25:28.2611604Z 2025-05-07T20:25:28.2611607Z 2025-05-07T20:25:28.2611611Z 2025-05-07T20:25:28.2611745Z 2025-05-07T20:25:28.2611748Z 2025-05-07T20:25:28.2611752Z 2025-05-07T20:25:28.2611756Z 2025-05-07T20:25:28.2613439Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:25:28.2613783Z 2025-05-07T20:25:28.2613787Z 2025-05-07T20:25:28.2613790Z 2025-05-07T20:25:28.2613794Z 2025-05-07T20:25:28.2613798Z 2025-05-07T20:25:28.2613801Z 2025-05-07T20:25:28.2613805Z 2025-05-07T20:25:28.2613808Z 2025-05-07T20:25:28.2613812Z 2025-05-07T20:25:28.2613815Z 2025-05-07T20:25:28.2614739Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:25:28.2615083Z 2025-05-07T20:25:28.2615087Z 2025-05-07T20:25:28.2615090Z 2025-05-07T20:25:28.2615104Z 2025-05-07T20:25:28.2615120Z 2025-05-07T20:25:28.2615123Z 2025-05-07T20:25:28.2615127Z 2025-05-07T20:25:28.2615130Z 2025-05-07T20:25:28.2615134Z 2025-05-07T20:25:28.2615137Z 2025-05-07T20:25:28.2615141Z 2025-05-07T20:25:28.2616952Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:25:28.2617287Z 2025-05-07T20:25:28.2617291Z 2025-05-07T20:25:28.2617295Z 2025-05-07T20:25:28.2617298Z 2025-05-07T20:25:28.2617302Z 2025-05-07T20:25:28.2617305Z 2025-05-07T20:25:28.2617309Z 2025-05-07T20:25:28.2617313Z 2025-05-07T20:25:28.2617316Z 2025-05-07T20:25:28.2617320Z 2025-05-07T20:25:28.2617323Z 2025-05-07T20:25:28.2617327Z 2025-05-07T20:25:28.2618127Z python-3.9.18 | 22.7 MB | | 0%  2025-05-07T20:25:28.2618472Z 2025-05-07T20:25:28.2618476Z 2025-05-07T20:25:28.2618485Z 2025-05-07T20:25:28.2618489Z 2025-05-07T20:25:28.2618492Z 2025-05-07T20:25:28.2618496Z 2025-05-07T20:25:28.2618505Z 2025-05-07T20:25:28.2618509Z 2025-05-07T20:25:28.2618513Z 2025-05-07T20:25:28.2618516Z 2025-05-07T20:25:28.2618520Z 2025-05-07T20:25:28.2618523Z 2025-05-07T20:25:28.2618527Z 2025-05-07T20:25:28.2619528Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:25:28.2619933Z 2025-05-07T20:25:28.2619936Z 2025-05-07T20:25:28.2619940Z 2025-05-07T20:25:28.2619944Z 2025-05-07T20:25:28.2619947Z 2025-05-07T20:25:28.2619951Z 2025-05-07T20:25:28.2619965Z 2025-05-07T20:25:28.2619973Z 2025-05-07T20:25:28.2619977Z 2025-05-07T20:25:28.2619980Z 2025-05-07T20:25:28.2619984Z 2025-05-07T20:25:28.2619988Z 2025-05-07T20:25:28.2619992Z 2025-05-07T20:25:28.2619995Z 2025-05-07T20:25:28.2620931Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:25:28.2621360Z 2025-05-07T20:25:28.2621365Z 2025-05-07T20:25:28.2621377Z 2025-05-07T20:25:28.2621383Z 2025-05-07T20:25:28.2621388Z 2025-05-07T20:25:28.2621400Z 2025-05-07T20:25:28.2621406Z 2025-05-07T20:25:28.2621411Z 2025-05-07T20:25:28.2621416Z 2025-05-07T20:25:28.2621421Z 2025-05-07T20:25:28.2621427Z 2025-05-07T20:25:28.2621432Z 2025-05-07T20:25:28.2621437Z 2025-05-07T20:25:28.2621442Z 2025-05-07T20:25:28.2621850Z 2025-05-07T20:25:28.2623841Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:25:28.2624265Z 2025-05-07T20:25:28.2624270Z 2025-05-07T20:25:28.2624276Z 2025-05-07T20:25:28.2624281Z 2025-05-07T20:25:28.2624286Z 2025-05-07T20:25:28.2624291Z 2025-05-07T20:25:28.2624304Z 2025-05-07T20:25:28.2624310Z 2025-05-07T20:25:28.2624315Z 2025-05-07T20:25:28.2624320Z 2025-05-07T20:25:28.2624325Z 2025-05-07T20:25:28.2624330Z 2025-05-07T20:25:28.2624336Z 2025-05-07T20:25:28.2624341Z 2025-05-07T20:25:28.2624346Z 2025-05-07T20:25:28.2624351Z 2025-05-07T20:25:28.2625234Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:25:28.2625809Z 2025-05-07T20:25:28.2625815Z 2025-05-07T20:25:28.2625821Z 2025-05-07T20:25:28.2625826Z 2025-05-07T20:25:28.2625831Z 2025-05-07T20:25:28.2625836Z 2025-05-07T20:25:28.2625850Z 2025-05-07T20:25:28.2625855Z 2025-05-07T20:25:28.2625860Z 2025-05-07T20:25:28.2625959Z 2025-05-07T20:25:28.2625963Z 2025-05-07T20:25:28.2625967Z 2025-05-07T20:25:28.2625970Z 2025-05-07T20:25:28.2625974Z 2025-05-07T20:25:28.2625978Z 2025-05-07T20:25:28.2625981Z 2025-05-07T20:25:28.2625985Z 2025-05-07T20:25:28.2628474Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:25:28.2628918Z 2025-05-07T20:25:28.2628923Z 2025-05-07T20:25:28.2628929Z 2025-05-07T20:25:28.2628934Z 2025-05-07T20:25:28.2628946Z 2025-05-07T20:25:28.2628951Z 2025-05-07T20:25:28.2628956Z 2025-05-07T20:25:28.2628961Z 2025-05-07T20:25:28.2628966Z 2025-05-07T20:25:28.2628972Z 2025-05-07T20:25:28.2628977Z 2025-05-07T20:25:28.2628982Z 2025-05-07T20:25:28.2628987Z 2025-05-07T20:25:28.2629001Z 2025-05-07T20:25:28.2629006Z 2025-05-07T20:25:28.2629012Z 2025-05-07T20:25:28.2629017Z 2025-05-07T20:25:28.2629022Z 2025-05-07T20:25:28.2629750Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:25:28.2630337Z 2025-05-07T20:25:28.2630343Z 2025-05-07T20:25:28.2630347Z 2025-05-07T20:25:28.2630353Z 2025-05-07T20:25:28.2630358Z 2025-05-07T20:25:28.2630363Z 2025-05-07T20:25:28.2630368Z 2025-05-07T20:25:28.2630374Z 2025-05-07T20:25:28.2630379Z 2025-05-07T20:25:28.2630384Z 2025-05-07T20:25:28.2630389Z 2025-05-07T20:25:28.2630395Z 2025-05-07T20:25:28.2630405Z 2025-05-07T20:25:28.2630410Z 2025-05-07T20:25:28.2630415Z 2025-05-07T20:25:28.2630421Z 2025-05-07T20:25:28.2630426Z 2025-05-07T20:25:28.2630431Z 2025-05-07T20:25:28.2630436Z 2025-05-07T20:25:28.3523775Z ... (more hidden) ... 2025-05-07T20:25:28.3529348Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:25:28.3530840Z 2025-05-07T20:25:28.3544546Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:25:28.3544893Z 2025-05-07T20:25:28.3545090Z 2025-05-07T20:25:28.3557904Z libcufft-11.3.0.4 | 156.2 MB | 3 | 4%  2025-05-07T20:25:28.3558274Z 2025-05-07T20:25:28.3558280Z 2025-05-07T20:25:28.3558811Z 2025-05-07T20:25:28.3584113Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:25:28.3584578Z 2025-05-07T20:25:28.3584584Z 2025-05-07T20:25:28.3584590Z 2025-05-07T20:25:28.3585434Z 2025-05-07T20:25:28.4528105Z cuda-nsight-12.6.77 | 113.2 MB | | 0%  2025-05-07T20:25:28.4536762Z nsight-compute-2024. | 443.1 MB | 1 | 1% 2025-05-07T20:25:28.4539459Z 2025-05-07T20:25:28.4558513Z libcublas-12.6.4.1 | 256.2 MB | 1 | 1%  2025-05-07T20:25:28.4558864Z 2025-05-07T20:25:28.4558870Z 2025-05-07T20:25:28.4564855Z 2025-05-07T20:25:28.4589134Z libcusparse-12.5.4.2 | 118.6 MB | 2 | 3%  2025-05-07T20:25:28.4589528Z 2025-05-07T20:25:28.4589534Z 2025-05-07T20:25:28.4589540Z 2025-05-07T20:25:28.4592190Z 2025-05-07T20:25:28.5041695Z cuda-nsight-12.6.77 | 113.2 MB | 3 | 3%  2025-05-07T20:25:28.5042114Z 2025-05-07T20:25:28.5042128Z 2025-05-07T20:25:28.5529774Z libcufft-11.3.0.4 | 156.2 MB | 7 | 7%  2025-05-07T20:25:28.5532559Z nsight-compute-2024. | 443.1 MB | 1 | 2% 2025-05-07T20:25:28.5534744Z 2025-05-07T20:25:28.5562741Z libcublas-12.6.4.1 | 256.2 MB | 2 | 3%  2025-05-07T20:25:28.5563076Z 2025-05-07T20:25:28.5563082Z 2025-05-07T20:25:28.5564448Z 2025-05-07T20:25:28.5591579Z libcusparse-12.5.4.2 | 118.6 MB | 5 | 6%  2025-05-07T20:25:28.5591947Z 2025-05-07T20:25:28.5591952Z 2025-05-07T20:25:28.5591957Z 2025-05-07T20:25:28.5591963Z 2025-05-07T20:25:28.6305704Z cuda-nsight-12.6.77 | 113.2 MB | 6 | 7%  2025-05-07T20:25:28.6306401Z 2025-05-07T20:25:28.6306410Z 2025-05-07T20:25:28.6532544Z libcufft-11.3.0.4 | 156.2 MB | # | 10%  2025-05-07T20:25:28.6533503Z nsight-compute-2024. | 443.1 MB | 2 | 3% 2025-05-07T20:25:28.6540072Z 2025-05-07T20:25:28.6567173Z libcublas-12.6.4.1 | 256.2 MB | 4 | 4%  2025-05-07T20:25:28.6567757Z 2025-05-07T20:25:28.6567762Z 2025-05-07T20:25:28.6569386Z 2025-05-07T20:25:28.6597892Z libcusparse-12.5.4.2 | 118.6 MB | 8 | 9%  2025-05-07T20:25:28.6598273Z 2025-05-07T20:25:28.6598278Z 2025-05-07T20:25:28.6598284Z 2025-05-07T20:25:28.6600064Z 2025-05-07T20:25:28.7461848Z cuda-nsight-12.6.77 | 113.2 MB | 9 | 10%  2025-05-07T20:25:28.7462241Z 2025-05-07T20:25:28.7464069Z 2025-05-07T20:25:28.7536941Z libcufft-11.3.0.4 | 156.2 MB | #2 | 13%  2025-05-07T20:25:28.7537863Z nsight-compute-2024. | 443.1 MB | 3 | 3% 2025-05-07T20:25:28.7540092Z 2025-05-07T20:25:28.7569581Z libcublas-12.6.4.1 | 256.2 MB | 5 | 5%  2025-05-07T20:25:28.7569930Z 2025-05-07T20:25:28.7569936Z 2025-05-07T20:25:28.7571962Z 2025-05-07T20:25:28.7598966Z libcusparse-12.5.4.2 | 118.6 MB | #1 | 12%  2025-05-07T20:25:28.7599337Z 2025-05-07T20:25:28.7599359Z 2025-05-07T20:25:28.7599364Z 2025-05-07T20:25:28.7599870Z 2025-05-07T20:25:28.8494023Z cuda-nsight-12.6.77 | 113.2 MB | #2 | 13%  2025-05-07T20:25:28.8494402Z 2025-05-07T20:25:28.8496159Z 2025-05-07T20:25:28.8542944Z libcufft-11.3.0.4 | 156.2 MB | #5 | 15%  2025-05-07T20:25:28.8570428Z nsight-compute-2024. | 443.1 MB | 3 | 4% 2025-05-07T20:25:28.8570785Z 2025-05-07T20:25:28.8570791Z 2025-05-07T20:25:28.8571060Z 2025-05-07T20:25:28.8599122Z libcusparse-12.5.4.2 | 118.6 MB | #4 | 15%  2025-05-07T20:25:28.8599510Z 2025-05-07T20:25:28.8599516Z 2025-05-07T20:25:28.8599522Z 2025-05-07T20:25:28.8600013Z 2025-05-07T20:25:28.8668733Z cuda-nsight-12.6.77 | 113.2 MB | #6 | 16%  2025-05-07T20:25:28.8669130Z 2025-05-07T20:25:28.9543986Z libcublas-12.6.4.1 | 256.2 MB | 6 | 6%  2025-05-07T20:25:28.9572218Z nsight-compute-2024. | 443.1 MB | 4 | 5% 2025-05-07T20:25:28.9572575Z 2025-05-07T20:25:28.9572598Z 2025-05-07T20:25:28.9573374Z 2025-05-07T20:25:28.9609057Z libcusparse-12.5.4.2 | 118.6 MB | #7 | 18%  2025-05-07T20:25:28.9609438Z 2025-05-07T20:25:28.9609444Z 2025-05-07T20:25:28.9643993Z libcufft-11.3.0.4 | 156.2 MB | #7 | 18%  2025-05-07T20:25:28.9644346Z 2025-05-07T20:25:28.9644352Z 2025-05-07T20:25:28.9644357Z 2025-05-07T20:25:28.9644372Z 2025-05-07T20:25:28.9673408Z cuda-nsight-12.6.77 | 113.2 MB | #9 | 19%  2025-05-07T20:25:28.9674147Z 2025-05-07T20:25:29.0548409Z libcublas-12.6.4.1 | 256.2 MB | 7 | 8%  2025-05-07T20:25:29.0580765Z nsight-compute-2024. | 443.1 MB | 5 | 5% 2025-05-07T20:25:29.0581126Z 2025-05-07T20:25:29.0581132Z 2025-05-07T20:25:29.0582382Z 2025-05-07T20:25:29.0639941Z libcusparse-12.5.4.2 | 118.6 MB | ## | 21%  2025-05-07T20:25:29.0640312Z 2025-05-07T20:25:29.0640319Z 2025-05-07T20:25:29.0673845Z libcufft-11.3.0.4 | 156.2 MB | ## | 20%  2025-05-07T20:25:29.0674667Z 2025-05-07T20:25:29.0695924Z libcublas-12.6.4.1 | 256.2 MB | 9 | 9%  2025-05-07T20:25:29.0696271Z 2025-05-07T20:25:29.0696277Z 2025-05-07T20:25:29.0696282Z 2025-05-07T20:25:29.0697391Z 2025-05-07T20:25:29.1548861Z cuda-nsight-12.6.77 | 113.2 MB | ##2 | 22%  2025-05-07T20:25:29.1588519Z nsight-compute-2024. | 443.1 MB | 6 | 6% 2025-05-07T20:25:29.1588862Z 2025-05-07T20:25:29.1588868Z 2025-05-07T20:25:29.1591025Z 2025-05-07T20:25:29.1674318Z libcusparse-12.5.4.2 | 118.6 MB | ##3 | 24%  2025-05-07T20:25:29.1676008Z 2025-05-07T20:25:29.1686543Z libcublas-12.6.4.1 | 256.2 MB | # | 11%  2025-05-07T20:25:29.1687169Z 2025-05-07T20:25:29.1687177Z 2025-05-07T20:25:29.1699235Z libcufft-11.3.0.4 | 156.2 MB | ##2 | 22%  2025-05-07T20:25:29.1699588Z 2025-05-07T20:25:29.1699594Z 2025-05-07T20:25:29.1699599Z 2025-05-07T20:25:29.1700379Z 2025-05-07T20:25:29.2572830Z cuda-nsight-12.6.77 | 113.2 MB | ##5 | 26%  2025-05-07T20:25:29.2657871Z nsight-compute-2024. | 443.1 MB | 6 | 7% 2025-05-07T20:25:29.2658295Z 2025-05-07T20:25:29.2658300Z 2025-05-07T20:25:29.2659573Z 2025-05-07T20:25:29.2675334Z libcusparse-12.5.4.2 | 118.6 MB | ##6 | 27%  2025-05-07T20:25:29.2678180Z 2025-05-07T20:25:29.2692681Z libcublas-12.6.4.1 | 256.2 MB | #1 | 12%  2025-05-07T20:25:29.2693041Z 2025-05-07T20:25:29.2695390Z 2025-05-07T20:25:29.2724268Z libcufft-11.3.0.4 | 156.2 MB | ##4 | 25%  2025-05-07T20:25:29.2724691Z 2025-05-07T20:25:29.2724697Z 2025-05-07T20:25:29.2724702Z 2025-05-07T20:25:29.2725392Z 2025-05-07T20:25:29.3576892Z cuda-nsight-12.6.77 | 113.2 MB | ##8 | 29%  2025-05-07T20:25:29.3658465Z nsight-compute-2024. | 443.1 MB | 7 | 8% 2025-05-07T20:25:29.3658810Z 2025-05-07T20:25:29.3658815Z 2025-05-07T20:25:29.3660323Z 2025-05-07T20:25:29.3679652Z libcusparse-12.5.4.2 | 118.6 MB | ##9 | 30%  2025-05-07T20:25:29.3681978Z 2025-05-07T20:25:29.3695623Z libcublas-12.6.4.1 | 256.2 MB | #3 | 13%  2025-05-07T20:25:29.3695959Z 2025-05-07T20:25:29.3697429Z 2025-05-07T20:25:29.3724971Z libcufft-11.3.0.4 | 156.2 MB | ##7 | 27%  2025-05-07T20:25:29.3725400Z 2025-05-07T20:25:29.3725405Z 2025-05-07T20:25:29.3725411Z 2025-05-07T20:25:29.3725416Z 2025-05-07T20:25:29.4580016Z cuda-nsight-12.6.77 | 113.2 MB | ###1 | 32%  2025-05-07T20:25:29.4661273Z nsight-compute-2024. | 443.1 MB | 8 | 9% 2025-05-07T20:25:29.4661614Z 2025-05-07T20:25:29.4661618Z 2025-05-07T20:25:29.4662196Z 2025-05-07T20:25:29.4701188Z libcusparse-12.5.4.2 | 118.6 MB | ###2 | 33%  2025-05-07T20:25:29.4701499Z 2025-05-07T20:25:29.4701503Z 2025-05-07T20:25:29.4742523Z libcufft-11.3.0.4 | 156.2 MB | ##9 | 30%  2025-05-07T20:25:29.4742849Z 2025-05-07T20:25:29.4742854Z 2025-05-07T20:25:29.4742872Z 2025-05-07T20:25:29.4743581Z 2025-05-07T20:25:29.4859641Z cuda-nsight-12.6.77 | 113.2 MB | ###4 | 35%  2025-05-07T20:25:29.4860522Z 2025-05-07T20:25:29.5607481Z libcublas-12.6.4.1 | 256.2 MB | #4 | 15%  2025-05-07T20:25:29.5663997Z nsight-compute-2024. | 443.1 MB | 9 | 9% 2025-05-07T20:25:29.5664414Z 2025-05-07T20:25:29.5664418Z 2025-05-07T20:25:29.5665925Z 2025-05-07T20:25:29.5702643Z libcusparse-12.5.4.2 | 118.6 MB | ###5 | 36%  2025-05-07T20:25:29.5702911Z 2025-05-07T20:25:29.5702915Z 2025-05-07T20:25:29.5745390Z libcufft-11.3.0.4 | 156.2 MB | ###2 | 32%  2025-05-07T20:25:29.5745641Z 2025-05-07T20:25:29.5745645Z 2025-05-07T20:25:29.5745664Z 2025-05-07T20:25:29.5749509Z 2025-05-07T20:25:29.5862157Z cuda-nsight-12.6.77 | 113.2 MB | ###8 | 38%  2025-05-07T20:25:29.5863232Z 2025-05-07T20:25:29.6667396Z libcublas-12.6.4.1 | 256.2 MB | #6 | 16%  2025-05-07T20:25:29.6667691Z 2025-05-07T20:25:29.6667696Z 2025-05-07T20:25:29.6669597Z 2025-05-07T20:25:29.6675509Z libcusparse-12.5.4.2 | 118.6 MB | ###8 | 39%  2025-05-07T20:25:29.6705793Z nsight-compute-2024. | 443.1 MB | # | 10% 2025-05-07T20:25:29.6706050Z 2025-05-07T20:25:29.6706054Z 2025-05-07T20:25:29.6865777Z libcufft-11.3.0.4 | 156.2 MB | ###4 | 35%  2025-05-07T20:25:29.6867960Z 2025-05-07T20:25:29.6875118Z libcublas-12.6.4.1 | 256.2 MB | #7 | 18%  2025-05-07T20:25:29.6875364Z 2025-05-07T20:25:29.6875368Z 2025-05-07T20:25:29.6875383Z 2025-05-07T20:25:29.6879387Z 2025-05-07T20:25:29.7670129Z cuda-nsight-12.6.77 | 113.2 MB | ####1 | 41%  2025-05-07T20:25:29.7670666Z 2025-05-07T20:25:29.7670671Z 2025-05-07T20:25:29.7671308Z 2025-05-07T20:25:29.7710833Z libcusparse-12.5.4.2 | 118.6 MB | ####1 | 42%  2025-05-07T20:25:29.7711113Z 2025-05-07T20:25:29.7711117Z 2025-05-07T20:25:29.7717036Z libcufft-11.3.0.4 | 156.2 MB | ###7 | 37%  2025-05-07T20:25:29.7868529Z nsight-compute-2024. | 443.1 MB | # | 11% 2025-05-07T20:25:29.7869079Z 2025-05-07T20:25:29.7889778Z libcublas-12.6.4.1 | 256.2 MB | #9 | 19%  2025-05-07T20:25:29.7890139Z 2025-05-07T20:25:29.7890143Z 2025-05-07T20:25:29.7890147Z 2025-05-07T20:25:29.7891440Z 2025-05-07T20:25:29.8721911Z cuda-nsight-12.6.77 | 113.2 MB | ####4 | 44%  2025-05-07T20:25:29.8723560Z nsight-compute-2024. | 443.1 MB | #1 | 12% 2025-05-07T20:25:29.8723805Z 2025-05-07T20:25:29.8723992Z 2025-05-07T20:25:29.8777056Z libcufft-11.3.0.4 | 156.2 MB | ###9 | 39%  2025-05-07T20:25:29.8777373Z 2025-05-07T20:25:29.8777377Z 2025-05-07T20:25:29.8779556Z 2025-05-07T20:25:29.8891806Z libcusparse-12.5.4.2 | 118.6 MB | ####4 | 45%  2025-05-07T20:25:29.8892081Z 2025-05-07T20:25:29.8892085Z 2025-05-07T20:25:29.8892090Z 2025-05-07T20:25:29.8894844Z 2025-05-07T20:25:29.9042691Z cuda-nsight-12.6.77 | 113.2 MB | ####7 | 47%  2025-05-07T20:25:29.9045235Z 2025-05-07T20:25:29.9724408Z libcublas-12.6.4.1 | 256.2 MB | ## | 21%  2025-05-07T20:25:29.9726234Z nsight-compute-2024. | 443.1 MB | #2 | 13% 2025-05-07T20:25:29.9726588Z 2025-05-07T20:25:29.9729150Z 2025-05-07T20:25:29.9779796Z libcufft-11.3.0.4 | 156.2 MB | ####1 | 42%  2025-05-07T20:25:29.9780095Z 2025-05-07T20:25:29.9780099Z 2025-05-07T20:25:29.9780103Z 2025-05-07T20:25:29.9892455Z libcusparse-12.5.4.2 | 118.6 MB | ####7 | 48%  2025-05-07T20:25:29.9892750Z 2025-05-07T20:25:29.9892754Z 2025-05-07T20:25:29.9892758Z 2025-05-07T20:25:29.9893197Z 2025-05-07T20:25:30.0605657Z cuda-nsight-12.6.77 | 113.2 MB | ##### | 51%  2025-05-07T20:25:30.0605990Z 2025-05-07T20:25:30.0734616Z libcublas-12.6.4.1 | 256.2 MB | ##1 | 22%  2025-05-07T20:25:30.0734876Z 2025-05-07T20:25:30.0736964Z 2025-05-07T20:25:30.0741264Z libcufft-11.3.0.4 | 156.2 MB | ####4 | 44%  2025-05-07T20:25:30.0805906Z nsight-compute-2024. | 443.1 MB | #3 | 13% 2025-05-07T20:25:30.0806195Z 2025-05-07T20:25:30.0806199Z 2025-05-07T20:25:30.0806203Z 2025-05-07T20:25:30.0929226Z libcusparse-12.5.4.2 | 118.6 MB | ##### | 51%  2025-05-07T20:25:30.0929574Z 2025-05-07T20:25:30.0929577Z 2025-05-07T20:25:30.0929581Z 2025-05-07T20:25:30.0929585Z 2025-05-07T20:25:30.1737906Z cuda-nsight-12.6.77 | 113.2 MB | #####3 | 54%  2025-05-07T20:25:30.1738194Z 2025-05-07T20:25:30.1738198Z 2025-05-07T20:25:30.1742005Z libcufft-11.3.0.4 | 156.2 MB | ####7 | 47%  2025-05-07T20:25:30.1806542Z nsight-compute-2024. | 443.1 MB | #4 | 14% 2025-05-07T20:25:30.1806846Z 2025-05-07T20:25:30.1806851Z 2025-05-07T20:25:30.1809102Z 2025-05-07T20:25:30.1931223Z libcusparse-12.5.4.2 | 118.6 MB | #####3 | 54%  2025-05-07T20:25:30.1931512Z 2025-05-07T20:25:30.1931516Z 2025-05-07T20:25:30.1931519Z 2025-05-07T20:25:30.1931536Z 2025-05-07T20:25:30.2133612Z cuda-nsight-12.6.77 | 113.2 MB | #####6 | 57%  2025-05-07T20:25:30.2136957Z 2025-05-07T20:25:30.2787598Z libcublas-12.6.4.1 | 256.2 MB | ##3 | 23%  2025-05-07T20:25:30.2887698Z nsight-compute-2024. | 443.1 MB | #4 | 15% 2025-05-07T20:25:30.2888015Z 2025-05-07T20:25:30.2888020Z 2025-05-07T20:25:30.2889235Z 2025-05-07T20:25:30.2935686Z libcusparse-12.5.4.2 | 118.6 MB | #####6 | 57%  2025-05-07T20:25:30.2935973Z 2025-05-07T20:25:30.2937743Z 2025-05-07T20:25:30.3017137Z libcufft-11.3.0.4 | 156.2 MB | ####9 | 50%  2025-05-07T20:25:30.3017399Z 2025-05-07T20:25:30.3017403Z 2025-05-07T20:25:30.3017407Z 2025-05-07T20:25:30.3021492Z 2025-05-07T20:25:30.3133835Z cuda-nsight-12.6.77 | 113.2 MB | #####9 | 60%  2025-05-07T20:25:30.3134138Z 2025-05-07T20:25:30.3843816Z libcublas-12.6.4.1 | 256.2 MB | ##4 | 24%  2025-05-07T20:25:30.3993483Z nsight-compute-2024. | 443.1 MB | #5 | 16% 2025-05-07T20:25:30.3994109Z 2025-05-07T20:25:30.3994114Z 2025-05-07T20:25:30.3995538Z 2025-05-07T20:25:30.4010544Z libcusparse-12.5.4.2 | 118.6 MB | #####9 | 59%  2025-05-07T20:25:30.4010820Z 2025-05-07T20:25:30.4010824Z 2025-05-07T20:25:30.4097642Z libcufft-11.3.0.4 | 156.2 MB | #####2 | 52%  2025-05-07T20:25:30.4097923Z 2025-05-07T20:25:30.4097927Z 2025-05-07T20:25:30.4097931Z 2025-05-07T20:25:30.4098371Z 2025-05-07T20:25:30.4135862Z cuda-nsight-12.6.77 | 113.2 MB | ######2 | 63%  2025-05-07T20:25:30.4136269Z 2025-05-07T20:25:30.4844839Z libcublas-12.6.4.1 | 256.2 MB | ##5 | 26%  2025-05-07T20:25:30.5068287Z nsight-compute-2024. | 443.1 MB | #6 | 17% 2025-05-07T20:25:30.5068644Z 2025-05-07T20:25:30.5068650Z 2025-05-07T20:25:30.5072909Z 2025-05-07T20:25:30.5112252Z libcusparse-12.5.4.2 | 118.6 MB | ######2 | 62%  2025-05-07T20:25:30.5112581Z 2025-05-07T20:25:30.5112598Z 2025-05-07T20:25:30.5126007Z libcufft-11.3.0.4 | 156.2 MB | #####4 | 54%  2025-05-07T20:25:30.5126299Z 2025-05-07T20:25:30.5126303Z 2025-05-07T20:25:30.5126307Z 2025-05-07T20:25:30.5126311Z 2025-05-07T20:25:30.5135931Z cuda-nsight-12.6.77 | 113.2 MB | ######5 | 66%  2025-05-07T20:25:30.5137645Z 2025-05-07T20:25:30.6069639Z libcublas-12.6.4.1 | 256.2 MB | ##7 | 27%  2025-05-07T20:25:30.6070013Z 2025-05-07T20:25:30.6070016Z 2025-05-07T20:25:30.6070517Z 2025-05-07T20:25:30.6112676Z libcusparse-12.5.4.2 | 118.6 MB | ######5 | 65%  2025-05-07T20:25:30.6113034Z 2025-05-07T20:25:30.6113039Z 2025-05-07T20:25:30.6128232Z libcufft-11.3.0.4 | 156.2 MB | #####6 | 57%  2025-05-07T20:25:30.6128608Z 2025-05-07T20:25:30.6128614Z 2025-05-07T20:25:30.6128620Z 2025-05-07T20:25:30.6130656Z 2025-05-07T20:25:30.6139732Z cuda-nsight-12.6.77 | 113.2 MB | ######8 | 69%  2025-05-07T20:25:30.6140066Z 2025-05-07T20:25:30.6788806Z libcublas-12.6.4.1 | 256.2 MB | ##8 | 29%  2025-05-07T20:25:30.7071093Z nsight-compute-2024. | 443.1 MB | #7 | 17% 2025-05-07T20:25:30.7071346Z 2025-05-07T20:25:30.7071350Z 2025-05-07T20:25:30.7074153Z 2025-05-07T20:25:30.7131372Z libcusparse-12.5.4.2 | 118.6 MB | ######8 | 68%  2025-05-07T20:25:30.7131739Z 2025-05-07T20:25:30.7131743Z 2025-05-07T20:25:30.7131747Z 2025-05-07T20:25:30.7131750Z 2025-05-07T20:25:30.7145272Z cuda-nsight-12.6.77 | 113.2 MB | #######2 | 72%  2025-05-07T20:25:30.7145625Z 2025-05-07T20:25:30.7530379Z libcublas-12.6.4.1 | 256.2 MB | ### | 30%  2025-05-07T20:25:30.7530651Z 2025-05-07T20:25:30.7530655Z 2025-05-07T20:25:30.7844712Z libcufft-11.3.0.4 | 156.2 MB | #####9 | 59%  2025-05-07T20:25:30.8112349Z nsight-compute-2024. | 443.1 MB | #8 | 18% 2025-05-07T20:25:30.8112686Z 2025-05-07T20:25:30.8112690Z 2025-05-07T20:25:30.8112703Z 2025-05-07T20:25:30.8189976Z libcusparse-12.5.4.2 | 118.6 MB | #######1 | 71%  2025-05-07T20:25:30.8190294Z 2025-05-07T20:25:30.8190298Z 2025-05-07T20:25:30.8190302Z 2025-05-07T20:25:30.8190306Z 2025-05-07T20:25:30.8242941Z cuda-nsight-12.6.77 | 113.2 MB | #######5 | 75%  2025-05-07T20:25:30.8247572Z 2025-05-07T20:25:30.8531611Z libcublas-12.6.4.1 | 256.2 MB | ###1 | 32%  2025-05-07T20:25:30.8531936Z 2025-05-07T20:25:30.8531941Z 2025-05-07T20:25:30.8848381Z libcufft-11.3.0.4 | 156.2 MB | ######1 | 61%  2025-05-07T20:25:30.9237750Z nsight-compute-2024. | 443.1 MB | #8 | 19% 2025-05-07T20:25:30.9238055Z 2025-05-07T20:25:30.9238059Z 2025-05-07T20:25:30.9238063Z 2025-05-07T20:25:30.9238067Z 2025-05-07T20:25:30.9250941Z cuda-nsight-12.6.77 | 113.2 MB | #######8 | 79%  2025-05-07T20:25:30.9251284Z 2025-05-07T20:25:30.9251288Z 2025-05-07T20:25:30.9251292Z 2025-05-07T20:25:30.9368366Z libcusparse-12.5.4.2 | 118.6 MB | #######4 | 74%  2025-05-07T20:25:30.9369023Z 2025-05-07T20:25:30.9534186Z libcublas-12.6.4.1 | 256.2 MB | ###2 | 33%  2025-05-07T20:25:30.9534506Z 2025-05-07T20:25:30.9534510Z 2025-05-07T20:25:30.9853504Z libcufft-11.3.0.4 | 156.2 MB | ######3 | 64%  2025-05-07T20:25:31.0362300Z nsight-compute-2024. | 443.1 MB | #9 | 20% 2025-05-07T20:25:31.0362573Z 2025-05-07T20:25:31.0362579Z 2025-05-07T20:25:31.0364067Z 2025-05-07T20:25:31.0366269Z libcusparse-12.5.4.2 | 118.6 MB | #######7 | 77%  2025-05-07T20:25:31.0366551Z 2025-05-07T20:25:31.0366556Z 2025-05-07T20:25:31.0366560Z 2025-05-07T20:25:31.0366563Z 2025-05-07T20:25:31.0371758Z cuda-nsight-12.6.77 | 113.2 MB | ########1 | 82%  2025-05-07T20:25:31.0374108Z 2025-05-07T20:25:31.0534474Z libcublas-12.6.4.1 | 256.2 MB | ###4 | 34%  2025-05-07T20:25:31.0534742Z 2025-05-07T20:25:31.0534747Z 2025-05-07T20:25:31.0854130Z libcufft-11.3.0.4 | 156.2 MB | ######6 | 66%  2025-05-07T20:25:31.1388180Z nsight-compute-2024. | 443.1 MB | ## | 20% 2025-05-07T20:25:31.1388459Z 2025-05-07T20:25:31.1407675Z libcublas-12.6.4.1 | 256.2 MB | ###5 | 36%  2025-05-07T20:25:31.1407938Z 2025-05-07T20:25:31.1407942Z 2025-05-07T20:25:31.1407946Z 2025-05-07T20:25:31.1464667Z libcusparse-12.5.4.2 | 118.6 MB | #######9 | 80%  2025-05-07T20:25:31.1464950Z 2025-05-07T20:25:31.1464954Z 2025-05-07T20:25:31.1464958Z 2025-05-07T20:25:31.1464962Z 2025-05-07T20:25:31.1572133Z cuda-nsight-12.6.77 | 113.2 MB | ########4 | 85%  2025-05-07T20:25:31.1572493Z 2025-05-07T20:25:31.1572499Z 2025-05-07T20:25:31.1857325Z libcufft-11.3.0.4 | 156.2 MB | ######8 | 68%  2025-05-07T20:25:31.2393243Z nsight-compute-2024. | 443.1 MB | ##1 | 21% 2025-05-07T20:25:31.2393570Z 2025-05-07T20:25:31.2412652Z libcublas-12.6.4.1 | 256.2 MB | ###7 | 37%  2025-05-07T20:25:31.2412918Z 2025-05-07T20:25:31.2412923Z 2025-05-07T20:25:31.2414570Z 2025-05-07T20:25:31.2576224Z libcusparse-12.5.4.2 | 118.6 MB | ########3 | 83%  2025-05-07T20:25:31.2576518Z 2025-05-07T20:25:31.2576522Z 2025-05-07T20:25:31.2863361Z libcufft-11.3.0.4 | 156.2 MB | #######1 | 71%  2025-05-07T20:25:31.3097677Z nsight-compute-2024. | 443.1 MB | ##1 | 22% 2025-05-07T20:25:31.3098065Z 2025-05-07T20:25:31.3098071Z 2025-05-07T20:25:31.3098076Z 2025-05-07T20:25:31.3099847Z 2025-05-07T20:25:31.3447977Z cuda-nsight-12.6.77 | 113.2 MB | ########7 | 87%  2025-05-07T20:25:31.3448269Z 2025-05-07T20:25:31.3535355Z libcublas-12.6.4.1 | 256.2 MB | ###8 | 39%  2025-05-07T20:25:31.3535618Z 2025-05-07T20:25:31.3535622Z 2025-05-07T20:25:31.3538873Z 2025-05-07T20:25:31.3688087Z libcusparse-12.5.4.2 | 118.6 MB | ########6 | 86%  2025-05-07T20:25:31.3688448Z 2025-05-07T20:25:31.3691319Z 2025-05-07T20:25:31.3975106Z libcufft-11.3.0.4 | 156.2 MB | #######3 | 73%  2025-05-07T20:25:31.4100851Z nsight-compute-2024. | 443.1 MB | ##2 | 23% 2025-05-07T20:25:31.4101134Z 2025-05-07T20:25:31.4101138Z 2025-05-07T20:25:31.4101142Z 2025-05-07T20:25:31.4101784Z 2025-05-07T20:25:31.4467290Z cuda-nsight-12.6.77 | 113.2 MB | ######### | 90%  2025-05-07T20:25:31.4467676Z 2025-05-07T20:25:31.4563368Z libcublas-12.6.4.1 | 256.2 MB | #### | 40%  2025-05-07T20:25:31.4563632Z 2025-05-07T20:25:31.4563639Z 2025-05-07T20:25:31.4564896Z 2025-05-07T20:25:31.4730808Z libcusparse-12.5.4.2 | 118.6 MB | ########8 | 89%  2025-05-07T20:25:31.4731103Z 2025-05-07T20:25:31.4731107Z 2025-05-07T20:25:31.4976188Z libcufft-11.3.0.4 | 156.2 MB | #######5 | 76%  2025-05-07T20:25:31.5104124Z nsight-compute-2024. | 443.1 MB | ##3 | 24% 2025-05-07T20:25:31.5104465Z 2025-05-07T20:25:31.5104469Z 2025-05-07T20:25:31.5104473Z 2025-05-07T20:25:31.5105129Z 2025-05-07T20:25:31.5536793Z cuda-nsight-12.6.77 | 113.2 MB | #########3 | 93%  2025-05-07T20:25:31.5537094Z 2025-05-07T20:25:31.5590710Z libcublas-12.6.4.1 | 256.2 MB | ####1 | 41%  2025-05-07T20:25:31.5591047Z 2025-05-07T20:25:31.5591053Z 2025-05-07T20:25:31.5593249Z 2025-05-07T20:25:31.5844552Z libcusparse-12.5.4.2 | 118.6 MB | #########1 | 92%  2025-05-07T20:25:31.5844831Z 2025-05-07T20:25:31.5844835Z 2025-05-07T20:25:31.6103605Z libcufft-11.3.0.4 | 156.2 MB | #######7 | 78%  2025-05-07T20:25:31.6108615Z nsight-compute-2024. | 443.1 MB | ##4 | 24% 2025-05-07T20:25:31.6108927Z 2025-05-07T20:25:31.6108931Z 2025-05-07T20:25:31.6108935Z 2025-05-07T20:25:31.6110838Z 2025-05-07T20:25:31.6642321Z cuda-nsight-12.6.77 | 113.2 MB | #########5 | 96%  2025-05-07T20:25:31.6642704Z 2025-05-07T20:25:31.6692542Z libcublas-12.6.4.1 | 256.2 MB | ####2 | 43%  2025-05-07T20:25:31.6692866Z 2025-05-07T20:25:31.6692871Z 2025-05-07T20:25:31.6693622Z 2025-05-07T20:25:31.6859779Z libcusparse-12.5.4.2 | 118.6 MB | #########4 | 95%  2025-05-07T20:25:31.6860090Z 2025-05-07T20:25:31.6860096Z 2025-05-07T20:25:31.7119188Z libcufft-11.3.0.4 | 156.2 MB | ######## | 80%  2025-05-07T20:25:31.7119493Z 2025-05-07T20:25:31.7119499Z 2025-05-07T20:25:31.7119504Z 2025-05-07T20:25:31.7119509Z 2025-05-07T20:25:31.7157120Z cuda-nsight-12.6.77 | 113.2 MB | #########8 | 99%  2025-05-07T20:25:31.7683691Z nsight-compute-2024. | 443.1 MB | ##5 | 25% 2025-05-07T20:25:31.7683953Z 2025-05-07T20:25:31.7777283Z libcublas-12.6.4.1 | 256.2 MB | ####4 | 44%  2025-05-07T20:25:31.7777548Z 2025-05-07T20:25:31.7777552Z 2025-05-07T20:25:31.7777978Z 2025-05-07T20:25:31.7861391Z libcusparse-12.5.4.2 | 118.6 MB | #########7 | 97%  2025-05-07T20:25:31.7861689Z 2025-05-07T20:25:31.7863173Z 2025-05-07T20:25:31.8157631Z libcufft-11.3.0.4 | 156.2 MB | ########2 | 82%  2025-05-07T20:25:31.8688384Z nsight-compute-2024. | 443.1 MB | ##5 | 26% 2025-05-07T20:25:31.8689312Z 2025-05-07T20:25:31.8868248Z libcublas-12.6.4.1 | 256.2 MB | ####5 | 46%  2025-05-07T20:25:31.8868539Z 2025-05-07T20:25:31.8870212Z 2025-05-07T20:25:31.9157775Z libcufft-11.3.0.4 | 156.2 MB | ########5 | 85%  2025-05-07T20:25:31.9689788Z nsight-compute-2024. | 443.1 MB | ##6 | 27% 2025-05-07T20:25:31.9692117Z 2025-05-07T20:25:31.9870784Z libcublas-12.6.4.1 | 256.2 MB | ####7 | 47%  2025-05-07T20:25:31.9871114Z 2025-05-07T20:25:31.9871118Z 2025-05-07T20:25:32.0161897Z libcufft-11.3.0.4 | 156.2 MB | ########7 | 88%  2025-05-07T20:25:32.0697633Z nsight-compute-2024. | 443.1 MB | ##7 | 28% 2025-05-07T20:25:32.0701749Z 2025-05-07T20:25:32.0873260Z libcublas-12.6.4.1 | 256.2 MB | ####8 | 49%  2025-05-07T20:25:32.0873564Z 2025-05-07T20:25:32.0874299Z 2025-05-07T20:25:32.1163389Z libcufft-11.3.0.4 | 156.2 MB | ######### | 91%  2025-05-07T20:25:32.1699187Z nsight-compute-2024. | 443.1 MB | ##8 | 29% 2025-05-07T20:25:32.1699557Z 2025-05-07T20:25:32.1877057Z libcublas-12.6.4.1 | 256.2 MB | ##### | 50%  2025-05-07T20:25:32.1877329Z 2025-05-07T20:25:32.1877333Z 2025-05-07T20:25:32.2166229Z libcufft-11.3.0.4 | 156.2 MB | #########3 | 93%  2025-05-07T20:25:32.2703092Z nsight-compute-2024. | 443.1 MB | ##9 | 29% 2025-05-07T20:25:32.2705204Z 2025-05-07T20:25:32.2878593Z libcublas-12.6.4.1 | 256.2 MB | #####1 | 52%  2025-05-07T20:25:32.2878857Z 2025-05-07T20:25:32.2879525Z 2025-05-07T20:25:32.3169560Z libcufft-11.3.0.4 | 156.2 MB | #########6 | 96%  2025-05-07T20:25:32.3704326Z nsight-compute-2024. | 443.1 MB | ### | 30% 2025-05-07T20:25:32.3705398Z 2025-05-07T20:25:32.3881753Z libcublas-12.6.4.1 | 256.2 MB | #####3 | 53%  2025-05-07T20:25:32.3882087Z 2025-05-07T20:25:32.3882633Z 2025-05-07T20:25:32.4170962Z libcufft-11.3.0.4 | 156.2 MB | #########9 | 99%  2025-05-07T20:25:32.4705846Z nsight-compute-2024. | 443.1 MB | ###1 | 31% 2025-05-07T20:25:32.4706521Z 2025-05-07T20:25:32.5173988Z libcublas-12.6.4.1 | 256.2 MB | #####5 | 55%  2025-05-07T20:25:32.5711048Z nsight-compute-2024. | 443.1 MB | ###2 | 33% 2025-05-07T20:25:32.5711397Z 2025-05-07T20:25:32.6174559Z libcublas-12.6.4.1 | 256.2 MB | #####7 | 58%  2025-05-07T20:25:32.6710870Z nsight-compute-2024. | 443.1 MB | ###3 | 34% 2025-05-07T20:25:32.6711939Z 2025-05-07T20:25:32.7174531Z libcublas-12.6.4.1 | 256.2 MB | #####9 | 60%  2025-05-07T20:25:32.7945718Z nsight-compute-2024. | 443.1 MB | ###5 | 35% 2025-05-07T20:25:32.7946790Z 2025-05-07T20:25:32.8174322Z libcublas-12.6.4.1 | 256.2 MB | ######1 | 62%  2025-05-07T20:25:32.9043036Z nsight-compute-2024. | 443.1 MB | ###6 | 37% 2025-05-07T20:25:32.9043366Z 2025-05-07T20:25:32.9176628Z libcublas-12.6.4.1 | 256.2 MB | ######3 | 63%  2025-05-07T20:25:33.0043609Z nsight-compute-2024. | 443.1 MB | ###8 | 38% 2025-05-07T20:25:33.0044651Z 2025-05-07T20:25:33.0177845Z libcublas-12.6.4.1 | 256.2 MB | ######5 | 66%  2025-05-07T20:25:33.1044700Z nsight-compute-2024. | 443.1 MB | ###9 | 39% 2025-05-07T20:25:33.1045075Z 2025-05-07T20:25:33.1180376Z libcublas-12.6.4.1 | 256.2 MB | ######7 | 68%  2025-05-07T20:25:33.2045226Z nsight-compute-2024. | 443.1 MB | #### | 41% 2025-05-07T20:25:33.2045580Z 2025-05-07T20:25:33.2226818Z libcublas-12.6.4.1 | 256.2 MB | ######9 | 69%  2025-05-07T20:25:33.3050547Z nsight-compute-2024. | 443.1 MB | ####2 | 42% 2025-05-07T20:25:33.3050843Z 2025-05-07T20:25:33.3338508Z libcublas-12.6.4.1 | 256.2 MB | #######1 | 72%  2025-05-07T20:25:33.4141210Z nsight-compute-2024. | 443.1 MB | ####3 | 43% 2025-05-07T20:25:33.4143509Z 2025-05-07T20:25:33.4484638Z libcublas-12.6.4.1 | 256.2 MB | #######3 | 73%  2025-05-07T20:25:33.5146046Z nsight-compute-2024. | 443.1 MB | ####4 | 45% 2025-05-07T20:25:33.5146553Z 2025-05-07T20:25:33.5511782Z libcublas-12.6.4.1 | 256.2 MB | #######5 | 75%  2025-05-07T20:25:33.6148256Z nsight-compute-2024. | 443.1 MB | ####5 | 46% 2025-05-07T20:25:33.6148632Z 2025-05-07T20:25:33.6566941Z libcublas-12.6.4.1 | 256.2 MB | #######7 | 77%  2025-05-07T20:25:33.7563613Z nsight-compute-2024. | 443.1 MB | ####6 | 47% 2025-05-07T20:25:33.7565396Z 2025-05-07T20:25:33.7571941Z libcublas-12.6.4.1 | 256.2 MB | #######9 | 79%  2025-05-07T20:25:33.8582736Z nsight-compute-2024. | 443.1 MB | ####8 | 48% 2025-05-07T20:25:33.8670518Z nsight-compute-2024. | 443.1 MB | ####9 | 50% 2025-05-07T20:25:33.8670813Z 2025-05-07T20:25:33.9499804Z libcublas-12.6.4.1 | 256.2 MB | ######## | 81%  2025-05-07T20:25:33.9500068Z 2025-05-07T20:25:33.9500089Z 2025-05-07T20:25:33.9500093Z 2025-05-07T20:25:33.9501332Z 2025-05-07T20:25:33.9673268Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:25:33.9675267Z 2025-05-07T20:25:33.9765522Z libcublas-12.6.4.1 | 256.2 MB | ########2 | 83%  2025-05-07T20:25:33.9922400Z nsight-compute-2024. | 443.1 MB | #####1 | 51% 2025-05-07T20:25:33.9922752Z 2025-05-07T20:25:33.9922757Z 2025-05-07T20:25:33.9922763Z 2025-05-07T20:25:33.9922768Z 2025-05-07T20:25:33.9922774Z 2025-05-07T20:25:34.0710864Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:25:34.0711243Z 2025-05-07T20:25:34.0926521Z libcublas-12.6.4.1 | 256.2 MB | ########4 | 84%  2025-05-07T20:25:34.0926831Z 2025-05-07T20:25:34.0926835Z 2025-05-07T20:25:34.0926839Z 2025-05-07T20:25:34.0926843Z 2025-05-07T20:25:34.0926961Z 2025-05-07T20:25:34.1149975Z cuda-nvvp-12.6.80 | 109.3 MB | 3 | 3%  2025-05-07T20:25:34.1921337Z nsight-compute-2024. | 443.1 MB | #####2 | 52% 2025-05-07T20:25:34.1921755Z 2025-05-07T20:25:34.1926625Z libcublas-12.6.4.1 | 256.2 MB | ########6 | 86%  2025-05-07T20:25:34.1926988Z 2025-05-07T20:25:34.1926993Z 2025-05-07T20:25:34.1926997Z 2025-05-07T20:25:34.1927199Z 2025-05-07T20:25:34.1929843Z 2025-05-07T20:25:34.2575061Z cuda-nvvp-12.6.80 | 109.3 MB | 5 | 6%  2025-05-07T20:25:34.2927326Z nsight-compute-2024. | 443.1 MB | #####3 | 53% 2025-05-07T20:25:34.2927701Z 2025-05-07T20:25:34.2927708Z 2025-05-07T20:25:34.2927713Z 2025-05-07T20:25:34.2927729Z 2025-05-07T20:25:34.2930568Z 2025-05-07T20:25:34.3113732Z cuda-nvvp-12.6.80 | 109.3 MB | 9 | 9%  2025-05-07T20:25:34.3114149Z 2025-05-07T20:25:34.3145671Z libcublas-12.6.4.1 | 256.2 MB | ########7 | 88%  2025-05-07T20:25:34.3145927Z 2025-05-07T20:25:34.3145931Z 2025-05-07T20:25:34.3145935Z 2025-05-07T20:25:34.3148654Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:25:34.3149009Z 2025-05-07T20:25:34.3149015Z 2025-05-07T20:25:34.3155304Z 2025-05-07T20:25:34.3668552Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:25:34.3668967Z 2025-05-07T20:25:34.3668973Z 2025-05-07T20:25:34.3668999Z 2025-05-07T20:25:34.3669002Z 2025-05-07T20:25:34.3669006Z 2025-05-07T20:25:34.3672615Z 2025-05-07T20:25:34.3929894Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:25:34.3930211Z 2025-05-07T20:25:34.3930217Z 2025-05-07T20:25:34.3930222Z 2025-05-07T20:25:34.3930227Z 2025-05-07T20:25:34.3930233Z 2025-05-07T20:25:34.3945412Z cuda-nvvp-12.6.80 | 109.3 MB | #2 | 12%  2025-05-07T20:25:34.4289798Z nsight-compute-2024. | 443.1 MB | #####4 | 54% 2025-05-07T20:25:34.4294286Z 2025-05-07T20:25:34.4669961Z libcublas-12.6.4.1 | 256.2 MB | ########9 | 89%  2025-05-07T20:25:34.4670247Z 2025-05-07T20:25:34.4670251Z 2025-05-07T20:25:34.4670255Z 2025-05-07T20:25:34.4670278Z 2025-05-07T20:25:34.4670282Z 2025-05-07T20:25:34.4672634Z 2025-05-07T20:25:34.4969656Z libcusolver-11.7.1.2 | 95.8 MB | 3 | 3%  2025-05-07T20:25:34.4970016Z 2025-05-07T20:25:34.4970022Z 2025-05-07T20:25:34.4970050Z 2025-05-07T20:25:34.4970055Z 2025-05-07T20:25:34.4972700Z 2025-05-07T20:25:34.5415022Z cuda-nvvp-12.6.80 | 109.3 MB | #5 | 15%  2025-05-07T20:25:34.5647056Z nsight-compute-2024. | 443.1 MB | #####5 | 55% 2025-05-07T20:25:34.5647461Z 2025-05-07T20:25:34.5672128Z libcublas-12.6.4.1 | 256.2 MB | ######### | 91%  2025-05-07T20:25:34.5672452Z 2025-05-07T20:25:34.5672468Z 2025-05-07T20:25:34.5672472Z 2025-05-07T20:25:34.5672476Z 2025-05-07T20:25:34.5672480Z 2025-05-07T20:25:34.5674259Z 2025-05-07T20:25:34.6048066Z libcusolver-11.7.1.2 | 95.8 MB | 6 | 6%  2025-05-07T20:25:34.6048493Z 2025-05-07T20:25:34.6048498Z 2025-05-07T20:25:34.6048501Z 2025-05-07T20:25:34.6048526Z 2025-05-07T20:25:34.6051353Z 2025-05-07T20:25:34.6673876Z cuda-nvvp-12.6.80 | 109.3 MB | #7 | 18%  2025-05-07T20:25:34.6674228Z 2025-05-07T20:25:34.6674232Z 2025-05-07T20:25:34.6674236Z 2025-05-07T20:25:34.6674240Z 2025-05-07T20:25:34.6674264Z 2025-05-07T20:25:34.6677074Z 2025-05-07T20:25:34.6887230Z libcusolver-11.7.1.2 | 95.8 MB | 8 | 9%  2025-05-07T20:25:34.6887701Z 2025-05-07T20:25:34.6889833Z libcublas-12.6.4.1 | 256.2 MB | #########1 | 92%  2025-05-07T20:25:34.7079208Z nsight-compute-2024. | 443.1 MB | #####6 | 56% 2025-05-07T20:25:34.7079466Z 2025-05-07T20:25:34.7079470Z 2025-05-07T20:25:34.7079474Z 2025-05-07T20:25:34.7079477Z 2025-05-07T20:25:34.7083028Z 2025-05-07T20:25:34.7678738Z cuda-nvvp-12.6.80 | 109.3 MB | ## | 21%  2025-05-07T20:25:34.7679065Z 2025-05-07T20:25:34.7679069Z 2025-05-07T20:25:34.7679073Z 2025-05-07T20:25:34.7679083Z 2025-05-07T20:25:34.7679087Z 2025-05-07T20:25:34.7681797Z 2025-05-07T20:25:34.8071676Z libcusolver-11.7.1.2 | 95.8 MB | #1 | 12%  2025-05-07T20:25:34.8072138Z 2025-05-07T20:25:34.8084299Z libcublas-12.6.4.1 | 256.2 MB | #########3 | 93%  2025-05-07T20:25:34.8084652Z 2025-05-07T20:25:34.8084924Z 2025-05-07T20:25:34.8084930Z 2025-05-07T20:25:34.8084935Z 2025-05-07T20:25:34.8088197Z 2025-05-07T20:25:34.8118840Z cuda-nvvp-12.6.80 | 109.3 MB | ##3 | 23%  2025-05-07T20:25:34.8789259Z nsight-compute-2024. | 443.1 MB | #####6 | 57% 2025-05-07T20:25:34.8789608Z 2025-05-07T20:25:34.8789615Z 2025-05-07T20:25:34.8789620Z 2025-05-07T20:25:34.8789625Z 2025-05-07T20:25:34.8789631Z 2025-05-07T20:25:34.8789637Z 2025-05-07T20:25:34.9161347Z libcusolver-11.7.1.2 | 95.8 MB | #4 | 14%  2025-05-07T20:25:34.9166881Z 2025-05-07T20:25:34.9215329Z libcublas-12.6.4.1 | 256.2 MB | #########4 | 94%  2025-05-07T20:25:34.9215687Z 2025-05-07T20:25:34.9215706Z 2025-05-07T20:25:34.9215712Z 2025-05-07T20:25:34.9215718Z 2025-05-07T20:25:34.9215897Z 2025-05-07T20:25:34.9364716Z cuda-nvvp-12.6.80 | 109.3 MB | ##5 | 26%  2025-05-07T20:25:34.9793310Z nsight-compute-2024. | 443.1 MB | #####7 | 58% 2025-05-07T20:25:34.9793671Z 2025-05-07T20:25:34.9793676Z 2025-05-07T20:25:34.9793682Z 2025-05-07T20:25:34.9793687Z 2025-05-07T20:25:34.9793693Z 2025-05-07T20:25:34.9793703Z 2025-05-07T20:25:35.0381893Z libcusolver-11.7.1.2 | 95.8 MB | #7 | 17%  2025-05-07T20:25:35.0382300Z 2025-05-07T20:25:35.0382306Z 2025-05-07T20:25:35.0382312Z 2025-05-07T20:25:35.0382317Z 2025-05-07T20:25:35.0387177Z 2025-05-07T20:25:35.0432256Z cuda-nvvp-12.6.80 | 109.3 MB | ##8 | 29%  2025-05-07T20:25:35.0432638Z 2025-05-07T20:25:35.0765697Z libcublas-12.6.4.1 | 256.2 MB | #########5 | 96%  2025-05-07T20:25:35.0818114Z nsight-compute-2024. | 443.1 MB | #####8 | 58% 2025-05-07T20:25:35.0818476Z 2025-05-07T20:25:35.0818483Z 2025-05-07T20:25:35.0818489Z 2025-05-07T20:25:35.0818494Z 2025-05-07T20:25:35.0818499Z 2025-05-07T20:25:35.0818504Z 2025-05-07T20:25:35.1468617Z libcusolver-11.7.1.2 | 95.8 MB | #9 | 20%  2025-05-07T20:25:35.1469024Z 2025-05-07T20:25:35.1469029Z 2025-05-07T20:25:35.1469034Z 2025-05-07T20:25:35.1469039Z 2025-05-07T20:25:35.1470637Z 2025-05-07T20:25:35.1615496Z cuda-nvvp-12.6.80 | 109.3 MB | ###1 | 31%  2025-05-07T20:25:35.1622770Z 2025-05-07T20:25:35.1869391Z libcublas-12.6.4.1 | 256.2 MB | #########6 | 97%  2025-05-07T20:25:35.1869761Z 2025-05-07T20:25:35.1869767Z 2025-05-07T20:25:35.1869855Z 2025-05-07T20:25:35.1869861Z 2025-05-07T20:25:35.1869866Z 2025-05-07T20:25:35.1873567Z 2025-05-07T20:25:35.1946738Z libcusolver-11.7.1.2 | 95.8 MB | ##2 | 22%  2025-05-07T20:25:35.2469497Z nsight-compute-2024. | 443.1 MB | #####8 | 59% 2025-05-07T20:25:35.2469973Z 2025-05-07T20:25:35.2469999Z 2025-05-07T20:25:35.2470006Z 2025-05-07T20:25:35.2470011Z 2025-05-07T20:25:35.2472073Z 2025-05-07T20:25:35.2662279Z cuda-nvvp-12.6.80 | 109.3 MB | ###3 | 34%  2025-05-07T20:25:35.2664702Z 2025-05-07T20:25:35.2876500Z libcublas-12.6.4.1 | 256.2 MB | #########7 | 98%  2025-05-07T20:25:35.2876867Z 2025-05-07T20:25:35.2876872Z 2025-05-07T20:25:35.2876875Z 2025-05-07T20:25:35.2876879Z 2025-05-07T20:25:35.2876883Z 2025-05-07T20:25:35.2878752Z 2025-05-07T20:25:35.3004667Z libcusolver-11.7.1.2 | 95.8 MB | ##5 | 25%  2025-05-07T20:25:35.3472167Z nsight-compute-2024. | 443.1 MB | #####9 | 60% 2025-05-07T20:25:35.3472453Z 2025-05-07T20:25:35.3472457Z 2025-05-07T20:25:35.3472461Z 2025-05-07T20:25:35.3472465Z 2025-05-07T20:25:35.3474458Z 2025-05-07T20:25:35.3667533Z cuda-nvvp-12.6.80 | 109.3 MB | ###6 | 37%  2025-05-07T20:25:35.3669499Z 2025-05-07T20:25:35.3877971Z libcublas-12.6.4.1 | 256.2 MB | #########8 | 99%  2025-05-07T20:25:35.3878331Z 2025-05-07T20:25:35.3878335Z 2025-05-07T20:25:35.3878345Z 2025-05-07T20:25:35.3878349Z 2025-05-07T20:25:35.3878352Z 2025-05-07T20:25:35.3883151Z 2025-05-07T20:25:35.4009542Z libcusolver-11.7.1.2 | 95.8 MB | ##7 | 28%  2025-05-07T20:25:35.4472920Z nsight-compute-2024. | 443.1 MB | ###### | 60% 2025-05-07T20:25:35.4473232Z 2025-05-07T20:25:35.4473236Z 2025-05-07T20:25:35.4473248Z 2025-05-07T20:25:35.4473252Z 2025-05-07T20:25:35.4476437Z 2025-05-07T20:25:35.4692557Z cuda-nvvp-12.6.80 | 109.3 MB | ###9 | 40%  2025-05-07T20:25:35.4692843Z 2025-05-07T20:25:35.4878578Z libcublas-12.6.4.1 | 256.2 MB | #########9 | 100%  2025-05-07T20:25:35.4878882Z 2025-05-07T20:25:35.4878885Z 2025-05-07T20:25:35.4878889Z 2025-05-07T20:25:35.4878893Z 2025-05-07T20:25:35.4878896Z 2025-05-07T20:25:35.4880976Z 2025-05-07T20:25:35.5078407Z libcusolver-11.7.1.2 | 95.8 MB | ###1 | 31%  2025-05-07T20:25:35.5473852Z nsight-compute-2024. | 443.1 MB | ###### | 61% 2025-05-07T20:25:35.5474113Z 2025-05-07T20:25:35.5474118Z 2025-05-07T20:25:35.5474122Z 2025-05-07T20:25:35.5474125Z 2025-05-07T20:25:35.5476064Z 2025-05-07T20:25:35.5894918Z cuda-nvvp-12.6.80 | 109.3 MB | ####2 | 42%  2025-05-07T20:25:35.5895270Z 2025-05-07T20:25:35.5895273Z 2025-05-07T20:25:35.5895277Z 2025-05-07T20:25:35.5895281Z 2025-05-07T20:25:35.5895284Z 2025-05-07T20:25:35.5897786Z 2025-05-07T20:25:35.6081234Z libcusolver-11.7.1.2 | 95.8 MB | ###4 | 34%  2025-05-07T20:25:35.6477467Z nsight-compute-2024. | 443.1 MB | ######1 | 61% 2025-05-07T20:25:35.6477752Z 2025-05-07T20:25:35.6477758Z 2025-05-07T20:25:35.6477763Z 2025-05-07T20:25:35.6477768Z 2025-05-07T20:25:35.6479374Z 2025-05-07T20:25:35.6906767Z cuda-nvvp-12.6.80 | 109.3 MB | ####5 | 46%  2025-05-07T20:25:35.6907127Z 2025-05-07T20:25:35.6907133Z 2025-05-07T20:25:35.6907166Z 2025-05-07T20:25:35.6907172Z 2025-05-07T20:25:35.6907177Z 2025-05-07T20:25:35.6907182Z 2025-05-07T20:25:35.7084198Z libcusolver-11.7.1.2 | 95.8 MB | ###7 | 37%  2025-05-07T20:25:35.7478757Z nsight-compute-2024. | 443.1 MB | ######2 | 62% 2025-05-07T20:25:35.7479021Z 2025-05-07T20:25:35.7479025Z 2025-05-07T20:25:35.7479029Z 2025-05-07T20:25:35.7479032Z 2025-05-07T20:25:35.7480786Z 2025-05-07T20:25:35.7900282Z cuda-nvvp-12.6.80 | 109.3 MB | ####8 | 49%  2025-05-07T20:25:35.7900557Z 2025-05-07T20:25:35.7900561Z 2025-05-07T20:25:35.7900565Z 2025-05-07T20:25:35.7900569Z 2025-05-07T20:25:35.7900572Z 2025-05-07T20:25:35.7902879Z 2025-05-07T20:25:35.8087076Z libcusolver-11.7.1.2 | 95.8 MB | #### | 40%  2025-05-07T20:25:35.8480249Z nsight-compute-2024. | 443.1 MB | ######2 | 63% 2025-05-07T20:25:35.8480494Z 2025-05-07T20:25:35.8480498Z 2025-05-07T20:25:35.8480502Z 2025-05-07T20:25:35.8480517Z 2025-05-07T20:25:35.8483880Z 2025-05-07T20:25:35.8925889Z cuda-nvvp-12.6.80 | 109.3 MB | #####1 | 52%  2025-05-07T20:25:35.8926158Z 2025-05-07T20:25:35.8926162Z 2025-05-07T20:25:35.8926166Z 2025-05-07T20:25:35.8926170Z 2025-05-07T20:25:35.8926186Z 2025-05-07T20:25:35.8928144Z 2025-05-07T20:25:35.9089572Z libcusolver-11.7.1.2 | 95.8 MB | ####3 | 43%  2025-05-07T20:25:35.9533502Z nsight-compute-2024. | 443.1 MB | ######3 | 63% 2025-05-07T20:25:35.9533780Z 2025-05-07T20:25:35.9533785Z 2025-05-07T20:25:35.9533791Z 2025-05-07T20:25:35.9533796Z 2025-05-07T20:25:35.9536229Z 2025-05-07T20:25:35.9952414Z cuda-nvvp-12.6.80 | 109.3 MB | #####5 | 55%  2025-05-07T20:25:35.9952700Z 2025-05-07T20:25:35.9952704Z 2025-05-07T20:25:35.9952708Z 2025-05-07T20:25:35.9952719Z 2025-05-07T20:25:35.9952722Z 2025-05-07T20:25:35.9958699Z 2025-05-07T20:25:36.0153449Z libcusolver-11.7.1.2 | 95.8 MB | ####6 | 46%  2025-05-07T20:25:36.0153809Z 2025-05-07T20:25:36.0153824Z 2025-05-07T20:25:36.0201914Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:25:36.0564375Z nsight-compute-2024. | 443.1 MB | ######4 | 64% 2025-05-07T20:25:36.0564682Z 2025-05-07T20:25:36.0564935Z 2025-05-07T20:25:36.0564941Z 2025-05-07T20:25:36.0564945Z 2025-05-07T20:25:36.0564949Z 2025-05-07T20:25:36.0564952Z 2025-05-07T20:25:36.0566405Z 2025-05-07T20:25:36.0600445Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:25:36.0600828Z 2025-05-07T20:25:36.0600834Z 2025-05-07T20:25:36.0600839Z 2025-05-07T20:25:36.0600845Z 2025-05-07T20:25:36.0600850Z 2025-05-07T20:25:36.1104530Z cuda-nvvp-12.6.80 | 109.3 MB | #####8 | 58%  2025-05-07T20:25:36.1104910Z 2025-05-07T20:25:36.1104916Z 2025-05-07T20:25:36.1104921Z 2025-05-07T20:25:36.1104926Z 2025-05-07T20:25:36.1104931Z 2025-05-07T20:25:36.1106832Z 2025-05-07T20:25:36.1383567Z libcusolver-11.7.1.2 | 95.8 MB | ####8 | 49%  2025-05-07T20:25:36.1568633Z nsight-compute-2024. | 443.1 MB | ######4 | 65% 2025-05-07T20:25:36.1568895Z 2025-05-07T20:25:36.1568899Z 2025-05-07T20:25:36.1568903Z 2025-05-07T20:25:36.1568906Z 2025-05-07T20:25:36.1568922Z 2025-05-07T20:25:36.1568926Z 2025-05-07T20:25:36.1572756Z 2025-05-07T20:25:36.1716862Z libnpp-12.3.1.54 | 93.4 MB | 2 | 3%  2025-05-07T20:25:36.1717254Z 2025-05-07T20:25:36.1717258Z 2025-05-07T20:25:36.1717262Z 2025-05-07T20:25:36.1717266Z 2025-05-07T20:25:36.1717270Z 2025-05-07T20:25:36.2170655Z cuda-nvvp-12.6.80 | 109.3 MB | ###### | 61%  2025-05-07T20:25:36.2170999Z 2025-05-07T20:25:36.2171005Z 2025-05-07T20:25:36.2171010Z 2025-05-07T20:25:36.2171015Z 2025-05-07T20:25:36.2171021Z 2025-05-07T20:25:36.2174264Z 2025-05-07T20:25:36.2394755Z libcusolver-11.7.1.2 | 95.8 MB | #####1 | 52%  2025-05-07T20:25:36.2570847Z nsight-compute-2024. | 443.1 MB | ######5 | 65% 2025-05-07T20:25:36.2571237Z 2025-05-07T20:25:36.2571243Z 2025-05-07T20:25:36.2571248Z 2025-05-07T20:25:36.2571253Z 2025-05-07T20:25:36.2571259Z 2025-05-07T20:25:36.2571264Z 2025-05-07T20:25:36.2573601Z 2025-05-07T20:25:36.2795642Z libnpp-12.3.1.54 | 93.4 MB | 5 | 5%  2025-05-07T20:25:36.2795928Z 2025-05-07T20:25:36.2795932Z 2025-05-07T20:25:36.2795936Z 2025-05-07T20:25:36.2795940Z 2025-05-07T20:25:36.2795943Z 2025-05-07T20:25:36.3216007Z cuda-nvvp-12.6.80 | 109.3 MB | ######3 | 64%  2025-05-07T20:25:36.3216288Z 2025-05-07T20:25:36.3216292Z 2025-05-07T20:25:36.3216296Z 2025-05-07T20:25:36.3216300Z 2025-05-07T20:25:36.3216303Z 2025-05-07T20:25:36.3216318Z 2025-05-07T20:25:36.3425712Z libcusolver-11.7.1.2 | 95.8 MB | #####4 | 54%  2025-05-07T20:25:36.3571525Z nsight-compute-2024. | 443.1 MB | ######5 | 66% 2025-05-07T20:25:36.3571777Z 2025-05-07T20:25:36.3571795Z 2025-05-07T20:25:36.3571799Z 2025-05-07T20:25:36.3571803Z 2025-05-07T20:25:36.3571807Z 2025-05-07T20:25:36.3571810Z 2025-05-07T20:25:36.3573731Z 2025-05-07T20:25:36.3942835Z libnpp-12.3.1.54 | 93.4 MB | 7 | 8%  2025-05-07T20:25:36.3943159Z 2025-05-07T20:25:36.3943165Z 2025-05-07T20:25:36.3943170Z 2025-05-07T20:25:36.3943175Z 2025-05-07T20:25:36.3943180Z 2025-05-07T20:25:36.4290289Z cuda-nvvp-12.6.80 | 109.3 MB | ######6 | 67%  2025-05-07T20:25:36.4290686Z 2025-05-07T20:25:36.4290692Z 2025-05-07T20:25:36.4290698Z 2025-05-07T20:25:36.4290704Z 2025-05-07T20:25:36.4290709Z 2025-05-07T20:25:36.4292327Z 2025-05-07T20:25:36.4442106Z libcusolver-11.7.1.2 | 95.8 MB | #####7 | 57%  2025-05-07T20:25:36.4580170Z nsight-compute-2024. | 443.1 MB | ######6 | 66% 2025-05-07T20:25:36.4580443Z 2025-05-07T20:25:36.4580447Z 2025-05-07T20:25:36.4580451Z 2025-05-07T20:25:36.4580455Z 2025-05-07T20:25:36.4580459Z 2025-05-07T20:25:36.4580700Z 2025-05-07T20:25:36.4583967Z 2025-05-07T20:25:36.5079616Z libnpp-12.3.1.54 | 93.4 MB | # | 11%  2025-05-07T20:25:36.5080003Z 2025-05-07T20:25:36.5080007Z 2025-05-07T20:25:36.5080011Z 2025-05-07T20:25:36.5080234Z 2025-05-07T20:25:36.5080988Z 2025-05-07T20:25:36.5293058Z cuda-nvvp-12.6.80 | 109.3 MB | ######9 | 69%  2025-05-07T20:25:36.5293427Z 2025-05-07T20:25:36.5293431Z 2025-05-07T20:25:36.5293435Z 2025-05-07T20:25:36.5293438Z 2025-05-07T20:25:36.5293442Z 2025-05-07T20:25:36.5297129Z 2025-05-07T20:25:36.5557930Z libcusolver-11.7.1.2 | 95.8 MB | #####9 | 60%  2025-05-07T20:25:36.5586468Z nsight-compute-2024. | 443.1 MB | ######7 | 67% 2025-05-07T20:25:36.5586749Z 2025-05-07T20:25:36.5586753Z 2025-05-07T20:25:36.5586757Z 2025-05-07T20:25:36.5586761Z 2025-05-07T20:25:36.5586764Z 2025-05-07T20:25:36.5586768Z 2025-05-07T20:25:36.5588470Z 2025-05-07T20:25:36.6259973Z libnpp-12.3.1.54 | 93.4 MB | #3 | 14%  2025-05-07T20:25:36.6260277Z 2025-05-07T20:25:36.6260281Z 2025-05-07T20:25:36.6260284Z 2025-05-07T20:25:36.6260288Z 2025-05-07T20:25:36.6260292Z 2025-05-07T20:25:36.6294227Z cuda-nvvp-12.6.80 | 109.3 MB | #######1 | 72%  2025-05-07T20:25:36.6294536Z 2025-05-07T20:25:36.6294540Z 2025-05-07T20:25:36.6294543Z 2025-05-07T20:25:36.6294547Z 2025-05-07T20:25:36.6294551Z 2025-05-07T20:25:36.6294555Z 2025-05-07T20:25:36.6588955Z libcusolver-11.7.1.2 | 95.8 MB | ######2 | 63%  2025-05-07T20:25:36.6589389Z 2025-05-07T20:25:36.6589395Z 2025-05-07T20:25:36.6589400Z 2025-05-07T20:25:36.6589405Z 2025-05-07T20:25:36.6589410Z 2025-05-07T20:25:36.6589416Z 2025-05-07T20:25:36.6591786Z 2025-05-07T20:25:36.6699482Z libnpp-12.3.1.54 | 93.4 MB | #6 | 16%  2025-05-07T20:25:36.7296252Z nsight-compute-2024. | 443.1 MB | ######7 | 68% 2025-05-07T20:25:36.7296592Z 2025-05-07T20:25:36.7296624Z 2025-05-07T20:25:36.7296630Z 2025-05-07T20:25:36.7296635Z 2025-05-07T20:25:36.7296639Z 2025-05-07T20:25:36.7300554Z 2025-05-07T20:25:36.7335089Z libcusolver-11.7.1.2 | 95.8 MB | ######5 | 66%  2025-05-07T20:25:36.7335549Z 2025-05-07T20:25:36.7335571Z 2025-05-07T20:25:36.7335577Z 2025-05-07T20:25:36.7335582Z 2025-05-07T20:25:36.7335588Z 2025-05-07T20:25:36.7591389Z cuda-nvvp-12.6.80 | 109.3 MB | #######4 | 74%  2025-05-07T20:25:36.7591678Z 2025-05-07T20:25:36.7591682Z 2025-05-07T20:25:36.7591686Z 2025-05-07T20:25:36.7591689Z 2025-05-07T20:25:36.7591693Z 2025-05-07T20:25:36.7591697Z 2025-05-07T20:25:36.7593134Z 2025-05-07T20:25:36.7700063Z libnpp-12.3.1.54 | 93.4 MB | #8 | 19%  2025-05-07T20:25:36.8333545Z nsight-compute-2024. | 443.1 MB | ######8 | 68% 2025-05-07T20:25:36.8333825Z 2025-05-07T20:25:36.8333829Z 2025-05-07T20:25:36.8333833Z 2025-05-07T20:25:36.8333837Z 2025-05-07T20:25:36.8333870Z 2025-05-07T20:25:36.8333874Z 2025-05-07T20:25:36.8405108Z libcusolver-11.7.1.2 | 95.8 MB | ######8 | 68%  2025-05-07T20:25:36.8405821Z 2025-05-07T20:25:36.8405825Z 2025-05-07T20:25:36.8405829Z 2025-05-07T20:25:36.8405833Z 2025-05-07T20:25:36.8405855Z 2025-05-07T20:25:36.8622347Z cuda-nvvp-12.6.80 | 109.3 MB | #######6 | 77%  2025-05-07T20:25:36.8622633Z 2025-05-07T20:25:36.8622637Z 2025-05-07T20:25:36.8622640Z 2025-05-07T20:25:36.8622644Z 2025-05-07T20:25:36.8622648Z 2025-05-07T20:25:36.8622652Z 2025-05-07T20:25:36.8622655Z 2025-05-07T20:25:36.8720070Z libnpp-12.3.1.54 | 93.4 MB | ##1 | 22%  2025-05-07T20:25:36.9354832Z nsight-compute-2024. | 443.1 MB | ######8 | 69% 2025-05-07T20:25:36.9355098Z 2025-05-07T20:25:36.9355103Z 2025-05-07T20:25:36.9355107Z 2025-05-07T20:25:36.9355110Z 2025-05-07T20:25:36.9355114Z 2025-05-07T20:25:36.9356819Z 2025-05-07T20:25:36.9493163Z libcusolver-11.7.1.2 | 95.8 MB | #######1 | 71%  2025-05-07T20:25:36.9493471Z 2025-05-07T20:25:36.9493475Z 2025-05-07T20:25:36.9493478Z 2025-05-07T20:25:36.9493482Z 2025-05-07T20:25:36.9495415Z 2025-05-07T20:25:36.9639813Z cuda-nvvp-12.6.80 | 109.3 MB | #######9 | 79%  2025-05-07T20:25:36.9640419Z 2025-05-07T20:25:36.9640423Z 2025-05-07T20:25:36.9640427Z 2025-05-07T20:25:36.9640431Z 2025-05-07T20:25:36.9640435Z 2025-05-07T20:25:36.9640446Z 2025-05-07T20:25:36.9643492Z 2025-05-07T20:25:36.9723995Z libnpp-12.3.1.54 | 93.4 MB | ##4 | 24%  2025-05-07T20:25:37.0360100Z nsight-compute-2024. | 443.1 MB | ######9 | 69% 2025-05-07T20:25:37.0360653Z 2025-05-07T20:25:37.0360658Z 2025-05-07T20:25:37.0360661Z 2025-05-07T20:25:37.0360666Z 2025-05-07T20:25:37.0360669Z 2025-05-07T20:25:37.0361620Z 2025-05-07T20:25:37.0505393Z libcusolver-11.7.1.2 | 95.8 MB | #######3 | 74%  2025-05-07T20:25:37.0505805Z 2025-05-07T20:25:37.0505834Z 2025-05-07T20:25:37.0505840Z 2025-05-07T20:25:37.0505845Z 2025-05-07T20:25:37.0507795Z 2025-05-07T20:25:37.0641425Z cuda-nvvp-12.6.80 | 109.3 MB | ########1 | 82%  2025-05-07T20:25:37.0641735Z 2025-05-07T20:25:37.0641741Z 2025-05-07T20:25:37.0641764Z 2025-05-07T20:25:37.0641769Z 2025-05-07T20:25:37.0641774Z 2025-05-07T20:25:37.0641787Z 2025-05-07T20:25:37.0641794Z 2025-05-07T20:25:37.0808818Z libnpp-12.3.1.54 | 93.4 MB | ##7 | 27%  2025-05-07T20:25:37.1361897Z nsight-compute-2024. | 443.1 MB | ####### | 70% 2025-05-07T20:25:37.1362173Z 2025-05-07T20:25:37.1362177Z 2025-05-07T20:25:37.1362180Z 2025-05-07T20:25:37.1362185Z 2025-05-07T20:25:37.1362188Z 2025-05-07T20:25:37.1363820Z 2025-05-07T20:25:37.1582260Z libcusolver-11.7.1.2 | 95.8 MB | #######6 | 77%  2025-05-07T20:25:37.1582615Z 2025-05-07T20:25:37.1582619Z 2025-05-07T20:25:37.1582623Z 2025-05-07T20:25:37.1582627Z 2025-05-07T20:25:37.1587306Z 2025-05-07T20:25:37.1642477Z cuda-nvvp-12.6.80 | 109.3 MB | ########3 | 84%  2025-05-07T20:25:37.1642872Z 2025-05-07T20:25:37.1642878Z 2025-05-07T20:25:37.1642883Z 2025-05-07T20:25:37.1642888Z 2025-05-07T20:25:37.1642893Z 2025-05-07T20:25:37.1642923Z 2025-05-07T20:25:37.1645922Z 2025-05-07T20:25:37.1835061Z libnpp-12.3.1.54 | 93.4 MB | ##9 | 30%  2025-05-07T20:25:37.2363028Z nsight-compute-2024. | 443.1 MB | ####### | 71% 2025-05-07T20:25:37.2363301Z 2025-05-07T20:25:37.2363305Z 2025-05-07T20:25:37.2363309Z 2025-05-07T20:25:37.2363313Z 2025-05-07T20:25:37.2363317Z 2025-05-07T20:25:37.2365278Z 2025-05-07T20:25:37.2645593Z libcusolver-11.7.1.2 | 95.8 MB | #######9 | 79%  2025-05-07T20:25:37.2645894Z 2025-05-07T20:25:37.2645898Z 2025-05-07T20:25:37.2645902Z 2025-05-07T20:25:37.2645906Z 2025-05-07T20:25:37.2645909Z 2025-05-07T20:25:37.2706220Z cuda-nvvp-12.6.80 | 109.3 MB | ########6 | 86%  2025-05-07T20:25:37.2706527Z 2025-05-07T20:25:37.2706531Z 2025-05-07T20:25:37.2706535Z 2025-05-07T20:25:37.2706539Z 2025-05-07T20:25:37.2706543Z 2025-05-07T20:25:37.2706546Z 2025-05-07T20:25:37.2706558Z 2025-05-07T20:25:37.2837060Z libnpp-12.3.1.54 | 93.4 MB | ###2 | 33%  2025-05-07T20:25:37.3363236Z nsight-compute-2024. | 443.1 MB | #######1 | 71% 2025-05-07T20:25:37.3363594Z 2025-05-07T20:25:37.3363599Z 2025-05-07T20:25:37.3363604Z 2025-05-07T20:25:37.3363609Z 2025-05-07T20:25:37.3363614Z 2025-05-07T20:25:37.3365467Z 2025-05-07T20:25:37.3659851Z libcusolver-11.7.1.2 | 95.8 MB | ########2 | 82%  2025-05-07T20:25:37.3660156Z 2025-05-07T20:25:37.3660160Z 2025-05-07T20:25:37.3660164Z 2025-05-07T20:25:37.3660167Z 2025-05-07T20:25:37.3661439Z 2025-05-07T20:25:37.3837200Z cuda-nvvp-12.6.80 | 109.3 MB | ########8 | 89%  2025-05-07T20:25:37.3837529Z 2025-05-07T20:25:37.3837534Z 2025-05-07T20:25:37.3837537Z 2025-05-07T20:25:37.3837830Z 2025-05-07T20:25:37.3837838Z 2025-05-07T20:25:37.3837844Z 2025-05-07T20:25:37.3839241Z 2025-05-07T20:25:37.3852993Z libnpp-12.3.1.54 | 93.4 MB | ###5 | 35%  2025-05-07T20:25:37.4364274Z nsight-compute-2024. | 443.1 MB | #######1 | 72% 2025-05-07T20:25:37.4364774Z 2025-05-07T20:25:37.4364778Z 2025-05-07T20:25:37.4364781Z 2025-05-07T20:25:37.4364785Z 2025-05-07T20:25:37.4364789Z 2025-05-07T20:25:37.4366212Z 2025-05-07T20:25:37.4760459Z libcusolver-11.7.1.2 | 95.8 MB | ########5 | 86%  2025-05-07T20:25:37.4760763Z 2025-05-07T20:25:37.4760767Z 2025-05-07T20:25:37.4760771Z 2025-05-07T20:25:37.4760774Z 2025-05-07T20:25:37.4762394Z 2025-05-07T20:25:37.4844008Z cuda-nvvp-12.6.80 | 109.3 MB | ######### | 91%  2025-05-07T20:25:37.4844293Z 2025-05-07T20:25:37.4844297Z 2025-05-07T20:25:37.4844301Z 2025-05-07T20:25:37.4844305Z 2025-05-07T20:25:37.4844309Z 2025-05-07T20:25:37.4844313Z 2025-05-07T20:25:37.4844333Z 2025-05-07T20:25:37.4861121Z libnpp-12.3.1.54 | 93.4 MB | ###8 | 38%  2025-05-07T20:25:37.5428319Z nsight-compute-2024. | 443.1 MB | #######2 | 72% 2025-05-07T20:25:37.5428688Z 2025-05-07T20:25:37.5428694Z 2025-05-07T20:25:37.5428725Z 2025-05-07T20:25:37.5428730Z 2025-05-07T20:25:37.5428735Z 2025-05-07T20:25:37.5430596Z 2025-05-07T20:25:37.5802676Z libcusolver-11.7.1.2 | 95.8 MB | ########8 | 89%  2025-05-07T20:25:37.5802972Z 2025-05-07T20:25:37.5802976Z 2025-05-07T20:25:37.5802980Z 2025-05-07T20:25:37.5802984Z 2025-05-07T20:25:37.5803035Z 2025-05-07T20:25:37.5864999Z cuda-nvvp-12.6.80 | 109.3 MB | #########3 | 93%  2025-05-07T20:25:37.5959891Z nsight-compute-2024. | 443.1 MB | #######3 | 73% 2025-05-07T20:25:37.5960149Z 2025-05-07T20:25:37.5960153Z 2025-05-07T20:25:37.5960157Z 2025-05-07T20:25:37.5960161Z 2025-05-07T20:25:37.5960164Z 2025-05-07T20:25:37.5960168Z 2025-05-07T20:25:37.5963090Z 2025-05-07T20:25:37.6429985Z libnpp-12.3.1.54 | 93.4 MB | #### | 41%  2025-05-07T20:25:37.6430318Z 2025-05-07T20:25:37.6430324Z 2025-05-07T20:25:37.6430330Z 2025-05-07T20:25:37.6430335Z 2025-05-07T20:25:37.6430340Z 2025-05-07T20:25:37.6430371Z 2025-05-07T20:25:37.6813490Z libcusolver-11.7.1.2 | 95.8 MB | #########1 | 92%  2025-05-07T20:25:37.6813799Z 2025-05-07T20:25:37.6813803Z 2025-05-07T20:25:37.6813806Z 2025-05-07T20:25:37.6813810Z 2025-05-07T20:25:37.6813813Z 2025-05-07T20:25:37.6904740Z cuda-nvvp-12.6.80 | 109.3 MB | #########5 | 95%  2025-05-07T20:25:37.6960766Z nsight-compute-2024. | 443.1 MB | #######3 | 74% 2025-05-07T20:25:37.6961030Z 2025-05-07T20:25:37.6961037Z 2025-05-07T20:25:37.6961041Z 2025-05-07T20:25:37.6961044Z 2025-05-07T20:25:37.6961048Z 2025-05-07T20:25:37.6961052Z 2025-05-07T20:25:37.6964183Z 2025-05-07T20:25:37.7466853Z libnpp-12.3.1.54 | 93.4 MB | ####3 | 44%  2025-05-07T20:25:37.7467178Z 2025-05-07T20:25:37.7467183Z 2025-05-07T20:25:37.7467187Z 2025-05-07T20:25:37.7467190Z 2025-05-07T20:25:37.7467194Z 2025-05-07T20:25:37.7468097Z 2025-05-07T20:25:37.7814393Z libcusolver-11.7.1.2 | 95.8 MB | #########4 | 95%  2025-05-07T20:25:37.7814762Z 2025-05-07T20:25:37.7814768Z 2025-05-07T20:25:37.7814773Z 2025-05-07T20:25:37.7814779Z 2025-05-07T20:25:37.7814784Z 2025-05-07T20:25:37.7931821Z cuda-nvvp-12.6.80 | 109.3 MB | #########7 | 98%  2025-05-07T20:25:37.8086239Z nsight-compute-2024. | 443.1 MB | #######4 | 74% 2025-05-07T20:25:37.8086500Z 2025-05-07T20:25:37.8086504Z 2025-05-07T20:25:37.8086508Z 2025-05-07T20:25:37.8088572Z 2025-05-07T20:25:37.8098866Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:25:37.8099168Z 2025-05-07T20:25:37.8099172Z 2025-05-07T20:25:37.8099176Z 2025-05-07T20:25:37.8099180Z 2025-05-07T20:25:37.8099184Z 2025-05-07T20:25:37.8099187Z 2025-05-07T20:25:37.8110717Z 2025-05-07T20:25:37.8616469Z libnpp-12.3.1.54 | 93.4 MB | ####6 | 46%  2025-05-07T20:25:37.8616768Z 2025-05-07T20:25:37.8616774Z 2025-05-07T20:25:37.8616778Z 2025-05-07T20:25:37.8616782Z 2025-05-07T20:25:37.8616785Z 2025-05-07T20:25:37.8619075Z 2025-05-07T20:25:37.8955840Z libcusolver-11.7.1.2 | 95.8 MB | #########7 | 98%  2025-05-07T20:25:37.9101748Z nsight-compute-2024. | 443.1 MB | #######4 | 75% 2025-05-07T20:25:37.9102008Z 2025-05-07T20:25:37.9102260Z 2025-05-07T20:25:37.9102265Z 2025-05-07T20:25:37.9102269Z 2025-05-07T20:25:37.9102272Z 2025-05-07T20:25:37.9102276Z 2025-05-07T20:25:37.9104875Z 2025-05-07T20:25:37.9957864Z libnpp-12.3.1.54 | 93.4 MB | ####9 | 49%  2025-05-07T20:25:38.0105098Z nsight-compute-2024. | 443.1 MB | #######5 | 76% 2025-05-07T20:25:38.0105360Z 2025-05-07T20:25:38.0105364Z 2025-05-07T20:25:38.0105367Z 2025-05-07T20:25:38.0105371Z 2025-05-07T20:25:38.0105399Z 2025-05-07T20:25:38.0105403Z 2025-05-07T20:25:38.0107172Z 2025-05-07T20:25:38.0958869Z libnpp-12.3.1.54 | 93.4 MB | #####2 | 53%  2025-05-07T20:25:38.1108501Z nsight-compute-2024. | 443.1 MB | #######6 | 76% 2025-05-07T20:25:38.1108877Z 2025-05-07T20:25:38.1108881Z 2025-05-07T20:25:38.1108885Z 2025-05-07T20:25:38.1108888Z 2025-05-07T20:25:38.1108892Z 2025-05-07T20:25:38.1108896Z 2025-05-07T20:25:38.1108900Z 2025-05-07T20:25:38.1961326Z libnpp-12.3.1.54 | 93.4 MB | #####6 | 56%  2025-05-07T20:25:38.2110186Z nsight-compute-2024. | 443.1 MB | #######7 | 77% 2025-05-07T20:25:38.2110561Z 2025-05-07T20:25:38.2110567Z 2025-05-07T20:25:38.2110573Z 2025-05-07T20:25:38.2110578Z 2025-05-07T20:25:38.2110583Z 2025-05-07T20:25:38.2110589Z 2025-05-07T20:25:38.2110595Z 2025-05-07T20:25:38.2965792Z libnpp-12.3.1.54 | 93.4 MB | ###### | 60%  2025-05-07T20:25:38.3126339Z nsight-compute-2024. | 443.1 MB | #######7 | 78% 2025-05-07T20:25:38.3126709Z 2025-05-07T20:25:38.3126715Z 2025-05-07T20:25:38.3126720Z 2025-05-07T20:25:38.3126726Z 2025-05-07T20:25:38.3126731Z 2025-05-07T20:25:38.3126737Z 2025-05-07T20:25:38.3126742Z 2025-05-07T20:25:38.3969893Z libnpp-12.3.1.54 | 93.4 MB | ######4 | 64%  2025-05-07T20:25:38.4127646Z nsight-compute-2024. | 443.1 MB | #######8 | 79% 2025-05-07T20:25:38.4128053Z 2025-05-07T20:25:38.4128059Z 2025-05-07T20:25:38.4128064Z 2025-05-07T20:25:38.4128069Z 2025-05-07T20:25:38.4128075Z 2025-05-07T20:25:38.4128080Z 2025-05-07T20:25:38.4128085Z 2025-05-07T20:25:38.4971317Z libnpp-12.3.1.54 | 93.4 MB | ######8 | 68%  2025-05-07T20:25:38.5128408Z nsight-compute-2024. | 443.1 MB | #######9 | 79% 2025-05-07T20:25:38.5128793Z 2025-05-07T20:25:38.5128799Z 2025-05-07T20:25:38.5128804Z 2025-05-07T20:25:38.5128810Z 2025-05-07T20:25:38.5128815Z 2025-05-07T20:25:38.5128821Z 2025-05-07T20:25:38.5128858Z 2025-05-07T20:25:38.5973182Z libnpp-12.3.1.54 | 93.4 MB | #######2 | 72%  2025-05-07T20:25:38.6133224Z nsight-compute-2024. | 443.1 MB | ######## | 80% 2025-05-07T20:25:38.6133499Z 2025-05-07T20:25:38.6133503Z 2025-05-07T20:25:38.6133535Z 2025-05-07T20:25:38.6133539Z 2025-05-07T20:25:38.6133542Z 2025-05-07T20:25:38.6133546Z 2025-05-07T20:25:38.6133550Z 2025-05-07T20:25:38.7016373Z libnpp-12.3.1.54 | 93.4 MB | #######6 | 76%  2025-05-07T20:25:38.7136758Z nsight-compute-2024. | 443.1 MB | ######## | 81% 2025-05-07T20:25:38.7137055Z 2025-05-07T20:25:38.7137061Z 2025-05-07T20:25:38.7137066Z 2025-05-07T20:25:38.7137071Z 2025-05-07T20:25:38.7137077Z 2025-05-07T20:25:38.7137082Z 2025-05-07T20:25:38.7137087Z 2025-05-07T20:25:38.8116844Z libnpp-12.3.1.54 | 93.4 MB | ######## | 80%  2025-05-07T20:25:38.8197322Z nsight-compute-2024. | 443.1 MB | ########1 | 81% 2025-05-07T20:25:38.8197642Z 2025-05-07T20:25:38.8197940Z 2025-05-07T20:25:38.8197949Z 2025-05-07T20:25:38.8197964Z 2025-05-07T20:25:38.8197969Z 2025-05-07T20:25:38.8197974Z 2025-05-07T20:25:38.8200808Z 2025-05-07T20:25:38.9119296Z libnpp-12.3.1.54 | 93.4 MB | ########4 | 84%  2025-05-07T20:25:38.9216649Z nsight-compute-2024. | 443.1 MB | ########2 | 82% 2025-05-07T20:25:38.9216996Z 2025-05-07T20:25:38.9217011Z 2025-05-07T20:25:38.9217016Z 2025-05-07T20:25:38.9217022Z 2025-05-07T20:25:38.9217027Z 2025-05-07T20:25:38.9217032Z 2025-05-07T20:25:38.9222979Z 2025-05-07T20:25:39.0212012Z libnpp-12.3.1.54 | 93.4 MB | ########7 | 88%  2025-05-07T20:25:39.0219094Z nsight-compute-2024. | 443.1 MB | ########2 | 83% 2025-05-07T20:25:39.0219434Z 2025-05-07T20:25:39.0219440Z 2025-05-07T20:25:39.0219444Z 2025-05-07T20:25:39.0219449Z 2025-05-07T20:25:39.0219468Z 2025-05-07T20:25:39.0219473Z 2025-05-07T20:25:39.0220745Z 2025-05-07T20:25:39.1213776Z libnpp-12.3.1.54 | 93.4 MB | #########1 | 92%  2025-05-07T20:25:39.1268126Z nsight-compute-2024. | 443.1 MB | ########3 | 84% 2025-05-07T20:25:39.1268484Z 2025-05-07T20:25:39.1268490Z 2025-05-07T20:25:39.1268496Z 2025-05-07T20:25:39.1268501Z 2025-05-07T20:25:39.1268526Z 2025-05-07T20:25:39.1268531Z 2025-05-07T20:25:39.1268797Z 2025-05-07T20:25:39.2274406Z libnpp-12.3.1.54 | 93.4 MB | #########5 | 96%  2025-05-07T20:25:39.2274718Z 2025-05-07T20:25:39.2274724Z 2025-05-07T20:25:39.2274730Z 2025-05-07T20:25:39.2274735Z 2025-05-07T20:25:39.2274741Z 2025-05-07T20:25:39.2274746Z 2025-05-07T20:25:39.2274760Z 2025-05-07T20:25:39.2307382Z libnpp-12.3.1.54 | 93.4 MB | #########9 | 100%  2025-05-07T20:25:39.3313575Z nsight-compute-2024. | 443.1 MB | ########4 | 84% 2025-05-07T20:25:39.4315073Z nsight-compute-2024. | 443.1 MB | ########5 | 85% 2025-05-07T20:25:39.5316330Z nsight-compute-2024. | 443.1 MB | ########6 | 86% 2025-05-07T20:25:39.6316884Z nsight-compute-2024. | 443.1 MB | ########6 | 87% 2025-05-07T20:25:39.7361964Z nsight-compute-2024. | 443.1 MB | ########7 | 88% 2025-05-07T20:25:39.8365207Z nsight-compute-2024. | 443.1 MB | ########8 | 89% 2025-05-07T20:25:39.9371061Z nsight-compute-2024. | 443.1 MB | ########9 | 90% 2025-05-07T20:25:40.0373760Z nsight-compute-2024. | 443.1 MB | ######### | 91% 2025-05-07T20:25:40.1375289Z nsight-compute-2024. | 443.1 MB | #########1 | 92% 2025-05-07T20:25:40.3631283Z nsight-compute-2024. | 443.1 MB | #########2 | 93% 2025-05-07T20:25:40.4632121Z nsight-compute-2024. | 443.1 MB | #########3 | 94% 2025-05-07T20:25:40.5635333Z nsight-compute-2024. | 443.1 MB | #########4 | 94% 2025-05-07T20:25:40.6417460Z nsight-compute-2024. | 443.1 MB | #########5 | 95% 2025-05-07T20:25:40.6417742Z 2025-05-07T20:25:40.6417746Z 2025-05-07T20:25:40.6417749Z 2025-05-07T20:25:40.6417753Z 2025-05-07T20:25:40.6417757Z 2025-05-07T20:25:40.6419772Z 2025-05-07T20:25:40.6635817Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:25:40.6931188Z nsight-compute-2024. | 443.1 MB | #########6 | 96% 2025-05-07T20:25:40.6931546Z 2025-05-07T20:25:40.6931552Z 2025-05-07T20:25:40.6931557Z 2025-05-07T20:25:40.6931562Z 2025-05-07T20:25:40.6931587Z 2025-05-07T20:25:40.6931592Z 2025-05-07T20:25:40.6931598Z 2025-05-07T20:25:40.6931603Z 2025-05-07T20:25:40.7801467Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:25:40.7934514Z nsight-compute-2024. | 443.1 MB | #########7 | 97% 2025-05-07T20:25:40.7934881Z 2025-05-07T20:25:40.7934887Z 2025-05-07T20:25:40.7934892Z 2025-05-07T20:25:40.7934897Z 2025-05-07T20:25:40.7934902Z 2025-05-07T20:25:40.7934908Z 2025-05-07T20:25:40.7934913Z 2025-05-07T20:25:40.7936515Z 2025-05-07T20:25:40.8955701Z cuda-nvdisasm-12.6.7 | 47.6 MB | 6 | 7%  2025-05-07T20:25:40.9087599Z nsight-compute-2024. | 443.1 MB | #########7 | 98% 2025-05-07T20:25:40.9088115Z 2025-05-07T20:25:40.9088121Z 2025-05-07T20:25:40.9088124Z 2025-05-07T20:25:40.9088128Z 2025-05-07T20:25:40.9088132Z 2025-05-07T20:25:40.9088136Z 2025-05-07T20:25:40.9088140Z 2025-05-07T20:25:40.9089743Z 2025-05-07T20:25:41.0111809Z cuda-nvdisasm-12.6.7 | 47.6 MB | #3 | 14%  2025-05-07T20:25:41.0248847Z nsight-compute-2024. | 443.1 MB | #########8 | 99% 2025-05-07T20:25:41.0249109Z 2025-05-07T20:25:41.0249113Z 2025-05-07T20:25:41.0249117Z 2025-05-07T20:25:41.0249121Z 2025-05-07T20:25:41.0249125Z 2025-05-07T20:25:41.0249129Z 2025-05-07T20:25:41.0249133Z 2025-05-07T20:25:41.0251963Z 2025-05-07T20:25:41.0955353Z cuda-nvdisasm-12.6.7 | 47.6 MB | ## | 20%  2025-05-07T20:25:41.0955663Z 2025-05-07T20:25:41.0955667Z 2025-05-07T20:25:41.0955671Z 2025-05-07T20:25:41.0955674Z 2025-05-07T20:25:41.0955678Z 2025-05-07T20:25:41.1164757Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:25:41.1252367Z nsight-compute-2024. | 443.1 MB | #########9 | 100% 2025-05-07T20:25:41.1252714Z 2025-05-07T20:25:41.1252718Z 2025-05-07T20:25:41.1252722Z 2025-05-07T20:25:41.1252726Z 2025-05-07T20:25:41.1252730Z 2025-05-07T20:25:41.1252734Z 2025-05-07T20:25:41.1252747Z 2025-05-07T20:25:41.1257484Z 2025-05-07T20:25:41.1488106Z cuda-nvdisasm-12.6.7 | 47.6 MB | ##6 | 27%  2025-05-07T20:25:41.1488515Z 2025-05-07T20:25:41.1488519Z 2025-05-07T20:25:41.1488523Z 2025-05-07T20:25:41.1488527Z 2025-05-07T20:25:41.1488530Z 2025-05-07T20:25:41.1488534Z 2025-05-07T20:25:41.1488538Z 2025-05-07T20:25:41.1488556Z 2025-05-07T20:25:41.1491136Z 2025-05-07T20:25:41.2257014Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:25:41.2257422Z 2025-05-07T20:25:41.2257433Z 2025-05-07T20:25:41.2257437Z 2025-05-07T20:25:41.2257441Z 2025-05-07T20:25:41.2257445Z 2025-05-07T20:25:41.2257448Z 2025-05-07T20:25:41.2257452Z 2025-05-07T20:25:41.2259435Z 2025-05-07T20:25:41.2488403Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###3 | 33%  2025-05-07T20:25:41.2488712Z 2025-05-07T20:25:41.2488716Z 2025-05-07T20:25:41.2488719Z 2025-05-07T20:25:41.2488723Z 2025-05-07T20:25:41.2488727Z 2025-05-07T20:25:41.2488746Z 2025-05-07T20:25:41.2488751Z 2025-05-07T20:25:41.2488756Z 2025-05-07T20:25:41.2492352Z 2025-05-07T20:25:41.3257311Z libcurand-10.3.7.77 | 39.9 MB | 8 | 8%  2025-05-07T20:25:41.3257630Z 2025-05-07T20:25:41.3257634Z 2025-05-07T20:25:41.3257638Z 2025-05-07T20:25:41.3257641Z 2025-05-07T20:25:41.3257645Z 2025-05-07T20:25:41.3257649Z 2025-05-07T20:25:41.3257653Z 2025-05-07T20:25:41.3260992Z 2025-05-07T20:25:41.3561161Z cuda-nvdisasm-12.6.7 | 47.6 MB | #### | 40%  2025-05-07T20:25:41.3561474Z 2025-05-07T20:25:41.3561478Z 2025-05-07T20:25:41.3561483Z 2025-05-07T20:25:41.3561487Z 2025-05-07T20:25:41.3561490Z 2025-05-07T20:25:41.3561518Z 2025-05-07T20:25:41.3561522Z 2025-05-07T20:25:41.3561526Z 2025-05-07T20:25:41.3562181Z 2025-05-07T20:25:41.4261653Z libcurand-10.3.7.77 | 39.9 MB | #6 | 17%  2025-05-07T20:25:41.4261952Z 2025-05-07T20:25:41.4261956Z 2025-05-07T20:25:41.4261975Z 2025-05-07T20:25:41.4261979Z 2025-05-07T20:25:41.4261983Z 2025-05-07T20:25:41.4261986Z 2025-05-07T20:25:41.4261990Z 2025-05-07T20:25:41.4264148Z 2025-05-07T20:25:41.4604449Z cuda-nvdisasm-12.6.7 | 47.6 MB | ####7 | 47%  2025-05-07T20:25:41.4604801Z 2025-05-07T20:25:41.4604807Z 2025-05-07T20:25:41.4604813Z 2025-05-07T20:25:41.4604818Z 2025-05-07T20:25:41.4604823Z 2025-05-07T20:25:41.4604829Z 2025-05-07T20:25:41.4604843Z 2025-05-07T20:25:41.4604849Z 2025-05-07T20:25:41.4606093Z 2025-05-07T20:25:41.5262105Z libcurand-10.3.7.77 | 39.9 MB | ##5 | 25%  2025-05-07T20:25:41.5262424Z 2025-05-07T20:25:41.5262428Z 2025-05-07T20:25:41.5262713Z 2025-05-07T20:25:41.5262719Z 2025-05-07T20:25:41.5262723Z 2025-05-07T20:25:41.5262726Z 2025-05-07T20:25:41.5262730Z 2025-05-07T20:25:41.5264825Z 2025-05-07T20:25:41.6266089Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####4 | 54%  2025-05-07T20:25:41.6266723Z 2025-05-07T20:25:41.6266727Z 2025-05-07T20:25:41.6266731Z 2025-05-07T20:25:41.6266735Z 2025-05-07T20:25:41.6266739Z 2025-05-07T20:25:41.6266742Z 2025-05-07T20:25:41.6266746Z 2025-05-07T20:25:41.6267044Z 2025-05-07T20:25:41.6517928Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######2 | 62%  2025-05-07T20:25:41.6518330Z 2025-05-07T20:25:41.6518334Z 2025-05-07T20:25:41.6518338Z 2025-05-07T20:25:41.6518341Z 2025-05-07T20:25:41.6518345Z 2025-05-07T20:25:41.6518356Z 2025-05-07T20:25:41.6518360Z 2025-05-07T20:25:41.6518364Z 2025-05-07T20:25:41.6518367Z 2025-05-07T20:25:41.7267070Z libcurand-10.3.7.77 | 39.9 MB | ###2 | 33%  2025-05-07T20:25:41.7267449Z 2025-05-07T20:25:41.7267472Z 2025-05-07T20:25:41.7267476Z 2025-05-07T20:25:41.7267480Z 2025-05-07T20:25:41.7267484Z 2025-05-07T20:25:41.7267489Z 2025-05-07T20:25:41.7267492Z 2025-05-07T20:25:41.7269863Z 2025-05-07T20:25:41.7521686Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######9 | 70%  2025-05-07T20:25:41.7522027Z 2025-05-07T20:25:41.7522032Z 2025-05-07T20:25:41.7522036Z 2025-05-07T20:25:41.7522040Z 2025-05-07T20:25:41.7522043Z 2025-05-07T20:25:41.7522047Z 2025-05-07T20:25:41.7522051Z 2025-05-07T20:25:41.7522055Z 2025-05-07T20:25:41.7522058Z 2025-05-07T20:25:41.8272034Z libcurand-10.3.7.77 | 39.9 MB | ####2 | 42%  2025-05-07T20:25:41.8272353Z 2025-05-07T20:25:41.8272357Z 2025-05-07T20:25:41.8272360Z 2025-05-07T20:25:41.8272364Z 2025-05-07T20:25:41.8272368Z 2025-05-07T20:25:41.8272372Z 2025-05-07T20:25:41.8272376Z 2025-05-07T20:25:41.8276131Z 2025-05-07T20:25:41.8524107Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######6 | 77%  2025-05-07T20:25:41.8524483Z 2025-05-07T20:25:41.8524488Z 2025-05-07T20:25:41.8524494Z 2025-05-07T20:25:41.8524499Z 2025-05-07T20:25:41.8524504Z 2025-05-07T20:25:41.8524510Z 2025-05-07T20:25:41.8524515Z 2025-05-07T20:25:41.8524521Z 2025-05-07T20:25:41.8524538Z 2025-05-07T20:25:41.9383915Z libcurand-10.3.7.77 | 39.9 MB | ####9 | 50%  2025-05-07T20:25:41.9384224Z 2025-05-07T20:25:41.9384227Z 2025-05-07T20:25:41.9384231Z 2025-05-07T20:25:41.9384235Z 2025-05-07T20:25:41.9384240Z 2025-05-07T20:25:41.9384243Z 2025-05-07T20:25:41.9384247Z 2025-05-07T20:25:41.9388791Z 2025-05-07T20:25:41.9528094Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########3 | 84%  2025-05-07T20:25:41.9528474Z 2025-05-07T20:25:41.9528478Z 2025-05-07T20:25:41.9528482Z 2025-05-07T20:25:41.9528486Z 2025-05-07T20:25:41.9528490Z 2025-05-07T20:25:41.9528500Z 2025-05-07T20:25:41.9528504Z 2025-05-07T20:25:41.9528507Z 2025-05-07T20:25:41.9530067Z 2025-05-07T20:25:42.0385617Z libcurand-10.3.7.77 | 39.9 MB | #####8 | 58%  2025-05-07T20:25:42.0385933Z 2025-05-07T20:25:42.0385938Z 2025-05-07T20:25:42.0385941Z 2025-05-07T20:25:42.0385945Z 2025-05-07T20:25:42.0385949Z 2025-05-07T20:25:42.0385953Z 2025-05-07T20:25:42.0385973Z 2025-05-07T20:25:42.0385978Z 2025-05-07T20:25:42.0529835Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######### | 91%  2025-05-07T20:25:42.0530209Z 2025-05-07T20:25:42.0530213Z 2025-05-07T20:25:42.0530217Z 2025-05-07T20:25:42.0530221Z 2025-05-07T20:25:42.0530224Z 2025-05-07T20:25:42.0530228Z 2025-05-07T20:25:42.0530232Z 2025-05-07T20:25:42.0530235Z 2025-05-07T20:25:42.0531705Z 2025-05-07T20:25:42.0815102Z libcurand-10.3.7.77 | 39.9 MB | ######6 | 66%  2025-05-07T20:25:42.0815517Z 2025-05-07T20:25:42.0815523Z 2025-05-07T20:25:42.0815529Z 2025-05-07T20:25:42.0815534Z 2025-05-07T20:25:42.0815539Z 2025-05-07T20:25:42.0815544Z 2025-05-07T20:25:42.0815825Z 2025-05-07T20:25:42.1222852Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:25:42.1223514Z 2025-05-07T20:25:42.1223518Z 2025-05-07T20:25:42.1223523Z 2025-05-07T20:25:42.1223526Z 2025-05-07T20:25:42.1223531Z 2025-05-07T20:25:42.1223808Z 2025-05-07T20:25:42.1223811Z 2025-05-07T20:25:42.1223815Z 2025-05-07T20:25:42.1223819Z 2025-05-07T20:25:42.1226209Z 2025-05-07T20:25:42.1390307Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:25:42.1390700Z 2025-05-07T20:25:42.1390706Z 2025-05-07T20:25:42.1390712Z 2025-05-07T20:25:42.1390718Z 2025-05-07T20:25:42.1390723Z 2025-05-07T20:25:42.1390728Z 2025-05-07T20:25:42.1390744Z 2025-05-07T20:25:42.1390749Z 2025-05-07T20:25:42.1535446Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########7 | 98%  2025-05-07T20:25:42.1535854Z 2025-05-07T20:25:42.1535859Z 2025-05-07T20:25:42.1535873Z 2025-05-07T20:25:42.1535878Z 2025-05-07T20:25:42.1535900Z 2025-05-07T20:25:42.1535906Z 2025-05-07T20:25:42.1535911Z 2025-05-07T20:25:42.1535916Z 2025-05-07T20:25:42.1535921Z 2025-05-07T20:25:42.2230539Z libcurand-10.3.7.77 | 39.9 MB | #######4 | 75%  2025-05-07T20:25:42.2230961Z 2025-05-07T20:25:42.2230991Z 2025-05-07T20:25:42.2230996Z 2025-05-07T20:25:42.2231001Z 2025-05-07T20:25:42.2231006Z 2025-05-07T20:25:42.2231011Z 2025-05-07T20:25:42.2231016Z 2025-05-07T20:25:42.2231022Z 2025-05-07T20:25:42.2231027Z 2025-05-07T20:25:42.2231032Z 2025-05-07T20:25:42.2632786Z gds-tools-1.11.1.6 | 37.8 MB | 6 | 7%  2025-05-07T20:25:42.2633197Z 2025-05-07T20:25:42.2633203Z 2025-05-07T20:25:42.2633208Z 2025-05-07T20:25:42.2633214Z 2025-05-07T20:25:42.2633219Z 2025-05-07T20:25:42.2633224Z 2025-05-07T20:25:42.2633230Z 2025-05-07T20:25:42.2633235Z 2025-05-07T20:25:42.2636240Z 2025-05-07T20:25:42.3234012Z libcurand-10.3.7.77 | 39.9 MB | ########2 | 83%  2025-05-07T20:25:42.3234445Z 2025-05-07T20:25:42.3234451Z 2025-05-07T20:25:42.3234456Z 2025-05-07T20:25:42.3234461Z 2025-05-07T20:25:42.3234467Z 2025-05-07T20:25:42.3234472Z 2025-05-07T20:25:42.3234477Z 2025-05-07T20:25:42.3234482Z 2025-05-07T20:25:42.3234489Z 2025-05-07T20:25:42.3234517Z 2025-05-07T20:25:42.3635557Z gds-tools-1.11.1.6 | 37.8 MB | #4 | 14%  2025-05-07T20:25:42.3635941Z 2025-05-07T20:25:42.3635947Z 2025-05-07T20:25:42.3635952Z 2025-05-07T20:25:42.3635957Z 2025-05-07T20:25:42.3635971Z 2025-05-07T20:25:42.3635977Z 2025-05-07T20:25:42.3635981Z 2025-05-07T20:25:42.3635987Z 2025-05-07T20:25:42.3639602Z 2025-05-07T20:25:42.4234263Z libcurand-10.3.7.77 | 39.9 MB | ######### | 91%  2025-05-07T20:25:42.4234660Z 2025-05-07T20:25:42.4234666Z 2025-05-07T20:25:42.4234671Z 2025-05-07T20:25:42.4234684Z 2025-05-07T20:25:42.4234689Z 2025-05-07T20:25:42.4234694Z 2025-05-07T20:25:42.4234700Z 2025-05-07T20:25:42.4234722Z 2025-05-07T20:25:42.4234728Z 2025-05-07T20:25:42.4236528Z 2025-05-07T20:25:42.4639875Z gds-tools-1.11.1.6 | 37.8 MB | ##2 | 22%  2025-05-07T20:25:42.4640275Z 2025-05-07T20:25:42.4640281Z 2025-05-07T20:25:42.4640286Z 2025-05-07T20:25:42.4640307Z 2025-05-07T20:25:42.4640313Z 2025-05-07T20:25:42.4640318Z 2025-05-07T20:25:42.4640324Z 2025-05-07T20:25:42.4640329Z 2025-05-07T20:25:42.4640334Z 2025-05-07T20:25:42.5238531Z libcurand-10.3.7.77 | 39.9 MB | #########9 | 99%  2025-05-07T20:25:42.5238942Z 2025-05-07T20:25:42.5238947Z 2025-05-07T20:25:42.5238952Z 2025-05-07T20:25:42.5238957Z 2025-05-07T20:25:42.5238962Z 2025-05-07T20:25:42.5238967Z 2025-05-07T20:25:42.5238973Z 2025-05-07T20:25:42.5238978Z 2025-05-07T20:25:42.5238983Z 2025-05-07T20:25:42.5238989Z 2025-05-07T20:25:42.6250820Z gds-tools-1.11.1.6 | 37.8 MB | ###1 | 31%  2025-05-07T20:25:42.6251219Z 2025-05-07T20:25:42.6251456Z 2025-05-07T20:25:42.6251462Z 2025-05-07T20:25:42.6251465Z 2025-05-07T20:25:42.6251469Z 2025-05-07T20:25:42.6251473Z 2025-05-07T20:25:42.6251476Z 2025-05-07T20:25:42.6251481Z 2025-05-07T20:25:42.6251486Z 2025-05-07T20:25:42.6251498Z 2025-05-07T20:25:42.7251178Z gds-tools-1.11.1.6 | 37.8 MB | ###9 | 40%  2025-05-07T20:25:42.7251574Z 2025-05-07T20:25:42.7251578Z 2025-05-07T20:25:42.7251582Z 2025-05-07T20:25:42.7251585Z 2025-05-07T20:25:42.7251589Z 2025-05-07T20:25:42.7251593Z 2025-05-07T20:25:42.7251597Z 2025-05-07T20:25:42.7251600Z 2025-05-07T20:25:42.7251604Z 2025-05-07T20:25:42.7252351Z 2025-05-07T20:25:42.8254161Z gds-tools-1.11.1.6 | 37.8 MB | ####8 | 48%  2025-05-07T20:25:42.8254564Z 2025-05-07T20:25:42.8254569Z 2025-05-07T20:25:42.8254578Z 2025-05-07T20:25:42.8254584Z 2025-05-07T20:25:42.8254589Z 2025-05-07T20:25:42.8254595Z 2025-05-07T20:25:42.8254601Z 2025-05-07T20:25:42.8254606Z 2025-05-07T20:25:42.8254642Z 2025-05-07T20:25:42.8254648Z 2025-05-07T20:25:42.9072648Z gds-tools-1.11.1.6 | 37.8 MB | #####6 | 57%  2025-05-07T20:25:42.9073005Z 2025-05-07T20:25:42.9073011Z 2025-05-07T20:25:42.9074802Z 2025-05-07T20:25:42.9254209Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:25:42.9254594Z 2025-05-07T20:25:42.9254600Z 2025-05-07T20:25:42.9254605Z 2025-05-07T20:25:42.9254611Z 2025-05-07T20:25:42.9254616Z 2025-05-07T20:25:42.9254621Z 2025-05-07T20:25:42.9254627Z 2025-05-07T20:25:42.9254632Z 2025-05-07T20:25:42.9254637Z 2025-05-07T20:25:42.9254642Z 2025-05-07T20:25:43.0258937Z gds-tools-1.11.1.6 | 37.8 MB | ######5 | 66%  2025-05-07T20:25:43.0259352Z 2025-05-07T20:25:43.0259358Z 2025-05-07T20:25:43.0259363Z 2025-05-07T20:25:43.0259369Z 2025-05-07T20:25:43.0259374Z 2025-05-07T20:25:43.0259379Z 2025-05-07T20:25:43.0259385Z 2025-05-07T20:25:43.0259390Z 2025-05-07T20:25:43.0259420Z 2025-05-07T20:25:43.0260877Z 2025-05-07T20:25:43.1376921Z gds-tools-1.11.1.6 | 37.8 MB | #######4 | 75%  2025-05-07T20:25:43.1377313Z 2025-05-07T20:25:43.1377317Z 2025-05-07T20:25:43.1377321Z 2025-05-07T20:25:43.1377328Z 2025-05-07T20:25:43.1377360Z 2025-05-07T20:25:43.1377364Z 2025-05-07T20:25:43.1377368Z 2025-05-07T20:25:43.1377372Z 2025-05-07T20:25:43.1377387Z 2025-05-07T20:25:43.1377834Z 2025-05-07T20:25:43.2378736Z gds-tools-1.11.1.6 | 37.8 MB | ########3 | 84%  2025-05-07T20:25:43.2379137Z 2025-05-07T20:25:43.2379153Z 2025-05-07T20:25:43.2379159Z 2025-05-07T20:25:43.2379164Z 2025-05-07T20:25:43.2379169Z 2025-05-07T20:25:43.2379176Z 2025-05-07T20:25:43.2379181Z 2025-05-07T20:25:43.2379186Z 2025-05-07T20:25:43.2379192Z 2025-05-07T20:25:43.2381774Z 2025-05-07T20:25:43.7546138Z gds-tools-1.11.1.6 | 37.8 MB | #########2 | 93%  2025-05-07T20:25:43.7546546Z 2025-05-07T20:25:43.7546576Z 2025-05-07T20:25:43.7546581Z 2025-05-07T20:25:43.7546585Z 2025-05-07T20:25:43.7546589Z 2025-05-07T20:25:43.7546593Z 2025-05-07T20:25:43.7546597Z 2025-05-07T20:25:43.7546601Z 2025-05-07T20:25:43.7546606Z 2025-05-07T20:25:43.8027975Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:25:43.8028398Z 2025-05-07T20:25:43.8028404Z 2025-05-07T20:25:43.8028410Z 2025-05-07T20:25:43.8028415Z 2025-05-07T20:25:43.8028420Z 2025-05-07T20:25:43.8028425Z 2025-05-07T20:25:43.8028430Z 2025-05-07T20:25:43.8028449Z 2025-05-07T20:25:43.8028454Z 2025-05-07T20:25:43.8028459Z 2025-05-07T20:25:43.8030844Z 2025-05-07T20:25:43.9031711Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:25:43.9032158Z 2025-05-07T20:25:43.9032164Z 2025-05-07T20:25:43.9032170Z 2025-05-07T20:25:43.9032175Z 2025-05-07T20:25:43.9032181Z 2025-05-07T20:25:43.9032186Z 2025-05-07T20:25:43.9032192Z 2025-05-07T20:25:43.9032197Z 2025-05-07T20:25:43.9032470Z 2025-05-07T20:25:43.9032476Z 2025-05-07T20:25:43.9032480Z 2025-05-07T20:25:43.9100134Z cuda-nvcc-tools-12.6 | 23.0 MB | #5 | 16%  2025-05-07T20:25:43.9100567Z 2025-05-07T20:25:43.9100573Z 2025-05-07T20:25:43.9100847Z 2025-05-07T20:25:43.9100853Z 2025-05-07T20:25:43.9100859Z 2025-05-07T20:25:43.9100864Z 2025-05-07T20:25:43.9100869Z 2025-05-07T20:25:43.9102369Z 2025-05-07T20:25:43.9567798Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:25:43.9568203Z 2025-05-07T20:25:43.9568210Z 2025-05-07T20:25:43.9568214Z 2025-05-07T20:25:43.9568218Z 2025-05-07T20:25:43.9568221Z 2025-05-07T20:25:43.9568225Z 2025-05-07T20:25:43.9568229Z 2025-05-07T20:25:43.9568233Z 2025-05-07T20:25:43.9568237Z 2025-05-07T20:25:43.9568240Z 2025-05-07T20:25:43.9568244Z 2025-05-07T20:25:43.9572051Z 2025-05-07T20:25:43.9627747Z python-3.9.18 | 22.7 MB | | 0%  2025-05-07T20:25:43.9628052Z 2025-05-07T20:25:44.0032229Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:25:44.0032501Z 2025-05-07T20:25:44.0032640Z 2025-05-07T20:25:44.0032646Z 2025-05-07T20:25:44.0032650Z 2025-05-07T20:25:44.0032654Z 2025-05-07T20:25:44.0032677Z 2025-05-07T20:25:44.0032702Z 2025-05-07T20:25:44.0032706Z 2025-05-07T20:25:44.0032710Z 2025-05-07T20:25:44.0032713Z 2025-05-07T20:25:44.0033175Z 2025-05-07T20:25:44.0267665Z cuda-nvcc-tools-12.6 | 23.0 MB | ###1 | 32%  2025-05-07T20:25:44.0268052Z 2025-05-07T20:25:44.0268058Z 2025-05-07T20:25:44.0268063Z 2025-05-07T20:25:44.0268069Z 2025-05-07T20:25:44.0268074Z 2025-05-07T20:25:44.0268080Z 2025-05-07T20:25:44.0268085Z 2025-05-07T20:25:44.0268090Z 2025-05-07T20:25:44.0268095Z 2025-05-07T20:25:44.0268101Z 2025-05-07T20:25:44.0268106Z 2025-05-07T20:25:44.0268110Z 2025-05-07T20:25:44.0271346Z 2025-05-07T20:25:44.0570875Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:25:44.0571316Z 2025-05-07T20:25:44.0571320Z 2025-05-07T20:25:44.0571324Z 2025-05-07T20:25:44.0571328Z 2025-05-07T20:25:44.0571332Z 2025-05-07T20:25:44.0571335Z 2025-05-07T20:25:44.0571339Z 2025-05-07T20:25:44.0571353Z 2025-05-07T20:25:44.0571358Z 2025-05-07T20:25:44.0571362Z 2025-05-07T20:25:44.0571366Z 2025-05-07T20:25:44.0571977Z 2025-05-07T20:25:44.1212576Z python-3.9.18 | 22.7 MB | #3 | 13%  2025-05-07T20:25:44.1212899Z 2025-05-07T20:25:44.1212903Z 2025-05-07T20:25:44.1212907Z 2025-05-07T20:25:44.1212917Z 2025-05-07T20:25:44.1212921Z 2025-05-07T20:25:44.1212925Z 2025-05-07T20:25:44.1212929Z 2025-05-07T20:25:44.1212932Z 2025-05-07T20:25:44.1212936Z 2025-05-07T20:25:44.1212940Z 2025-05-07T20:25:44.1212944Z 2025-05-07T20:25:44.1267419Z cuda-nvcc-tools-12.6 | 23.0 MB | ####8 | 48%  2025-05-07T20:25:44.1267746Z 2025-05-07T20:25:44.1267770Z 2025-05-07T20:25:44.1267774Z 2025-05-07T20:25:44.1267778Z 2025-05-07T20:25:44.1267782Z 2025-05-07T20:25:44.1267785Z 2025-05-07T20:25:44.1267789Z 2025-05-07T20:25:44.1267793Z 2025-05-07T20:25:44.1267797Z 2025-05-07T20:25:44.1267801Z 2025-05-07T20:25:44.1267812Z 2025-05-07T20:25:44.1267816Z 2025-05-07T20:25:44.1272194Z 2025-05-07T20:25:44.1641385Z cuda-nvrtc-12.6.85 | 17.3 MB | #6 | 17%  2025-05-07T20:25:44.1641759Z 2025-05-07T20:25:44.1641763Z 2025-05-07T20:25:44.1641767Z 2025-05-07T20:25:44.1641771Z 2025-05-07T20:25:44.1641775Z 2025-05-07T20:25:44.1641778Z 2025-05-07T20:25:44.1641782Z 2025-05-07T20:25:44.1641786Z 2025-05-07T20:25:44.1641790Z 2025-05-07T20:25:44.1641793Z 2025-05-07T20:25:44.1641797Z 2025-05-07T20:25:44.1641801Z 2025-05-07T20:25:44.2116904Z python-3.9.18 | 22.7 MB | ##6 | 26%  2025-05-07T20:25:44.2117308Z 2025-05-07T20:25:44.2117319Z 2025-05-07T20:25:44.2321218Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:25:44.2321513Z 2025-05-07T20:25:44.2321519Z 2025-05-07T20:25:44.2321525Z 2025-05-07T20:25:44.2321530Z 2025-05-07T20:25:44.2321535Z 2025-05-07T20:25:44.2321540Z 2025-05-07T20:25:44.2321545Z 2025-05-07T20:25:44.2321747Z 2025-05-07T20:25:44.2321753Z 2025-05-07T20:25:44.2321759Z 2025-05-07T20:25:44.2321764Z 2025-05-07T20:25:44.2321769Z 2025-05-07T20:25:44.2326822Z 2025-05-07T20:25:44.2445229Z cuda-nvrtc-12.6.85 | 17.3 MB | ###3 | 33%  2025-05-07T20:25:44.2445663Z 2025-05-07T20:25:44.2445668Z 2025-05-07T20:25:44.2445674Z 2025-05-07T20:25:44.2445679Z 2025-05-07T20:25:44.2445684Z 2025-05-07T20:25:44.2445689Z 2025-05-07T20:25:44.2445694Z 2025-05-07T20:25:44.2445700Z 2025-05-07T20:25:44.2445705Z 2025-05-07T20:25:44.2445710Z 2025-05-07T20:25:44.2445726Z 2025-05-07T20:25:44.2803483Z cuda-nvcc-tools-12.6 | 23.0 MB | ######2 | 63%  2025-05-07T20:25:44.2804116Z 2025-05-07T20:25:44.2804120Z 2025-05-07T20:25:44.2804124Z 2025-05-07T20:25:44.2804135Z 2025-05-07T20:25:44.2804139Z 2025-05-07T20:25:44.2804143Z 2025-05-07T20:25:44.2804147Z 2025-05-07T20:25:44.2804150Z 2025-05-07T20:25:44.2804155Z 2025-05-07T20:25:44.2804174Z 2025-05-07T20:25:44.2804178Z 2025-05-07T20:25:44.2808238Z 2025-05-07T20:25:44.3414783Z python-3.9.18 | 22.7 MB | ###9 | 39%  2025-05-07T20:25:44.3415097Z 2025-05-07T20:25:44.3415101Z 2025-05-07T20:25:44.3415105Z 2025-05-07T20:25:44.3415108Z 2025-05-07T20:25:44.3415112Z 2025-05-07T20:25:44.3415116Z 2025-05-07T20:25:44.3415120Z 2025-05-07T20:25:44.3415124Z 2025-05-07T20:25:44.3415128Z 2025-05-07T20:25:44.3415132Z 2025-05-07T20:25:44.3415135Z 2025-05-07T20:25:44.3415139Z 2025-05-07T20:25:44.3415143Z 2025-05-07T20:25:44.3544092Z cuda-nvrtc-12.6.85 | 17.3 MB | ####9 | 49%  2025-05-07T20:25:44.3544467Z 2025-05-07T20:25:44.3544490Z 2025-05-07T20:25:44.3544494Z 2025-05-07T20:25:44.3544497Z 2025-05-07T20:25:44.3544501Z 2025-05-07T20:25:44.3544505Z 2025-05-07T20:25:44.3544508Z 2025-05-07T20:25:44.3544513Z 2025-05-07T20:25:44.3544517Z 2025-05-07T20:25:44.3544533Z 2025-05-07T20:25:44.3544551Z 2025-05-07T20:25:44.3838575Z cuda-nvcc-tools-12.6 | 23.0 MB | #######6 | 77%  2025-05-07T20:25:44.3838895Z 2025-05-07T20:25:44.3838899Z 2025-05-07T20:25:44.3838909Z 2025-05-07T20:25:44.3838912Z 2025-05-07T20:25:44.3838916Z 2025-05-07T20:25:44.3838920Z 2025-05-07T20:25:44.3838924Z 2025-05-07T20:25:44.3838927Z 2025-05-07T20:25:44.3838931Z 2025-05-07T20:25:44.3838935Z 2025-05-07T20:25:44.3838938Z 2025-05-07T20:25:44.3842267Z 2025-05-07T20:25:44.4419096Z python-3.9.18 | 22.7 MB | #####1 | 51%  2025-05-07T20:25:44.4419541Z 2025-05-07T20:25:44.4419547Z 2025-05-07T20:25:44.4419552Z 2025-05-07T20:25:44.4419557Z 2025-05-07T20:25:44.4419579Z 2025-05-07T20:25:44.4419585Z 2025-05-07T20:25:44.4419591Z 2025-05-07T20:25:44.4419596Z 2025-05-07T20:25:44.4419602Z 2025-05-07T20:25:44.4419607Z 2025-05-07T20:25:44.4419612Z 2025-05-07T20:25:44.4419618Z 2025-05-07T20:25:44.4419623Z 2025-05-07T20:25:44.4603946Z cuda-nvrtc-12.6.85 | 17.3 MB | ######6 | 66%  2025-05-07T20:25:44.4604394Z 2025-05-07T20:25:44.4604398Z 2025-05-07T20:25:44.4604402Z 2025-05-07T20:25:44.4604406Z 2025-05-07T20:25:44.4604409Z 2025-05-07T20:25:44.4604413Z 2025-05-07T20:25:44.4604417Z 2025-05-07T20:25:44.4604420Z 2025-05-07T20:25:44.4604424Z 2025-05-07T20:25:44.4604428Z 2025-05-07T20:25:44.4605838Z 2025-05-07T20:25:44.4839943Z cuda-nvcc-tools-12.6 | 23.0 MB | ######### | 90%  2025-05-07T20:25:44.4840333Z 2025-05-07T20:25:44.4840337Z 2025-05-07T20:25:44.4840341Z 2025-05-07T20:25:44.4840345Z 2025-05-07T20:25:44.4840348Z 2025-05-07T20:25:44.4840352Z 2025-05-07T20:25:44.4840613Z 2025-05-07T20:25:44.4840618Z 2025-05-07T20:25:44.4840622Z 2025-05-07T20:25:44.4840626Z 2025-05-07T20:25:44.4840630Z 2025-05-07T20:25:44.4840633Z 2025-05-07T20:25:44.5426122Z python-3.9.18 | 22.7 MB | ######3 | 63%  2025-05-07T20:25:44.5426679Z 2025-05-07T20:25:44.5426683Z 2025-05-07T20:25:44.5426686Z 2025-05-07T20:25:44.5426690Z 2025-05-07T20:25:44.5426694Z 2025-05-07T20:25:44.5426698Z 2025-05-07T20:25:44.5426702Z 2025-05-07T20:25:44.5426705Z 2025-05-07T20:25:44.5426709Z 2025-05-07T20:25:44.5426713Z 2025-05-07T20:25:44.5426717Z 2025-05-07T20:25:44.5426720Z 2025-05-07T20:25:44.5431876Z 2025-05-07T20:25:44.5876008Z cuda-nvrtc-12.6.85 | 17.3 MB | ########2 | 82%  2025-05-07T20:25:44.5876345Z 2025-05-07T20:25:44.5876349Z 2025-05-07T20:25:44.5876353Z 2025-05-07T20:25:44.5876357Z 2025-05-07T20:25:44.5876361Z 2025-05-07T20:25:44.5876365Z 2025-05-07T20:25:44.5876368Z 2025-05-07T20:25:44.5876388Z 2025-05-07T20:25:44.5876392Z 2025-05-07T20:25:44.5876395Z 2025-05-07T20:25:44.5876399Z 2025-05-07T20:25:44.5876403Z 2025-05-07T20:25:44.6427818Z python-3.9.18 | 22.7 MB | #######5 | 75%  2025-05-07T20:25:44.6428188Z 2025-05-07T20:25:44.6428194Z 2025-05-07T20:25:44.6428199Z 2025-05-07T20:25:44.6428204Z 2025-05-07T20:25:44.6428220Z 2025-05-07T20:25:44.6428226Z 2025-05-07T20:25:44.6428232Z 2025-05-07T20:25:44.6428237Z 2025-05-07T20:25:44.6428242Z 2025-05-07T20:25:44.6428247Z 2025-05-07T20:25:44.6428251Z 2025-05-07T20:25:44.6428255Z 2025-05-07T20:25:44.6428258Z 2025-05-07T20:25:44.6670673Z cuda-nvrtc-12.6.85 | 17.3 MB | #########8 | 99%  2025-05-07T20:25:44.6671115Z 2025-05-07T20:25:44.6671121Z 2025-05-07T20:25:44.6671126Z 2025-05-07T20:25:44.6671131Z 2025-05-07T20:25:44.6671137Z 2025-05-07T20:25:44.6671142Z 2025-05-07T20:25:44.6671147Z 2025-05-07T20:25:44.6671153Z 2025-05-07T20:25:44.6671171Z 2025-05-07T20:25:44.6676955Z 2025-05-07T20:25:44.6881721Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:25:44.6882088Z 2025-05-07T20:25:44.6882094Z 2025-05-07T20:25:44.6882099Z 2025-05-07T20:25:44.6882104Z 2025-05-07T20:25:44.6882126Z 2025-05-07T20:25:44.6882131Z 2025-05-07T20:25:44.6882136Z 2025-05-07T20:25:44.6882141Z 2025-05-07T20:25:44.6882146Z 2025-05-07T20:25:44.6882151Z 2025-05-07T20:25:44.6882156Z 2025-05-07T20:25:44.6885261Z 2025-05-07T20:25:44.7056712Z python-3.9.18 | 22.7 MB | ########9 | 89%  2025-05-07T20:25:44.7057101Z 2025-05-07T20:25:44.7057105Z 2025-05-07T20:25:44.7057109Z 2025-05-07T20:25:44.7057113Z 2025-05-07T20:25:44.7057117Z 2025-05-07T20:25:44.7057120Z 2025-05-07T20:25:44.7057124Z 2025-05-07T20:25:44.7057128Z 2025-05-07T20:25:44.7057132Z 2025-05-07T20:25:44.7057135Z 2025-05-07T20:25:44.7057139Z 2025-05-07T20:25:44.7057143Z 2025-05-07T20:25:44.7057147Z 2025-05-07T20:25:44.7060602Z 2025-05-07T20:25:44.8059822Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:25:44.8060164Z 2025-05-07T20:25:44.8060169Z 2025-05-07T20:25:44.8060173Z 2025-05-07T20:25:44.8060176Z 2025-05-07T20:25:44.8060210Z 2025-05-07T20:25:44.8060223Z 2025-05-07T20:25:44.8060227Z 2025-05-07T20:25:44.8060231Z 2025-05-07T20:25:44.8060234Z 2025-05-07T20:25:44.8060238Z 2025-05-07T20:25:44.8060242Z 2025-05-07T20:25:44.8060245Z 2025-05-07T20:25:44.8060249Z 2025-05-07T20:25:44.8061852Z 2025-05-07T20:25:44.9064789Z libnvjitlink-12.6.85 | 14.9 MB | ##3 | 23%  2025-05-07T20:25:44.9065258Z 2025-05-07T20:25:44.9065262Z 2025-05-07T20:25:44.9065266Z 2025-05-07T20:25:44.9065269Z 2025-05-07T20:25:44.9065273Z 2025-05-07T20:25:44.9065278Z 2025-05-07T20:25:44.9065281Z 2025-05-07T20:25:44.9065288Z 2025-05-07T20:25:44.9065292Z 2025-05-07T20:25:44.9065296Z 2025-05-07T20:25:44.9065520Z 2025-05-07T20:25:44.9065527Z 2025-05-07T20:25:44.9065530Z 2025-05-07T20:25:44.9066748Z 2025-05-07T20:25:45.0065286Z libnvjitlink-12.6.85 | 14.9 MB | ####7 | 47%  2025-05-07T20:25:45.0065747Z 2025-05-07T20:25:45.0065753Z 2025-05-07T20:25:45.0066046Z 2025-05-07T20:25:45.0066064Z 2025-05-07T20:25:45.0066070Z 2025-05-07T20:25:45.0066075Z 2025-05-07T20:25:45.0066080Z 2025-05-07T20:25:45.0066083Z 2025-05-07T20:25:45.0066087Z 2025-05-07T20:25:45.0066091Z 2025-05-07T20:25:45.0066095Z 2025-05-07T20:25:45.0066101Z 2025-05-07T20:25:45.0066105Z 2025-05-07T20:25:45.0067642Z 2025-05-07T20:25:45.1066211Z libnvjitlink-12.6.85 | 14.9 MB | #######2 | 73%  2025-05-07T20:25:45.1066559Z 2025-05-07T20:25:45.1066563Z 2025-05-07T20:25:45.1066567Z 2025-05-07T20:25:45.1066571Z 2025-05-07T20:25:45.1066575Z 2025-05-07T20:25:45.1066579Z 2025-05-07T20:25:45.1066582Z 2025-05-07T20:25:45.1066586Z 2025-05-07T20:25:45.1066615Z 2025-05-07T20:25:45.1066619Z 2025-05-07T20:25:45.1066623Z 2025-05-07T20:25:45.1066626Z 2025-05-07T20:25:45.1066630Z 2025-05-07T20:25:45.1068907Z 2025-05-07T20:25:45.2404831Z libnvjitlink-12.6.85 | 14.9 MB | #########7 | 97%  2025-05-07T20:25:45.2405315Z 2025-05-07T20:25:45.2405319Z 2025-05-07T20:25:45.2405323Z 2025-05-07T20:25:45.2405327Z 2025-05-07T20:25:45.2405346Z 2025-05-07T20:25:45.2405350Z 2025-05-07T20:25:45.2405353Z 2025-05-07T20:25:45.2405357Z 2025-05-07T20:25:45.2405361Z 2025-05-07T20:25:45.2405364Z 2025-05-07T20:25:45.2405369Z 2025-05-07T20:25:45.2405372Z 2025-05-07T20:25:45.2407259Z 2025-05-07T20:25:45.2821333Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:25:45.2821659Z 2025-05-07T20:25:45.2821663Z 2025-05-07T20:25:45.2821667Z 2025-05-07T20:25:45.2821670Z 2025-05-07T20:25:45.2821674Z 2025-05-07T20:25:45.2821678Z 2025-05-07T20:25:45.2821682Z 2025-05-07T20:25:45.2821714Z 2025-05-07T20:25:45.2821718Z 2025-05-07T20:25:45.2821722Z 2025-05-07T20:25:45.2833325Z 2025-05-07T20:25:45.2883599Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:25:45.2884113Z 2025-05-07T20:25:45.2884119Z 2025-05-07T20:25:45.2884139Z 2025-05-07T20:25:45.2884144Z 2025-05-07T20:25:45.2884150Z 2025-05-07T20:25:45.2884155Z 2025-05-07T20:25:45.2884160Z 2025-05-07T20:25:45.2884165Z 2025-05-07T20:25:45.2884171Z 2025-05-07T20:25:45.2884176Z 2025-05-07T20:25:45.2884181Z 2025-05-07T20:25:45.2884186Z 2025-05-07T20:25:45.2884191Z 2025-05-07T20:25:45.2884196Z 2025-05-07T20:25:45.2884524Z 2025-05-07T20:25:45.3231639Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:25:45.3231974Z 2025-05-07T20:25:45.3231979Z 2025-05-07T20:25:45.3231982Z 2025-05-07T20:25:45.3231986Z 2025-05-07T20:25:45.3231990Z 2025-05-07T20:25:45.3231994Z 2025-05-07T20:25:45.3231998Z 2025-05-07T20:25:45.3232014Z 2025-05-07T20:25:45.3232018Z 2025-05-07T20:25:45.3232022Z 2025-05-07T20:25:45.3232026Z 2025-05-07T20:25:45.3232039Z 2025-05-07T20:25:45.3232042Z 2025-05-07T20:25:45.3232046Z 2025-05-07T20:25:45.3232050Z 2025-05-07T20:25:45.3235930Z 2025-05-07T20:25:45.3885438Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:25:45.3885940Z 2025-05-07T20:25:45.3885947Z 2025-05-07T20:25:45.3885951Z 2025-05-07T20:25:45.3885954Z 2025-05-07T20:25:45.3885958Z 2025-05-07T20:25:45.3885962Z 2025-05-07T20:25:45.3885965Z 2025-05-07T20:25:45.3885969Z 2025-05-07T20:25:45.3885973Z 2025-05-07T20:25:45.3885976Z 2025-05-07T20:25:45.3885980Z 2025-05-07T20:25:45.3885984Z 2025-05-07T20:25:45.3885987Z 2025-05-07T20:25:45.3885991Z 2025-05-07T20:25:45.3885995Z 2025-05-07T20:25:45.4232633Z cuda-nvcc-dev_linux- | 10.8 MB | ##8 | 28%  2025-05-07T20:25:45.4233128Z 2025-05-07T20:25:45.4233133Z 2025-05-07T20:25:45.4233357Z 2025-05-07T20:25:45.4233362Z 2025-05-07T20:25:45.4233366Z 2025-05-07T20:25:45.4233369Z 2025-05-07T20:25:45.4233373Z 2025-05-07T20:25:45.4233376Z 2025-05-07T20:25:45.4233381Z 2025-05-07T20:25:45.4233384Z 2025-05-07T20:25:45.4233388Z 2025-05-07T20:25:45.4233547Z 2025-05-07T20:25:45.4233551Z 2025-05-07T20:25:45.4233565Z 2025-05-07T20:25:45.4233568Z 2025-05-07T20:25:45.4234099Z 2025-05-07T20:25:45.4886344Z cuda-nvvm-tools-12.6 | 10.4 MB | ### | 30%  2025-05-07T20:25:45.4886732Z 2025-05-07T20:25:45.4886746Z 2025-05-07T20:25:45.4886750Z 2025-05-07T20:25:45.4886754Z 2025-05-07T20:25:45.4886758Z 2025-05-07T20:25:45.4886761Z 2025-05-07T20:25:45.4886765Z 2025-05-07T20:25:45.4886772Z 2025-05-07T20:25:45.4886778Z 2025-05-07T20:25:45.4886783Z 2025-05-07T20:25:45.4886788Z 2025-05-07T20:25:45.4886794Z 2025-05-07T20:25:45.4886799Z 2025-05-07T20:25:45.4886804Z 2025-05-07T20:25:45.4894030Z 2025-05-07T20:25:45.4970817Z cuda-nvcc-dev_linux- | 10.8 MB | #####6 | 57%  2025-05-07T20:25:45.4971247Z 2025-05-07T20:25:45.4971251Z 2025-05-07T20:25:45.4971255Z 2025-05-07T20:25:45.4971258Z 2025-05-07T20:25:45.4971262Z 2025-05-07T20:25:45.4971280Z 2025-05-07T20:25:45.4971284Z 2025-05-07T20:25:45.4971287Z 2025-05-07T20:25:45.4971291Z 2025-05-07T20:25:45.4971295Z 2025-05-07T20:25:45.4971298Z 2025-05-07T20:25:45.4974267Z 2025-05-07T20:25:45.5323495Z python-3.9.18 | 22.7 MB | ########## | 100%  2025-05-07T20:25:45.5323900Z 2025-05-07T20:25:45.5323904Z 2025-05-07T20:25:45.5323908Z 2025-05-07T20:25:45.5323911Z 2025-05-07T20:25:45.5323924Z 2025-05-07T20:25:45.5323928Z 2025-05-07T20:25:45.5323932Z 2025-05-07T20:25:45.5323935Z 2025-05-07T20:25:45.5323941Z 2025-05-07T20:25:45.5323955Z 2025-05-07T20:25:45.5323960Z 2025-05-07T20:25:45.5323965Z 2025-05-07T20:25:45.5323971Z 2025-05-07T20:25:45.5323976Z 2025-05-07T20:25:45.5324004Z 2025-05-07T20:25:45.5324013Z 2025-05-07T20:25:45.5888990Z cuda-nvvm-tools-12.6 | 10.4 MB | ###### | 60%  2025-05-07T20:25:45.5889541Z 2025-05-07T20:25:45.5889547Z 2025-05-07T20:25:45.5889552Z 2025-05-07T20:25:45.5889594Z 2025-05-07T20:25:45.5889600Z 2025-05-07T20:25:45.5889605Z 2025-05-07T20:25:45.5889610Z 2025-05-07T20:25:45.5889615Z 2025-05-07T20:25:45.5889621Z 2025-05-07T20:25:45.5889626Z 2025-05-07T20:25:45.5889631Z 2025-05-07T20:25:45.5889636Z 2025-05-07T20:25:45.5889641Z 2025-05-07T20:25:45.5889647Z 2025-05-07T20:25:45.5890030Z 2025-05-07T20:25:45.6142612Z cuda-nvcc-dev_linux- | 10.8 MB | ########6 | 87%  2025-05-07T20:25:45.6142951Z 2025-05-07T20:25:45.6142965Z 2025-05-07T20:25:45.6142976Z 2025-05-07T20:25:45.6142980Z 2025-05-07T20:25:45.6142983Z 2025-05-07T20:25:45.6142987Z 2025-05-07T20:25:45.6142991Z 2025-05-07T20:25:45.6142995Z 2025-05-07T20:25:45.6142998Z 2025-05-07T20:25:45.6143022Z 2025-05-07T20:25:45.6143025Z 2025-05-07T20:25:45.6143029Z 2025-05-07T20:25:45.6143033Z 2025-05-07T20:25:45.6145505Z 2025-05-07T20:25:45.6323738Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:25:45.6324261Z 2025-05-07T20:25:45.6324268Z 2025-05-07T20:25:45.6324273Z 2025-05-07T20:25:45.6324278Z 2025-05-07T20:25:45.6324283Z 2025-05-07T20:25:45.6324288Z 2025-05-07T20:25:45.6324294Z 2025-05-07T20:25:45.6324299Z 2025-05-07T20:25:45.6324304Z 2025-05-07T20:25:45.6324309Z 2025-05-07T20:25:45.6324315Z 2025-05-07T20:25:45.6324320Z 2025-05-07T20:25:45.6324325Z 2025-05-07T20:25:45.6324331Z 2025-05-07T20:25:45.6324336Z 2025-05-07T20:25:45.6325735Z 2025-05-07T20:25:45.6594647Z cuda-nvvm-tools-12.6 | 10.4 MB | #########4 | 94%  2025-05-07T20:25:45.6594982Z 2025-05-07T20:25:45.6594994Z 2025-05-07T20:25:45.6594998Z 2025-05-07T20:25:45.6595001Z 2025-05-07T20:25:45.6595259Z 2025-05-07T20:25:45.6595267Z 2025-05-07T20:25:45.6595272Z 2025-05-07T20:25:45.6595277Z 2025-05-07T20:25:45.6595283Z 2025-05-07T20:25:45.6595288Z 2025-05-07T20:25:45.6595294Z 2025-05-07T20:25:45.6595299Z 2025-05-07T20:25:45.6595304Z 2025-05-07T20:25:45.6595310Z 2025-05-07T20:25:45.6595498Z 2025-05-07T20:25:45.6595504Z 2025-05-07T20:25:45.6595509Z 2025-05-07T20:25:45.6601654Z 2025-05-07T20:25:45.7520480Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:25:45.7520842Z 2025-05-07T20:25:45.7520846Z 2025-05-07T20:25:45.7520850Z 2025-05-07T20:25:45.7520854Z 2025-05-07T20:25:45.7520858Z 2025-05-07T20:25:45.7520862Z 2025-05-07T20:25:45.7520866Z 2025-05-07T20:25:45.7520870Z 2025-05-07T20:25:45.7520881Z 2025-05-07T20:25:45.7520885Z 2025-05-07T20:25:45.7520888Z 2025-05-07T20:25:45.7520892Z 2025-05-07T20:25:45.7520896Z 2025-05-07T20:25:45.7520899Z 2025-05-07T20:25:45.7520903Z 2025-05-07T20:25:45.7520906Z 2025-05-07T20:25:45.7526625Z 2025-05-07T20:25:45.7598196Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:25:45.7598757Z 2025-05-07T20:25:45.7598764Z 2025-05-07T20:25:45.7598770Z 2025-05-07T20:25:45.7598775Z 2025-05-07T20:25:45.7598799Z 2025-05-07T20:25:45.7598805Z 2025-05-07T20:25:45.7598811Z 2025-05-07T20:25:45.7598818Z 2025-05-07T20:25:45.7598824Z 2025-05-07T20:25:45.7598829Z 2025-05-07T20:25:45.7598835Z 2025-05-07T20:25:45.7598841Z 2025-05-07T20:25:45.7598847Z 2025-05-07T20:25:45.7598853Z 2025-05-07T20:25:45.7598858Z 2025-05-07T20:25:45.7598864Z 2025-05-07T20:25:45.7598870Z 2025-05-07T20:25:45.7600230Z 2025-05-07T20:25:45.8522115Z cuda-nvvm-impl-12.6. | 7.7 MB | ####2 | 43%  2025-05-07T20:25:45.8522539Z 2025-05-07T20:25:45.8522543Z 2025-05-07T20:25:45.8522556Z 2025-05-07T20:25:45.8522559Z 2025-05-07T20:25:45.8522563Z 2025-05-07T20:25:45.8522567Z 2025-05-07T20:25:45.8522590Z 2025-05-07T20:25:45.8522593Z 2025-05-07T20:25:45.8522597Z 2025-05-07T20:25:45.8522601Z 2025-05-07T20:25:45.8522604Z 2025-05-07T20:25:45.8522608Z 2025-05-07T20:25:45.8522612Z 2025-05-07T20:25:45.8522617Z 2025-05-07T20:25:45.8522621Z 2025-05-07T20:25:45.8522634Z 2025-05-07T20:25:45.8526920Z 2025-05-07T20:25:45.8849690Z cuda-sanitizer-api-1 | 8.9 MB | ###7 | 38%  2025-05-07T20:25:45.8850046Z 2025-05-07T20:25:45.8850052Z 2025-05-07T20:25:45.8850058Z 2025-05-07T20:25:45.8850063Z 2025-05-07T20:25:45.8850068Z 2025-05-07T20:25:45.8850074Z 2025-05-07T20:25:45.8850079Z 2025-05-07T20:25:45.8850084Z 2025-05-07T20:25:45.8850089Z 2025-05-07T20:25:45.8850095Z 2025-05-07T20:25:45.8850100Z 2025-05-07T20:25:45.8850114Z 2025-05-07T20:25:45.8850119Z 2025-05-07T20:25:45.8850124Z 2025-05-07T20:25:45.8850130Z 2025-05-07T20:25:45.8850135Z 2025-05-07T20:25:45.8850140Z 2025-05-07T20:25:45.8856817Z 2025-05-07T20:25:45.9525184Z cuda-nvvm-impl-12.6. | 7.7 MB | ########5 | 86%  2025-05-07T20:25:45.9525535Z 2025-05-07T20:25:45.9525547Z 2025-05-07T20:25:45.9525551Z 2025-05-07T20:25:45.9525555Z 2025-05-07T20:25:45.9525559Z 2025-05-07T20:25:45.9525573Z 2025-05-07T20:25:45.9525576Z 2025-05-07T20:25:45.9525580Z 2025-05-07T20:25:45.9525584Z 2025-05-07T20:25:45.9525587Z 2025-05-07T20:25:45.9525591Z 2025-05-07T20:25:45.9525595Z 2025-05-07T20:25:45.9525598Z 2025-05-07T20:25:45.9525602Z 2025-05-07T20:25:45.9525606Z 2025-05-07T20:25:45.9525610Z 2025-05-07T20:25:45.9526049Z 2025-05-07T20:25:46.0198602Z cuda-sanitizer-api-1 | 8.9 MB | #######5 | 75%  2025-05-07T20:25:46.0198951Z 2025-05-07T20:25:46.0198955Z 2025-05-07T20:25:46.0198959Z 2025-05-07T20:25:46.0198963Z 2025-05-07T20:25:46.0198967Z 2025-05-07T20:25:46.0198971Z 2025-05-07T20:25:46.0198974Z 2025-05-07T20:25:46.0198979Z 2025-05-07T20:25:46.0198982Z 2025-05-07T20:25:46.0199259Z 2025-05-07T20:25:46.0199267Z 2025-05-07T20:25:46.0199281Z 2025-05-07T20:25:46.0199287Z 2025-05-07T20:25:46.0199292Z 2025-05-07T20:25:46.0199297Z 2025-05-07T20:25:46.0202570Z 2025-05-07T20:25:46.0212127Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:25:46.0212786Z 2025-05-07T20:25:46.0212802Z 2025-05-07T20:25:46.0212806Z 2025-05-07T20:25:46.0212810Z 2025-05-07T20:25:46.0212813Z 2025-05-07T20:25:46.0212817Z 2025-05-07T20:25:46.0212821Z 2025-05-07T20:25:46.0212824Z 2025-05-07T20:25:46.0212828Z 2025-05-07T20:25:46.0212832Z 2025-05-07T20:25:46.0212835Z 2025-05-07T20:25:46.0212839Z 2025-05-07T20:25:46.0212843Z 2025-05-07T20:25:46.0212846Z 2025-05-07T20:25:46.0212850Z 2025-05-07T20:25:46.0801739Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:25:46.0802091Z 2025-05-07T20:25:46.0802095Z 2025-05-07T20:25:46.0802099Z 2025-05-07T20:25:46.0802122Z 2025-05-07T20:25:46.0802126Z 2025-05-07T20:25:46.0802130Z 2025-05-07T20:25:46.0802133Z 2025-05-07T20:25:46.0802137Z 2025-05-07T20:25:46.0802140Z 2025-05-07T20:25:46.0802144Z 2025-05-07T20:25:46.0802148Z 2025-05-07T20:25:46.0802152Z 2025-05-07T20:25:46.0802156Z 2025-05-07T20:25:46.0802178Z 2025-05-07T20:25:46.0802182Z 2025-05-07T20:25:46.0802185Z 2025-05-07T20:25:46.0802189Z 2025-05-07T20:25:46.0802193Z 2025-05-07T20:25:46.0804487Z 2025-05-07T20:25:46.1611299Z ... (more hidden) ... 2025-05-07T20:25:46.1611617Z 2025-05-07T20:25:46.1611620Z 2025-05-07T20:25:46.1611624Z 2025-05-07T20:25:46.1611628Z 2025-05-07T20:25:46.1611632Z 2025-05-07T20:25:46.1611635Z 2025-05-07T20:25:46.1611639Z 2025-05-07T20:25:46.1611643Z 2025-05-07T20:25:46.1611647Z 2025-05-07T20:25:46.1611650Z 2025-05-07T20:25:46.1611654Z 2025-05-07T20:25:46.1611658Z 2025-05-07T20:25:46.1611661Z 2025-05-07T20:25:46.1611665Z 2025-05-07T20:25:46.1611669Z 2025-05-07T20:25:46.1611700Z 2025-05-07T20:25:46.1611704Z 2025-05-07T20:25:46.1613347Z 2025-05-07T20:25:46.1801855Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:25:46.1802303Z 2025-05-07T20:25:46.1802309Z 2025-05-07T20:25:46.1802329Z 2025-05-07T20:25:46.1802335Z 2025-05-07T20:25:46.1802340Z 2025-05-07T20:25:46.1802356Z 2025-05-07T20:25:46.1802362Z 2025-05-07T20:25:46.1802371Z 2025-05-07T20:25:46.1802376Z 2025-05-07T20:25:46.1802433Z 2025-05-07T20:25:46.1802439Z 2025-05-07T20:25:46.1802442Z 2025-05-07T20:25:46.1802446Z 2025-05-07T20:25:46.1802548Z 2025-05-07T20:25:46.1802554Z 2025-05-07T20:25:46.1802562Z 2025-05-07T20:25:46.1802569Z 2025-05-07T20:25:46.1802574Z 2025-05-07T20:25:46.1802629Z 2025-05-07T20:25:46.3066355Z ... (more hidden) ... 2025-05-07T20:25:46.3066781Z 2025-05-07T20:25:46.3066789Z 2025-05-07T20:25:46.3066795Z 2025-05-07T20:25:46.3066801Z 2025-05-07T20:25:46.3066847Z 2025-05-07T20:25:46.3066853Z 2025-05-07T20:25:46.3066860Z 2025-05-07T20:25:46.3066867Z 2025-05-07T20:25:46.3066873Z 2025-05-07T20:25:46.3066880Z 2025-05-07T20:25:46.3066897Z 2025-05-07T20:25:46.3066903Z 2025-05-07T20:25:46.3066908Z 2025-05-07T20:25:46.3066932Z 2025-05-07T20:25:46.3066938Z 2025-05-07T20:25:46.3066944Z 2025-05-07T20:25:46.3066950Z 2025-05-07T20:25:46.3426037Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:25:46.3426444Z 2025-05-07T20:25:46.3426450Z 2025-05-07T20:25:46.3426455Z 2025-05-07T20:25:46.3426460Z 2025-05-07T20:25:46.3426465Z 2025-05-07T20:25:46.3426470Z 2025-05-07T20:25:46.3426476Z 2025-05-07T20:25:46.3426481Z 2025-05-07T20:25:46.3426486Z 2025-05-07T20:25:46.3426492Z 2025-05-07T20:25:46.3426497Z 2025-05-07T20:25:46.3426503Z 2025-05-07T20:25:46.3426509Z 2025-05-07T20:25:46.3426515Z 2025-05-07T20:25:46.3426521Z 2025-05-07T20:25:46.3426526Z 2025-05-07T20:25:46.3426892Z 2025-05-07T20:25:46.3426900Z 2025-05-07T20:25:46.3427128Z 2025-05-07T20:25:47.3387472Z ... (more hidden) ... 2025-05-07T20:25:47.3387789Z 2025-05-07T20:25:47.3387794Z 2025-05-07T20:25:47.3387797Z 2025-05-07T20:25:47.3388075Z 2025-05-07T20:25:47.3388094Z 2025-05-07T20:25:47.3388097Z 2025-05-07T20:25:48.3027396Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:25:48.3027788Z 2025-05-07T20:25:48.3027792Z 2025-05-07T20:25:48.3027796Z 2025-05-07T20:25:48.3027800Z 2025-05-07T20:25:48.3027807Z 2025-05-07T20:25:48.4399184Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:25:48.4399592Z 2025-05-07T20:25:48.4399597Z 2025-05-07T20:25:48.4399603Z 2025-05-07T20:25:48.4399608Z 2025-05-07T20:25:48.4399614Z 2025-05-07T20:25:48.4399619Z 2025-05-07T20:25:48.4399624Z 2025-05-07T20:25:48.4399628Z 2025-05-07T20:25:48.4399638Z 2025-05-07T20:25:48.6998033Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:25:48.6998346Z 2025-05-07T20:25:48.6998350Z 2025-05-07T20:25:48.6998354Z 2025-05-07T20:25:48.6998358Z 2025-05-07T20:25:48.6998361Z 2025-05-07T20:25:48.6998365Z 2025-05-07T20:25:48.6998369Z 2025-05-07T20:25:48.6998394Z 2025-05-07T20:25:48.7445971Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:25:48.7446359Z 2025-05-07T20:25:48.7446365Z 2025-05-07T20:25:48.7446370Z 2025-05-07T20:25:48.7446376Z 2025-05-07T20:25:48.7446381Z 2025-05-07T20:25:48.7446386Z 2025-05-07T20:25:48.7446391Z 2025-05-07T20:25:49.1323329Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:25:49.1323715Z 2025-05-07T20:25:49.1323719Z 2025-05-07T20:25:49.1323723Z 2025-05-07T20:25:49.1323726Z 2025-05-07T20:25:49.1323730Z 2025-05-07T20:25:49.1323734Z 2025-05-07T20:25:49.1323739Z 2025-05-07T20:25:49.1323743Z 2025-05-07T20:25:49.1323747Z 2025-05-07T20:25:49.1323754Z 2025-05-07T20:25:49.2660203Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:25:49.2906721Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:25:49.2907020Z 2025-05-07T20:25:49.2907024Z 2025-05-07T20:25:49.2907028Z 2025-05-07T20:25:49.2907052Z 2025-05-07T20:25:49.2907056Z 2025-05-07T20:25:49.2907059Z 2025-05-07T20:25:49.2907063Z 2025-05-07T20:25:49.2907067Z 2025-05-07T20:25:49.2907070Z 2025-05-07T20:25:49.2907074Z 2025-05-07T20:25:49.2907078Z 2025-05-07T20:25:49.2907089Z 2025-05-07T20:25:49.2907097Z 2025-05-07T20:25:49.7279747Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:25:49.7280095Z 2025-05-07T20:25:49.7280099Z 2025-05-07T20:25:49.7280103Z 2025-05-07T20:25:49.7280107Z 2025-05-07T20:25:49.7280111Z 2025-05-07T20:25:49.7280114Z 2025-05-07T20:25:49.7280118Z 2025-05-07T20:25:49.7280122Z 2025-05-07T20:25:49.7280126Z 2025-05-07T20:25:49.7280130Z 2025-05-07T20:25:49.7280162Z 2025-05-07T20:25:50.0402186Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:25:50.0402519Z 2025-05-07T20:25:50.0402523Z 2025-05-07T20:25:50.0402527Z 2025-05-07T20:25:50.0402532Z 2025-05-07T20:25:50.0402537Z 2025-05-07T20:25:50.0402564Z 2025-05-07T20:25:50.0402568Z 2025-05-07T20:25:50.0402572Z 2025-05-07T20:25:50.0402575Z 2025-05-07T20:25:50.0402579Z 2025-05-07T20:25:50.0402583Z 2025-05-07T20:25:50.0402587Z 2025-05-07T20:25:50.0402590Z 2025-05-07T20:25:50.0402594Z 2025-05-07T20:25:50.2861286Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:25:50.2861623Z 2025-05-07T20:25:50.2861628Z 2025-05-07T20:25:50.2861632Z 2025-05-07T20:25:50.2861635Z 2025-05-07T20:25:50.2861639Z 2025-05-07T20:25:50.2861643Z 2025-05-07T20:25:50.2861655Z 2025-05-07T20:25:50.2861659Z 2025-05-07T20:25:50.2861662Z 2025-05-07T20:25:50.2861666Z 2025-05-07T20:25:50.2861669Z 2025-05-07T20:25:50.2861908Z 2025-05-07T20:25:50.2861913Z 2025-05-07T20:25:50.2861917Z 2025-05-07T20:25:50.2861920Z 2025-05-07T20:25:50.2861924Z 2025-05-07T20:25:50.6574796Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:25:50.6575412Z 2025-05-07T20:25:50.6575416Z 2025-05-07T20:25:50.6575420Z 2025-05-07T20:25:50.6575423Z 2025-05-07T20:25:50.6575427Z 2025-05-07T20:25:50.6575431Z 2025-05-07T20:25:50.6575435Z 2025-05-07T20:25:50.6575438Z 2025-05-07T20:25:50.6575442Z 2025-05-07T20:25:50.6575446Z 2025-05-07T20:25:50.6575449Z 2025-05-07T20:25:50.6575453Z 2025-05-07T20:25:50.6620072Z python-3.9.18 | 22.7 MB | ########## | 100%  2025-05-07T20:25:50.6620366Z 2025-05-07T20:25:50.6620370Z 2025-05-07T20:25:50.6620374Z 2025-05-07T20:25:50.6620377Z 2025-05-07T20:25:50.6620381Z 2025-05-07T20:25:50.6620386Z 2025-05-07T20:25:50.6620389Z 2025-05-07T20:25:50.6620393Z 2025-05-07T20:25:50.6620403Z 2025-05-07T20:25:50.6620421Z 2025-05-07T20:25:50.6620425Z 2025-05-07T20:25:50.6620428Z 2025-05-07T20:25:50.6620432Z 2025-05-07T20:25:50.6620436Z 2025-05-07T20:25:50.6621094Z 2025-05-07T20:25:50.8254876Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:25:50.8255255Z 2025-05-07T20:25:50.8255259Z 2025-05-07T20:25:50.8255263Z 2025-05-07T20:25:50.8255266Z 2025-05-07T20:25:50.8255270Z 2025-05-07T20:25:50.8255274Z 2025-05-07T20:25:50.8255278Z 2025-05-07T20:25:50.8255281Z 2025-05-07T20:25:50.8255285Z 2025-05-07T20:25:50.8255289Z 2025-05-07T20:25:50.8255293Z 2025-05-07T20:25:50.8255296Z 2025-05-07T20:25:50.8255300Z 2025-05-07T20:25:50.8255304Z 2025-05-07T20:25:50.8255308Z 2025-05-07T20:25:50.8255311Z 2025-05-07T20:25:50.8255315Z 2025-05-07T20:25:50.8255319Z 2025-05-07T20:25:50.8794887Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:25:50.8795297Z 2025-05-07T20:25:50.8795331Z 2025-05-07T20:25:50.8795335Z 2025-05-07T20:25:50.8795348Z 2025-05-07T20:25:50.8795352Z 2025-05-07T20:25:50.8795356Z 2025-05-07T20:25:50.8795359Z 2025-05-07T20:25:50.8795363Z 2025-05-07T20:25:50.8795366Z 2025-05-07T20:25:50.8795370Z 2025-05-07T20:25:50.8795374Z 2025-05-07T20:25:50.8795387Z 2025-05-07T20:25:50.8795391Z 2025-05-07T20:25:50.8795394Z 2025-05-07T20:25:50.8795398Z 2025-05-07T20:25:50.8795401Z 2025-05-07T20:25:50.8795405Z 2025-05-07T20:25:51.1191111Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:25:51.1191471Z 2025-05-07T20:25:51.1191475Z 2025-05-07T20:25:51.1191479Z 2025-05-07T20:25:51.1191483Z 2025-05-07T20:25:51.1191487Z 2025-05-07T20:25:51.1191499Z 2025-05-07T20:25:51.1191503Z 2025-05-07T20:25:51.1191507Z 2025-05-07T20:25:51.1191510Z 2025-05-07T20:25:51.1191514Z 2025-05-07T20:25:51.1191518Z 2025-05-07T20:25:51.1191522Z 2025-05-07T20:25:51.1191527Z 2025-05-07T20:25:51.1191531Z 2025-05-07T20:25:51.1191562Z 2025-05-07T20:25:51.1191566Z 2025-05-07T20:25:51.1191569Z 2025-05-07T20:25:51.1191573Z 2025-05-07T20:25:51.1191577Z 2025-05-07T20:25:52.5289226Z ... (more hidden) ... 2025-05-07T20:25:52.5290190Z 2025-05-07T20:25:57.1700800Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:25:57.1709249Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:25:57.1709579Z 2025-05-07T20:25:57.1709594Z 2025-05-07T20:25:57.1709597Z 2025-05-07T20:25:57.1709601Z 2025-05-07T20:25:57.1709605Z 2025-05-07T20:25:57.1709608Z 2025-05-07T20:25:57.1709612Z 2025-05-07T20:25:57.1709616Z 2025-05-07T20:25:57.1709620Z 2025-05-07T20:25:57.1709631Z 2025-05-07T20:25:57.1709635Z 2025-05-07T20:25:57.1709638Z 2025-05-07T20:25:57.1709642Z 2025-05-07T20:25:57.1709646Z 2025-05-07T20:25:57.1709649Z 2025-05-07T20:25:57.1709653Z 2025-05-07T20:25:57.1709657Z 2025-05-07T20:25:57.1709660Z 2025-05-07T20:25:57.1709664Z 2025-05-07T20:25:57.1710050Z 2025-05-07T20:25:57.1710442Z  2025-05-07T20:25:57.1710794Z 2025-05-07T20:25:57.1711043Z 2025-05-07T20:25:57.1711221Z  2025-05-07T20:25:57.1711588Z 2025-05-07T20:25:57.1711592Z 2025-05-07T20:25:57.1711760Z  2025-05-07T20:25:57.1711962Z 2025-05-07T20:25:57.1711966Z 2025-05-07T20:25:57.1711970Z 2025-05-07T20:25:57.1712146Z  2025-05-07T20:25:57.1712347Z 2025-05-07T20:25:57.1712351Z 2025-05-07T20:25:57.1712355Z 2025-05-07T20:25:57.1712358Z 2025-05-07T20:25:57.1712530Z  2025-05-07T20:25:57.1712744Z 2025-05-07T20:25:57.1712748Z 2025-05-07T20:25:57.1712752Z 2025-05-07T20:25:57.1712755Z 2025-05-07T20:25:57.1712759Z 2025-05-07T20:25:57.1713165Z  2025-05-07T20:25:57.1713486Z 2025-05-07T20:25:57.1713492Z 2025-05-07T20:25:57.1713497Z 2025-05-07T20:25:57.1713503Z 2025-05-07T20:25:57.1713508Z 2025-05-07T20:25:57.1713513Z 2025-05-07T20:25:57.1713790Z  2025-05-07T20:25:57.1714100Z 2025-05-07T20:25:57.1714106Z 2025-05-07T20:25:57.1714112Z 2025-05-07T20:25:57.1714117Z 2025-05-07T20:25:57.1714122Z 2025-05-07T20:25:57.1714128Z 2025-05-07T20:25:57.1714153Z 2025-05-07T20:25:57.1714357Z  2025-05-07T20:25:57.1714665Z 2025-05-07T20:25:57.1714671Z 2025-05-07T20:25:57.1714677Z 2025-05-07T20:25:57.1714683Z 2025-05-07T20:25:57.1714689Z 2025-05-07T20:25:57.1714694Z 2025-05-07T20:25:57.1714708Z 2025-05-07T20:25:57.1714712Z 2025-05-07T20:25:57.1714915Z  2025-05-07T20:25:57.1715225Z 2025-05-07T20:25:57.1715231Z 2025-05-07T20:25:57.1715236Z 2025-05-07T20:25:57.1715249Z 2025-05-07T20:25:57.1715255Z 2025-05-07T20:25:57.1715260Z 2025-05-07T20:25:57.1715265Z 2025-05-07T20:25:57.1715271Z 2025-05-07T20:25:57.1715283Z 2025-05-07T20:25:57.1715484Z  2025-05-07T20:25:57.1715782Z 2025-05-07T20:25:57.1715785Z 2025-05-07T20:25:57.1715789Z 2025-05-07T20:25:57.1715793Z 2025-05-07T20:25:57.1715796Z 2025-05-07T20:25:57.1715800Z 2025-05-07T20:25:57.1715804Z 2025-05-07T20:25:57.1715807Z 2025-05-07T20:25:57.1715811Z 2025-05-07T20:25:57.1715815Z 2025-05-07T20:25:57.1716014Z  2025-05-07T20:25:57.1716320Z 2025-05-07T20:25:57.1716324Z 2025-05-07T20:25:57.1716327Z 2025-05-07T20:25:57.1716331Z 2025-05-07T20:25:57.1716335Z 2025-05-07T20:25:57.1716338Z 2025-05-07T20:25:57.1716342Z 2025-05-07T20:25:57.1716354Z 2025-05-07T20:25:57.1716358Z 2025-05-07T20:25:57.1716361Z 2025-05-07T20:25:57.1716365Z 2025-05-07T20:25:57.1716579Z  2025-05-07T20:25:57.1716868Z 2025-05-07T20:25:57.1716879Z 2025-05-07T20:25:57.1716883Z 2025-05-07T20:25:57.1716886Z 2025-05-07T20:25:57.1716890Z 2025-05-07T20:25:57.1716894Z 2025-05-07T20:25:57.1716897Z 2025-05-07T20:25:57.1716901Z 2025-05-07T20:25:57.1716904Z 2025-05-07T20:25:57.1716908Z 2025-05-07T20:25:57.1716911Z 2025-05-07T20:25:57.1716915Z 2025-05-07T20:25:57.1717127Z  2025-05-07T20:25:57.1717412Z 2025-05-07T20:25:57.1717417Z 2025-05-07T20:25:57.1717423Z 2025-05-07T20:25:57.1717428Z 2025-05-07T20:25:57.1717434Z 2025-05-07T20:25:57.1717449Z 2025-05-07T20:25:57.1717454Z 2025-05-07T20:25:57.1717461Z 2025-05-07T20:25:57.1717467Z 2025-05-07T20:25:57.1717473Z 2025-05-07T20:25:57.1717622Z 2025-05-07T20:25:57.1717627Z 2025-05-07T20:25:57.1717631Z 2025-05-07T20:25:57.1717885Z  2025-05-07T20:25:57.1718188Z 2025-05-07T20:25:57.1718192Z 2025-05-07T20:25:57.1718282Z 2025-05-07T20:25:57.1718286Z 2025-05-07T20:25:57.1718289Z 2025-05-07T20:25:57.1718293Z 2025-05-07T20:25:57.1718296Z 2025-05-07T20:25:57.1718300Z 2025-05-07T20:25:57.1718304Z 2025-05-07T20:25:57.1718308Z 2025-05-07T20:25:57.1718311Z 2025-05-07T20:25:57.1718315Z 2025-05-07T20:25:57.1718318Z 2025-05-07T20:25:57.1718322Z 2025-05-07T20:25:57.1718637Z  2025-05-07T20:25:57.1718878Z 2025-05-07T20:25:57.1718882Z 2025-05-07T20:25:57.1718886Z 2025-05-07T20:25:57.1718889Z 2025-05-07T20:25:57.1718893Z 2025-05-07T20:25:57.1718897Z 2025-05-07T20:25:57.1718900Z 2025-05-07T20:25:57.1718904Z 2025-05-07T20:25:57.1718910Z 2025-05-07T20:25:57.1718924Z 2025-05-07T20:25:57.1718930Z 2025-05-07T20:25:57.1718935Z 2025-05-07T20:25:57.1718949Z 2025-05-07T20:25:57.1718954Z 2025-05-07T20:25:57.1718959Z 2025-05-07T20:25:57.1719250Z  2025-05-07T20:25:57.1719568Z 2025-05-07T20:25:57.1719581Z 2025-05-07T20:25:57.1719587Z 2025-05-07T20:25:57.1719592Z 2025-05-07T20:25:57.1719597Z 2025-05-07T20:25:57.1719602Z 2025-05-07T20:25:57.1719607Z 2025-05-07T20:25:57.1719612Z 2025-05-07T20:25:57.1719617Z 2025-05-07T20:25:57.1719622Z 2025-05-07T20:25:57.1719626Z 2025-05-07T20:25:57.1719631Z 2025-05-07T20:25:57.1719636Z 2025-05-07T20:25:57.1719641Z 2025-05-07T20:25:57.1719646Z 2025-05-07T20:25:57.1719652Z 2025-05-07T20:25:57.1719940Z  2025-05-07T20:25:57.1720174Z 2025-05-07T20:25:57.1720177Z 2025-05-07T20:25:57.1720181Z 2025-05-07T20:25:57.1720185Z 2025-05-07T20:25:57.1720193Z 2025-05-07T20:25:57.1720200Z 2025-05-07T20:25:57.1720204Z 2025-05-07T20:25:57.1720208Z 2025-05-07T20:25:57.1720211Z 2025-05-07T20:25:57.1721773Z 2025-05-07T20:25:57.1721958Z 2025-05-07T20:25:57.1721962Z 2025-05-07T20:25:57.1721966Z 2025-05-07T20:25:57.1721991Z 2025-05-07T20:25:57.1722104Z 2025-05-07T20:25:57.1722108Z 2025-05-07T20:25:57.1722118Z 2025-05-07T20:25:57.1722557Z  2025-05-07T20:25:57.1722856Z 2025-05-07T20:25:57.1722862Z 2025-05-07T20:25:57.1722883Z 2025-05-07T20:25:57.1722888Z 2025-05-07T20:25:57.1722894Z 2025-05-07T20:25:57.1722899Z 2025-05-07T20:25:57.1722904Z 2025-05-07T20:25:57.1722909Z 2025-05-07T20:25:57.1722915Z 2025-05-07T20:25:57.1722919Z 2025-05-07T20:25:57.1722936Z 2025-05-07T20:25:57.1722941Z 2025-05-07T20:25:57.1722946Z 2025-05-07T20:25:57.1722951Z 2025-05-07T20:25:57.1722956Z 2025-05-07T20:25:57.1722961Z 2025-05-07T20:25:57.1722977Z 2025-05-07T20:25:57.1722982Z 2025-05-07T20:25:57.1723238Z  2025-05-07T20:25:57.1723543Z 2025-05-07T20:25:57.1723550Z 2025-05-07T20:25:57.1723657Z  2025-05-07T20:25:57.1723765Z 2025-05-07T20:25:57.1723769Z 2025-05-07T20:25:57.1723872Z  2025-05-07T20:25:57.1723982Z 2025-05-07T20:25:57.1723986Z 2025-05-07T20:25:57.1723989Z 2025-05-07T20:25:57.1724099Z  2025-05-07T20:25:57.1724208Z 2025-05-07T20:25:57.1724211Z 2025-05-07T20:25:57.1724216Z 2025-05-07T20:25:57.1724219Z 2025-05-07T20:25:57.1724326Z  2025-05-07T20:25:57.1724482Z 2025-05-07T20:25:57.1724487Z 2025-05-07T20:25:57.1724492Z 2025-05-07T20:25:57.1724497Z 2025-05-07T20:25:57.1724502Z 2025-05-07T20:25:57.1724647Z  2025-05-07T20:25:57.1724786Z 2025-05-07T20:25:57.1724790Z 2025-05-07T20:25:57.1724793Z 2025-05-07T20:25:57.1724797Z 2025-05-07T20:25:57.1724800Z 2025-05-07T20:25:57.1725288Z 2025-05-07T20:25:57.1725459Z  2025-05-07T20:25:57.1725653Z 2025-05-07T20:25:57.1725656Z 2025-05-07T20:25:57.1725660Z 2025-05-07T20:25:57.1725664Z 2025-05-07T20:25:57.1725667Z 2025-05-07T20:25:57.1725671Z 2025-05-07T20:25:57.1725674Z 2025-05-07T20:25:57.1725896Z  2025-05-07T20:25:57.1726039Z 2025-05-07T20:25:57.1726043Z 2025-05-07T20:25:57.1726046Z 2025-05-07T20:25:57.1726050Z 2025-05-07T20:25:57.1726054Z 2025-05-07T20:25:57.1726057Z 2025-05-07T20:25:57.1726061Z 2025-05-07T20:25:57.1726065Z 2025-05-07T20:25:57.1726192Z  2025-05-07T20:25:57.1726341Z 2025-05-07T20:25:57.1726345Z 2025-05-07T20:25:57.1726348Z 2025-05-07T20:25:57.1726352Z 2025-05-07T20:25:57.1726356Z 2025-05-07T20:25:57.1726359Z 2025-05-07T20:25:57.1726363Z 2025-05-07T20:25:57.1726366Z 2025-05-07T20:25:57.1726370Z 2025-05-07T20:25:57.1726518Z  2025-05-07T20:25:57.1726725Z 2025-05-07T20:25:57.1726729Z 2025-05-07T20:25:57.1726741Z 2025-05-07T20:25:57.1726745Z 2025-05-07T20:25:57.1726749Z 2025-05-07T20:25:57.1726752Z 2025-05-07T20:25:57.1726756Z 2025-05-07T20:25:57.1726760Z 2025-05-07T20:25:57.1726764Z 2025-05-07T20:25:57.1726767Z 2025-05-07T20:25:57.1726906Z  2025-05-07T20:25:57.1727080Z 2025-05-07T20:25:57.1727084Z 2025-05-07T20:25:57.1727101Z 2025-05-07T20:25:57.1727105Z 2025-05-07T20:25:57.1727109Z 2025-05-07T20:25:57.1727112Z 2025-05-07T20:25:57.1727116Z 2025-05-07T20:25:57.1727126Z 2025-05-07T20:25:57.1727130Z 2025-05-07T20:25:57.1727134Z 2025-05-07T20:25:57.1727138Z 2025-05-07T20:25:57.1727270Z  2025-05-07T20:25:57.1727444Z 2025-05-07T20:25:57.1727448Z 2025-05-07T20:25:57.1727451Z 2025-05-07T20:25:57.1727462Z 2025-05-07T20:25:57.1727465Z 2025-05-07T20:25:57.1727469Z 2025-05-07T20:25:57.1727473Z 2025-05-07T20:25:57.1727476Z 2025-05-07T20:25:57.1727480Z 2025-05-07T20:25:57.1727483Z 2025-05-07T20:25:57.1727487Z 2025-05-07T20:25:57.1727498Z 2025-05-07T20:25:57.1727683Z  2025-05-07T20:25:57.1727889Z 2025-05-07T20:25:57.1727893Z 2025-05-07T20:25:57.1727896Z 2025-05-07T20:25:57.1727900Z 2025-05-07T20:25:57.1727904Z 2025-05-07T20:25:57.1727907Z 2025-05-07T20:25:57.1727911Z 2025-05-07T20:25:57.1727919Z 2025-05-07T20:25:57.1727923Z 2025-05-07T20:25:57.1727926Z 2025-05-07T20:25:57.1727930Z 2025-05-07T20:25:57.1727934Z 2025-05-07T20:25:57.1727937Z 2025-05-07T20:25:57.1728082Z  2025-05-07T20:25:57.1728334Z 2025-05-07T20:25:57.1728338Z 2025-05-07T20:25:57.1728342Z 2025-05-07T20:25:57.1728345Z 2025-05-07T20:25:57.1728349Z 2025-05-07T20:25:57.1728353Z 2025-05-07T20:25:57.1728356Z 2025-05-07T20:25:57.1728360Z 2025-05-07T20:25:57.1728363Z 2025-05-07T20:25:57.1728367Z 2025-05-07T20:25:57.1728371Z 2025-05-07T20:25:57.1728374Z 2025-05-07T20:25:57.1728378Z 2025-05-07T20:25:57.1728381Z 2025-05-07T20:25:57.1728550Z  2025-05-07T20:25:57.1728812Z 2025-05-07T20:25:57.1728816Z 2025-05-07T20:25:57.1728819Z 2025-05-07T20:25:57.1728823Z 2025-05-07T20:25:57.1728826Z 2025-05-07T20:25:57.1728830Z 2025-05-07T20:25:57.1728846Z 2025-05-07T20:25:57.1728850Z 2025-05-07T20:25:57.1728853Z 2025-05-07T20:25:57.1728861Z 2025-05-07T20:25:57.1728865Z 2025-05-07T20:25:57.1728868Z 2025-05-07T20:25:57.1728872Z 2025-05-07T20:25:57.1728876Z 2025-05-07T20:25:57.1728879Z 2025-05-07T20:25:57.1729035Z  2025-05-07T20:25:57.1729297Z 2025-05-07T20:25:57.1729301Z 2025-05-07T20:25:57.1729304Z 2025-05-07T20:25:57.1729308Z 2025-05-07T20:25:57.1729312Z 2025-05-07T20:25:57.1729315Z 2025-05-07T20:25:57.1729319Z 2025-05-07T20:25:57.1729322Z 2025-05-07T20:25:57.1729326Z 2025-05-07T20:25:57.1729329Z 2025-05-07T20:25:57.1729333Z 2025-05-07T20:25:57.1729336Z 2025-05-07T20:25:57.1729340Z 2025-05-07T20:25:57.1729344Z 2025-05-07T20:25:57.1729347Z 2025-05-07T20:25:57.1729351Z 2025-05-07T20:25:57.1745870Z  2025-05-07T20:25:57.1746170Z 2025-05-07T20:25:57.1746175Z 2025-05-07T20:25:57.1746181Z 2025-05-07T20:25:57.1746186Z 2025-05-07T20:25:57.1746191Z 2025-05-07T20:25:57.1746196Z 2025-05-07T20:25:57.1746201Z 2025-05-07T20:25:57.1746312Z 2025-05-07T20:25:57.1746318Z 2025-05-07T20:25:57.1746323Z 2025-05-07T20:25:57.1746328Z 2025-05-07T20:25:57.1746333Z 2025-05-07T20:25:57.1746339Z 2025-05-07T20:25:57.1746344Z 2025-05-07T20:25:57.1746360Z 2025-05-07T20:25:57.1746365Z 2025-05-07T20:25:57.1746371Z 2025-05-07T20:25:57.1746669Z  2025-05-07T20:25:57.1746940Z 2025-05-07T20:25:57.1746944Z 2025-05-07T20:25:57.1746948Z 2025-05-07T20:25:57.1746952Z 2025-05-07T20:25:57.1746958Z 2025-05-07T20:25:57.1746962Z 2025-05-07T20:25:57.1746965Z 2025-05-07T20:25:57.1746969Z 2025-05-07T20:25:57.1746973Z 2025-05-07T20:25:57.1746976Z 2025-05-07T20:25:57.1746980Z 2025-05-07T20:25:57.1746983Z 2025-05-07T20:25:57.1746996Z 2025-05-07T20:25:57.1747000Z 2025-05-07T20:25:57.1747003Z 2025-05-07T20:25:57.1747007Z 2025-05-07T20:25:57.1747010Z 2025-05-07T20:25:57.1747014Z 2025-05-07T20:25:57.1747205Z  2025-05-07T20:25:57.1747419Z 2025-05-07T20:25:57.1747428Z 2025-05-07T20:25:57.1747533Z  2025-05-07T20:25:57.1747681Z 2025-05-07T20:25:57.1747687Z 2025-05-07T20:25:57.1747825Z  2025-05-07T20:25:57.1747974Z 2025-05-07T20:25:57.1747990Z 2025-05-07T20:25:57.1747995Z 2025-05-07T20:25:57.1748156Z  2025-05-07T20:25:57.1748330Z 2025-05-07T20:25:57.1748336Z 2025-05-07T20:25:57.1748342Z 2025-05-07T20:25:57.1748347Z 2025-05-07T20:25:57.1748519Z  2025-05-07T20:25:57.1748682Z 2025-05-07T20:25:57.1748688Z 2025-05-07T20:25:57.1748692Z 2025-05-07T20:25:57.1748698Z 2025-05-07T20:25:57.1748703Z 2025-05-07T20:25:57.1748860Z  2025-05-07T20:25:57.1749029Z 2025-05-07T20:25:57.1749034Z 2025-05-07T20:25:57.1749039Z 2025-05-07T20:25:57.1749052Z 2025-05-07T20:25:57.1749058Z 2025-05-07T20:25:57.1749062Z 2025-05-07T20:25:57.1749225Z  2025-05-07T20:25:57.1749402Z 2025-05-07T20:25:57.1749407Z 2025-05-07T20:25:57.1749413Z 2025-05-07T20:25:57.1749418Z 2025-05-07T20:25:57.1749423Z 2025-05-07T20:25:57.1749435Z 2025-05-07T20:25:57.1749441Z 2025-05-07T20:25:57.1749609Z  2025-05-07T20:25:57.1749909Z 2025-05-07T20:25:57.1749914Z 2025-05-07T20:25:57.1749920Z 2025-05-07T20:25:57.1749925Z 2025-05-07T20:25:57.1749930Z 2025-05-07T20:25:57.1749935Z 2025-05-07T20:25:57.1749941Z 2025-05-07T20:25:57.1749946Z 2025-05-07T20:25:57.1750125Z  2025-05-07T20:25:57.1750332Z 2025-05-07T20:25:57.1750337Z 2025-05-07T20:25:57.1750343Z 2025-05-07T20:25:57.1750348Z 2025-05-07T20:25:57.1750353Z 2025-05-07T20:25:57.1750358Z 2025-05-07T20:25:57.1750363Z 2025-05-07T20:25:57.1750368Z 2025-05-07T20:25:57.1750374Z 2025-05-07T20:25:57.1750570Z  2025-05-07T20:25:57.1750750Z 2025-05-07T20:25:57.1750754Z 2025-05-07T20:25:57.1750758Z 2025-05-07T20:25:57.1750761Z 2025-05-07T20:25:57.1750765Z 2025-05-07T20:25:57.1750864Z 2025-05-07T20:25:57.1750868Z 2025-05-07T20:25:57.1750871Z 2025-05-07T20:25:57.1750875Z 2025-05-07T20:25:57.1750883Z 2025-05-07T20:25:57.1751021Z  2025-05-07T20:25:57.1751188Z 2025-05-07T20:25:57.1751192Z 2025-05-07T20:25:57.1751196Z 2025-05-07T20:25:57.1751199Z 2025-05-07T20:25:57.1751203Z 2025-05-07T20:25:57.1751207Z 2025-05-07T20:25:57.1751210Z 2025-05-07T20:25:57.1751214Z 2025-05-07T20:25:57.1751226Z 2025-05-07T20:25:57.1751230Z 2025-05-07T20:25:57.1751233Z 2025-05-07T20:25:57.1751365Z  2025-05-07T20:25:57.1751540Z 2025-05-07T20:25:57.1751543Z 2025-05-07T20:25:57.1751547Z 2025-05-07T20:25:57.1751551Z 2025-05-07T20:25:57.1751554Z 2025-05-07T20:25:57.1751566Z 2025-05-07T20:25:57.1751569Z 2025-05-07T20:25:57.1751573Z 2025-05-07T20:25:57.1751577Z 2025-05-07T20:25:57.1751712Z 2025-05-07T20:25:57.1751717Z 2025-05-07T20:25:57.1751720Z 2025-05-07T20:25:57.1751866Z  2025-05-07T20:25:57.1752057Z 2025-05-07T20:25:57.1752060Z 2025-05-07T20:25:57.1752064Z 2025-05-07T20:25:57.1752067Z 2025-05-07T20:25:57.1752147Z 2025-05-07T20:25:57.1752151Z 2025-05-07T20:25:57.1752155Z 2025-05-07T20:25:57.1752158Z 2025-05-07T20:25:57.1752162Z 2025-05-07T20:25:57.1752166Z 2025-05-07T20:25:57.1752169Z 2025-05-07T20:25:57.1752173Z 2025-05-07T20:25:57.1752177Z 2025-05-07T20:25:57.1752314Z  2025-05-07T20:25:57.1752516Z 2025-05-07T20:25:57.1752519Z 2025-05-07T20:25:57.1752523Z 2025-05-07T20:25:57.1752526Z 2025-05-07T20:25:57.1752530Z 2025-05-07T20:25:57.1752534Z 2025-05-07T20:25:57.1752537Z 2025-05-07T20:25:57.1752541Z 2025-05-07T20:25:57.1752544Z 2025-05-07T20:25:57.1752548Z 2025-05-07T20:25:57.1752552Z 2025-05-07T20:25:57.1752555Z 2025-05-07T20:25:57.1752559Z 2025-05-07T20:25:57.1752563Z 2025-05-07T20:25:57.1752721Z  2025-05-07T20:25:57.1752915Z 2025-05-07T20:25:57.1752919Z 2025-05-07T20:25:57.1752922Z 2025-05-07T20:25:57.1752926Z 2025-05-07T20:25:57.1752930Z 2025-05-07T20:25:57.1752934Z 2025-05-07T20:25:57.1752937Z 2025-05-07T20:25:57.1752946Z 2025-05-07T20:25:57.1752949Z 2025-05-07T20:25:57.1752953Z 2025-05-07T20:25:57.1752957Z 2025-05-07T20:25:57.1752960Z 2025-05-07T20:25:57.1752971Z 2025-05-07T20:25:57.1752975Z 2025-05-07T20:25:57.1752979Z 2025-05-07T20:25:57.1753127Z  2025-05-07T20:25:57.1753325Z 2025-05-07T20:25:57.1753329Z 2025-05-07T20:25:57.1753332Z 2025-05-07T20:25:57.1753336Z 2025-05-07T20:25:57.1753347Z 2025-05-07T20:25:57.1753351Z 2025-05-07T20:25:57.1753355Z 2025-05-07T20:25:57.1753359Z 2025-05-07T20:25:57.1753362Z 2025-05-07T20:25:57.1753366Z 2025-05-07T20:25:57.1753370Z 2025-05-07T20:25:57.1753373Z 2025-05-07T20:25:57.1753377Z 2025-05-07T20:25:57.1753380Z 2025-05-07T20:25:57.1753387Z 2025-05-07T20:25:57.1753391Z 2025-05-07T20:25:57.1753544Z  2025-05-07T20:25:57.1753754Z 2025-05-07T20:25:57.1753757Z 2025-05-07T20:25:57.1753761Z 2025-05-07T20:25:57.1753765Z 2025-05-07T20:25:57.1753768Z 2025-05-07T20:25:57.1753776Z 2025-05-07T20:25:57.1753780Z 2025-05-07T20:25:57.1753783Z 2025-05-07T20:25:57.1753787Z 2025-05-07T20:25:57.1753790Z 2025-05-07T20:25:57.1753794Z 2025-05-07T20:25:57.1753798Z 2025-05-07T20:25:57.1753801Z 2025-05-07T20:25:57.1753805Z 2025-05-07T20:25:57.1753809Z 2025-05-07T20:25:57.1753812Z 2025-05-07T20:25:57.1753816Z 2025-05-07T20:25:57.1753983Z  2025-05-07T20:25:57.1754188Z 2025-05-07T20:25:57.1754192Z 2025-05-07T20:25:57.1754196Z 2025-05-07T20:25:57.1754199Z 2025-05-07T20:25:57.1754203Z 2025-05-07T20:25:57.1754207Z 2025-05-07T20:25:57.1754210Z 2025-05-07T20:25:57.1754214Z 2025-05-07T20:25:57.1754217Z 2025-05-07T20:25:57.1754227Z 2025-05-07T20:25:57.1754235Z 2025-05-07T20:25:57.1754238Z 2025-05-07T20:25:57.1754242Z 2025-05-07T20:25:57.1754246Z 2025-05-07T20:25:57.1754249Z 2025-05-07T20:25:57.1754253Z 2025-05-07T20:25:57.1754256Z 2025-05-07T20:25:57.1754260Z 2025-05-07T20:25:57.1754422Z  2025-05-07T20:25:57.1754651Z 2025-05-07T20:25:57.1754654Z 2025-05-07T20:25:57.1754755Z  2025-05-07T20:25:57.1754858Z 2025-05-07T20:25:57.1754861Z 2025-05-07T20:25:57.1754968Z  2025-05-07T20:25:57.1755071Z 2025-05-07T20:25:57.1755075Z 2025-05-07T20:25:57.1755079Z 2025-05-07T20:25:57.1755185Z  2025-05-07T20:25:57.1755296Z 2025-05-07T20:25:57.1755300Z 2025-05-07T20:25:57.1755304Z 2025-05-07T20:25:57.1755308Z 2025-05-07T20:25:57.1755423Z  2025-05-07T20:25:57.1755547Z 2025-05-07T20:25:57.1755551Z 2025-05-07T20:25:57.1755555Z 2025-05-07T20:25:57.1755558Z 2025-05-07T20:25:57.1755562Z 2025-05-07T20:25:57.1755680Z  2025-05-07T20:25:57.1755806Z 2025-05-07T20:25:57.1755913Z 2025-05-07T20:25:57.1755917Z 2025-05-07T20:25:57.1755921Z 2025-05-07T20:25:57.1755925Z 2025-05-07T20:25:57.1755928Z 2025-05-07T20:25:57.1756048Z  2025-05-07T20:25:57.1756175Z 2025-05-07T20:25:57.1756178Z 2025-05-07T20:25:57.1756281Z 2025-05-07T20:25:57.1756285Z 2025-05-07T20:25:57.1756288Z 2025-05-07T20:25:57.1756292Z 2025-05-07T20:25:57.1756296Z 2025-05-07T20:25:57.1756421Z  2025-05-07T20:25:57.1756564Z 2025-05-07T20:25:57.1756568Z 2025-05-07T20:25:57.1756571Z 2025-05-07T20:25:57.1756575Z 2025-05-07T20:25:57.1756579Z 2025-05-07T20:25:57.1756582Z 2025-05-07T20:25:57.1756586Z 2025-05-07T20:25:57.1756590Z 2025-05-07T20:25:57.1756714Z  2025-05-07T20:25:57.1756866Z 2025-05-07T20:25:57.1756870Z 2025-05-07T20:25:57.1756874Z 2025-05-07T20:25:57.1756877Z 2025-05-07T20:25:57.1756881Z 2025-05-07T20:25:57.1756884Z 2025-05-07T20:25:57.1756888Z 2025-05-07T20:25:57.1756892Z 2025-05-07T20:25:57.1756895Z 2025-05-07T20:25:57.1757029Z  2025-05-07T20:25:57.1757185Z 2025-05-07T20:25:57.1757189Z 2025-05-07T20:25:57.1757192Z 2025-05-07T20:25:57.1757196Z 2025-05-07T20:25:57.1757199Z 2025-05-07T20:25:57.1757203Z 2025-05-07T20:25:57.1757207Z 2025-05-07T20:25:57.1757215Z 2025-05-07T20:25:57.1757218Z 2025-05-07T20:25:57.1757222Z 2025-05-07T20:25:57.1757354Z  2025-05-07T20:25:57.1757520Z 2025-05-07T20:25:57.1757523Z 2025-05-07T20:25:57.1757527Z 2025-05-07T20:25:57.1757530Z 2025-05-07T20:25:57.1757534Z 2025-05-07T20:25:57.1757540Z 2025-05-07T20:25:57.1757545Z 2025-05-07T20:25:57.1757550Z 2025-05-07T20:25:57.1757555Z 2025-05-07T20:25:57.1757568Z 2025-05-07T20:25:57.1757574Z 2025-05-07T20:25:57.1757758Z  2025-05-07T20:25:57.1757979Z 2025-05-07T20:25:57.1757982Z 2025-05-07T20:25:57.1757986Z 2025-05-07T20:25:57.1757990Z 2025-05-07T20:25:57.1757993Z 2025-05-07T20:25:57.1757997Z 2025-05-07T20:25:57.1758007Z 2025-05-07T20:25:57.1758017Z 2025-05-07T20:25:57.1758021Z 2025-05-07T20:25:57.1758024Z 2025-05-07T20:25:57.1758028Z 2025-05-07T20:25:57.1758032Z 2025-05-07T20:25:57.1758178Z  2025-05-07T20:25:57.1758367Z 2025-05-07T20:25:57.1758379Z 2025-05-07T20:25:57.1758387Z 2025-05-07T20:25:57.1758391Z 2025-05-07T20:25:57.1758394Z 2025-05-07T20:25:57.1758398Z 2025-05-07T20:25:57.1758402Z 2025-05-07T20:25:57.1758405Z 2025-05-07T20:25:57.1758409Z 2025-05-07T20:25:57.1758412Z 2025-05-07T20:25:57.1758416Z 2025-05-07T20:25:57.1758420Z 2025-05-07T20:25:57.1758423Z 2025-05-07T20:25:57.1758557Z  2025-05-07T20:25:57.1758749Z 2025-05-07T20:25:57.1758752Z 2025-05-07T20:25:57.1758756Z 2025-05-07T20:25:57.1758760Z 2025-05-07T20:25:57.1758763Z 2025-05-07T20:25:57.1758767Z 2025-05-07T20:25:57.1758770Z 2025-05-07T20:25:57.1758774Z 2025-05-07T20:25:57.1758778Z 2025-05-07T20:25:57.1758781Z 2025-05-07T20:25:57.1758785Z 2025-05-07T20:25:57.1758792Z 2025-05-07T20:25:57.1758796Z 2025-05-07T20:25:57.1758799Z 2025-05-07T20:25:57.1758945Z  2025-05-07T20:25:57.1759137Z 2025-05-07T20:25:57.1759141Z 2025-05-07T20:25:57.1759145Z 2025-05-07T20:25:57.1759148Z 2025-05-07T20:25:57.1759156Z 2025-05-07T20:25:57.1759160Z 2025-05-07T20:25:57.1759163Z 2025-05-07T20:25:57.1759167Z 2025-05-07T20:25:57.1759171Z 2025-05-07T20:25:57.1759174Z 2025-05-07T20:25:57.1759178Z 2025-05-07T20:25:57.1759181Z 2025-05-07T20:25:57.1759185Z 2025-05-07T20:25:57.1759196Z 2025-05-07T20:25:57.1759199Z 2025-05-07T20:25:57.1759347Z  2025-05-07T20:25:57.1759546Z 2025-05-07T20:25:57.1759550Z 2025-05-07T20:25:57.1759553Z 2025-05-07T20:25:57.1759557Z 2025-05-07T20:25:57.1759561Z 2025-05-07T20:25:57.1759574Z 2025-05-07T20:25:57.1759577Z 2025-05-07T20:25:57.1759581Z 2025-05-07T20:25:57.1759585Z 2025-05-07T20:25:57.1759588Z 2025-05-07T20:25:57.1759592Z 2025-05-07T20:25:57.1759596Z 2025-05-07T20:25:57.1759692Z 2025-05-07T20:25:57.1759696Z 2025-05-07T20:25:57.1759700Z 2025-05-07T20:25:57.1759704Z 2025-05-07T20:25:57.1759865Z  2025-05-07T20:25:57.1760072Z 2025-05-07T20:25:57.1760075Z 2025-05-07T20:25:57.1760465Z 2025-05-07T20:25:57.1760471Z 2025-05-07T20:25:57.1760475Z 2025-05-07T20:25:57.1760481Z 2025-05-07T20:25:57.1760486Z 2025-05-07T20:25:57.1760491Z 2025-05-07T20:25:57.1760496Z 2025-05-07T20:25:57.1760501Z 2025-05-07T20:25:57.1760506Z 2025-05-07T20:25:57.1760511Z 2025-05-07T20:25:57.1760516Z 2025-05-07T20:25:57.1760522Z 2025-05-07T20:25:57.1760527Z 2025-05-07T20:25:57.1760532Z 2025-05-07T20:25:57.1760537Z 2025-05-07T20:25:57.1760748Z  2025-05-07T20:25:57.1761013Z 2025-05-07T20:25:57.1761018Z 2025-05-07T20:25:57.1761023Z 2025-05-07T20:25:57.1761029Z 2025-05-07T20:25:57.1761034Z 2025-05-07T20:25:57.1761037Z 2025-05-07T20:25:57.1761041Z 2025-05-07T20:25:57.1761044Z 2025-05-07T20:25:57.1761056Z 2025-05-07T20:25:57.1761060Z 2025-05-07T20:25:57.1761073Z 2025-05-07T20:25:57.1761077Z 2025-05-07T20:25:57.1761080Z 2025-05-07T20:25:57.1761084Z 2025-05-07T20:25:57.1761088Z 2025-05-07T20:25:57.1761091Z 2025-05-07T20:25:57.1761095Z 2025-05-07T20:25:57.1761105Z 2025-05-07T20:25:57.1761289Z  2025-05-07T20:25:57.1761560Z 2025-05-07T20:25:57.1761567Z 2025-05-07T20:25:57.1761680Z  2025-05-07T20:25:57.1761824Z 2025-05-07T20:25:57.1761828Z 2025-05-07T20:25:57.1761943Z  2025-05-07T20:25:57.1762046Z 2025-05-07T20:25:57.1762049Z 2025-05-07T20:25:57.1762053Z 2025-05-07T20:25:57.1762190Z  2025-05-07T20:25:57.1762331Z 2025-05-07T20:25:57.1762335Z 2025-05-07T20:25:57.1762339Z 2025-05-07T20:25:57.1762342Z 2025-05-07T20:25:57.1762462Z  2025-05-07T20:25:57.1762633Z 2025-05-07T20:25:57.1762637Z 2025-05-07T20:25:57.1762641Z 2025-05-07T20:25:57.1762645Z 2025-05-07T20:25:57.1762648Z 2025-05-07T20:25:57.1762767Z  2025-05-07T20:25:57.1762896Z 2025-05-07T20:25:57.1762899Z 2025-05-07T20:25:57.1762903Z 2025-05-07T20:25:57.1762907Z 2025-05-07T20:25:57.1762910Z 2025-05-07T20:25:57.1762914Z 2025-05-07T20:25:57.1763021Z  2025-05-07T20:25:57.1763163Z 2025-05-07T20:25:57.1763169Z 2025-05-07T20:25:57.1763174Z 2025-05-07T20:25:57.1763179Z 2025-05-07T20:25:57.1763184Z 2025-05-07T20:25:57.1763190Z 2025-05-07T20:25:57.1763195Z 2025-05-07T20:25:57.1763359Z  2025-05-07T20:25:57.1763501Z 2025-05-07T20:25:57.1763511Z 2025-05-07T20:25:57.1763514Z 2025-05-07T20:25:57.1763518Z 2025-05-07T20:25:57.1763522Z 2025-05-07T20:25:57.1763525Z 2025-05-07T20:25:57.1763529Z 2025-05-07T20:25:57.1763533Z 2025-05-07T20:25:57.1763704Z  2025-05-07T20:25:57.1763863Z 2025-05-07T20:25:57.1763875Z 2025-05-07T20:25:57.1763879Z 2025-05-07T20:25:57.1763883Z 2025-05-07T20:25:57.1763886Z 2025-05-07T20:25:57.1763890Z 2025-05-07T20:25:57.1763898Z 2025-05-07T20:25:57.1763902Z 2025-05-07T20:25:57.1763906Z 2025-05-07T20:25:57.1764027Z  2025-05-07T20:25:57.1764186Z 2025-05-07T20:25:57.1764189Z 2025-05-07T20:25:57.1764193Z 2025-05-07T20:25:57.1764197Z 2025-05-07T20:25:57.1764200Z 2025-05-07T20:25:57.1764208Z 2025-05-07T20:25:57.1764212Z 2025-05-07T20:25:57.1764216Z 2025-05-07T20:25:57.1764219Z 2025-05-07T20:25:57.1764223Z 2025-05-07T20:25:57.1764370Z  2025-05-07T20:25:57.1764582Z 2025-05-07T20:25:57.1764585Z 2025-05-07T20:25:57.1764589Z 2025-05-07T20:25:57.1764592Z 2025-05-07T20:25:57.1764596Z 2025-05-07T20:25:57.1764600Z 2025-05-07T20:25:57.1764603Z 2025-05-07T20:25:57.1764607Z 2025-05-07T20:25:57.1764611Z 2025-05-07T20:25:57.1764614Z 2025-05-07T20:25:57.1764618Z 2025-05-07T20:25:57.1764798Z  2025-05-07T20:25:57.1764986Z 2025-05-07T20:25:57.1764990Z 2025-05-07T20:25:57.1764995Z 2025-05-07T20:25:57.1765001Z 2025-05-07T20:25:57.1765007Z 2025-05-07T20:25:57.1765131Z 2025-05-07T20:25:57.1765138Z 2025-05-07T20:25:57.1765143Z 2025-05-07T20:25:57.1765149Z 2025-05-07T20:25:57.1765154Z 2025-05-07T20:25:57.1765160Z 2025-05-07T20:25:57.1765165Z 2025-05-07T20:25:57.1765317Z  2025-05-07T20:25:57.1765602Z 2025-05-07T20:25:57.1765605Z 2025-05-07T20:25:57.1765609Z 2025-05-07T20:25:57.1765612Z 2025-05-07T20:25:57.1765616Z 2025-05-07T20:25:57.1765620Z 2025-05-07T20:25:57.1765623Z 2025-05-07T20:25:57.1765627Z 2025-05-07T20:25:57.1765630Z 2025-05-07T20:25:57.1765634Z 2025-05-07T20:25:57.1765638Z 2025-05-07T20:25:57.1765641Z 2025-05-07T20:25:57.1765652Z 2025-05-07T20:25:57.1765789Z  2025-05-07T20:25:57.1765972Z 2025-05-07T20:25:57.1765976Z 2025-05-07T20:25:57.1765980Z 2025-05-07T20:25:57.1765983Z 2025-05-07T20:25:57.1765987Z 2025-05-07T20:25:57.1765990Z 2025-05-07T20:25:57.1765994Z 2025-05-07T20:25:57.1766004Z 2025-05-07T20:25:57.1766008Z 2025-05-07T20:25:57.1766017Z 2025-05-07T20:25:57.1766021Z 2025-05-07T20:25:57.1766025Z 2025-05-07T20:25:57.1766028Z 2025-05-07T20:25:57.1766032Z 2025-05-07T20:25:57.1766170Z  2025-05-07T20:25:57.1766396Z 2025-05-07T20:25:57.1766401Z 2025-05-07T20:25:57.1766414Z 2025-05-07T20:25:57.1766419Z 2025-05-07T20:25:57.1766425Z 2025-05-07T20:25:57.1766430Z 2025-05-07T20:25:57.1766435Z 2025-05-07T20:25:57.1766440Z 2025-05-07T20:25:57.1766445Z 2025-05-07T20:25:57.1766450Z 2025-05-07T20:25:57.1766456Z 2025-05-07T20:25:57.1766460Z 2025-05-07T20:25:57.1766466Z 2025-05-07T20:25:57.1766471Z 2025-05-07T20:25:57.1766476Z 2025-05-07T20:25:57.1766639Z  2025-05-07T20:25:57.1766846Z 2025-05-07T20:25:57.1766850Z 2025-05-07T20:25:57.1766853Z 2025-05-07T20:25:57.1766857Z 2025-05-07T20:25:57.1766861Z 2025-05-07T20:25:57.1766864Z 2025-05-07T20:25:57.1766868Z 2025-05-07T20:25:57.1766871Z 2025-05-07T20:25:57.1766875Z 2025-05-07T20:25:57.1766886Z 2025-05-07T20:25:57.1766890Z 2025-05-07T20:25:57.1766893Z 2025-05-07T20:25:57.1766897Z 2025-05-07T20:25:57.1766901Z 2025-05-07T20:25:57.1766904Z 2025-05-07T20:25:57.1766908Z 2025-05-07T20:25:57.1767097Z  2025-05-07T20:25:57.1767335Z 2025-05-07T20:25:57.1767338Z 2025-05-07T20:25:57.1767342Z 2025-05-07T20:25:57.1767346Z 2025-05-07T20:25:57.1767349Z 2025-05-07T20:25:57.1767353Z 2025-05-07T20:25:57.1767357Z 2025-05-07T20:25:57.1767367Z 2025-05-07T20:25:57.1767371Z 2025-05-07T20:25:57.1767374Z 2025-05-07T20:25:57.1767378Z 2025-05-07T20:25:57.1767381Z 2025-05-07T20:25:57.1767385Z 2025-05-07T20:25:57.1767448Z 2025-05-07T20:25:57.1767451Z 2025-05-07T20:25:57.1767455Z 2025-05-07T20:25:57.1767459Z 2025-05-07T20:25:57.1767617Z  2025-05-07T20:25:57.1767823Z 2025-05-07T20:25:57.1767826Z 2025-05-07T20:25:57.1767833Z 2025-05-07T20:25:57.1767838Z 2025-05-07T20:25:57.1767844Z 2025-05-07T20:25:57.1767855Z 2025-05-07T20:25:57.1767861Z 2025-05-07T20:25:57.1767866Z 2025-05-07T20:25:57.1767871Z 2025-05-07T20:25:57.1767876Z 2025-05-07T20:25:57.1767881Z 2025-05-07T20:25:57.1767886Z 2025-05-07T20:25:57.1767891Z 2025-05-07T20:25:57.1767896Z 2025-05-07T20:25:57.1767901Z 2025-05-07T20:25:57.1767912Z 2025-05-07T20:25:57.1767917Z 2025-05-07T20:25:57.1767922Z 2025-05-07T20:25:57.1768134Z  2025-05-07T20:25:57.1768343Z 2025-05-07T20:25:57.1768347Z 2025-05-07T20:25:57.1768450Z  2025-05-07T20:25:57.1768591Z 2025-05-07T20:25:57.1768597Z 2025-05-07T20:25:57.1768738Z  2025-05-07T20:25:57.1768886Z 2025-05-07T20:25:57.1768891Z 2025-05-07T20:25:57.1768897Z 2025-05-07T20:25:57.1769042Z  2025-05-07T20:25:57.1769186Z 2025-05-07T20:25:57.1769192Z 2025-05-07T20:25:57.1769205Z 2025-05-07T20:25:57.1769210Z 2025-05-07T20:25:57.1769351Z  2025-05-07T20:25:57.1769510Z 2025-05-07T20:25:57.1769516Z 2025-05-07T20:25:57.1769521Z 2025-05-07T20:25:57.1769635Z 2025-05-07T20:25:57.1769641Z 2025-05-07T20:25:57.1769802Z  2025-05-07T20:25:57.1769983Z 2025-05-07T20:25:57.1769990Z 2025-05-07T20:25:57.1769996Z 2025-05-07T20:25:57.1770003Z 2025-05-07T20:25:57.1770010Z 2025-05-07T20:25:57.1770106Z 2025-05-07T20:25:57.1770276Z  2025-05-07T20:25:57.1770403Z 2025-05-07T20:25:57.1770406Z 2025-05-07T20:25:57.1770410Z 2025-05-07T20:25:57.1770414Z 2025-05-07T20:25:57.1770417Z 2025-05-07T20:25:57.1770421Z 2025-05-07T20:25:57.1770425Z 2025-05-07T20:25:57.1770548Z  2025-05-07T20:25:57.1770684Z 2025-05-07T20:25:57.1770687Z 2025-05-07T20:25:57.1770691Z 2025-05-07T20:25:57.1770695Z 2025-05-07T20:25:57.1770698Z 2025-05-07T20:25:57.1770702Z 2025-05-07T20:25:57.1770706Z 2025-05-07T20:25:57.1770709Z 2025-05-07T20:25:57.1770833Z  2025-05-07T20:25:57.1770981Z 2025-05-07T20:25:57.1770984Z 2025-05-07T20:25:57.1770988Z 2025-05-07T20:25:57.1770992Z 2025-05-07T20:25:57.1771001Z 2025-05-07T20:25:57.1771005Z 2025-05-07T20:25:57.1771009Z 2025-05-07T20:25:57.1771012Z 2025-05-07T20:25:57.1771016Z 2025-05-07T20:25:57.1771139Z  2025-05-07T20:25:57.1771291Z 2025-05-07T20:25:57.1771295Z 2025-05-07T20:25:57.1771298Z 2025-05-07T20:25:57.1771307Z 2025-05-07T20:25:57.1771311Z 2025-05-07T20:25:57.1771315Z 2025-05-07T20:25:57.1771318Z 2025-05-07T20:25:57.1771322Z 2025-05-07T20:25:57.1771325Z 2025-05-07T20:25:57.1771336Z 2025-05-07T20:25:57.1771459Z  2025-05-07T20:25:57.1771621Z 2025-05-07T20:25:57.1771624Z 2025-05-07T20:25:57.1771628Z 2025-05-07T20:25:57.1771631Z 2025-05-07T20:25:57.1771635Z 2025-05-07T20:25:57.1771639Z 2025-05-07T20:25:57.1771642Z 2025-05-07T20:25:57.1771653Z 2025-05-07T20:25:57.1771657Z 2025-05-07T20:25:57.1771660Z 2025-05-07T20:25:57.1771664Z 2025-05-07T20:25:57.1771793Z  2025-05-07T20:25:57.1771964Z 2025-05-07T20:25:57.1771968Z 2025-05-07T20:25:57.1771975Z 2025-05-07T20:25:57.1771985Z 2025-05-07T20:25:57.1771989Z 2025-05-07T20:25:57.1771993Z 2025-05-07T20:25:57.1771996Z 2025-05-07T20:25:57.1772000Z 2025-05-07T20:25:57.1772004Z 2025-05-07T20:25:57.1772007Z 2025-05-07T20:25:57.1772011Z 2025-05-07T20:25:57.1772015Z 2025-05-07T20:25:57.1772150Z  2025-05-07T20:25:57.1772333Z 2025-05-07T20:25:57.1772337Z 2025-05-07T20:25:57.1772340Z 2025-05-07T20:25:57.1772344Z 2025-05-07T20:25:57.1772348Z 2025-05-07T20:25:57.1772351Z 2025-05-07T20:25:57.1772355Z 2025-05-07T20:25:57.1772358Z 2025-05-07T20:25:57.1772362Z 2025-05-07T20:25:57.1772366Z 2025-05-07T20:25:57.1772369Z 2025-05-07T20:25:57.1772373Z 2025-05-07T20:25:57.1772377Z 2025-05-07T20:25:57.1772506Z  2025-05-07T20:25:57.1772695Z 2025-05-07T20:25:57.1772699Z 2025-05-07T20:25:57.1772703Z 2025-05-07T20:25:57.1772706Z 2025-05-07T20:25:57.1772710Z 2025-05-07T20:25:57.1772714Z 2025-05-07T20:25:57.1772717Z 2025-05-07T20:25:57.1772725Z 2025-05-07T20:25:57.1772728Z 2025-05-07T20:25:57.1772732Z 2025-05-07T20:25:57.1772736Z 2025-05-07T20:25:57.1772739Z 2025-05-07T20:25:57.1772743Z 2025-05-07T20:25:57.1772746Z 2025-05-07T20:25:57.1772889Z  2025-05-07T20:25:57.1773150Z 2025-05-07T20:25:57.1773155Z 2025-05-07T20:25:57.1773161Z 2025-05-07T20:25:57.1773166Z 2025-05-07T20:25:57.1773171Z 2025-05-07T20:25:57.1773176Z 2025-05-07T20:25:57.1773182Z 2025-05-07T20:25:57.1773187Z 2025-05-07T20:25:57.1773192Z 2025-05-07T20:25:57.1773198Z 2025-05-07T20:25:57.1773211Z 2025-05-07T20:25:57.1773216Z 2025-05-07T20:25:57.1773222Z 2025-05-07T20:25:57.1773227Z 2025-05-07T20:25:57.1773231Z 2025-05-07T20:25:57.1773442Z  2025-05-07T20:25:57.1773712Z 2025-05-07T20:25:57.1773717Z 2025-05-07T20:25:57.1773730Z 2025-05-07T20:25:57.1773735Z 2025-05-07T20:25:57.1773740Z 2025-05-07T20:25:57.1773745Z 2025-05-07T20:25:57.1773751Z 2025-05-07T20:25:57.1773932Z 2025-05-07T20:25:57.1773938Z 2025-05-07T20:25:57.1773943Z 2025-05-07T20:25:57.1773949Z 2025-05-07T20:25:57.1773954Z 2025-05-07T20:25:57.1773959Z 2025-05-07T20:25:57.1773964Z 2025-05-07T20:25:57.1773969Z 2025-05-07T20:25:57.1773975Z 2025-05-07T20:25:57.1774280Z  2025-05-07T20:25:57.1774488Z 2025-05-07T20:25:57.1774492Z 2025-05-07T20:25:57.1774496Z 2025-05-07T20:25:57.1774499Z 2025-05-07T20:25:57.1774503Z 2025-05-07T20:25:57.1774507Z 2025-05-07T20:25:57.1774510Z 2025-05-07T20:25:57.1774514Z 2025-05-07T20:25:57.1774518Z 2025-05-07T20:25:57.1774521Z 2025-05-07T20:25:57.1774525Z 2025-05-07T20:25:57.1774529Z 2025-05-07T20:25:57.1774532Z 2025-05-07T20:25:57.1774536Z 2025-05-07T20:25:57.1774540Z 2025-05-07T20:25:57.1774543Z 2025-05-07T20:25:57.1774547Z 2025-05-07T20:25:57.1774705Z  2025-05-07T20:25:57.1774906Z 2025-05-07T20:25:57.1774909Z 2025-05-07T20:25:57.1774913Z 2025-05-07T20:25:57.1774923Z 2025-05-07T20:25:57.1774926Z 2025-05-07T20:25:57.1774930Z 2025-05-07T20:25:57.1774934Z 2025-05-07T20:25:57.1774947Z 2025-05-07T20:25:57.1774953Z 2025-05-07T20:25:57.1774958Z 2025-05-07T20:25:57.1774962Z 2025-05-07T20:25:57.1774965Z 2025-05-07T20:25:57.1774975Z 2025-05-07T20:25:57.1774978Z 2025-05-07T20:25:57.1774982Z 2025-05-07T20:25:57.1774986Z 2025-05-07T20:25:57.1774989Z 2025-05-07T20:25:57.1774993Z 2025-05-07T20:25:57.1775155Z  2025-05-07T20:25:57.1775365Z 2025-05-07T20:25:57.1775368Z 2025-05-07T20:25:57.1775467Z  2025-05-07T20:25:57.1775569Z 2025-05-07T20:25:57.1775572Z 2025-05-07T20:25:57.1775678Z  2025-05-07T20:25:57.1775781Z 2025-05-07T20:25:57.1775784Z 2025-05-07T20:25:57.1775788Z 2025-05-07T20:25:57.1775897Z  2025-05-07T20:25:57.1776001Z 2025-05-07T20:25:57.1776005Z 2025-05-07T20:25:57.1776009Z 2025-05-07T20:25:57.1776012Z 2025-05-07T20:25:57.1776113Z  2025-05-07T20:25:57.1776240Z 2025-05-07T20:25:57.1776243Z 2025-05-07T20:25:57.1776247Z 2025-05-07T20:25:57.1776250Z 2025-05-07T20:25:57.1776254Z 2025-05-07T20:25:57.1776361Z  2025-05-07T20:25:57.1776487Z 2025-05-07T20:25:57.1776491Z 2025-05-07T20:25:57.1776494Z 2025-05-07T20:25:57.1776502Z 2025-05-07T20:25:57.1776506Z 2025-05-07T20:25:57.1776509Z 2025-05-07T20:25:57.1776616Z  2025-05-07T20:25:57.1776745Z 2025-05-07T20:25:57.1776749Z 2025-05-07T20:25:57.1776753Z 2025-05-07T20:25:57.1776756Z 2025-05-07T20:25:57.1776760Z 2025-05-07T20:25:57.1776764Z 2025-05-07T20:25:57.1776768Z 2025-05-07T20:25:57.1776878Z  2025-05-07T20:25:57.1777018Z 2025-05-07T20:25:57.1777021Z 2025-05-07T20:25:57.1777025Z 2025-05-07T20:25:57.1777029Z 2025-05-07T20:25:57.1777032Z 2025-05-07T20:25:57.1777036Z 2025-05-07T20:25:57.1777040Z 2025-05-07T20:25:57.1777043Z 2025-05-07T20:25:57.1777158Z  2025-05-07T20:25:57.1777316Z 2025-05-07T20:25:57.1777320Z 2025-05-07T20:25:57.1777328Z 2025-05-07T20:25:57.1777332Z 2025-05-07T20:25:57.1777335Z 2025-05-07T20:25:57.1777339Z 2025-05-07T20:25:57.1777343Z 2025-05-07T20:25:57.1777347Z 2025-05-07T20:25:57.1777350Z 2025-05-07T20:25:57.1777471Z  2025-05-07T20:25:57.1777635Z 2025-05-07T20:25:57.1777638Z 2025-05-07T20:25:57.1777642Z 2025-05-07T20:25:57.1777646Z 2025-05-07T20:25:57.1777649Z 2025-05-07T20:25:57.1777653Z 2025-05-07T20:25:57.1777657Z 2025-05-07T20:25:57.1777660Z 2025-05-07T20:25:57.1777664Z 2025-05-07T20:25:57.1777668Z 2025-05-07T20:25:57.1777789Z  2025-05-07T20:25:57.1777953Z 2025-05-07T20:25:57.1777957Z 2025-05-07T20:25:57.1777960Z 2025-05-07T20:25:57.1777964Z 2025-05-07T20:25:57.1777968Z 2025-05-07T20:25:57.1777971Z 2025-05-07T20:25:57.1777975Z 2025-05-07T20:25:57.1777979Z 2025-05-07T20:25:57.1777982Z 2025-05-07T20:25:57.1777986Z 2025-05-07T20:25:57.1777990Z 2025-05-07T20:25:57.1778121Z  2025-05-07T20:25:57.1778381Z 2025-05-07T20:25:57.1778386Z 2025-05-07T20:25:57.1778389Z 2025-05-07T20:25:57.1778393Z 2025-05-07T20:25:57.1778396Z 2025-05-07T20:25:57.1778400Z 2025-05-07T20:25:57.1778404Z 2025-05-07T20:25:57.1778407Z 2025-05-07T20:25:57.1778411Z 2025-05-07T20:25:57.1778484Z 2025-05-07T20:25:57.1778487Z 2025-05-07T20:25:57.1778491Z 2025-05-07T20:25:57.1778634Z  2025-05-07T20:25:57.1778813Z 2025-05-07T20:25:57.1778817Z 2025-05-07T20:25:57.1778821Z 2025-05-07T20:25:57.1778824Z 2025-05-07T20:25:57.1778828Z 2025-05-07T20:25:57.1778832Z 2025-05-07T20:25:57.1778835Z 2025-05-07T20:25:57.1778839Z 2025-05-07T20:25:57.1778843Z 2025-05-07T20:25:57.1778847Z 2025-05-07T20:25:57.1778856Z 2025-05-07T20:25:57.1778859Z 2025-05-07T20:25:57.1778863Z 2025-05-07T20:25:57.1778994Z  2025-05-07T20:25:57.1779178Z 2025-05-07T20:25:57.1779181Z 2025-05-07T20:25:57.1779185Z 2025-05-07T20:25:57.1779189Z 2025-05-07T20:25:57.1779198Z 2025-05-07T20:25:57.1779207Z 2025-05-07T20:25:57.1779211Z 2025-05-07T20:25:57.1779215Z 2025-05-07T20:25:57.1779218Z 2025-05-07T20:25:57.1779222Z 2025-05-07T20:25:57.1779226Z 2025-05-07T20:25:57.1779230Z 2025-05-07T20:25:57.1779233Z 2025-05-07T20:25:57.1779237Z 2025-05-07T20:25:57.1779380Z  2025-05-07T20:25:57.1779574Z 2025-05-07T20:25:57.1779577Z 2025-05-07T20:25:57.1779581Z 2025-05-07T20:25:57.1779585Z 2025-05-07T20:25:57.1779588Z 2025-05-07T20:25:57.1779592Z 2025-05-07T20:25:57.1779595Z 2025-05-07T20:25:57.1779599Z 2025-05-07T20:25:57.1779603Z 2025-05-07T20:25:57.1779606Z 2025-05-07T20:25:57.1779610Z 2025-05-07T20:25:57.1779613Z 2025-05-07T20:25:57.1779617Z 2025-05-07T20:25:57.1779621Z 2025-05-07T20:25:57.1779624Z 2025-05-07T20:25:57.1779779Z  2025-05-07T20:25:57.1779970Z 2025-05-07T20:25:57.1779974Z 2025-05-07T20:25:57.1779977Z 2025-05-07T20:25:57.1779981Z 2025-05-07T20:25:57.1779984Z 2025-05-07T20:25:57.1779993Z 2025-05-07T20:25:57.1779997Z 2025-05-07T20:25:57.1780000Z 2025-05-07T20:25:57.1780004Z 2025-05-07T20:25:57.1780007Z 2025-05-07T20:25:57.1780011Z 2025-05-07T20:25:57.1780015Z 2025-05-07T20:25:57.1780018Z 2025-05-07T20:25:57.1780028Z 2025-05-07T20:25:57.1780035Z 2025-05-07T20:25:57.1780039Z 2025-05-07T20:25:57.1780188Z  2025-05-07T20:25:57.1780387Z 2025-05-07T20:25:57.1780391Z 2025-05-07T20:25:57.1780395Z 2025-05-07T20:25:57.1780398Z 2025-05-07T20:25:57.1780402Z 2025-05-07T20:25:57.1780411Z 2025-05-07T20:25:57.1780415Z 2025-05-07T20:25:57.1780419Z 2025-05-07T20:25:57.1780422Z 2025-05-07T20:25:57.1780426Z 2025-05-07T20:25:57.1780430Z 2025-05-07T20:25:57.1780433Z 2025-05-07T20:25:57.1780437Z 2025-05-07T20:25:57.1780440Z 2025-05-07T20:25:57.1780444Z 2025-05-07T20:25:57.1780448Z 2025-05-07T20:25:57.1780451Z 2025-05-07T20:25:57.1780605Z  2025-05-07T20:25:57.1780812Z 2025-05-07T20:25:57.1780820Z 2025-05-07T20:25:57.1780823Z 2025-05-07T20:25:57.1780827Z 2025-05-07T20:25:57.1780831Z 2025-05-07T20:25:57.1780834Z 2025-05-07T20:25:57.1780838Z 2025-05-07T20:25:57.1780842Z 2025-05-07T20:25:57.1780845Z 2025-05-07T20:25:57.1780849Z 2025-05-07T20:25:57.1780855Z 2025-05-07T20:25:57.1780859Z 2025-05-07T20:25:57.1780863Z 2025-05-07T20:25:57.1780866Z 2025-05-07T20:25:57.1780870Z 2025-05-07T20:25:57.1780874Z 2025-05-07T20:25:57.1780877Z 2025-05-07T20:25:57.1780881Z 2025-05-07T20:25:57.1781046Z  2025-05-07T20:25:57.1781250Z 2025-05-07T20:25:57.1781254Z 2025-05-07T20:25:57.1781360Z  2025-05-07T20:25:57.1781459Z 2025-05-07T20:25:57.1781463Z 2025-05-07T20:25:57.1781561Z  2025-05-07T20:25:57.1781666Z 2025-05-07T20:25:57.1781669Z 2025-05-07T20:25:57.1781673Z 2025-05-07T20:25:57.1781772Z  2025-05-07T20:25:57.1781876Z 2025-05-07T20:25:57.1781885Z 2025-05-07T20:25:57.1781889Z 2025-05-07T20:25:57.1781974Z 2025-05-07T20:25:57.1782077Z  2025-05-07T20:25:57.1782190Z 2025-05-07T20:25:57.1782194Z 2025-05-07T20:25:57.1782198Z 2025-05-07T20:25:57.1782201Z 2025-05-07T20:25:57.1782211Z 2025-05-07T20:25:57.1782319Z  2025-05-07T20:25:57.1782444Z 2025-05-07T20:25:57.1782523Z 2025-05-07T20:25:57.1782526Z 2025-05-07T20:25:57.1782530Z 2025-05-07T20:25:57.1782534Z 2025-05-07T20:25:57.1782537Z 2025-05-07T20:25:57.1782674Z  2025-05-07T20:25:57.1782797Z 2025-05-07T20:25:57.1782800Z 2025-05-07T20:25:57.1782804Z 2025-05-07T20:25:57.1782808Z 2025-05-07T20:25:57.1782812Z 2025-05-07T20:25:57.1782821Z 2025-05-07T20:25:57.1782825Z 2025-05-07T20:25:57.1782936Z  2025-05-07T20:25:57.1783070Z 2025-05-07T20:25:57.1783073Z 2025-05-07T20:25:57.1783077Z 2025-05-07T20:25:57.1783080Z 2025-05-07T20:25:57.1783084Z 2025-05-07T20:25:57.1783088Z 2025-05-07T20:25:57.1783098Z 2025-05-07T20:25:57.1783101Z 2025-05-07T20:25:57.1783224Z  2025-05-07T20:25:57.1783369Z 2025-05-07T20:25:57.1783372Z 2025-05-07T20:25:57.1783376Z 2025-05-07T20:25:57.1783380Z 2025-05-07T20:25:57.1783383Z 2025-05-07T20:25:57.1783392Z 2025-05-07T20:25:57.1783396Z 2025-05-07T20:25:57.1783399Z 2025-05-07T20:25:57.1783403Z 2025-05-07T20:25:57.1783533Z  2025-05-07T20:25:57.1783687Z 2025-05-07T20:25:57.1783690Z 2025-05-07T20:25:57.1783694Z 2025-05-07T20:25:57.1783697Z 2025-05-07T20:25:57.1783707Z 2025-05-07T20:25:57.1783710Z 2025-05-07T20:25:57.1783714Z 2025-05-07T20:25:57.1783717Z 2025-05-07T20:25:57.1783721Z 2025-05-07T20:25:57.1783725Z 2025-05-07T20:25:57.1783848Z  2025-05-07T20:25:57.1784008Z 2025-05-07T20:25:57.1784018Z 2025-05-07T20:25:57.1784021Z 2025-05-07T20:25:57.1784025Z 2025-05-07T20:25:57.1784028Z 2025-05-07T20:25:57.1784032Z 2025-05-07T20:25:57.1784036Z 2025-05-07T20:25:57.1784039Z 2025-05-07T20:25:57.1784043Z 2025-05-07T20:25:57.1784047Z 2025-05-07T20:25:57.1784050Z 2025-05-07T20:25:57.1784183Z  2025-05-07T20:25:57.1784362Z 2025-05-07T20:25:57.1784365Z 2025-05-07T20:25:57.1784369Z 2025-05-07T20:25:57.1784372Z 2025-05-07T20:25:57.1784376Z 2025-05-07T20:25:57.1784380Z 2025-05-07T20:25:57.1784383Z 2025-05-07T20:25:57.1784390Z 2025-05-07T20:25:57.1784394Z 2025-05-07T20:25:57.1784397Z 2025-05-07T20:25:57.1784401Z 2025-05-07T20:25:57.1784405Z 2025-05-07T20:25:57.1784544Z  done 2025-05-07T20:25:57.4984980Z Preparing transaction: \ | / done 2025-05-07T20:25:58.9344690Z Verifying transaction: \ | / - \ | / - \ | / - \ | done 2025-05-07T20:25:59.6813667Z Executing transaction: - \ | / - \ | done 2025-05-07T20:26:02.0232851Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ... 2025-05-07T20:26:02.0233410Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:02.0234226Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:02.0234787Z 2025-05-07T20:26:02.0246693Z 2025-05-07T20:26:02.0247713Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:02.0248456Z 2025-05-07T20:26:02.0258944Z 2025-05-07T20:26:02.0259260Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:02.0265469Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:02.0269469Z 2025-05-07T20:26:02.0534480Z 2025-05-07T20:26:02.0540196Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:02.0544048Z 2025-05-07T20:26:02.0561025Z 2025-05-07T20:26:02.0561493Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:02.0926909Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:03.9759643Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:04.0399627Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:04.0400173Z 2025-05-07T20:26:04.4630876Z 2025-05-07T20:26:04.4639807Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:04.4993445Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:04.4994156Z 2025-05-07T20:26:04.9297917Z 2025-05-07T20:26:04.9298308Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:04.9299459Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:04.9300179Z 2025-05-07T20:26:05.3502974Z 2025-05-07T20:26:07.3750678Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:09.4032334Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:11.4366828Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:11.4367645Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:13.4609461Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:15.3513102Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:15.3513384Z 2025-05-07T20:26:15.4143411Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:26:19.2862200Z /tmp/tmpdfxt0iiv: line 3: clang: command not found 2025-05-07T20:26:19.2862484Z 2025-05-07T20:26:19.2863196Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:26:19.3506307Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:26:19.3506638Z 2025-05-07T20:26:19.3528670Z total 36 2025-05-07T20:26:19.3529072Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:25 . 2025-05-07T20:26:19.3529468Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:24 .. 2025-05-07T20:26:19.3530042Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:26:19.3530984Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:26:19.3531625Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:26:19.3532242Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:26:19.3532713Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:26:19.3533157Z -rw-r--r--. 2 ec2-user ec2-user 2932 Nov 20 20:32 ~cuda-nvcc_activate.sh 2025-05-07T20:26:19.3533440Z 2025-05-07T20:26:19.3533653Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:26:19.3534280Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:26:19.3534709Z 2025-05-07T20:26:19.3551766Z 2025-05-07T20:26:19.3552215Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:26:19.3552570Z 2025-05-07T20:26:21.2956510Z 2025-05-07T20:26:21.2957310Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:26:21.2958062Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:26:21.2958465Z 2025-05-07T20:26:21.7186734Z 2025-05-07T20:26:21.7187079Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:26:21.7187430Z 2025-05-07T20:26:23.6018989Z -allow-unsupported-compiler 2025-05-07T20:26:23.6019243Z 2025-05-07T20:26:23.6639873Z 2025-05-07T20:26:23.6640418Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:26:23.6641105Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:26:23.6641554Z 2025-05-07T20:26:25.6049247Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:26:25.6050039Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:26:25.6050440Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:26:25.6050761Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:26:25.6051099Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:26:25.6051368Z #define _STL_PAIR_H 1 2025-05-07T20:26:25.6051612Z #define __cpp_attributes 200809L 2025-05-07T20:26:25.6051940Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:26:25.6052283Z #define __DELETE_THROW throw() 2025-05-07T20:26:25.6052535Z #define _PTRDIFF_T_ 2025-05-07T20:26:25.6052783Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:26:25.6053069Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:26:25.6053342Z #define _IO_LEFT 02 2025-05-07T20:26:25.6053620Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:26:25.6053969Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:26:25.6054337Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:26:25.6054905Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:26:25.6055483Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:26:25.6055869Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:26:25.6056208Z #define _IOS_OUTPUT 2 2025-05-07T20:26:25.6056543Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:26:25.6056905Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:26:25.6057203Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:26:25.6057536Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:26:25.6057968Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:26:25.6058930Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:26:25.6060278Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:26:25.6060688Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:26:25.6061355Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:26:25.6061771Z #define _T_WCHAR_ 2025-05-07T20:26:25.6062058Z #define stdout stdout 2025-05-07T20:26:25.6062497Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:26:25.6063189Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:26:25.6063521Z #define __flexarr [] 2025-05-07T20:26:25.6063842Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:26:25.6064272Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:26:25.6064731Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:26:25.6065068Z #define _MATH_H 1 2025-05-07T20:26:25.6065436Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:26:25.6065888Z #define __S64_TYPE long int 2025-05-07T20:26:25.6066215Z #define __stub_fchflags 2025-05-07T20:26:25.6066609Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:26:25.6067005Z #define __SQUAD_TYPE long int 2025-05-07T20:26:25.6067352Z #define __INTMAX_C(c) c ## L 2025-05-07T20:26:25.6067711Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:26:25.6067983Z #define NL_NMAX INT_MAX 2025-05-07T20:26:25.6068209Z #define _BITS_TIME_H 1 2025-05-07T20:26:25.6068485Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:26:25.6068811Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:26:25.6069115Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:26:25.6069468Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:26:25.6069941Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:26:25.6070312Z #define __CHAR_BIT__ 8 2025-05-07T20:26:25.6070566Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.6070879Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:26:25.6071173Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:26:25.6071434Z #define FP_NAN 0 2025-05-07T20:26:25.6071698Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:26:25.6072141Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:26:25.6072634Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:26:25.6073022Z #define __cudaCDP2GetErrorString 2025-05-07T20:26:25.6073306Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:26:25.6073560Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:26:25.6073816Z #define __SM_80_RT_H__ 2025-05-07T20:26:25.6074040Z #define _NEW 2025-05-07T20:26:25.6074256Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:26:25.6074535Z #define __UINT8_MAX__ 0xff 2025-05-07T20:26:25.6074904Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:26:25.6075303Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:26:25.6075533Z #define __USE_ANSI 1 2025-05-07T20:26:25.6075813Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:26:25.6076204Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:26:25.6076551Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:26:25.6076877Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:26:25.6077179Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:26:25.6077457Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:26:25.6077731Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:26:25.6078010Z #define PIPE_BUF 4096 2025-05-07T20:26:25.6078321Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:26:25.6078686Z #define ADJ_TICK 0x4000 2025-05-07T20:26:25.6078961Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:26:25.6079284Z #define MQ_PRIO_MAX 32768 2025-05-07T20:26:25.6079538Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:26:25.6079853Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:26:25.6080307Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:25.6080820Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:26:25.6081188Z #define _XOPEN_SOURCE 700 2025-05-07T20:26:25.6081442Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:26:25.6081708Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:26:25.6082090Z #define __cpp_static_assert 201411L 2025-05-07T20:26:25.6082535Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:26:25.6082870Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:26:25.6083146Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:26:25.6083515Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:26:25.6083816Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:26:25.6084092Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:26:25.6084393Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.6084747Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:26:25.6085079Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:26:25.6085355Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:26:25.6085669Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.6086017Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:26:25.6086367Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:26:25.6086663Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:26:25.6086947Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:26:25.6087271Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:26:25.6087591Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:26:25.6087993Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:26:25.6088398Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:26:25.6088696Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:26:25.6088963Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:26:25.6089233Z #define __GCC_IEC_559 2 2025-05-07T20:26:25.6089519Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:26:25.6089849Z #define _IO_flockfile(_fp) 2025-05-07T20:26:25.6090100Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:26:25.6090362Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:26:25.6090622Z #define _IOFBF 0 2025-05-07T20:26:25.6090827Z #define __USE_BSD 1 2025-05-07T20:26:25.6091053Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:26:25.6091327Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:26:25.6091609Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:26:25.6091851Z #define _IO_NO_WRITES 8 2025-05-07T20:26:25.6092100Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:26:25.6092448Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:26:25.6093026Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:26:25.6093328Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:26:25.6093646Z #define __cpp_binary_literals 201304L 2025-05-07T20:26:25.6093924Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:26:25.6094185Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:26:25.6094448Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:26:25.6094752Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:26:25.6095134Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:26:25.6095489Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:26:25.6095790Z #define M_PI 3.14159265358979323846 2025-05-07T20:26:25.6096090Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:26:25.6096416Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:26:25.6096755Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:26:25.6097051Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:26:25.6097329Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:26:25.6097596Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:26:25.6098168Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:26:25.6098747Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:26:25.6099067Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:26:25.6099388Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:26:25.6099677Z #define __cudaCDP2GetErrorName 2025-05-07T20:26:25.6099950Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:26:25.6100216Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:26:25.6100600Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:26:25.6100929Z #define __cpp_variadic_templates 200704L 2025-05-07T20:26:25.6101226Z #define RAND_MAX 2147483647 2025-05-07T20:26:25.6101483Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:26:25.6101811Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.6102206Z #define __SM_90_RT_H__ 2025-05-07T20:26:25.6112613Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:26:25.6112931Z #define __COMPAR_FN_T 2025-05-07T20:26:25.6113189Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:26:25.6113465Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:26:25.6113944Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:26:25.6114465Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:26:25.6114816Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:26:25.6115176Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:26:25.6115483Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:26:25.6115830Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:26:25.6116144Z #define __cpp_variable_templates 201304L 2025-05-07T20:26:25.6116658Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:25.6117210Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:26:25.6117554Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:26:25.6117827Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:26:25.6118130Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:26:25.6118438Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:26:25.6118704Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:26:25.6118980Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:26:25.6119247Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:26:25.6119499Z #define __u_char_defined 2025-05-07T20:26:25.6119815Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:26:25.6120171Z #define STA_PPSERROR 0x0800 2025-05-07T20:26:25.6120424Z #define _GLIBCXX_STD_A std 2025-05-07T20:26:25.6120683Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:26:25.6120968Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:26:25.6121409Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:26:25.6121825Z #define FP_INFINITE 1 2025-05-07T20:26:25.6122198Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:25.6122612Z #define _IO_pid_t __pid_t 2025-05-07T20:26:25.6122860Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:26:25.6123119Z #define __LEAF , __leaf__ 2025-05-07T20:26:25.6123358Z #define PATH_MAX 4096 2025-05-07T20:26:25.6123615Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:26:25.6123944Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:26:25.6124265Z #define _LIMITS_H___ 2025-05-07T20:26:25.6124496Z #define __size_t 2025-05-07T20:26:25.6124727Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:26:25.6125265Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:26:25.6125823Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:26:25.6126126Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:26:25.6126462Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:26:25.6126751Z #define _WCHAR_T_DEFINED 2025-05-07T20:26:25.6127105Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:26:25.6127488Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:26:25.6127791Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:26:25.6128114Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:26:25.6128384Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:26:25.6128660Z #define __INT8_C(c) c 2025-05-07T20:26:25.6128918Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:26:25.6129210Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:26:25.6129462Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:26:25.6129711Z #define __SM_70_RT_HPP__ 2025-05-07T20:26:25.6129952Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:26:25.6130390Z #define __cpp_variadic_using 201611L 2025-05-07T20:26:25.6130719Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.6131037Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:26:25.6131302Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:26:25.6131720Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:26:25.6131975Z #define __cpp_capture_star_this 201603L 2025-05-07T20:26:25.6132272Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:26:25.6132696Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:26:25.6133058Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:26:25.6133427Z #define NFDBITS __NFDBITS 2025-05-07T20:26:25.6133678Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:26:25.6133961Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:26:25.6134273Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:26:25.6134576Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:26:25.6134829Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:26:25.6135123Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:26:25.6135413Z #define STA_UNSYNC 0x0040 2025-05-07T20:26:25.6135719Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:25.6136125Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:26:25.6136575Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:26:25.6136881Z #define __cpp_if_constexpr 201606L 2025-05-07T20:26:25.6137221Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:26:25.6137584Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:26:25.6137909Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:26:25.6138215Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:26:25.6138537Z #define __daddr_t_defined 2025-05-07T20:26:25.6138776Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:26:25.6139050Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:26:25.6139360Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:26:25.6139860Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:26:25.6140341Z #define _ACRTIMP 2025-05-07T20:26:25.6140648Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:26:25.6140955Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:26:25.6141249Z #define _IOS_BIN 128 2025-05-07T20:26:25.6141595Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:26:25.6142000Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.6142258Z #define UNDERFLOW 4 2025-05-07T20:26:25.6142473Z #define NAME_MAX 255 2025-05-07T20:26:25.6142706Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:26:25.6142964Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:26:25.6143251Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:26:25.6143538Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:26:25.6143900Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:26:25.6144278Z #define __ptr_t void * 2025-05-07T20:26:25.6144515Z #define M_E 2.7182818284590452354 2025-05-07T20:26:25.6144779Z #define cudaSurfaceType1D 0x01 2025-05-07T20:26:25.6145044Z #define __USE_ISOCXX11 1 2025-05-07T20:26:25.6145300Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:26:25.6145605Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:26:25.6145899Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:26:25.6146163Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:26:25.6146452Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:26:25.6146753Z #define cudaSurfaceType2D 0x02 2025-05-07T20:26:25.6147009Z #define __linux 1 2025-05-07T20:26:25.6147233Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:26:25.6147494Z #define cudaDeviceMask 0xff 2025-05-07T20:26:25.6147756Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:26:25.6148042Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:26:25.6148310Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:26:25.6148593Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:26:25.6148991Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:26:25.6149284Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:26:25.6149565Z #define _BITS_TYPES_H 1 2025-05-07T20:26:25.6149915Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:26:25.6150241Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:26:25.6150621Z #define cudaSurfaceType3D 0x03 2025-05-07T20:26:25.6150893Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:26:25.6151169Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:26:25.6151445Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:26:25.6152213Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:26:25.6153003Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:26:25.6153271Z #define __unix 1 2025-05-07T20:26:25.6153480Z #define MATH_ERRNO 1 2025-05-07T20:26:25.6153716Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:26:25.6153989Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:26:25.6154252Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:26:25.6154526Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:26:25.6154807Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:26:25.6155077Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:26:25.6155534Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:26:25.6155990Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:26:25.6156272Z #define CUDARTAPI_CDECL 2025-05-07T20:26:25.6156567Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:26:25.6156843Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:26:25.6157116Z #define __cpp_lib_void_t 201411 2025-05-07T20:26:25.6157372Z #define _POSIX_AIO_MAX 1 2025-05-07T20:26:25.6157604Z #define __SIZE_T 2025-05-07T20:26:25.6157842Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:26:25.6158155Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:26:25.6158446Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:26:25.6158698Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:26:25.6158949Z #define _ATFILE_SOURCE 1 2025-05-07T20:26:25.6159328Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:26:25.6159744Z #define __WAIT_STATUS void * 2025-05-07T20:26:25.6159999Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:26:25.6160257Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:26:25.6160516Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:26:25.6160788Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:26:25.6161056Z #define __WINT_MIN__ 0U 2025-05-07T20:26:25.6161623Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:26:25.6162249Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:26:25.6162541Z #define WUNTRACED 2 2025-05-07T20:26:25.6162763Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:26:25.6163038Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:26:25.6163305Z #define NZERO 20 2025-05-07T20:26:25.6163531Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:26:25.6163807Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:26:25.6164083Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:26:25.6164364Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:26:25.6164618Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:26:25.6164890Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:26:25.6165156Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:26:25.6165427Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:26:25.6165807Z #define EXIT_FAILURE 1 2025-05-07T20:26:25.6166049Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:26:25.6166310Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:26:25.6166572Z #define _SIZE_T_DEFINED_ 2025-05-07T20:26:25.6166866Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:26:25.6167149Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:26:25.6167481Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:26:25.6167918Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:26:25.6168213Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:26:25.6168463Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:26:25.6168726Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:26:25.6169017Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:26:25.6169399Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:26:25.6169680Z #define SEEK_DATA 3 2025-05-07T20:26:25.6169911Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:26:25.6170211Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:26:25.6170622Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:26:25.6171010Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:26:25.6171259Z #define __INT64_C(c) c ## L 2025-05-07T20:26:25.6171528Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:26:25.6171852Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:26:25.6172172Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:26:25.6172452Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:26:25.6172749Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:26:25.6173052Z #define STA_PPSWANDER 0x0400 2025-05-07T20:26:25.6173304Z #define __INT_WCHAR_T_H 2025-05-07T20:26:25.6173543Z #define WSTOPPED 2 2025-05-07T20:26:25.6173780Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:26:25.6174064Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:26:25.6174317Z #define FP_NORMAL 4 2025-05-07T20:26:25.6174559Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:26:25.6174832Z #define _BITS_TIMEX_H 1 2025-05-07T20:26:25.6175075Z #define _POSIX_LINK_MAX 8 2025-05-07T20:26:25.6175334Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:26:25.6175608Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:26:25.6175883Z #define cudaTextureType1D 0x01 2025-05-07T20:26:25.6176160Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:26:25.6176422Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:26:25.6176684Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:26:25.6176977Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:26:25.6177409Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:26:25.6177851Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:26:25.6178116Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:26:25.6178373Z #define _POSIX_SOURCE 1 2025-05-07T20:26:25.6178622Z #define cudaTextureType2D 0x02 2025-05-07T20:26:25.6178884Z #define _PTR_TRAITS_H 1 2025-05-07T20:26:25.6179153Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:26:25.6179460Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:26:25.6179725Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:26:25.6180044Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:26:25.6180370Z #define cudaTextureType3D 0x03 2025-05-07T20:26:25.6180645Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:26:25.6180901Z #define CLOCK_REALTIME 0 2025-05-07T20:26:25.6181148Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:26:25.6181416Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:26:25.6181725Z #define __cpp_aligned_new 201606L 2025-05-07T20:26:25.6182003Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:26:25.6182278Z #define cudaEventBlockingSync 0x01 2025-05-07T20:26:25.6182562Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:26:25.6182831Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:26:25.6183131Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:26:25.6183424Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:26:25.6183700Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:26:25.6183944Z #define __GLIBC__ 2 2025-05-07T20:26:25.6184160Z #define __END_DECLS } 2025-05-07T20:26:25.6184402Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:26:25.6184756Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:26:25.6185129Z #define __CONCAT(x,y) x ## y 2025-05-07T20:26:25.6185381Z #define WCONTINUED 8 2025-05-07T20:26:25.6185613Z #define __STDC_HOSTED__ 1 2025-05-07T20:26:25.6185863Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:26:25.6186140Z #define _ALLOCA_H 1 2025-05-07T20:26:25.6186481Z #define __host__ __location__(host) 2025-05-07T20:26:25.6186922Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:26:25.6187352Z #define __SLONG32_TYPE int 2025-05-07T20:26:25.6187619Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:26:25.6187980Z #define _SYS_SELECT_H 1 2025-05-07T20:26:25.6188219Z #define _IO_LINE_BUF 0x200 2025-05-07T20:26:25.6188468Z #define _IOS_NOCREATE 32 2025-05-07T20:26:25.6188710Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:26:25.6188988Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:26:25.6189280Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:26:25.6189557Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:26:25.6189839Z #define __global__ __location__(global) 2025-05-07T20:26:25.6190210Z #define __GNU_LIBRARY__ 6 2025-05-07T20:26:25.6190471Z #define __cpp_decltype_auto 201304L 2025-05-07T20:26:25.6190738Z #define __DBL_DIG__ 15 2025-05-07T20:26:25.6190966Z #define TIME_UTC 1 2025-05-07T20:26:25.6191289Z #define __FLT32_DIG__ 6 2025-05-07T20:26:25.6191605Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:26:25.6191998Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:26:25.6192312Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:26:25.6192620Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:26:25.6192921Z #define _G_BUFSIZ 8192 2025-05-07T20:26:25.6193223Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:26:25.6193579Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:26:25.6193873Z #define __cudaCDP2GetDevice 2025-05-07T20:26:25.6194152Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:26:25.6194437Z #define STA_CLOCKERR 0x1000 2025-05-07T20:26:25.6194677Z #define __GXX_WEAK__ 1 2025-05-07T20:26:25.6194929Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:25.6195331Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:26:25.6195596Z #define __SHRT_WIDTH__ 16 2025-05-07T20:26:25.6195899Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:26:25.6196236Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:26:25.6196511Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:26:25.6196805Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:26:25.6197143Z #define _G_config_h 1 2025-05-07T20:26:25.6197424Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:26:25.6197755Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:26:25.6198033Z #define _GCC_WCHAR_T 2025-05-07T20:26:25.6198255Z #define TMP_MAX 238328 2025-05-07T20:26:25.6198495Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:26:25.6198761Z #define __DEVICE_TYPES_H__ 2025-05-07T20:26:25.6199022Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:25.6199287Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:26:25.6199559Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:26:25.6199850Z #define _IO_SKIPWS 01 2025-05-07T20:26:25.6200258Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:26:25.6200719Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:26:25.6200984Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:26:25.6201309Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:26:25.6201674Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:26:25.6202041Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:26:25.6202393Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:26:25.6202650Z #define le32toh(x) (x) 2025-05-07T20:26:25.6202885Z #define _SIZE_T_DEFINED 2025-05-07T20:26:25.6203133Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:26:25.6203476Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:26:25.6204056Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:26:25.6204463Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:26:25.6204870Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:26:25.6205137Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:26:25.6205399Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:26:25.6205833Z #define _POSIX_NAME_MAX 14 2025-05-07T20:26:25.6206118Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:26:25.6206634Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:26:25.6207126Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:26:25.6207546Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:26:25.6207896Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:26:25.6208209Z #define _WCHAR_T_ 2025-05-07T20:26:25.6208429Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:26:25.6208789Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:26:25.6209169Z #define RTSIG_MAX 32 2025-05-07T20:26:25.6209385Z #define _STDDEF_H 2025-05-07T20:26:25.6209614Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:26:25.6209886Z #define _VA_LIST_DEFINED 2025-05-07T20:26:25.6210138Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:26:25.6210473Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:26:25.6210852Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:26:25.6211167Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:26:25.6211450Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:26:25.6211900Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:26:25.6212422Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:26:25.6212777Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:26:25.6213093Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:26:25.6213397Z #define __unix__ 1 2025-05-07T20:26:25.6213618Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:25.6213907Z #define __INT_WIDTH__ 32 2025-05-07T20:26:25.6214144Z #define __SIZEOF_LONG__ 8 2025-05-07T20:26:25.6214368Z #define _IONBF 2 2025-05-07T20:26:25.6214805Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:26:25.6215565Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:26:25.6216086Z #define __STDC_IEC_559__ 1 2025-05-07T20:26:25.6216329Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:26:25.6216598Z #define __UINT16_C(c) c 2025-05-07T20:26:25.6216840Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:26:25.6217102Z #define STA_DEL 0x0020 2025-05-07T20:26:25.6217339Z #define __CUDACC_VER_MINOR__ 6 2025-05-07T20:26:25.6217589Z #define __id_t_defined 2025-05-07T20:26:25.6217849Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:26:25.6218291Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:26:25.6218714Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:26:25.6218970Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:26:25.6219229Z #define __DECIMAL_DIG__ 21 2025-05-07T20:26:25.6219484Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:26:25.6219747Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:26:25.6220008Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:26:25.6220270Z #define SING 2 2025-05-07T20:26:25.6220487Z #define STA_FREQHOLD 0x0080 2025-05-07T20:26:25.6220747Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:25.6221040Z #define cudaStreamDefault 0x00 2025-05-07T20:26:25.6221389Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:26:25.6221751Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:26:25.6222014Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:26:25.6222275Z #define __gnu_linux__ 1 2025-05-07T20:26:25.6222502Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:26:25.6222751Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:26:25.6222992Z #define MAX_INPUT 255 2025-05-07T20:26:25.6223220Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:26:25.6223537Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:26:25.6223900Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:26:25.6224295Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:26:25.6224719Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:26:25.6225125Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:26:25.6225543Z #define _IO_SHOWPOS 02000 2025-05-07T20:26:25.6225867Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:26:25.6226311Z #define _Mfloat_ float 2025-05-07T20:26:25.6226584Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:26:25.6226921Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:26:25.6227204Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:26:25.6227695Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:26:25.6228176Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.6228438Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:26:25.6228759Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:26:25.6229109Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:26:25.6229396Z #define __USE_ISOC11 1 2025-05-07T20:26:25.6229622Z #define _BSD_SIZE_T_ 2025-05-07T20:26:25.6229918Z #define ADJ_MICRO 0x1000 2025-05-07T20:26:25.6230262Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:26:25.6230526Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:26:25.6235809Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:26:25.6236150Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:26:25.6236455Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:26:25.6236776Z #define __THROW throw () 2025-05-07T20:26:25.6237041Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:26:25.6237348Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:25.6237694Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:26:25.6238041Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:26:25.6238302Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:26:25.6238557Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:26:25.6238816Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:26:25.6239067Z #define L_tmpnam 20 2025-05-07T20:26:25.6239284Z #define ___int_wchar_t_h 2025-05-07T20:26:25.6239619Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:26:25.6239984Z #define isascii(c) __isascii (c) 2025-05-07T20:26:25.6240236Z #define _T_PTRDIFF 2025-05-07T20:26:25.6240531Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:26:25.6240872Z #define toascii(c) __toascii (c) 2025-05-07T20:26:25.6241121Z #define __GNUC__ 11 2025-05-07T20:26:25.6241370Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:26:25.6241649Z #define __GXX_RTTI 1 2025-05-07T20:26:25.6241865Z #define __pie__ 2 2025-05-07T20:26:25.6242066Z #define __MMX__ 1 2025-05-07T20:26:25.6242277Z #define __cudaCDP2Malloc 2025-05-07T20:26:25.6242518Z #define __timespec_defined 1 2025-05-07T20:26:25.6242768Z #define L_ctermid 9 2025-05-07T20:26:25.6242987Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:25.6243277Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:26:25.6243657Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:26:25.6244017Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:26:25.6244267Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:26:25.6244551Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:26:25.6244849Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:26:25.6245148Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:26:25.6245405Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:26:25.6245831Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:26:25.6246558Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:25.6247141Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:26:25.6247432Z #define __USE_SVID 1 2025-05-07T20:26:25.6247677Z #define __constant__ __location__(constant) 2025-05-07T20:26:25.6247974Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:26:25.6248375Z #define __device__ __location__(device) 2025-05-07T20:26:25.6248697Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:26:25.6249007Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:26:25.6249261Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:26:25.6249669Z #define CUDART_DEVICE __device__ 2025-05-07T20:26:25.6250011Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:26:25.6250364Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:26:25.6250638Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:26:25.6250993Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:26:25.6251357Z #define __STDC_UTF_16__ 1 2025-05-07T20:26:25.6251749Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:26:25.6252104Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:26:25.6252513Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:26:25.6252815Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:26:25.6253082Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:26:25.6253333Z #define NGROUPS_MAX 65536 2025-05-07T20:26:25.6253578Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:26:25.6253827Z #define __USE_ISOC95 1 2025-05-07T20:26:25.6254043Z #define _TIME_H 1 2025-05-07T20:26:25.6254302Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:26:25.6254613Z #define __USE_ISOC99 1 2025-05-07T20:26:25.6254924Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:26:25.6255275Z #define HOST_NAME_MAX 64 2025-05-07T20:26:25.6255515Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:26:25.6255766Z #define _IOS_ATEND 4 2025-05-07T20:26:25.6255984Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:26:25.6256300Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:26:25.6256687Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:25.6257011Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:26:25.6257282Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:26:25.6257590Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:26:25.6257891Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:26:25.6258135Z #define _STDIO_H 1 2025-05-07T20:26:25.6258519Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:26:25.6258976Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:26:25.6259326Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:25.6259689Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:26:25.6259968Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:26:25.6260221Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:26:25.6260481Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:26:25.6260758Z #define __cpp_raw_strings 200710L 2025-05-07T20:26:25.6261047Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.6261350Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:26:25.6261611Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:26:25.6261883Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:26:25.6262169Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:26:25.6262432Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:26:25.6262708Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:26:25.6263049Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:26:25.6263407Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:26:25.6263640Z #define __USE_XOPEN 1 2025-05-07T20:26:25.6263867Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:26:25.6264295Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:25.6264720Z #define __USE_XOPEN2K 1 2025-05-07T20:26:25.6264951Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:26:25.6265210Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:26:25.6265491Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:26:25.6265750Z #define __cpp_fold_expressions 201603L 2025-05-07T20:26:25.6266258Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:26:25.6266857Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:26:25.6267129Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:26:25.6267474Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:26:25.6267848Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:26:25.6268291Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:26:25.6268674Z #define __END_NAMESPACE_C99 2025-05-07T20:26:25.6268932Z #define __glibcxx_integral_traps true 2025-05-07T20:26:25.6269204Z #define _POSIX_PATH_MAX 256 2025-05-07T20:26:25.6269445Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:26:25.6269692Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:26:25.6270043Z #define _ISOC11_SOURCE 1 2025-05-07T20:26:25.6270284Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:26:25.6270558Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:26:25.6270845Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:26:25.6271198Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:26:25.6271568Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:26:25.6271838Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:26:25.6272091Z #define _IO_UNITBUF 020000 2025-05-07T20:26:25.6272330Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:26:25.6272579Z #define __FD_SETSIZE 1024 2025-05-07T20:26:25.6272831Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:26:25.6273090Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:26:25.6273419Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:26:25.6273763Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:26:25.6274014Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:26:25.6274312Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:26:25.6274617Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:26:25.6274874Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:26:25.6275163Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:26:25.6275484Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:26:25.6275771Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:26:25.6276082Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:26:25.6276358Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:26:25.6276619Z #define __USE_POSIX199506 1 2025-05-07T20:26:25.6276858Z #define _FEATURES_H 1 2025-05-07T20:26:25.6277087Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:26:25.6277474Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:26:25.6277879Z #define __stub_getmsg 2025-05-07T20:26:25.6278103Z #define _IO_FIXED 010000 2025-05-07T20:26:25.6278362Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:26:25.6278658Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:26:25.6278914Z #define __stub_setlogin 2025-05-07T20:26:25.6279141Z #define __stub_fattach 2025-05-07T20:26:25.6279368Z #define __cplusplus 201703L 2025-05-07T20:26:25.6279616Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:26:25.6279886Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:26:25.6280132Z #define INFINITY (__builtin_inff()) 2025-05-07T20:26:25.6280397Z #define _IO_UNBUFFERED 2 2025-05-07T20:26:25.6280869Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:26:25.6281380Z #define _IO_INTERNAL 010 2025-05-07T20:26:25.6281610Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:26:25.6281938Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:25.6282284Z #define __dev_t_defined 2025-05-07T20:26:25.6282508Z #define __DEPRECATED 1 2025-05-07T20:26:25.6282726Z #define __S32_TYPE int 2025-05-07T20:26:25.6282967Z #define __cpp_rvalue_references 200610L 2025-05-07T20:26:25.6283248Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:26:25.6283490Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:26:25.6283735Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:26:25.6284326Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:26:25.6284941Z #define _G_HAVE_MREMAP 1 2025-05-07T20:26:25.6285400Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:25.6285733Z #define OVERFLOW 3 2025-05-07T20:26:25.6285965Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:26:25.6286265Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:26:25.6286551Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:25.6286996Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:26:25.6287317Z #define __SSE2_MATH__ 1 2025-05-07T20:26:25.6287546Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:26:25.6287838Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:25.6288119Z #define _IO_STDIO_H 2025-05-07T20:26:25.6288349Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:26:25.6288629Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:26:25.6288933Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:26:25.6289221Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.6289521Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:26:25.6289771Z #define __amd64 1 2025-05-07T20:26:25.6289987Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:26:25.6290243Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:26:25.6290504Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:26:25.6290780Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:26:25.6291075Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:26:25.6291332Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:26:25.6291616Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:26:25.6291869Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:26:25.6292109Z #define __bounded 2025-05-07T20:26:25.6292333Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:26:25.6292621Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:26:25.6292892Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:26:25.6293154Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:26:25.6293433Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.6293738Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:26:25.6294155Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:25.6294564Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:26:25.6294826Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:26:25.6295163Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:26:25.6295506Z #define STA_PLL 0x0001 2025-05-07T20:26:25.6295739Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:26:25.6296014Z #define __GNUG__ 11 2025-05-07T20:26:25.6296239Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:26:25.6296496Z #define _T_WCHAR 2025-05-07T20:26:25.6296724Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:26:25.6297049Z #define __specialization_static 2025-05-07T20:26:25.6297357Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:26:25.6297655Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:26:25.6297911Z #define cudaArraySparse 0x40 2025-05-07T20:26:25.6298173Z #define STA_PPSFREQ 0x0002 2025-05-07T20:26:25.6298410Z #define __GLIBCXX__ 20230528 2025-05-07T20:26:25.6298687Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:26:25.6298988Z #define _WCHAR_T 2025-05-07T20:26:25.6299200Z #define __cudaCDP2Free 2025-05-07T20:26:25.6299829Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:26:25.6300505Z #define __cpp_nsdmi 200809L 2025-05-07T20:26:25.6300918Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:26:25.6301352Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:26:25.6301631Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:26:25.6301889Z #define cudaArrayCubemap 0x04 2025-05-07T20:26:25.6302218Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:25.6302569Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:26:25.6302818Z #define __NO_CTYPE 1 2025-05-07T20:26:25.6303039Z #define __stub_bdflush 2025-05-07T20:26:25.6303394Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:26:25.6304178Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:26:25.6304489Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:26:25.6304752Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:26:25.6305025Z #define __cpp_initializer_lists 200806L 2025-05-07T20:26:25.6305447Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:26:25.6305736Z #define __U16_TYPE unsigned short int 2025-05-07T20:26:25.6306069Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:26:25.6306410Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:26:25.6306686Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:26:25.6306963Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:26:25.6307314Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:26:25.6307648Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:26:25.6307924Z #define _IO_STDIO 040000 2025-05-07T20:26:25.6308248Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:26:25.6308639Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:26:25.6308949Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:26:25.6309242Z #define _PTRDIFF_T 2025-05-07T20:26:25.6309461Z #define _MOVE_H 1 2025-05-07T20:26:25.6309686Z #define __cpp_hex_float 201603L 2025-05-07T20:26:25.6310016Z #define ADJ_TAI 0x0080 2025-05-07T20:26:25.6310252Z #define __ptrvalue 2025-05-07T20:26:25.6310479Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:26:25.6310731Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:26:25.6311018Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:26:25.6311310Z #define MATH_ERREXCEPT 2 2025-05-07T20:26:25.6311560Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:26:25.6311846Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:26:25.6312236Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:26:25.6312613Z #define __USE_GNU 1 2025-05-07T20:26:25.6312843Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:26:25.6313117Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:26:25.6313384Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:26:25.6313770Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:26:25.6314151Z #define WEXITED 4 2025-05-07T20:26:25.6314364Z #define _IO_NO_READS 4 2025-05-07T20:26:25.6314668Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:26:25.6315017Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:26:25.6315285Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:26:25.6315581Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:26:25.6315889Z #define __uid_t_defined 2025-05-07T20:26:25.6316132Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:26:25.6316425Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:26:25.6316696Z #define WNOHANG 1 2025-05-07T20:26:25.6316940Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:26:25.6317241Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:26:25.6317510Z #define cudaEventDefault 0x00 2025-05-07T20:26:25.6317809Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:26:25.6318116Z #define NL_SETMAX INT_MAX 2025-05-07T20:26:25.6318349Z #define __x86_64 1 2025-05-07T20:26:25.6318576Z #define __cudaCDP2LaunchDevice 2025-05-07T20:26:25.6318966Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:25.6319443Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:26:25.6319931Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:25.6320350Z #define __PTRDIFF_T 2025-05-07T20:26:25.6320668Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:26:25.6321044Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:26:25.6321319Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:25.6321599Z #define _Mlong_double_ long double 2025-05-07T20:26:25.6321871Z #define __cpp_lambdas 200907L 2025-05-07T20:26:25.6322122Z #define _IO_DEC 020 2025-05-07T20:26:25.6322340Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:26:25.6322702Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:26:25.6322995Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:26:25.6323271Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:26:25.6323534Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:26:25.6323835Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:26:25.6324233Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:26:25.6324511Z #define _ANSI_STDDEF_H 2025-05-07T20:26:25.6324776Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:26:25.6325081Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:26:25.6325450Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:26:25.6325828Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:26:25.6326111Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:26:25.6326393Z #define __cpp_template_auto 201606L 2025-05-07T20:26:25.6326747Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:26:25.6327115Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:26:25.6327379Z #define __key_t_defined 2025-05-07T20:26:25.6327627Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:26:25.6327993Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:26:25.6328450Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:26:25.6328815Z #define __GNUC_VA_LIST 2025-05-07T20:26:25.6329141Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:25.6329517Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:26:25.6329770Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:26:25.6330046Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:26:25.6330339Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:26:25.6330578Z #define __WCOREFLAG 0x80 2025-05-07T20:26:25.6330907Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:26:25.6331236Z #define cudaEventDisableTiming 0x02 2025-05-07T20:26:25.6331505Z #define __LP64__ 1 2025-05-07T20:26:25.6331751Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:26:25.6332067Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:26:25.6332339Z #define _IO_off64_t __off64_t 2025-05-07T20:26:25.6332594Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.6332847Z #define __time_t_defined 1 2025-05-07T20:26:25.6333091Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:26:25.6333437Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:26:25.6333797Z #define __USE_UNIX98 1 2025-05-07T20:26:25.6334032Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:26:25.6334296Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:26:25.6334561Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:26:25.6334855Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:26:25.6335154Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:26:25.6335412Z #define SEEK_CUR 1 2025-05-07T20:26:25.6335636Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:25.6335893Z #define _ASSERT_H 1 2025-05-07T20:26:25.6336467Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:26:25.6337089Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:26:25.6337353Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:26:25.6337606Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:26:25.6337873Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:26:25.6338144Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:26:25.6338510Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:25.6338910Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:26:25.6339562Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:26:25.6340199Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:26:25.6340490Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:26:25.6340835Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:26:25.6341303Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:26:25.6341568Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:26:25.6341848Z #define cudaArrayDefault 0x00 2025-05-07T20:26:25.6342125Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:26:25.6342408Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:26:25.6342784Z #define TLOSS 5 2025-05-07T20:26:25.6343000Z #define __ssize_t_defined 2025-05-07T20:26:25.6343246Z #define __CUDACC_VER_BUILD__ 85 2025-05-07T20:26:25.6343513Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:26:25.6343794Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:26:25.6344077Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:26:25.6344431Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:26:25.6344810Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:26:25.6345084Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:26:25.6345366Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:26:25.6345670Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:26:25.6345963Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:26:25.6346241Z #define __REGISTER_PREFIX__ 2025-05-07T20:26:25.6346497Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:26:25.6346823Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:26:25.6347175Z #define _IOS_NOREPLACE 64 2025-05-07T20:26:25.6347413Z #define __cdecl 2025-05-07T20:26:25.6347644Z #define cudaEventInterprocess 0x04 2025-05-07T20:26:25.6347962Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:26:25.6348284Z #define LOGIN_NAME_MAX 256 2025-05-07T20:26:25.6348534Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:26:25.6348797Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:26:25.6349080Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:26:25.6349345Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:26:25.6349644Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:26:25.6350068Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:26:25.6350474Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:25.6350904Z #define ADJ_NANO 0x2000 2025-05-07T20:26:25.6351203Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:26:25.6351550Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:26:25.6351836Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:26:25.6352089Z #define __FLT_DIG__ 6 2025-05-07T20:26:25.6352432Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:26:25.6352820Z #define __NO_INLINE__ 1 2025-05-07T20:26:25.6357592Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:25.6357961Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:26:25.6358214Z #define ADJ_STATUS 0x0010 2025-05-07T20:26:25.6358471Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:26:25.6358751Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:26:25.6359016Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:25.6359307Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:26:25.6359589Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:26:25.6359958Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:26:25.6360363Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:26:25.6360699Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:26:25.6361040Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:26:25.6361274Z #define MAX_CANON 255 2025-05-07T20:26:25.6361497Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:26:25.6361736Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:26:25.6361995Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:26:25.6362269Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:26:25.6362561Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:26:25.6362848Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:26:25.6363116Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:26:25.6363427Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:26:25.6363725Z #define __VERSION__ "11.4.0" 2025-05-07T20:26:25.6364083Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:26:25.6364371Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:26:25.6364650Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:26:25.6364915Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:26:25.6365213Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:26:25.6365570Z #define __UINT64_C(c) c ## UL 2025-05-07T20:26:25.6365821Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:26:25.6366064Z #define _SYS_TYPES_H 1 2025-05-07T20:26:25.6366297Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:26:25.6366545Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:26:25.6366784Z #define _SYS_CDEFS_H 1 2025-05-07T20:26:25.6367008Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:26:25.6367271Z #define __cpp_unicode_characters 201411L 2025-05-07T20:26:25.6367550Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:26:25.6367792Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:26:25.6368075Z #define __cudaCDP2StreamDestroy 2025-05-07T20:26:25.6368339Z #define FP_SUBNORMAL 3 2025-05-07T20:26:25.6368581Z #define cudaOccupancyDefault 0x00 2025-05-07T20:26:25.6368852Z #define _INITIALIZER_LIST 2025-05-07T20:26:25.6369093Z #define _STDC_PREDEF_H 1 2025-05-07T20:26:25.6369327Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:26:25.6369595Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:26:25.6369879Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:26:25.6370125Z #define _IO_file_flags _flags 2025-05-07T20:26:25.6370374Z #define __USE_XOPEN2K8 1 2025-05-07T20:26:25.6370612Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:26:25.6370878Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:26:25.6371142Z #define HUGE 3.40282347e+38F 2025-05-07T20:26:25.6371401Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:26:25.6371765Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:26:25.6372146Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:26:25.6372442Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:26:25.6372700Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:26:25.6372947Z #define _BSD_SOURCE 1 2025-05-07T20:26:25.6373173Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:26:25.6374010Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:26:25.6374833Z #define __catch(X) catch(X) 2025-05-07T20:26:25.6375089Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:26:25.6375366Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:26:25.6375633Z #define __TIMER_T_TYPE void * 2025-05-07T20:26:25.6375874Z #define __STRING(x) #x 2025-05-07T20:26:25.6376105Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:26:25.6376369Z #define _T_PTRDIFF_ 2025-05-07T20:26:25.6376598Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:26:25.6376889Z #define cudaEventWaitExternal 0x01 2025-05-07T20:26:25.6377154Z #define __unbounded 2025-05-07T20:26:25.6377382Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:25.6377668Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:26:25.6377937Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:25.6378220Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:26:25.6378487Z #define __cpp_lib_is_final 201402L 2025-05-07T20:26:25.6378774Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:26:25.6379091Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:26:25.6379386Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:26:25.6379653Z #define __managed__ __location__(managed) 2025-05-07T20:26:25.6379943Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:26:25.6380328Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:25.6380738Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:26:25.6380990Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:26:25.6381347Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:26:25.6381740Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:26:25.6382071Z #define _SYS_SIZE_T_H 2025-05-07T20:26:25.6382352Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:26:25.6382677Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:26:25.6382948Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:26:25.6383228Z #define _CRTIMP 2025-05-07T20:26:25.6383522Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:26:25.6383815Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:25.6384132Z #define STA_PPSJITTER 0x0200 2025-05-07T20:26:25.6384474Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:26:25.6384869Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.6385174Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:26:25.6385439Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:26:25.6385717Z #define __SIZE_T__ 2025-05-07T20:26:25.6385921Z #define __stub_gtty 2025-05-07T20:26:25.6386138Z #define __pid_t_defined 2025-05-07T20:26:25.6386383Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:26:25.6386685Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:25.6386986Z #define __glibcxx_function_requires(...) 2025-05-07T20:26:25.6387263Z #define __SM_80_RT_HPP__ 2025-05-07T20:26:25.6387497Z #define __need_clockid_t 2025-05-07T20:26:25.6387726Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:26:25.6387983Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:26:25.6388294Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:26:25.6388599Z #define _IO_HEX 0100 2025-05-07T20:26:25.6388844Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:26:25.6389168Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:26:25.6389468Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:26:25.6389730Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:26:25.6390238Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:25.6390665Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:26:25.6390967Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:26:25.6391251Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:26:25.6391354Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:26:25.6391455Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:26:25.6391535Z #define __stub_sstk 2025-05-07T20:26:25.6391625Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:26:25.6391783Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:26:25.6391866Z #define __wur 2025-05-07T20:26:25.6391983Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:26:25.6392069Z #define _G_HAVE_MMAP 1 2025-05-07T20:26:25.6392146Z #define _IO_OCT 040 2025-05-07T20:26:25.6392240Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:26:25.6392325Z #define NL_MSGMAX INT_MAX 2025-05-07T20:26:25.6392411Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:26:25.6392541Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:26:25.6392627Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:26:25.6392725Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:26:25.6392917Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:26:25.6393008Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:26:25.6393095Z #define _STL_ALGOBASE_H 1 2025-05-07T20:26:25.6393199Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:26:25.6393285Z #define __off64_t_defined 2025-05-07T20:26:25.6393389Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:26:25.6393472Z #define __FLT128_DIG__ 33 2025-05-07T20:26:25.6393571Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:26:25.6393665Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:26:25.6393746Z #define __INT32_C(c) c 2025-05-07T20:26:25.6393836Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:26:25.6393941Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:26:25.6394033Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:26:25.6394120Z #define __PDP_ENDIAN 3412 2025-05-07T20:26:25.6394206Z #define _ISOC95_SOURCE 1 2025-05-07T20:26:25.6394300Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:26:25.6394431Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:26:25.6394615Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:26:25.6394701Z #define __SM_90_RT_HPP__ 2025-05-07T20:26:25.6394797Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:26:25.6394888Z #define __have_pthread_attr_t 1 2025-05-07T20:26:25.6394986Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:26:25.6395286Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:26:25.6395390Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:26:25.6395488Z #define __cudaCDP2EventRecord 2025-05-07T20:26:25.6395582Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:26:25.6395663Z #define htole32(x) (x) 2025-05-07T20:26:25.6395913Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:26:25.6396032Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:26:25.6396127Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:26:25.6396283Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:26:25.6396421Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:26:25.6396540Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:26:25.6396677Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:26:25.6396762Z #define ADJ_OFFSET 0x0001 2025-05-07T20:26:25.6396857Z #define cudaArrayLayered 0x01 2025-05-07T20:26:25.6397032Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:26:25.6397134Z #define cudaEventRecordDefault 0x00 2025-05-07T20:26:25.6397227Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:26:25.6397325Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:26:25.6397401Z #define unix 1 2025-05-07T20:26:25.6397492Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:26:25.6397585Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:26:25.6397675Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:26:25.6397791Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:26:25.6397872Z #define __USE_POSIX 1 2025-05-07T20:26:25.6397962Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:26:25.6398097Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:26:25.6398182Z #define __THROWNL throw () 2025-05-07T20:26:25.6398269Z #define __cpp_rtti 199711L 2025-05-07T20:26:25.6398371Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:26:25.6398452Z #define __PMT(args) args 2025-05-07T20:26:25.6398567Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.6398709Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:26:25.6398816Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:26:25.6398907Z #define _SIZE_T_DECLARED 2025-05-07T20:26:25.6398998Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:26:25.6399087Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:26:25.6399477Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:26:25.6399572Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:26:25.6399661Z #define XATTR_LIST_MAX 65536 2025-05-07T20:26:25.6399753Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:26:25.6399903Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:26:25.6399988Z #define _WCHAR_T_H 2025-05-07T20:26:25.6400071Z #define __FLT64X_DIG__ 18 2025-05-07T20:26:25.6400157Z #define _IO_SHOWBASE 0200 2025-05-07T20:26:25.6400241Z #define _POSIX_QLIMIT 1 2025-05-07T20:26:25.6400338Z #define __INT8_TYPE__ signed char 2025-05-07T20:26:25.6400426Z #define __SURFACE_TYPES_H__ 2025-05-07T20:26:25.6400515Z #define __CUDA_ARCH__ 520 2025-05-07T20:26:25.6400616Z #define __cpp_digit_separators 201309L 2025-05-07T20:26:25.6400694Z #define __ELF__ 1 2025-05-07T20:26:25.6400793Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:26:25.6400888Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:26:25.6400969Z #define STA_INS 0x0010 2025-05-07T20:26:25.6401065Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:26:25.6401233Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:26:25.6401326Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:26:25.6401502Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:26:25.6401609Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.6401715Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:26:25.6401810Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:26:25.6401909Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:26:25.6403633Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:26:25.6404064Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:26:25.6404225Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:26:25.6404326Z #define _IO_funlockfile(_fp) 2025-05-07T20:26:25.6404644Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:25.6404772Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:26:25.6404863Z #define __DRIVER_TYPES_H__ 2025-05-07T20:26:25.6404946Z #define __FLT_RADIX__ 2 2025-05-07T20:26:25.6405049Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:26:25.6405217Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:26:25.6405307Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:26:25.6405403Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:26:25.6405498Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:26:25.6405590Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:26:25.6405698Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:26:25.6405795Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:26:25.6405877Z #define WORD_BIT 32 2025-05-07T20:26:25.6405963Z #define _IO_USER_BUF 1 2025-05-07T20:26:25.6406053Z #define __VECTOR_TYPES_H__ 2025-05-07T20:26:25.6406157Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:25.6406260Z #define cudaHostAllocPortable 0x01 2025-05-07T20:26:25.6406357Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:26:25.6406457Z #define __long_double_t long double 2025-05-07T20:26:25.6406547Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:26:25.6406633Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:26:25.6407038Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:26:25.6407121Z #define __k8 1 2025-05-07T20:26:25.6407311Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:26:25.6407482Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:26:25.6407596Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:26:25.6407694Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:26:25.6407788Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:26:25.6407886Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:26:25.6407980Z #define __blksize_t_defined 2025-05-07T20:26:25.6408070Z #define _IO_SHOWPOINT 0400 2025-05-07T20:26:25.6408164Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:26:25.6408277Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:26:25.6408368Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:26:25.6408467Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:26:25.6408559Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:26:25.6408653Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:26:25.6408907Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:26:25.6409242Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:26:25.6409345Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:26:25.6409444Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:26:25.6409525Z #define SEEK_SET 0 2025-05-07T20:26:25.6409618Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:26:25.6409709Z #define __CUDA_API_VER_MINOR__ 6 2025-05-07T20:26:25.6409905Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:26:25.6410005Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:26:25.6410107Z #define __cudaCDP2GetLastError 2025-05-07T20:26:25.6410201Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:26:25.6410291Z #define _MATH_H_MATHDEF 1 2025-05-07T20:26:25.6410774Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:26:25.6410875Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:26:25.6410969Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:26:25.6411067Z #define __stub_sigreturn 2025-05-07T20:26:25.6411298Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:26:25.6411502Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:26:25.6411595Z #define __HOST_CONFIG_H__ 2025-05-07T20:26:25.6411692Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:26:25.6411773Z #define CLOCK_TAI 11 2025-05-07T20:26:25.6411879Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:26:25.6411965Z #define __restrict_arr 2025-05-07T20:26:25.6412082Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:26:25.6412220Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:26:25.6412737Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:25.6412923Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:26:25.6413009Z #define __USE_MISC 1 2025-05-07T20:26:25.6413109Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:26:25.6413215Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:26:25.6413299Z #define _GCC_LIMITS_H_ 2025-05-07T20:26:25.6413391Z #define __LDBL_DIG__ 18 2025-05-07T20:26:25.6413486Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:26:25.6413588Z #define __malloc_and_calloc_defined 2025-05-07T20:26:25.6413687Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:26:25.6413788Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:26:25.6413868Z #define __x86_64__ 1 2025-05-07T20:26:25.6413952Z #define _SIZE_T_ 2025-05-07T20:26:25.6414834Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:26:25.6414943Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:26:25.6415036Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:26:25.6415154Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:26:25.6415273Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:26:25.6415365Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:26:25.6415471Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:26:25.6415595Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:26:25.6415731Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:26:25.6415824Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:26:25.6416283Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:25.6416405Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:26:25.6416550Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:26:25.6416652Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:26:25.6416744Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:26:25.6416836Z #define STA_FLL 0x0008 2025-05-07T20:26:25.6416975Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:26:25.6417068Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:26:25.6417193Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.6417300Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:26:25.6417383Z #define __stub_revoke 2025-05-07T20:26:25.6417476Z #define __timer_t_defined 1 2025-05-07T20:26:25.6417605Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:26:25.6417697Z #define INT_MAX __INT_MAX__ 2025-05-07T20:26:25.6417798Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:26:25.6417898Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:26:25.6418079Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:26:25.6418181Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:26:25.6418286Z #define cudaArrayTextureGather 0x08 2025-05-07T20:26:25.6418385Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:26:25.6418525Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:26:25.6418693Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:26:25.6418784Z #define _IO_off_t __off_t 2025-05-07T20:26:25.6418867Z #define __FLT64_DIG__ 15 2025-05-07T20:26:25.6419083Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:26:25.6419182Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:26:25.6419310Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.6419433Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:26:25.6419525Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:26:25.6419624Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:26:25.6419711Z #define NULL __null 2025-05-07T20:26:25.6419844Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:26:25.6419944Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:26:25.6420048Z #define __U64_TYPE unsigned long int 2025-05-07T20:26:25.6420141Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.6420229Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:26:25.6420320Z #define FP_ZERO 2 2025-05-07T20:26:25.6420413Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:26:25.6420565Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:26:25.6420669Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.6420751Z #define __WCHAR_T__ 2025-05-07T20:26:25.6420848Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:26:25.6421041Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:26:25.6421189Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:26:25.6421287Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:26:25.6421404Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:26:25.6421518Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:25.6421648Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:26:25.6421773Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:26:25.6421869Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:26:25.6421959Z #define _SIGSET_H_types 1 2025-05-07T20:26:25.6422075Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:26:25.6422187Z #define __cpp_unicode_literals 200710L 2025-05-07T20:26:25.6422332Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:26:25.6422430Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:26:25.6422553Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:26:25.6422681Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:26:25.6422784Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:26:25.6422914Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:26:25.6423085Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:26:25.6423186Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:26:25.6423289Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:26:25.6423384Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:26:25.6423476Z #define STA_MODE 0x4000 2025-05-07T20:26:25.6423581Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:26:25.6423684Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:26:25.6423802Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:26:25.6423900Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:26:25.6423993Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:26:25.6424100Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:26:25.6424194Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:26:25.6424303Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:26:25.6424392Z #define __SIZE_WIDTH__ 64 2025-05-07T20:26:25.6424506Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.6424592Z #define __SEG_FS 1 2025-05-07T20:26:25.6424681Z #define _IO_size_t size_t 2025-05-07T20:26:25.6424857Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:26:25.6424957Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:26:25.6425044Z #define __stub_lchmod 2025-05-07T20:26:25.6425134Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:26:25.6425245Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.6425414Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:26:25.6425495Z #define __SEG_GS 1 2025-05-07T20:26:25.6425677Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:26:25.6425760Z #define _IOS_APPEND 8 2025-05-07T20:26:25.6425850Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:26:25.6425943Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:26:25.6426037Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:26:25.6426139Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:26:25.6426235Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:26:25.6426318Z #define htole16(x) (x) 2025-05-07T20:26:25.6426431Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:25.6426522Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:26:25.6426622Z #define __INT16_TYPE__ short int 2025-05-07T20:26:25.6426727Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:26:25.6426830Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:26:25.6426936Z #define __cpp_structured_bindings 201606L 2025-05-07T20:26:25.6427064Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:26:25.6427150Z #define __SIZEOF_INT__ 4 2025-05-07T20:26:25.6427238Z #define __WCLONE 0x80000000 2025-05-07T20:26:25.6427332Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:26:25.6427414Z #define SEEK_HOLE 4 2025-05-07T20:26:25.6427504Z #define TIMER_ABSTIME 1 2025-05-07T20:26:25.6427594Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:26:25.6427684Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:26:25.6427859Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:25.6427967Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.6428061Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:26:25.6428177Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:26:25.6428271Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:26:25.6428389Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:26:25.6428482Z #define _LINUX_LIMITS_H 2025-05-07T20:26:25.6428561Z #define linux 1 2025-05-07T20:26:25.6428661Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:26:25.6428768Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:26:25.6428867Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:26:25.6428964Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:26:25.6429070Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:26:25.6429211Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:26:25.6429313Z #define __cpp_lib_hypot 201603 2025-05-07T20:26:25.6429408Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.6429504Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:26:25.6429592Z #define MOD_NANO ADJ_NANO 2025-05-07T20:26:25.6429674Z #define htole64(x) (x) 2025-05-07T20:26:25.6429771Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:26:25.6429998Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:26:25.6430104Z #define _IO_UPPERCASE 01000 2025-05-07T20:26:25.6430599Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:26:25.6430688Z #define __USE_POSIX2 1 2025-05-07T20:26:25.6430785Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:26:25.6430877Z #define __WALL 0x40000000 2025-05-07T20:26:25.6430970Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:26:25.6431051Z #define _XLOCALE_H 1 2025-05-07T20:26:25.6431149Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:26:25.6431245Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:26:25.6431336Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:26:25.6431444Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:26:25.6431527Z #define __EXCEPTIONS 1 2025-05-07T20:26:25.6431624Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:26:25.6431937Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:26:25.6432021Z #define __WORDSIZE 64 2025-05-07T20:26:25.6432117Z #define CLOCK_MONOTONIC 1 2025-05-07T20:26:25.6432203Z #define _STL_RELOPS_H 1 2025-05-07T20:26:25.6432296Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:26:25.6432397Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:26:25.6432570Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:26:25.6432662Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:26:25.6432763Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:26:25.6433063Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:26:25.6433288Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:25.6433408Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:26:25.6433506Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:26:25.6433615Z #define __cpp_range_based_for 201603L 2025-05-07T20:26:25.6433726Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:26:25.6433831Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:26:25.6433940Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:26:25.6434121Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:26:25.6434215Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:26:25.6434311Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:26:25.6434418Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:26:25.6434589Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:25.6434706Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:26:25.6434786Z #define _STRING_H 1 2025-05-07T20:26:25.6434885Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:26:25.6434972Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:26:25.6435068Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:26:25.6435206Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:26:25.6435299Z #define __code_model_small__ 1 2025-05-07T20:26:25.6435387Z #define _PSTL_CONFIG_H 2025-05-07T20:26:25.6435491Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:26:25.6435606Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:26:25.6435699Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:26:25.6435806Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:26:25.6436137Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:25.6436240Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:26:25.6436323Z #define le64toh(x) (x) 2025-05-07T20:26:25.6436421Z #define FILENAME_MAX 4096 2025-05-07T20:26:25.6436594Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:26:25.6436726Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:26:25.6436807Z #define L_cuserid 9 2025-05-07T20:26:25.6436897Z #define __ino_t_defined 2025-05-07T20:26:25.6436978Z #define __k8__ 1 2025-05-07T20:26:25.6437072Z #define __INTPTR_TYPE__ long int 2025-05-07T20:26:25.6437183Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:26:25.6437269Z #define __int8_t_defined 2025-05-07T20:26:25.6437367Z #define __WCHAR_TYPE__ int 2025-05-07T20:26:25.6437470Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:26:25.6437580Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:26:25.6437679Z #define __SLONGWORD_TYPE long int 2025-05-07T20:26:25.6437760Z #define _IOS_TRUNC 16 2025-05-07T20:26:25.6437881Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:26:25.6438027Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:26:25.6438109Z #define __HAVE_COLUMN 2025-05-07T20:26:25.6438193Z #define __stub_fdetach 2025-05-07T20:26:25.6438603Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:26:25.6438686Z #define __pic__ 2 2025-05-07T20:26:25.6438802Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.6438903Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:26:25.6438995Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:26:25.6439185Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:26:25.6439272Z #define __stub_chflags 2025-05-07T20:26:25.6439360Z #define CLOCK_BOOTTIME 7 2025-05-07T20:26:25.6439451Z #define __need_IOV_MAX 2025-05-07T20:26:25.6439556Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:26:25.6439657Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:26:25.6439831Z #define __cpp_decltype 200707L 2025-05-07T20:26:25.6439926Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:26:25.6440014Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:26:25.6440122Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:26:25.6440206Z #define TTY_NAME_MAX 32 2025-05-07T20:26:25.6440368Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:26:25.6440495Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.6440661Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:26:25.6440773Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:26:25.6440866Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:26:25.6440962Z #define STA_PPSTIME 0x0004 2025-05-07T20:26:25.6441050Z #define __import__ 2025-05-07T20:26:25.6441137Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:26:25.6441269Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:26:25.6441351Z #define __export__ 2025-05-07T20:26:25.6441474Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:26:25.6441571Z #define cudaMemAttachHost 0x02 2025-05-07T20:26:25.6441736Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:25.6441831Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:26:25.6441924Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:26:25.6442018Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:26:25.6442106Z #define _WCHAR_T_DECLARED 2025-05-07T20:26:25.6442226Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:26:25.6442341Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:26:25.6442445Z #define __cpp_inline_variables 201606L 2025-05-07T20:26:25.6442539Z #define WNOWAIT 0x01000000 2025-05-07T20:26:25.6442626Z #define PLOSS 6 2025-05-07T20:26:25.6442716Z #define M_LN10 2.30258509299404568402 2025-05-07T20:26:25.6442981Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:26:25.6443065Z #define EXIT_SUCCESS 0 2025-05-07T20:26:25.6443163Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:26:25.6443263Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:26:25.6443360Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:26:25.6443456Z #define __thread__ __thread 2025-05-07T20:26:25.6443552Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:26:25.6443644Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:26:25.6443748Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:26:25.6443975Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:25.6444090Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:26:25.6444185Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:26:25.6444267Z #define __linux__ 1 2025-05-07T20:26:25.6444367Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:26:25.6444498Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:26:25.6444589Z #define __S16_TYPE short int 2025-05-07T20:26:25.6444934Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:26:25.6445045Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:26:25.6445233Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:26:25.6445334Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:26:25.6445431Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:26:25.6445511Z #define _T_SIZE_ 2025-05-07T20:26:25.6445609Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:25.6445727Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:26:25.6445817Z #define _PSTL_VERSION 12000 2025-05-07T20:26:25.6445941Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:26:25.6446033Z #define __WNOTHREAD 0x20000000 2025-05-07T20:26:25.6446221Z #define _G_va_list __gnuc_va_list 2025-05-07T20:26:25.6446351Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:26:25.6446435Z #define _IOS_INPUT 1 2025-05-07T20:26:25.6446536Z #define __USE_LARGEFILE64 1 2025-05-07T20:26:25.6446654Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:26:25.6446841Z #define __INT64_TYPE__ long int 2025-05-07T20:26:25.6446949Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:26:25.6447047Z #define __shared__ __location__(shared) 2025-05-07T20:26:25.6447137Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:26:25.6447295Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:26:25.6447381Z #define __gid_t_defined 2025-05-07T20:26:25.6447491Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:26:25.6447593Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:26:25.6447787Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:26:25.6447889Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:26:25.6447982Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:26:25.6448066Z #define ___int_size_t_h 2025-05-07T20:26:25.6448173Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:25.6448290Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:26:25.6448444Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:26:25.6448558Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:26:25.6448653Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:26:25.6448751Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:26:25.6448848Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:26:25.6448970Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.6449084Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:26:25.6449200Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:26:25.6449288Z #define __clock_t_defined 1 2025-05-07T20:26:25.6449391Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:26:25.6449497Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:26:25.6449584Z #define __GLIBC_MINOR__ 17 2025-05-07T20:26:25.6449685Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:26:25.6449782Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:26:25.6449888Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:26:25.6449979Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:26:25.6450148Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:25.6450235Z #define __SSE__ 1 2025-05-07T20:26:25.6450333Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:26:25.6450428Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:26:25.6450514Z #define _CTYPE_H 1 2025-05-07T20:26:25.6450603Z #define __sigset_t_defined 2025-05-07T20:26:25.6450696Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:26:25.6450793Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:26:25.6450876Z #define MOD_TAI ADJ_TAI 2025-05-07T20:26:25.6450971Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:26:25.6451066Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:26:25.6451148Z #define __SM_70_RT_H__ 2025-05-07T20:26:25.6451248Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:26:25.6451358Z #define cudaEventWaitDefault 0x00 2025-05-07T20:26:25.6451450Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:26:25.6451610Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:25.6451707Z #define _POSIX_MAX_CANON 255 2025-05-07T20:26:25.6451818Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:26:25.6451912Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:26:25.6452000Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:26:25.6452078Z #define __amd64__ 1 2025-05-07T20:26:25.6452171Z #define __WINT_WIDTH__ 32 2025-05-07T20:26:25.6452272Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:26:25.6452533Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:25.6452635Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:26:25.6452716Z #define EOF (-1) 2025-05-07T20:26:25.6452809Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:26:25.6452904Z #define __USE_POSIX199309 1 2025-05-07T20:26:25.6453085Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:26:25.6453184Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:26:25.6453279Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:26:25.6453374Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:26:25.6453489Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:26:25.6453656Z #define ____mbstate_t_defined 1 2025-05-07T20:26:25.6453740Z #define STA_NANO 0x2000 2025-05-07T20:26:25.6453844Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:26:25.6453936Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:26:25.6454021Z #define _IO_LINKED 0x80 2025-05-07T20:26:25.6454122Z #define __cpp_lib_launder 201606 2025-05-07T20:26:25.6454210Z #define __SIZEOF_INT128__ 16 2025-05-07T20:26:25.6454309Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:26:25.6458761Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:26:25.6458882Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:26:25.6459027Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:26:25.6459146Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:25.6459248Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:25.6459341Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:26:25.6459437Z #define __W_CONTINUED 0xffff 2025-05-07T20:26:25.6459524Z #define __ATOMIC_RELAXED 0 2025-05-07T20:26:25.6459655Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:26:25.6459785Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:25.6459990Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:26:25.6460170Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:26:25.6460258Z #define __stub_stty 2025-05-07T20:26:25.6460423Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:26:25.6460508Z #define le16toh(x) (x) 2025-05-07T20:26:25.6460616Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:26:25.6460790Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:26:25.6460875Z #define _SIZET_ 2025-05-07T20:26:25.6460967Z #define XATTR_NAME_MAX 255 2025-05-07T20:26:25.6461049Z #define _SVID_SOURCE 1 2025-05-07T20:26:25.6461132Z #define _LP64 1 2025-05-07T20:26:25.6461221Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:26:25.6461452Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:26:25.6461569Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:26:25.6461652Z #define __UINT8_C(c) c 2025-05-07T20:26:25.6461743Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:26:25.6461838Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:26:25.6461943Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:26:25.6462038Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:26:25.6462131Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:26:25.6462224Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:26:25.6462310Z #define CUDARTAPI 2025-05-07T20:26:25.6462390Z #define IOV_MAX 1024 2025-05-07T20:26:25.6462532Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:26:25.6462635Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:26:25.6462733Z #define cudaMemAttachSingle 0x04 2025-05-07T20:26:25.6462811Z #define __wchar_t__ 2025-05-07T20:26:25.6462911Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:26:25.6462991Z #define SEEK_END 2 2025-05-07T20:26:25.6463079Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:26:25.6463256Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:26:25.6463352Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:26:25.6463493Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:26:25.6463577Z #define ____FILE_defined 1 2025-05-07T20:26:25.6463690Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:26:25.6463786Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:26:25.6463870Z #define _ISOC99_SOURCE 1 2025-05-07T20:26:25.6463966Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:26:25.6464212Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:25.6464472Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:26:25.6464557Z #define _IO_RIGHT 04 2025-05-07T20:26:25.6464654Z #define __END_NAMESPACE_STD 2025-05-07T20:26:25.6464837Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:25.6464928Z #define _GLIBCXX_STD_C std 2025-05-07T20:26:25.6465116Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:26:25.6465209Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:26:25.6465306Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:26:25.6465385Z #define _STDDEF_H_ 2025-05-07T20:26:25.6465561Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:25.6465656Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.6465771Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:26:25.6465964Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:26:25.6466076Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:25.6466223Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:26:25.6466342Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:26:25.6466440Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:26:25.6466550Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:26:25.6466641Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:26:25.6466754Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:26:25.6466849Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:26:25.6466942Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:26:25.6467039Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:26:25.6467210Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:26:25.6467298Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:26:25.6467476Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:26:25.6467571Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:26:25.6467663Z #define __STDCPP_THREADS__ 1 2025-05-07T20:26:25.6467805Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:26:25.6467902Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:26:25.6467991Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:26:25.6468090Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:26:25.6468181Z #define P_tmpdir "/tmp" 2025-05-07T20:26:25.6468303Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:26:25.6468398Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:26:25.6468495Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:26:25.6468658Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:26:25.6468824Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:26:25.6468919Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:26:25.6469044Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:26:25.6469151Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:26:25.6469249Z #define __location__(a) __annotate__(a) 2025-05-07T20:26:25.6469477Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:26:25.6469574Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:26:25.6469686Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:26:25.6469781Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:26:25.6469940Z #define __STDC_UTF_32__ 1 2025-05-07T20:26:25.6470050Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:26:25.6470148Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:26:25.6470242Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:26:25.6470325Z #define __FXSR__ 1 2025-05-07T20:26:25.6470403Z #define _SIZE_T 2025-05-07T20:26:25.6470503Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:26:25.6470612Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:26:25.6470777Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:25.6470923Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:26:25.6471019Z #define _IO_ssize_t __ssize_t 2025-05-07T20:26:25.6471115Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:26:25.6471299Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:25.6471608Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:26:25.6471698Z #define _GXX_NULLPTR_T 2025-05-07T20:26:25.6471821Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:26:25.6471905Z #define FOPEN_MAX 16 2025-05-07T20:26:25.6472060Z #define __BIG_ENDIAN 4321 2025-05-07T20:26:25.6472181Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:25.6472276Z #define __suseconds_t_defined 2025-05-07T20:26:25.6472361Z #define __off_t_defined 2025-05-07T20:26:25.6472446Z #define stderr stderr 2025-05-07T20:26:25.6472536Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:26:25.6472652Z #define __glibcxx_requires_string(_String) 2025-05-07T20:26:25.6472744Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:26:25.6472836Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:26:25.6473245Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:26:25.6473338Z #define __mode_t_defined 2025-05-07T20:26:25.6473418Z #define _GCC_SIZE_T 2025-05-07T20:26:25.6473517Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:25.6473617Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:26:25.6473722Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:26:25.6473823Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:26:25.6473911Z #define __UINT32_C(c) c ## U 2025-05-07T20:26:25.6474015Z #define __cpp_alias_templates 200704L 2025-05-07T20:26:25.6474118Z #define cudaHostAllocMapped 0x02 2025-05-07T20:26:25.6474220Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:26:25.6474310Z #define _STL_ITERATOR_H 1 2025-05-07T20:26:25.6474389Z #define __size_t__ 2025-05-07T20:26:25.6474515Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:26:25.6474608Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:26:25.6474720Z #define cudaEventRecordExternal 0x01 2025-05-07T20:26:25.6474865Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:26:25.6474964Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:26:25.6475129Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:26:25.6475212Z #define _ENDIAN_H 1 2025-05-07T20:26:25.6475316Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:26:25.6475407Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:26:25.6475517Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:26:25.6475595Z #define __try try 2025-05-07T20:26:25.6475687Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:26:25.6475782Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:26:25.6475867Z #define __INT8_MAX__ 0x7f 2025-05-07T20:26:25.6476125Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:26:25.6476216Z #define __LONG_WIDTH__ 64 2025-05-07T20:26:25.6476292Z #define __PIC__ 2 2025-05-07T20:26:25.6476398Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:26:25.6476517Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:26:25.6476650Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:26:25.6476745Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:26:25.6476837Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:26:25.6477017Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:25.6477116Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:26:25.6477218Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:26:25.6477304Z #define _IO_uid_t __uid_t 2025-05-07T20:26:25.6477405Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:26:25.6477528Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:26:25.6477616Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:26:25.6477762Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:25.6477858Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:26:25.6477977Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:26:25.6478063Z #define LONG_BIT 64 2025-05-07T20:26:25.6478166Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:26:25.6478349Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:26:25.6478477Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:26:25.6478571Z #define __fsfilcnt_t_defined 2025-05-07T20:26:25.6478664Z #define __blkcnt_t_defined 2025-05-07T20:26:25.6478937Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:25.6479172Z #define __USE_LARGEFILE 1 2025-05-07T20:26:25.6479272Z #define __cpp_constexpr 201603L 2025-05-07T20:26:25.6479362Z #define CUDART_VERSION 12060 2025-05-07T20:26:25.6479446Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:26:25.6479549Z #define cudaDeviceMapHost 0x08 2025-05-07T20:26:25.6479636Z #define _GLIBCXX_CMATH 1 2025-05-07T20:26:25.6479830Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:26:25.6479919Z #define __lldiv_t_defined 1 2025-05-07T20:26:25.6479998Z #define __SSE2__ 1 2025-05-07T20:26:25.6480081Z #define _IOLBF 1 2025-05-07T20:26:25.6480180Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:26:25.6480277Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:26:25.6480383Z #define __cpp_deduction_guides 201703L 2025-05-07T20:26:25.6480475Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:26:25.6480579Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:26:25.6480670Z #define __INT32_TYPE__ int 2025-05-07T20:26:25.6480764Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:26:25.6480867Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:26:25.6480965Z #define __cpp_exceptions 199711L 2025-05-07T20:26:25.6481058Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:26:25.6481171Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:26:25.6481259Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:26:25.6481371Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:26:25.6481529Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:26:25.6481621Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:26:25.6481714Z #define __SWORD_TYPE long int 2025-05-07T20:26:25.6481807Z #define __INTMAX_TYPE__ long int 2025-05-07T20:26:25.6481904Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:26:25.6481996Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:26:25.6482090Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:26:25.6482369Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:25.6482461Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:26:25.6482615Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:26:25.6482689Z #define _T_SIZE 2025-05-07T20:26:25.6482795Z #define cudaHostAllocDefault 0x00 2025-05-07T20:26:25.6482915Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:26:25.6483036Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:26:25.6483131Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:26:25.6483219Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:26:25.6483338Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:26:25.6483431Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:26:25.6483526Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.6483615Z #define __ATOMIC_CONSUME 1 2025-05-07T20:26:25.6483800Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:26:25.6483888Z #define __GNUC_MINOR__ 4 2025-05-07T20:26:25.6483991Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:26:25.6484084Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:26:25.6484203Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.6484284Z #define __PIE__ 2 2025-05-07T20:26:25.6484387Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:26:25.6484481Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:26:25.6484670Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:26:25.6484885Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:25.6484972Z #define __nlink_t_defined 2025-05-07T20:26:25.6485099Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:26:25.6485209Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:26:25.6485296Z #define _XOPEN_LIM_H 1 2025-05-07T20:26:25.6485639Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:25.6485757Z #define __cpp_template_template_args 201611L 2025-05-07T20:26:25.6485864Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:26:25.6486058Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:26:25.6486151Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:26:25.6486242Z #define __FILE_defined 1 2025-05-07T20:26:25.6486414Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:26:25.6486509Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:26:25.6486602Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:26:25.6486707Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:26:25.6486823Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:26:25.6486926Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:26:25.6487022Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:26:25.6487108Z #define __INT16_C(c) c 2025-05-07T20:26:25.6487206Z #define __U32_TYPE unsigned int 2025-05-07T20:26:25.6487302Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:26:25.6487427Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:26:25.6487503Z #define __STDC__ 1 2025-05-07T20:26:25.6487595Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:26:25.6487698Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:26:25.6487792Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:26:25.6487940Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:26:25.6488028Z #define __FLT32X_DIG__ 15 2025-05-07T20:26:25.6488126Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:26:25.6488226Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:26:25.6488334Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:26:25.6488438Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:26:25.6488534Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:26:25.6488635Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:26:25.6488715Z #define stdin stdin 2025-05-07T20:26:25.6488810Z #define __ino64_t_defined 2025-05-07T20:26:25.6488895Z #define STA_CLK 0x8000 2025-05-07T20:26:25.6488988Z #define __clockid_t_defined 1 2025-05-07T20:26:25.6489137Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:26:25.6489298Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:26:25.6489408Z #define __cudaCDP2MemsetAsync 2025-05-07T20:26:25.6489511Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:26:25.6489609Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:26:25.6489711Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:26:25.6489905Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:26:25.6489994Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:26:25.6490514Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:26:25.6490598Z #define DOMAIN 1 2025-05-07T20:26:25.6490686Z #define M_LN2 0.69314718055994530942 2025-05-07T20:26:25.6490769Z #define __NVCC__ 1 2025-05-07T20:26:25.6490869Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:26:25.6490986Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:25.6491093Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:26:25.6491192Z #define __throw_exception_again throw 2025-05-07T20:26:25.6491286Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:26:25.6491372Z #define __EXCEPTION_H 1 2025-05-07T20:26:25.6491468Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:26:25.6491571Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:26:25.6491871Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:25.6491982Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:26:25.6492084Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:26:25.6492177Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:26:25.6492364Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:26:25.6492464Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:26:25.6492604Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:26:25.6492707Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:25.6492814Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:26:25.6492977Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:26:25.6493082Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:26:25.6493174Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:26:25.6493273Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:26:25.6493411Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:26:25.6493506Z #define __useconds_t_defined 2025-05-07T20:26:25.6493609Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:26:25.6493790Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:26:25.6493934Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:26:25.6494021Z #define __SSE_MATH__ 1 2025-05-07T20:26:25.6494113Z #define _IO_wint_t wint_t 2025-05-07T20:26:25.6494210Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:26:25.6494302Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:26:25.6494394Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:26:25.6494504Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:26:25.6494606Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:26:25.6494698Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:26:25.6494781Z #define __USE_ATFILE 1 2025-05-07T20:26:25.6494873Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:26:25.6494967Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:26:25.6495056Z #define _GCC_PTRDIFF_T 2025-05-07T20:26:25.6495279Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:25.6495373Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:26:25.6495473Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:26:25.6495573Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:26:25.6495680Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:26:25.6495765Z #define _STDLIB_H 1 2025-05-07T20:26:25.6495901Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:26:25.6495993Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:26:25.6496088Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:26:25.6496216Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:25.6496334Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:25.6496424Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:26:25.6496604Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:26:25.6496759Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:26:25.6496861Z #define __glibcxx_requires_nonempty() 2025-05-07T20:26:25.6496976Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:26:25.6497067Z #define __ldiv_t_defined 1 2025-05-07T20:26:25.6497243Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:26:25.6497332Z #define ___int_ptrdiff_t_h 2025-05-07T20:26:25.6497507Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:25.6497607Z #define __cudaCDP2EventDestroy 2025-05-07T20:26:25.6497700Z #define __HOST_DEFINES_H__ 2025-05-07T20:26:25.6497802Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:26:25.6497898Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:25.6498003Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:26:25.6498082Z #define CUDART_CB 2025-05-07T20:26:25.6498180Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:26:25.6498305Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:26:25.6498389Z #define MB_LEN_MAX 16 2025-05-07T20:26:25.6498610Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:25.6498712Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:26:25.6498834Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:26:25.6498943Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:26:25.6499040Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:26:25.6499270Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:26:25.6499380Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:26:25.6499464Z #define _GNU_SOURCE 1 2025-05-07T20:26:25.6499548Z #define __stub_putmsg 2025-05-07T20:26:25.6499633Z #define __CUDACC__ 1 2025-05-07T20:26:25.6499794Z #define __N(msgid) (msgid) 2025-05-07T20:26:25.6499877Z #define __P(args) args 2025-05-07T20:26:25.6500133Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:26:25.6500233Z #define __cpp_init_captures 201304L 2025-05-07T20:26:25.6500336Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:26:25.6500428Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:26:25.6500526Z #define __cpp_lib_as_const 201510 2025-05-07T20:26:25.6500604Z #define __WCHAR_T 2025-05-07T20:26:25.6500696Z #define __ATOMIC_RELEASE 3 2025-05-07T20:26:25.6500787Z #define __fsblkcnt_t_defined 2025-05-07T20:26:25.6500904Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:26:25.6501008Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:26:25.6501014Z 2025-05-07T20:26:25.6664508Z 2025-05-07T20:26:25.6664925Z + conda run -n build_binary nvcc --version 2025-05-07T20:26:25.6664938Z 2025-05-07T20:26:27.5645739Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:26:27.5646239Z Copyright (c) 2005-2024 NVIDIA Corporation 2025-05-07T20:26:27.5646651Z Built on Tue_Oct_29_23:50:19_PDT_2024 2025-05-07T20:26:27.5646958Z Cuda compilation tools, release 12.6, V12.6.85 2025-05-07T20:26:27.5647315Z Build cuda_12.6.r12.6/compiler.35059454_0 2025-05-07T20:26:27.5647527Z 2025-05-07T20:26:27.6272200Z 2025-05-07T20:26:27.6282201Z /usr/bin/nvidia-smi 2025-05-07T20:26:27.6287660Z + nvidia-smi 2025-05-07T20:26:27.6287856Z 2025-05-07T20:26:27.6460336Z Wed May 7 20:26:27 2025 2025-05-07T20:26:27.6460821Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:27.6461495Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:26:27.6462001Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:27.6462493Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:26:27.6463006Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:26:27.6463436Z | | | MIG M. | 2025-05-07T20:26:27.6463764Z |=========================================+========================+======================| 2025-05-07T20:26:27.6631673Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:26:27.6632249Z | 0% 25C P8 16W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:26:27.6632756Z | | | N/A | 2025-05-07T20:26:27.6633202Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:26:27.6636488Z 2025-05-07T20:26:27.6637047Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:27.6637689Z | Processes: | 2025-05-07T20:26:27.6638233Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:26:27.6638628Z | ID ID Usage | 2025-05-07T20:26:27.6638970Z |=========================================================================================| 2025-05-07T20:26:27.6641526Z | No running processes found | 2025-05-07T20:26:27.6642178Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:26:27.9322647Z 2025-05-07T20:26:27.9327523Z [INSTALL] Successfully installed CUDA 12.6.3 2025-05-07T20:26:27.9381155Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:26:27.9381711Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:26:27.9394204Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:26:27.9394549Z env: 2025-05-07T20:26:27.9394768Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:26:27.9395059Z BUILD_ENV: build_binary 2025-05-07T20:26:27.9395304Z BUILD_TARGET: genai 2025-05-07T20:26:27.9395531Z BUILD_VARIANT: cuda 2025-05-07T20:26:27.9395762Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:26:27.9396009Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:26:27.9396337Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:26:27.9396793Z ##[endgroup] 2025-05-07T20:26:28.2751995Z ################################################################################ 2025-05-07T20:26:28.2752372Z # Install PyTorch (PIP) 2025-05-07T20:26:28.2752602Z # 2025-05-07T20:26:28.2768413Z # [2025-05-07T20:26:28.276Z] + install_pytorch_pip build_binary nightly cuda/12.6.3 2025-05-07T20:26:28.2768856Z ################################################################################ 2025-05-07T20:26:28.2769084Z 2025-05-07T20:26:28.2798261Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:26:29.2666751Z Channels: 2025-05-07T20:26:29.2666992Z - conda-forge 2025-05-07T20:26:29.2667213Z Platform: linux-64 2025-05-07T20:26:32.5763424Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:26:33.3070727Z Solving environment: \ | / done 2025-05-07T20:26:33.5296364Z 2025-05-07T20:26:33.5296875Z ## Package Plan ## 2025-05-07T20:26:33.5297080Z 2025-05-07T20:26:33.5297357Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:33.5297756Z 2025-05-07T20:26:33.5297882Z added / updated specs: 2025-05-07T20:26:33.5298202Z - numpy 2025-05-07T20:26:33.5298355Z 2025-05-07T20:26:33.5298382Z 2025-05-07T20:26:33.5298557Z The following packages will be downloaded: 2025-05-07T20:26:33.5298839Z 2025-05-07T20:26:33.5299012Z package | build 2025-05-07T20:26:33.5299428Z ---------------------------|----------------- 2025-05-07T20:26:33.5299816Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:26:33.5300434Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:26:33.5300987Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:26:33.5301427Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:26:33.5301887Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:26:33.5302361Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:26:33.5302804Z numpy-2.0.2 | py39h9cb892a_1 7.6 MB conda-forge 2025-05-07T20:26:33.5303192Z ------------------------------------------------------------ 2025-05-07T20:26:33.5303522Z Total: 14.8 MB 2025-05-07T20:26:33.5303991Z 2025-05-07T20:26:33.5304160Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:33.5304384Z 2025-05-07T20:26:33.5304593Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:26:33.5305161Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:26:33.5314176Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:26:33.5314827Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:26:33.5315362Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:26:33.5315987Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:26:33.5316979Z numpy conda-forge/linux-64::numpy-2.0.2-py39h9cb892a_1 2025-05-07T20:26:33.5317302Z 2025-05-07T20:26:33.5317307Z 2025-05-07T20:26:33.5317323Z 2025-05-07T20:26:33.5317473Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:33.5318032Z numpy-2.0.2 | 7.6 MB | | 0% 2025-05-07T20:26:33.5318276Z 2025-05-07T20:26:33.5318581Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:33.5318834Z 2025-05-07T20:26:33.5318838Z 2025-05-07T20:26:33.5328085Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:26:33.5328461Z 2025-05-07T20:26:33.5328466Z 2025-05-07T20:26:33.5328471Z 2025-05-07T20:26:33.5336206Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:26:33.5336590Z 2025-05-07T20:26:33.5336596Z 2025-05-07T20:26:33.5336601Z 2025-05-07T20:26:33.5336607Z 2025-05-07T20:26:33.5360796Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:33.5361165Z 2025-05-07T20:26:33.5361170Z 2025-05-07T20:26:33.5361186Z 2025-05-07T20:26:33.5361191Z 2025-05-07T20:26:33.5361196Z 2025-05-07T20:26:33.5362177Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:33.5362537Z 2025-05-07T20:26:33.5362555Z 2025-05-07T20:26:33.5362566Z 2025-05-07T20:26:33.5362571Z 2025-05-07T20:26:33.5362577Z 2025-05-07T20:26:33.5364015Z 2025-05-07T20:26:33.6946733Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:26:33.6947141Z 2025-05-07T20:26:33.6947145Z 2025-05-07T20:26:33.6947149Z 2025-05-07T20:26:33.6976854Z 2025-05-07T20:26:33.6977882Z libblas-3.9.0 | 16 KB | #########7 | 97%  2025-05-07T20:26:33.6978158Z 2025-05-07T20:26:33.6981413Z 2025-05-07T20:26:33.6988416Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:26:33.6988840Z 2025-05-07T20:26:33.6988846Z 2025-05-07T20:26:33.6991881Z 2025-05-07T20:26:33.7022716Z libgfortran-15.1.0 | 34 KB | #########4 | 95%  2025-05-07T20:26:33.7023114Z 2025-05-07T20:26:33.7023121Z 2025-05-07T20:26:33.7023126Z 2025-05-07T20:26:33.7023131Z 2025-05-07T20:26:33.7075946Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:33.7076400Z 2025-05-07T20:26:33.7076406Z 2025-05-07T20:26:33.7078612Z 2025-05-07T20:26:33.7400730Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:33.7401099Z 2025-05-07T20:26:33.7401105Z 2025-05-07T20:26:33.7401110Z 2025-05-07T20:26:33.7401116Z 2025-05-07T20:26:33.7401121Z 2025-05-07T20:26:33.7401138Z 2025-05-07T20:26:33.7422141Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:33.7422487Z 2025-05-07T20:26:33.7422491Z 2025-05-07T20:26:33.7422495Z 2025-05-07T20:26:33.7422499Z 2025-05-07T20:26:33.7422512Z 2025-05-07T20:26:33.7422516Z 2025-05-07T20:26:33.7436247Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:33.7436614Z 2025-05-07T20:26:33.7436618Z 2025-05-07T20:26:33.7436640Z 2025-05-07T20:26:33.7439606Z 2025-05-07T20:26:33.7583641Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:33.7583922Z 2025-05-07T20:26:33.7586764Z 2025-05-07T20:26:33.7617948Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:33.7618221Z 2025-05-07T20:26:33.7618234Z 2025-05-07T20:26:33.7618238Z 2025-05-07T20:26:33.7618242Z 2025-05-07T20:26:33.7623387Z 2025-05-07T20:26:33.7646700Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:26:33.7646977Z 2025-05-07T20:26:33.7646988Z 2025-05-07T20:26:33.7646992Z 2025-05-07T20:26:33.7646996Z 2025-05-07T20:26:33.7648437Z 2025-05-07T20:26:33.7677505Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:33.7678668Z 2025-05-07T20:26:33.7759356Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:26:33.7759722Z 2025-05-07T20:26:33.7759728Z 2025-05-07T20:26:33.7761807Z 2025-05-07T20:26:33.7887499Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:26:33.7887942Z 2025-05-07T20:26:33.7887948Z 2025-05-07T20:26:33.7887954Z 2025-05-07T20:26:33.7887959Z 2025-05-07T20:26:33.7888180Z 2025-05-07T20:26:33.7889689Z 2025-05-07T20:26:33.7928823Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:33.8293688Z numpy-2.0.2 | 7.6 MB | | 0% 2025-05-07T20:26:33.8293954Z 2025-05-07T20:26:33.8293958Z 2025-05-07T20:26:33.8293962Z 2025-05-07T20:26:33.8293966Z 2025-05-07T20:26:33.8294068Z 2025-05-07T20:26:33.8693162Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:26:33.8693569Z 2025-05-07T20:26:33.8693576Z 2025-05-07T20:26:33.8694082Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:33.8694404Z 2025-05-07T20:26:33.8694408Z 2025-05-07T20:26:33.8720470Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:26:33.8720805Z 2025-05-07T20:26:33.8723660Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:33.8726928Z 2025-05-07T20:26:33.8928796Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:33.9066361Z numpy-2.0.2 | 7.6 MB | ########8 | 88% 2025-05-07T20:26:34.0196256Z numpy-2.0.2 | 7.6 MB | ########## | 100% 2025-05-07T20:26:34.0196545Z 2025-05-07T20:26:34.3437976Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:26:34.3443136Z numpy-2.0.2 | 7.6 MB | ########## | 100% 2025-05-07T20:26:34.3443469Z 2025-05-07T20:26:34.3443727Z 2025-05-07T20:26:34.3444019Z  2025-05-07T20:26:34.3444257Z 2025-05-07T20:26:34.3444262Z 2025-05-07T20:26:34.3444429Z  2025-05-07T20:26:34.3444631Z 2025-05-07T20:26:34.3444635Z 2025-05-07T20:26:34.3444639Z 2025-05-07T20:26:34.3444834Z  2025-05-07T20:26:34.3445075Z 2025-05-07T20:26:34.3445081Z 2025-05-07T20:26:34.3445087Z 2025-05-07T20:26:34.3445092Z 2025-05-07T20:26:34.3445349Z  2025-05-07T20:26:34.3445557Z 2025-05-07T20:26:34.3445560Z 2025-05-07T20:26:34.3445564Z 2025-05-07T20:26:34.3445568Z 2025-05-07T20:26:34.3445571Z 2025-05-07T20:26:34.3445748Z  2025-05-07T20:26:34.3445954Z 2025-05-07T20:26:34.3445957Z 2025-05-07T20:26:34.3445961Z 2025-05-07T20:26:34.3445965Z 2025-05-07T20:26:34.3445968Z 2025-05-07T20:26:34.3445972Z 2025-05-07T20:26:34.3446159Z  done 2025-05-07T20:26:34.4449035Z Preparing transaction: \ done 2025-05-07T20:26:34.6458720Z Verifying transaction: / - done 2025-05-07T20:26:34.7467800Z Executing transaction: | done 2025-05-07T20:26:34.9227885Z ################################################################################ 2025-05-07T20:26:34.9228381Z # Install Package From PyTorch PIP: torch 2025-05-07T20:26:34.9228799Z # 2025-05-07T20:26:34.9248193Z # [2025-05-07T20:26:34.924Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3 2025-05-07T20:26:34.9248873Z ################################################################################ 2025-05-07T20:26:34.9249177Z 2025-05-07T20:26:34.9265251Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:26:35.0172924Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:26:35.0173878Z ################################################################################ 2025-05-07T20:26:35.0174736Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:26:35.0175382Z # 2025-05-07T20:26:35.0192963Z # [2025-05-07T20:26:35.018Z] + __prepare_pip_arguments torch nightly cuda/12.6.3 2025-05-07T20:26:35.0193813Z ################################################################################ 2025-05-07T20:26:35.0194083Z 2025-05-07T20:26:35.0215381Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:26:35.0240710Z [INSTALL] Extracted package variant: cu126 2025-05-07T20:26:35.0257297Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:26:35.0258016Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:26:35.0266241Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:26:35.0275288Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ... 2025-05-07T20:26:35.0296621Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:55.6040625Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:55.6041191Z Collecting torch 2025-05-07T20:27:55.6041869Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp39-cp39-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:27:55.6042770Z Collecting filelock (from torch) 2025-05-07T20:27:55.6043432Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:27:55.6044666Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from torch) (4.13.2) 2025-05-07T20:27:55.6045381Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:27:55.6045886Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:27:55.6046732Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 35.7 MB/s eta 0:00:00 2025-05-07T20:27:55.6047080Z Collecting networkx (from torch) 2025-05-07T20:27:55.6047582Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.2.1-py3-none-any.whl (1.6 MB) 2025-05-07T20:27:55.6048298Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 18.8 MB/s eta 0:00:00 2025-05-07T20:27:55.6048643Z Collecting jinja2 (from torch) 2025-05-07T20:27:55.6049123Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:27:55.6049640Z Collecting fsspec (from torch) 2025-05-07T20:27:55.6050148Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:27:55.6050721Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch) 2025-05-07T20:27:55.6051440Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) 2025-05-07T20:27:55.6052266Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 64.2 MB/s eta 0:00:00 2025-05-07T20:27:55.6052686Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch) 2025-05-07T20:27:55.6053415Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB) 2025-05-07T20:27:55.6054208Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 10.4 MB/s eta 0:00:00 2025-05-07T20:27:55.6054614Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch) 2025-05-07T20:27:55.6055332Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB) 2025-05-07T20:27:55.6056106Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 43.6 MB/s eta 0:00:00 2025-05-07T20:27:55.6056491Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch) 2025-05-07T20:27:55.6057176Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) 2025-05-07T20:27:55.6057944Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 34.2 MB/s eta 0:00:00 2025-05-07T20:27:55.6058332Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch) 2025-05-07T20:27:55.6059852Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) 2025-05-07T20:27:55.6060724Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 58.9 MB/s eta 0:00:00 2025-05-07T20:27:55.6061309Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch) 2025-05-07T20:27:55.6061988Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB) 2025-05-07T20:27:55.6062755Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 128.9 MB/s eta 0:00:00 2025-05-07T20:27:55.6063151Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch) 2025-05-07T20:27:55.6063843Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB) 2025-05-07T20:27:55.6064615Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 192.6 MB/s eta 0:00:00 2025-05-07T20:27:55.6065001Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch) 2025-05-07T20:27:55.6065711Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB) 2025-05-07T20:27:55.6066489Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 152.8 MB/s eta 0:00:00 2025-05-07T20:27:55.6066893Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch) 2025-05-07T20:27:55.6067587Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB) 2025-05-07T20:27:55.6068416Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 128.4 MB/s eta 0:00:00 2025-05-07T20:27:55.6068809Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:27:55.6069502Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:27:55.6070358Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 163.0 MB/s eta 0:00:00 2025-05-07T20:27:55.6070732Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:27:55.6071505Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:27:55.6072284Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch) 2025-05-07T20:27:55.6072938Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB) 2025-05-07T20:27:55.6073612Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch) 2025-05-07T20:27:55.6074391Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) 2025-05-07T20:27:55.6075245Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 149.3 MB/s eta 0:00:00 2025-05-07T20:27:55.6075634Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch) 2025-05-07T20:27:55.6076431Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:27:55.6077244Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:27:55.6078063Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:27:55.6079463Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1) 2025-05-07T20:27:55.6080315Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:27:55.6080871Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:27:55.6081510Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 56.1 MB/s eta 0:00:00 2025-05-07T20:27:55.6081884Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:27:55.6082668Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB) 2025-05-07T20:27:55.6083697Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp39-cp39-manylinux_2_28_x86_64.whl (825.5 MB) 2025-05-07T20:27:55.6084505Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.5/825.5 MB 36.7 MB/s eta 0:00:00 2025-05-07T20:27:55.6085274Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) 2025-05-07T20:27:55.6086128Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 14.1 MB/s eta 0:00:00 2025-05-07T20:27:55.6086876Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:27:55.6087719Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 101.7 MB/s eta 0:00:00 2025-05-07T20:27:55.6088514Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.4 MB) 2025-05-07T20:27:55.6089380Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.4/153.4 MB 131.5 MB/s eta 0:00:00 2025-05-07T20:27:55.6091130Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:27:55.6092781Z 2025-05-07T20:27:55.6094765Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.2.1 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126 2025-05-07T20:27:55.6096826Z 2025-05-07T20:27:57.8298594Z torch 2.8.0.dev20250507+cu126 2025-05-07T20:27:57.8300944Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126) 2025-05-07T20:28:01.2209390Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:04.6348408Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:04.6348880Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:07.9653180Z True 2025-05-07T20:28:07.9653411Z True 2025-05-07T20:28:07.9653510Z 2025-05-07T20:28:08.0266318Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:08.0302968Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:08.0303576Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:08.0317989Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:08.0318507Z env: 2025-05-07T20:28:08.0318732Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:08.0319026Z BUILD_ENV: build_binary 2025-05-07T20:28:08.0319271Z BUILD_TARGET: genai 2025-05-07T20:28:08.0319498Z BUILD_VARIANT: cuda 2025-05-07T20:28:08.0319738Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:08.0319984Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:08.0320285Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:08.0320623Z ##[endgroup] 2025-05-07T20:28:08.3660200Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:08.3661736Z ################################################################################ 2025-05-07T20:28:08.3662215Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:08.3662577Z # 2025-05-07T20:28:08.3677266Z # [2025-05-07T20:28:08.367Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:08.3677667Z ################################################################################ 2025-05-07T20:28:08.3677897Z 2025-05-07T20:28:08.3693092Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:08.4599569Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:08.4609577Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:08.4610197Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:08.4610591Z 2025-05-07T20:28:08.5510538Z 2025-05-07T20:28:08.5511232Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:08.5534687Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:28:14.8470830Z Collecting environment information... 2025-05-07T20:28:14.8471187Z PyTorch version: 2.8.0.dev20250507+cu126 2025-05-07T20:28:14.8471472Z Is debug build: False 2025-05-07T20:28:14.8471823Z CUDA used to build PyTorch: 12.6 2025-05-07T20:28:14.8472182Z ROCM used to build PyTorch: N/A 2025-05-07T20:28:14.8472354Z 2025-05-07T20:28:14.8472468Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:28:14.8472776Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:28:14.8473090Z Clang version: Could not collect 2025-05-07T20:28:14.8473359Z CMake version: Could not collect 2025-05-07T20:28:14.8473639Z Libc version: glibc-2.34 2025-05-07T20:28:14.8473791Z 2025-05-07T20:28:14.8474090Z Python version: 3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:33:10) [GCC 12.3.0] (64-bit runtime) 2025-05-07T20:28:14.8474701Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:28:14.8475225Z Is CUDA available: True 2025-05-07T20:28:14.8475559Z CUDA runtime version: 12.6.85 2025-05-07T20:28:14.8475818Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:28:14.8476121Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:28:14.8476446Z Nvidia driver version: 570.133.07 2025-05-07T20:28:14.8476720Z cuDNN version: Could not collect 2025-05-07T20:28:14.8476981Z HIP runtime version: N/A 2025-05-07T20:28:14.8477229Z MIOpen runtime version: N/A 2025-05-07T20:28:14.8477485Z Is XNNPACK available: True 2025-05-07T20:28:14.8477650Z 2025-05-07T20:28:14.8477752Z CPU: 2025-05-07T20:28:14.8478046Z Architecture: x86_64 2025-05-07T20:28:14.8478519Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:28:14.8489783Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:28:14.8490341Z Byte Order: Little Endian 2025-05-07T20:28:14.8490704Z CPU(s): 16 2025-05-07T20:28:14.8490989Z On-line CPU(s) list: 0-15 2025-05-07T20:28:14.8491774Z Vendor ID: AuthenticAMD 2025-05-07T20:28:14.8492252Z Model name: AMD EPYC 7R32 2025-05-07T20:28:14.8492696Z CPU family: 23 2025-05-07T20:28:14.8493045Z Model: 49 2025-05-07T20:28:14.8493504Z Thread(s) per core: 2 2025-05-07T20:28:14.8493779Z Core(s) per socket: 8 2025-05-07T20:28:14.8494043Z Socket(s): 1 2025-05-07T20:28:14.8494304Z Stepping: 0 2025-05-07T20:28:14.8494587Z BogoMIPS: 5600.00 2025-05-07T20:28:14.8496647Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:28:14.8498691Z Hypervisor vendor: KVM 2025-05-07T20:28:14.8498996Z Virtualization type: full 2025-05-07T20:28:14.8499329Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:28:14.8499692Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:28:14.8500041Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:28:14.8500393Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:28:14.8500707Z NUMA node(s): 1 2025-05-07T20:28:14.8500989Z NUMA node0 CPU(s): 0-15 2025-05-07T20:28:14.8501319Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:28:14.8501697Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:28:14.8502051Z Vulnerability L1tf: Not affected 2025-05-07T20:28:14.8502387Z Vulnerability Mds: Not affected 2025-05-07T20:28:14.8502734Z Vulnerability Meltdown: Not affected 2025-05-07T20:28:14.8503097Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:28:14.8503447Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:28:14.8504302Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:28:14.8504882Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:28:14.8505411Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:28:14.8506084Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:28:14.8506935Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:28:14.8507603Z Vulnerability Srbds: Not affected 2025-05-07T20:28:14.8507951Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:28:14.8508182Z 2025-05-07T20:28:14.8508287Z Versions of relevant libraries: 2025-05-07T20:28:14.8508548Z [pip3] numpy==2.0.2 2025-05-07T20:28:14.8508789Z [pip3] nvidia-cublas-cu12==12.6.4.1 2025-05-07T20:28:14.8509084Z [pip3] nvidia-cuda-cupti-cu12==12.6.80 2025-05-07T20:28:14.8509387Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 2025-05-07T20:28:14.8509697Z [pip3] nvidia-cuda-runtime-cu12==12.6.77 2025-05-07T20:28:14.8510057Z [pip3] nvidia-cudnn-cu12==9.5.1.17 2025-05-07T20:28:14.8510347Z [pip3] nvidia-cufft-cu12==11.3.0.4 2025-05-07T20:28:14.8510636Z [pip3] nvidia-curand-cu12==10.3.7.77 2025-05-07T20:28:14.8510928Z [pip3] nvidia-cusolver-cu12==11.7.1.2 2025-05-07T20:28:14.8511231Z [pip3] nvidia-cusparse-cu12==12.5.4.2 2025-05-07T20:28:14.8511738Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:28:14.8512029Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:28:14.8512331Z [pip3] nvidia-nvjitlink-cu12==12.6.85 2025-05-07T20:28:14.8512654Z [pip3] nvidia-nvtx-cu12==12.6.77 2025-05-07T20:28:14.8512935Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:28:14.8513360Z [pip3] torch==2.8.0.dev20250507+cu126 2025-05-07T20:28:14.8513732Z [conda] cuda-cudart 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.8514218Z [conda] cuda-cudart-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.8514716Z [conda] cuda-cudart-dev_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:14.8515233Z [conda] cuda-cudart-static 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.8515760Z [conda] cuda-cudart-static_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:14.8516276Z [conda] cuda-cudart_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:14.8516758Z [conda] cuda-cupti 12.6.80 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.8517221Z [conda] cuda-cupti-dev 12.6.80 h5888daf_0 conda-forge 2025-05-07T20:28:14.8517693Z [conda] cuda-libraries 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:14.8518176Z [conda] cuda-libraries-dev 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:14.8518650Z [conda] cuda-nvrtc 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.8519103Z [conda] cuda-nvrtc-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:14.8519552Z [conda] cuda-nvtx 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.8519990Z [conda] cuda-opencl 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.8520460Z [conda] cuda-opencl-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.8520938Z [conda] cuda-runtime 12.6.3 ha804496_0 conda-forge 2025-05-07T20:28:14.8521384Z [conda] libcublas 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:14.8521845Z [conda] libcublas-dev 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:14.8522308Z [conda] libcufft 11.3.0.4 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.8522760Z [conda] libcufft-dev 11.3.0.4 h5888daf_0 conda-forge 2025-05-07T20:28:14.8523208Z [conda] libcurand 10.3.7.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.8523669Z [conda] libcurand-dev 10.3.7.77 h5888daf_0 conda-forge 2025-05-07T20:28:14.8524136Z [conda] libcusolver 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:14.8524600Z [conda] libcusolver-dev 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:14.8525079Z [conda] libcusparse 12.5.4.2 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.8525550Z [conda] libcusparse-dev 12.5.4.2 h5888daf_0 conda-forge 2025-05-07T20:28:14.8526028Z [conda] libnvjitlink 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:14.8526500Z [conda] libnvjitlink-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:14.8526959Z [conda] numpy 2.0.2 py39h9cb892a_1 conda-forge 2025-05-07T20:28:14.8527409Z [conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi 2025-05-07T20:28:14.8527891Z [conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi 2025-05-07T20:28:14.8528380Z [conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:14.8528877Z [conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:14.8529362Z [conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi 2025-05-07T20:28:14.8529925Z [conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi 2025-05-07T20:28:14.8530397Z [conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi 2025-05-07T20:28:14.8530878Z [conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi 2025-05-07T20:28:14.8531447Z [conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi 2025-05-07T20:28:14.8531935Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:28:14.8532414Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:28:14.8532944Z [conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi 2025-05-07T20:28:14.8533411Z [conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:14.8533883Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:28:14.8534342Z [conda] torch 2.8.0.dev20250507+cu126 pypi_0 pypi 2025-05-07T20:28:14.8534606Z 2025-05-07T20:28:14.9161569Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:14.9162127Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:14.9176279Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:14.9176645Z env: 2025-05-07T20:28:14.9176870Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:14.9177173Z BUILD_ENV: build_binary 2025-05-07T20:28:14.9177407Z BUILD_TARGET: genai 2025-05-07T20:28:14.9177764Z BUILD_VARIANT: cuda 2025-05-07T20:28:14.9177999Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:14.9178255Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:14.9178549Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:14.9178880Z ##[endgroup] 2025-05-07T20:28:15.2573357Z ################################################################################ 2025-05-07T20:28:15.2573744Z # Prepare FBGEMM-GPU Build 2025-05-07T20:28:15.2573996Z # 2025-05-07T20:28:15.2588299Z # [2025-05-07T20:28:15.258Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:28:15.2588709Z ################################################################################ 2025-05-07T20:28:15.2588922Z 2025-05-07T20:28:15.2604412Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:15.3516140Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:15.3537123Z [BUILD] Running git submodules update ... 2025-05-07T20:28:15.3558248Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:28:15.3921971Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:28:15.3922436Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:28:15.3922876Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:28:15.3923256Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:28:15.3923655Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:28:15.3924094Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:28:15.3924499Z Synchronizing submodule url for '../external/json' 2025-05-07T20:28:15.3957503Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:28:15.4500656Z [BUILD] Installing other build dependencies ... 2025-05-07T20:28:15.4522950Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:28:17.9176865Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:28:17.9295123Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:28:18.0390948Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:28:18.0426439Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:28:18.2963029Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:28:18.3000367Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:28:18.4145054Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:28:18.4178977Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:28:18.7861561Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:28:18.7894385Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:28:18.8521997Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:28:18.8527137Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:28:18.9405026Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:28:18.9438096Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:28:18.9981028Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 21)) (2.0.2) 2025-05-07T20:28:19.0583664Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:28:19.0615784Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:28:19.1876564Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:28:19.1937147Z Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:28:19.3217101Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:28:19.3268410Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:28:19.3763406Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:28:19.4465060Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:28:19.4497604Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:28:19.5503122Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:28:19.5555637Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:28:19.6737761Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:28:19.6770134Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:28:19.7929501Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:19.7960277Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:28:19.8973557Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:19.9003245Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:28:20.0505837Z Collecting importlib-metadata>=4.6 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:20.0538100Z Downloading importlib_metadata-8.7.0-py3-none-any.whl.metadata (4.8 kB) 2025-05-07T20:28:20.1744481Z Collecting tomli>=1.1.0 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:20.1779245Z Downloading tomli-2.2.1-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:20.2886034Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:20.2920439Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:20.4564756Z Collecting exceptiongroup>=1.0.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:20.4600615Z Downloading exceptiongroup-1.2.2-py3-none-any.whl.metadata (6.6 kB) 2025-05-07T20:28:20.5642543Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:20.5680054Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:20.6272040Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:28:20.6803205Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:20.6834793Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:28:20.7304705Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:28:20.7873218Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:28:20.7903945Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:28:20.8370526Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:28:20.9181921Z Collecting zipp>=3.20 (from importlib-metadata>=4.6->build->-r requirements.txt (line 14)) 2025-05-07T20:28:20.9212330Z Downloading zipp-3.21.0-py3-none-any.whl.metadata (3.7 kB) 2025-05-07T20:28:21.0325363Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:21.0356285Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:28:21.0908777Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:28:21.1394832Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:28:21.1859897Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:28:21.6691715Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 57.7 MB/s eta 0:00:00 2025-05-07T20:28:21.6732894Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:28:21.7234875Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:28:21.7757517Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:28:21.8239889Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:28:21.8782853Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:28:21.9256636Z Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (737 kB) 2025-05-07T20:28:21.9840102Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 737.4/737.4 kB 8.5 MB/s eta 0:00:00 2025-05-07T20:28:21.9882408Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:28:22.0380882Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:22.0887265Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:28:22.1374118Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:28:22.1937325Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:28:22.2410758Z Downloading exceptiongroup-1.2.2-py3-none-any.whl (16 kB) 2025-05-07T20:28:22.2899336Z Downloading importlib_metadata-8.7.0-py3-none-any.whl (27 kB) 2025-05-07T20:28:22.3379625Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:28:22.3881579Z Downloading tomli-2.2.1-py3-none-any.whl (14 kB) 2025-05-07T20:28:22.4355721Z Downloading zipp-3.21.0-py3-none-any.whl (9.6 kB) 2025-05-07T20:28:22.4866813Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:28:22.5469761Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:22.5989247Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:28:22.6488967Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:28:22.8954871Z Installing collected packages: sortedcontainers, zipp, tomli, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, exceptiongroup, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, importlib-metadata, hypothesis, pyre-extensions, build 2025-05-07T20:28:25.3675867Z 2025-05-07T20:28:25.3753197Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 exceptiongroup-1.2.2 hypothesis-6.131.14 importlib-metadata-8.7.0 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 tomli-2.2.1 typing-inspect-0.9.0 zipp-3.21.0 2025-05-07T20:28:25.5582277Z ################################################################################ 2025-05-07T20:28:25.5582652Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:28:25.5582922Z # 2025-05-07T20:28:25.5599117Z # [2025-05-07T20:28:25.559Z] + install_triton_pip build_binary 2025-05-07T20:28:25.5599564Z ################################################################################ 2025-05-07T20:28:25.5599867Z 2025-05-07T20:28:25.5600121Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:28:25.5600914Z ################################################################################ 2025-05-07T20:28:25.5601283Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:28:25.5601607Z # 2025-05-07T20:28:25.5616561Z # [2025-05-07T20:28:25.561Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:25.5617220Z ################################################################################ 2025-05-07T20:28:25.5617442Z 2025-05-07T20:28:25.5632129Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:25.6527225Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:25.6527654Z ################################################################################ 2025-05-07T20:28:25.6527984Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:28:25.6528260Z # 2025-05-07T20:28:25.6544842Z # [2025-05-07T20:28:25.654Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:25.6545451Z ################################################################################ 2025-05-07T20:28:25.6545669Z 2025-05-07T20:28:25.6592389Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:28:25.6609100Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:28:25.6609595Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:25.6618714Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:25.6627877Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:28:25.6649037Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:32.9647424Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:28:32.9648760Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:28:32.9649489Z 2025-05-07T20:28:32.9649714Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:32.9650135Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:32.9650919Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:28:32.9652119Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.4 MB) 2025-05-07T20:28:32.9653194Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.4/166.4 MB 61.8 MB/s eta 0:00:00 2025-05-07T20:28:32.9653580Z Installing collected packages: pytorch-triton 2025-05-07T20:28:32.9653923Z Attempting uninstall: pytorch-triton 2025-05-07T20:28:32.9654310Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:28:32.9654734Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:28:32.9655151Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:28:32.9655935Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:28:32.9656196Z 2025-05-07T20:28:35.1755769Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:28:35.1759099Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:28:37.3102508Z ################################################################################ 2025-05-07T20:28:37.3103101Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:28:37.3103469Z ################################################################################ 2025-05-07T20:28:37.3103968Z 2025-05-07T20:28:39.3401693Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:28:41.5147053Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:28:41.5150430Z [BUILD] Successfully ran git submodules update 2025-05-07T20:28:41.5202833Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:41.5203307Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:41.5216747Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:41.5217096Z env: 2025-05-07T20:28:41.5217324Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:41.5217613Z BUILD_ENV: build_binary 2025-05-07T20:28:41.5217861Z BUILD_TARGET: genai 2025-05-07T20:28:41.5218090Z BUILD_VARIANT: cuda 2025-05-07T20:28:41.5218320Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:41.5218571Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:41.5218865Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:41.5219214Z ##[endgroup] 2025-05-07T20:28:41.8553391Z ################################################################################ 2025-05-07T20:28:41.8553899Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:28:41.8554250Z # 2025-05-07T20:28:41.8570001Z # [2025-05-07T20:28:41.856Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.8570633Z ################################################################################ 2025-05-07T20:28:41.8570852Z 2025-05-07T20:28:41.8571204Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.8571953Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.8572281Z 2025-05-07T20:28:41.8691424Z d4ed0368510af43fe003d0e644f3e214a1184cea fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.8693800Z 2025-05-07T20:28:41.8694387Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.8694730Z 2025-05-07T20:28:41.8829441Z 9230f5ec3cd9c0291353aa93f1630c572cadce13d0b83330ed75b92574b61dfc fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.8831701Z 2025-05-07T20:28:41.8832207Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.9061622Z 2025-05-07T20:28:41.9062128Z 9e24861bd267fb8b82804cd45222d975 fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:41.9064355Z 2025-05-07T20:28:41.9074071Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl ... 2025-05-07T20:28:41.9095168Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:44.6290076Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:44.6291023Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.0.2) 2025-05-07T20:28:44.6291868Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:28:44.6292306Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:28:44.6292572Z 2025-05-07T20:28:51.4409012Z ################################################################################ 2025-05-07T20:28:51.4409738Z [CHECK] !!!! INFO !!!! 2025-05-07T20:28:51.4410465Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:51.4411066Z [CHECK] CUDA version reported by PyTorch is: 12.6 2025-05-07T20:28:51.4411368Z [CHECK] 2025-05-07T20:28:51.4411691Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:28:51.4412184Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:28:51.4412568Z ################################################################################ 2025-05-07T20:28:51.4412786Z 2025-05-07T20:28:51.4412899Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:28:55.3058194Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:28:59.1830543Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:03.0725566Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:03.0729150Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:29:14.7303157Z ################################################################################ 2025-05-07T20:29:14.7303573Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:29:14.7304047Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:29:14.7304380Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:29:14.7304719Z ################################################################################ 2025-05-07T20:29:14.7304937Z 2025-05-07T20:29:22.4900624Z ################################################################################ 2025-05-07T20:29:22.4901020Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:29:22.4902401Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:29:22.4904175Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:29:22.4904694Z ################################################################################ 2025-05-07T20:29:22.4904911Z 2025-05-07T20:29:22.4905070Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:29:26.3713128Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:29:30.2458119Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:29:34.2591676Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:29:38.1647329Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:29:38.1651567Z [INSTALL] Check for operator registrations ... 2025-05-07T20:29:41.9794718Z fbgemm.nccl_init 2025-05-07T20:29:41.9794909Z 2025-05-07T20:29:42.0433310Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:29:45.8619385Z fbgemm.gqa_attn_splitk 2025-05-07T20:29:45.8619663Z 2025-05-07T20:29:45.9235509Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:29:49.7571237Z fbgemm.rope_qkv_decoding 2025-05-07T20:29:49.7571449Z 2025-05-07T20:29:49.8185779Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:29:49.8186379Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:29:49.8221668Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:49.8222146Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:49.8236477Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:49.8236833Z env: 2025-05-07T20:29:49.8237239Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:49.8237548Z BUILD_ENV: build_binary 2025-05-07T20:29:49.8237796Z BUILD_TARGET: genai 2025-05-07T20:29:49.8238027Z BUILD_VARIANT: cuda 2025-05-07T20:29:49.8238261Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:29:49.8238523Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:49.8238829Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:49.8239157Z ##[endgroup] 2025-05-07T20:29:50.1597439Z ################################################################################ 2025-05-07T20:29:50.1597804Z # Test All FBGEMM-GPU Modules 2025-05-07T20:29:50.1598064Z # 2025-05-07T20:29:50.1612919Z # [2025-05-07T20:29:50.160Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:29:50.1613343Z ################################################################################ 2025-05-07T20:29:50.1613556Z 2025-05-07T20:29:57.9302948Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:29:57.9303881Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:29:57.9304385Z [TEST] Determined the test directories: 2025-05-07T20:29:57.9304693Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:29:57.9304997Z fbgemm_gpu/experimental/example/test 2025-05-07T20:29:57.9305295Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:29:57.9305481Z 2025-05-07T20:29:57.9312510Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:29:57.9320378Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:29:57.9320974Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:29:57.9321361Z 2025-05-07T20:29:58.3613271Z 2025-05-07T20:29:58.3613599Z [TEST] Installing PyTest ... 2025-05-07T20:29:58.3636721Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:29:59.4685232Z Channels: 2025-05-07T20:29:59.4685538Z - conda-forge 2025-05-07T20:29:59.4685864Z Platform: linux-64 2025-05-07T20:30:02.6987950Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:03.8460730Z Solving environment: \ | / done 2025-05-07T20:30:04.0755002Z 2025-05-07T20:30:04.0755673Z ## Package Plan ## 2025-05-07T20:30:04.0755914Z 2025-05-07T20:30:04.0756196Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:04.0756598Z 2025-05-07T20:30:04.0756700Z added / updated specs: 2025-05-07T20:30:04.0756949Z - expecttest 2025-05-07T20:30:04.0757166Z - pytest 2025-05-07T20:30:04.0757286Z 2025-05-07T20:30:04.0757290Z 2025-05-07T20:30:04.0757413Z The following packages will be downloaded: 2025-05-07T20:30:04.0757660Z 2025-05-07T20:30:04.0757776Z package | build 2025-05-07T20:30:04.0758100Z ---------------------------|----------------- 2025-05-07T20:30:04.0758616Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:04.0759269Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:04.0759895Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:04.0760345Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:04.0760772Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:04.0761199Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:04.0761611Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:04.0762372Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:04.0762756Z ------------------------------------------------------------ 2025-05-07T20:30:04.0763096Z Total: 428 KB 2025-05-07T20:30:04.0763302Z 2025-05-07T20:30:04.0763438Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:04.0763817Z 2025-05-07T20:30:04.0764022Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:04.0764521Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:04.0765040Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:04.0765513Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:04.0765971Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:04.0766417Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:04.0766849Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:04.0767266Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:04.0767518Z 2025-05-07T20:30:04.0767522Z 2025-05-07T20:30:04.0767526Z 2025-05-07T20:30:04.0767674Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:04.0768045Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:04.0768275Z 2025-05-07T20:30:04.0768655Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:04.0768888Z 2025-05-07T20:30:04.0768892Z 2025-05-07T20:30:04.0783688Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:04.0783934Z 2025-05-07T20:30:04.0783938Z 2025-05-07T20:30:04.0783942Z 2025-05-07T20:30:04.0793927Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:04.0794194Z 2025-05-07T20:30:04.0794199Z 2025-05-07T20:30:04.0794209Z 2025-05-07T20:30:04.0799199Z 2025-05-07T20:30:04.0833205Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:04.0833707Z 2025-05-07T20:30:04.0833715Z 2025-05-07T20:30:04.0833731Z 2025-05-07T20:30:04.0833737Z 2025-05-07T20:30:04.0833743Z 2025-05-07T20:30:04.0834863Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:04.0835262Z 2025-05-07T20:30:04.0835270Z 2025-05-07T20:30:04.0835290Z 2025-05-07T20:30:04.0835306Z 2025-05-07T20:30:04.0835312Z 2025-05-07T20:30:04.0835323Z 2025-05-07T20:30:04.0837376Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:04.0837816Z 2025-05-07T20:30:04.0837831Z 2025-05-07T20:30:04.0837836Z 2025-05-07T20:30:04.0837842Z 2025-05-07T20:30:04.0837847Z 2025-05-07T20:30:04.0837852Z 2025-05-07T20:30:04.0837857Z 2025-05-07T20:30:04.2388019Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:04.2388573Z 2025-05-07T20:30:04.2388577Z 2025-05-07T20:30:04.2399835Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:30:04.2400083Z 2025-05-07T20:30:04.2400096Z 2025-05-07T20:30:04.2401553Z 2025-05-07T20:30:04.2415735Z pluggy-1.5.0 | 23 KB | ######9 | 69%  2025-05-07T20:30:04.2415984Z 2025-05-07T20:30:04.2415988Z 2025-05-07T20:30:04.2415992Z 2025-05-07T20:30:04.2419195Z 2025-05-07T20:30:04.2444870Z exceptiongroup-1.2.2 | 20 KB | #######9 | 80%  2025-05-07T20:30:04.2445186Z 2025-05-07T20:30:04.2445190Z 2025-05-07T20:30:04.2448978Z 2025-05-07T20:30:04.2461813Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:04.2462071Z 2025-05-07T20:30:04.2465024Z 2025-05-07T20:30:04.2469870Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:04.2470318Z 2025-05-07T20:30:04.2470330Z 2025-05-07T20:30:04.2470334Z 2025-05-07T20:30:04.2470339Z 2025-05-07T20:30:04.2817684Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:04.2818029Z 2025-05-07T20:30:04.2818033Z 2025-05-07T20:30:04.2818045Z 2025-05-07T20:30:04.2818140Z 2025-05-07T20:30:04.2820927Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:04.2821231Z 2025-05-07T20:30:04.2821243Z 2025-05-07T20:30:04.2821247Z 2025-05-07T20:30:04.2821251Z 2025-05-07T20:30:04.2821254Z 2025-05-07T20:30:04.2821258Z 2025-05-07T20:30:04.2821262Z 2025-05-07T20:30:04.2827885Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:04.2828172Z 2025-05-07T20:30:04.2828175Z 2025-05-07T20:30:04.2828179Z 2025-05-07T20:30:04.2828183Z 2025-05-07T20:30:04.2828186Z 2025-05-07T20:30:04.2828190Z 2025-05-07T20:30:04.2828194Z 2025-05-07T20:30:04.2835669Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:04.2835937Z 2025-05-07T20:30:04.2835941Z 2025-05-07T20:30:04.2836573Z 2025-05-07T20:30:04.2877438Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:04.2877691Z 2025-05-07T20:30:04.2877695Z 2025-05-07T20:30:04.2924711Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:04.2924956Z 2025-05-07T20:30:04.2924967Z 2025-05-07T20:30:04.2924971Z 2025-05-07T20:30:04.2924975Z 2025-05-07T20:30:04.2924979Z 2025-05-07T20:30:04.2929355Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:30:04.2929607Z 2025-05-07T20:30:04.2929627Z 2025-05-07T20:30:04.2929632Z 2025-05-07T20:30:04.2929636Z 2025-05-07T20:30:04.2929639Z 2025-05-07T20:30:04.2936210Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:04.2936454Z 2025-05-07T20:30:04.2936465Z 2025-05-07T20:30:04.2936468Z 2025-05-07T20:30:04.2936472Z 2025-05-07T20:30:04.2936476Z 2025-05-07T20:30:04.2936480Z 2025-05-07T20:30:04.2936483Z 2025-05-07T20:30:04.3012801Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:04.3013078Z 2025-05-07T20:30:04.3013082Z 2025-05-07T20:30:04.3013086Z 2025-05-07T20:30:04.3013089Z 2025-05-07T20:30:04.3013093Z 2025-05-07T20:30:04.3015425Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:04.3015675Z 2025-05-07T20:30:04.3015679Z 2025-05-07T20:30:04.3015683Z 2025-05-07T20:30:04.3015686Z 2025-05-07T20:30:04.3015690Z 2025-05-07T20:30:04.3015694Z 2025-05-07T20:30:04.3019080Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:04.3019400Z 2025-05-07T20:30:04.3019405Z 2025-05-07T20:30:04.3019409Z 2025-05-07T20:30:04.3019413Z 2025-05-07T20:30:04.3019416Z 2025-05-07T20:30:04.3019420Z 2025-05-07T20:30:04.3086239Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:04.3086523Z 2025-05-07T20:30:04.3086528Z 2025-05-07T20:30:04.3086532Z 2025-05-07T20:30:04.3086535Z 2025-05-07T20:30:04.3086539Z 2025-05-07T20:30:04.3086542Z 2025-05-07T20:30:04.3316094Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:04.3351988Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:30:04.3471675Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:04.3471965Z 2025-05-07T20:30:04.3492470Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:30:04.3493138Z 2025-05-07T20:30:04.3648542Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:04.3649026Z 2025-05-07T20:30:04.3676892Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:04.3683012Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:04.3683350Z 2025-05-07T20:30:04.3683554Z 2025-05-07T20:30:04.3683765Z  2025-05-07T20:30:04.3684034Z 2025-05-07T20:30:04.3684038Z 2025-05-07T20:30:04.3684217Z  2025-05-07T20:30:04.3684425Z 2025-05-07T20:30:04.3684429Z 2025-05-07T20:30:04.3684433Z 2025-05-07T20:30:04.3684608Z  2025-05-07T20:30:04.3685003Z 2025-05-07T20:30:04.3685008Z 2025-05-07T20:30:04.3685012Z 2025-05-07T20:30:04.3685016Z 2025-05-07T20:30:04.3685222Z  2025-05-07T20:30:04.3685461Z 2025-05-07T20:30:04.3685464Z 2025-05-07T20:30:04.3685468Z 2025-05-07T20:30:04.3685471Z 2025-05-07T20:30:04.3685600Z 2025-05-07T20:30:04.3685797Z  2025-05-07T20:30:04.3686037Z 2025-05-07T20:30:04.3686040Z 2025-05-07T20:30:04.3686043Z 2025-05-07T20:30:04.3686047Z 2025-05-07T20:30:04.3686050Z 2025-05-07T20:30:04.3686054Z 2025-05-07T20:30:04.3686249Z  2025-05-07T20:30:04.3686461Z 2025-05-07T20:30:04.3686465Z 2025-05-07T20:30:04.3686469Z 2025-05-07T20:30:04.3686472Z 2025-05-07T20:30:04.3686476Z 2025-05-07T20:30:04.3686479Z 2025-05-07T20:30:04.3686483Z 2025-05-07T20:30:04.3686685Z  done 2025-05-07T20:30:04.4691283Z Preparing transaction: \ done 2025-05-07T20:30:04.5697176Z Verifying transaction: / done 2025-05-07T20:30:06.4724621Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:30:06.5961073Z [TEST] Checking imports ... 2025-05-07T20:30:10.4505678Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:10.4516905Z [TEST] Setting feature flags ... 2025-05-07T20:30:10.4517354Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:30:10.4517692Z 2025-05-07T20:30:10.8794208Z 2025-05-07T20:30:10.8794752Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:30:10.8795211Z ################################################################################ 2025-05-07T20:30:10.8795519Z # Run FBGEMM-GPU Tests: 2025-05-07T20:30:10.8795760Z # 2025-05-07T20:30:10.8813031Z # [2025-05-07T20:30:10.880Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:30:10.8813443Z ################################################################################ 2025-05-07T20:30:10.8813666Z 2025-05-07T20:30:10.8820341Z [TEST] Enumerating ALL test files ... 2025-05-07T20:30:10.8849209Z ./attention/gqa_test.py 2025-05-07T20:30:10.8849504Z ./coalesce/coalesce_test.py 2025-05-07T20:30:10.8849758Z ./comm/multi_gpu_car_test.py 2025-05-07T20:30:10.8850037Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:30:10.8850334Z ./kv_cache/kv_cache_test.py 2025-05-07T20:30:10.8850577Z ./moe/activation_test.py 2025-05-07T20:30:10.8850827Z ./moe/gather_scatter_test.py 2025-05-07T20:30:10.8851078Z ./moe/layers_test.py 2025-05-07T20:30:10.8851310Z ./moe/shuffling_test.py 2025-05-07T20:30:10.8851543Z ./quantize/quantize_test.py 2025-05-07T20:30:10.8851709Z 2025-05-07T20:30:10.8851823Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:30:10.8852032Z 2025-05-07T20:30:10.8869955Z ################################################################################ 2025-05-07T20:30:10.8885193Z # [2025-05-07T20:30:10.888Z] Run Python Test Suite: 2025-05-07T20:30:10.8885512Z # ./attention/gqa_test.py 2025-05-07T20:30:10.8885795Z ################################################################################ 2025-05-07T20:30:10.8910344Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:30:10.8910953Z 2025-05-07T20:30:13.4277109Z ============================= test session starts ============================== 2025-05-07T20:30:13.4277966Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:13.4278500Z cachedir: .pytest_cache 2025-05-07T20:30:13.4279083Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:13.4280075Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:13.4280510Z plugins: hypothesis-6.131.14 2025-05-07T20:30:14.9426489Z collecting ... collected 2 items 2025-05-07T20:30:14.9427017Z 2025-05-07T20:30:51.7868324Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:30:51.7869078Z self=, 2025-05-07T20:30:51.7870120Z int4_kv=False, 2025-05-07T20:30:51.7870383Z num_groups=1, 2025-05-07T20:30:51.7870626Z B=1, 2025-05-07T20:30:51.7870850Z MAX_T=4, 2025-05-07T20:30:51.7871083Z N_H_L=1, 2025-05-07T20:30:51.7871308Z ) 2025-05-07T20:30:51.7871543Z Trying example: test_gqa( 2025-05-07T20:30:51.7871900Z self=, 2025-05-07T20:30:51.7872285Z int4_kv=True, 2025-05-07T20:30:51.7872535Z num_groups=1, 2025-05-07T20:30:51.7872783Z B=1, 2025-05-07T20:30:51.7872994Z MAX_T=4, 2025-05-07T20:30:51.7873219Z N_H_L=1, 2025-05-07T20:30:51.7873440Z ) 2025-05-07T20:30:51.7873673Z Trying example: test_gqa( 2025-05-07T20:30:51.7874024Z self=, 2025-05-07T20:30:51.7874399Z int4_kv=True, 2025-05-07T20:30:51.7874640Z num_groups=4, 2025-05-07T20:30:51.7874882Z B=23, 2025-05-07T20:30:51.7875104Z MAX_T=33, 2025-05-07T20:30:51.7875331Z N_H_L=68, 2025-05-07T20:30:51.7875568Z ) 2025-05-07T20:30:51.7875792Z Trying example: test_gqa( 2025-05-07T20:30:51.7876132Z self=, 2025-05-07T20:30:51.7876503Z int4_kv=True, 2025-05-07T20:30:51.7876757Z num_groups=4, 2025-05-07T20:30:51.7876990Z B=77, 2025-05-07T20:30:51.7877210Z MAX_T=4, 2025-05-07T20:30:51.7877450Z N_H_L=1, 2025-05-07T20:30:51.7877665Z ) 2025-05-07T20:30:51.7877895Z Trying example: test_gqa( 2025-05-07T20:30:51.7878241Z self=, 2025-05-07T20:30:51.7878608Z int4_kv=True, 2025-05-07T20:30:51.7878854Z num_groups=4, 2025-05-07T20:30:51.7879092Z B=77, 2025-05-07T20:30:51.7879311Z MAX_T=52, 2025-05-07T20:30:51.7879542Z N_H_L=67, 2025-05-07T20:30:51.7879766Z ) 2025-05-07T20:30:51.7879983Z Trying example: test_gqa( 2025-05-07T20:30:51.7880328Z self=, 2025-05-07T20:30:51.7880700Z int4_kv=False, 2025-05-07T20:30:51.7880948Z num_groups=4, 2025-05-07T20:30:51.7881190Z B=57, 2025-05-07T20:30:51.7881409Z MAX_T=45, 2025-05-07T20:30:51.7881640Z N_H_L=120, 2025-05-07T20:30:51.7881866Z ) 2025-05-07T20:30:51.7882094Z Trying example: test_gqa( 2025-05-07T20:30:51.7882446Z self=, 2025-05-07T20:30:51.7882812Z int4_kv=True, 2025-05-07T20:30:51.7883057Z num_groups=4, 2025-05-07T20:30:51.7883297Z B=52, 2025-05-07T20:30:51.7883510Z MAX_T=42, 2025-05-07T20:30:51.7883741Z N_H_L=53, 2025-05-07T20:30:51.7883963Z ) 2025-05-07T20:30:51.7884180Z Trying example: test_gqa( 2025-05-07T20:30:51.7884528Z self=, 2025-05-07T20:30:51.7884903Z int4_kv=True, 2025-05-07T20:30:51.7885151Z num_groups=1, 2025-05-07T20:30:51.7885394Z B=77, 2025-05-07T20:30:51.7885610Z MAX_T=95, 2025-05-07T20:30:51.7885830Z N_H_L=53, 2025-05-07T20:30:51.7886057Z ) 2025-05-07T20:30:51.7886288Z Trying example: test_gqa( 2025-05-07T20:30:51.7886632Z self=, 2025-05-07T20:30:51.7887006Z int4_kv=True, 2025-05-07T20:30:51.7887253Z num_groups=4, 2025-05-07T20:30:51.7887487Z B=113, 2025-05-07T20:30:51.7887709Z MAX_T=48, 2025-05-07T20:30:51.7887937Z N_H_L=96, 2025-05-07T20:30:51.7888153Z ) 2025-05-07T20:30:51.7888378Z Trying example: test_gqa( 2025-05-07T20:30:51.7888722Z self=, 2025-05-07T20:30:51.7889091Z int4_kv=False, 2025-05-07T20:30:51.7889369Z num_groups=1, 2025-05-07T20:30:51.7889633Z B=51, 2025-05-07T20:30:51.7889858Z MAX_T=61, 2025-05-07T20:30:51.7890081Z N_H_L=69, 2025-05-07T20:30:51.7890547Z ) 2025-05-07T20:30:51.7890780Z Trying example: test_gqa( 2025-05-07T20:30:51.7891117Z self=, 2025-05-07T20:30:51.7891492Z int4_kv=False, 2025-05-07T20:30:51.7891745Z num_groups=4, 2025-05-07T20:30:51.7891977Z B=17, 2025-05-07T20:30:51.7892197Z MAX_T=113, 2025-05-07T20:30:51.7892522Z N_H_L=65, 2025-05-07T20:30:51.7892746Z ) 2025-05-07T20:30:51.7892970Z Trying example: test_gqa( 2025-05-07T20:30:51.7893317Z self=, 2025-05-07T20:30:51.7893685Z int4_kv=False, 2025-05-07T20:30:51.7893942Z num_groups=4, 2025-05-07T20:30:51.7894186Z B=17, 2025-05-07T20:30:51.7894399Z MAX_T=65, 2025-05-07T20:30:51.7894629Z N_H_L=65, 2025-05-07T20:30:51.7894853Z ) 2025-05-07T20:30:51.7895070Z Trying example: test_gqa( 2025-05-07T20:30:51.7895416Z self=, 2025-05-07T20:30:51.7895787Z int4_kv=False, 2025-05-07T20:30:51.7896029Z num_groups=4, 2025-05-07T20:30:51.7896278Z B=65, 2025-05-07T20:30:51.7896497Z MAX_T=65, 2025-05-07T20:30:51.7896723Z N_H_L=65, 2025-05-07T20:30:51.7896946Z ) 2025-05-07T20:30:51.7897173Z Trying example: test_gqa( 2025-05-07T20:30:51.7897511Z self=, 2025-05-07T20:30:51.7897890Z int4_kv=False, 2025-05-07T20:30:51.7898151Z num_groups=1, 2025-05-07T20:30:51.7898392Z B=6, 2025-05-07T20:30:51.7898606Z MAX_T=108, 2025-05-07T20:30:51.7898840Z N_H_L=14, 2025-05-07T20:30:51.7899064Z ) 2025-05-07T20:30:51.7899287Z Trying example: test_gqa( 2025-05-07T20:30:51.7899670Z self=, 2025-05-07T20:30:51.7900066Z int4_kv=False, 2025-05-07T20:30:51.7900312Z num_groups=1, 2025-05-07T20:30:51.7900554Z B=6, 2025-05-07T20:30:51.7900770Z MAX_T=14, 2025-05-07T20:30:51.7900998Z N_H_L=14, 2025-05-07T20:30:51.7901224Z ) 2025-05-07T20:30:51.7901450Z Trying example: test_gqa( 2025-05-07T20:30:51.7901794Z self=, 2025-05-07T20:30:51.7902167Z int4_kv=False, 2025-05-07T20:30:51.7902417Z num_groups=1, 2025-05-07T20:30:51.7902651Z B=6, 2025-05-07T20:30:51.7902872Z MAX_T=6, 2025-05-07T20:30:51.7903101Z N_H_L=14, 2025-05-07T20:30:51.7903324Z ) 2025-05-07T20:30:51.7903557Z Trying example: test_gqa( 2025-05-07T20:30:51.7904172Z self=, 2025-05-07T20:30:51.7904541Z int4_kv=False, 2025-05-07T20:30:51.7904789Z num_groups=1, 2025-05-07T20:30:51.7905033Z B=6, 2025-05-07T20:30:51.7905243Z MAX_T=6, 2025-05-07T20:30:51.7905469Z N_H_L=6, 2025-05-07T20:30:51.7905689Z ) 2025-05-07T20:30:51.7905908Z Trying example: test_gqa( 2025-05-07T20:30:51.7906254Z self=, 2025-05-07T20:30:51.7906623Z int4_kv=False, 2025-05-07T20:30:51.7906872Z num_groups=1, 2025-05-07T20:30:51.7907114Z B=70, 2025-05-07T20:30:51.7907333Z MAX_T=94, 2025-05-07T20:30:51.7907565Z N_H_L=78, 2025-05-07T20:30:51.7907788Z ) 2025-05-07T20:30:51.7908011Z Trying example: test_gqa( 2025-05-07T20:30:51.7908356Z self=, 2025-05-07T20:30:51.7908727Z int4_kv=False, 2025-05-07T20:30:51.7908978Z num_groups=1, 2025-05-07T20:30:51.7909217Z B=78, 2025-05-07T20:30:51.7909437Z MAX_T=94, 2025-05-07T20:30:51.7909662Z N_H_L=78, 2025-05-07T20:30:51.7909966Z ) 2025-05-07T20:30:51.7910188Z Trying example: test_gqa( 2025-05-07T20:30:51.7910529Z self=, 2025-05-07T20:30:51.7910902Z int4_kv=False, 2025-05-07T20:30:51.7911143Z num_groups=1, 2025-05-07T20:30:51.7911380Z B=94, 2025-05-07T20:30:51.7911596Z MAX_T=94, 2025-05-07T20:30:51.7911817Z N_H_L=78, 2025-05-07T20:30:51.7912042Z ) 2025-05-07T20:30:51.7912266Z Trying example: test_gqa( 2025-05-07T20:30:51.7912607Z self=, 2025-05-07T20:30:51.7912989Z int4_kv=False, 2025-05-07T20:30:51.7913404Z num_groups=1, 2025-05-07T20:30:51.7913646Z B=94, 2025-05-07T20:30:51.7913873Z MAX_T=94, 2025-05-07T20:30:51.7914100Z N_H_L=94, 2025-05-07T20:30:51.7914319Z ) 2025-05-07T20:30:51.7914543Z Trying example: test_gqa( 2025-05-07T20:30:51.7914894Z self=, 2025-05-07T20:30:51.7915381Z int4_kv=False, 2025-05-07T20:30:51.7915638Z num_groups=4, 2025-05-07T20:30:51.7915878Z B=41, 2025-05-07T20:30:51.7916092Z MAX_T=105, 2025-05-07T20:30:51.7916331Z N_H_L=126, 2025-05-07T20:30:51.7916563Z ) 2025-05-07T20:30:51.7916791Z Trying example: test_gqa( 2025-05-07T20:30:51.7917131Z self=, 2025-05-07T20:30:51.7917516Z int4_kv=False, 2025-05-07T20:30:51.7917792Z num_groups=4, 2025-05-07T20:30:51.7918042Z B=105, 2025-05-07T20:30:51.7918273Z MAX_T=105, 2025-05-07T20:30:51.7918519Z N_H_L=126, 2025-05-07T20:30:51.7918754Z ) 2025-05-07T20:30:51.7918998Z Trying example: test_gqa( 2025-05-07T20:30:51.7919322Z self=, 2025-05-07T20:30:51.7919651Z int4_kv=False, 2025-05-07T20:30:51.7919855Z num_groups=4, 2025-05-07T20:30:51.7920051Z B=105, 2025-05-07T20:30:51.7920235Z MAX_T=105, 2025-05-07T20:30:51.7920420Z N_H_L=105, 2025-05-07T20:30:51.7920614Z ) 2025-05-07T20:30:51.7920807Z Trying example: test_gqa( 2025-05-07T20:30:51.7921096Z self=, 2025-05-07T20:30:51.7921408Z int4_kv=True, 2025-05-07T20:30:51.7921614Z num_groups=1, 2025-05-07T20:30:51.7921816Z B=95, 2025-05-07T20:30:51.7922007Z MAX_T=114, 2025-05-07T20:30:51.7922206Z N_H_L=43, 2025-05-07T20:30:51.7922387Z ) 2025-05-07T20:30:51.7922585Z Trying example: test_gqa( 2025-05-07T20:30:51.7922871Z self=, 2025-05-07T20:30:51.7923169Z int4_kv=True, 2025-05-07T20:30:51.7923377Z num_groups=1, 2025-05-07T20:30:51.7923577Z B=43, 2025-05-07T20:30:51.7923750Z MAX_T=114, 2025-05-07T20:30:51.7923946Z N_H_L=43, 2025-05-07T20:30:51.7924133Z ) 2025-05-07T20:30:51.7924317Z Trying example: test_gqa( 2025-05-07T20:30:51.7924606Z self=, 2025-05-07T20:30:51.7924913Z int4_kv=True, 2025-05-07T20:30:51.7925116Z num_groups=1, 2025-05-07T20:30:51.7925315Z B=43, 2025-05-07T20:30:51.7925497Z MAX_T=43, 2025-05-07T20:30:51.7925687Z N_H_L=43, 2025-05-07T20:30:51.7925866Z ) 2025-05-07T20:30:51.7926052Z Trying example: test_gqa( 2025-05-07T20:30:51.7926341Z self=, 2025-05-07T20:30:51.7926639Z int4_kv=False, 2025-05-07T20:30:51.7926852Z num_groups=1, 2025-05-07T20:30:51.7927049Z B=21, 2025-05-07T20:30:51.7927227Z MAX_T=38, 2025-05-07T20:30:51.7927417Z N_H_L=42, 2025-05-07T20:30:51.7927598Z ) 2025-05-07T20:30:51.7927777Z Trying example: test_gqa( 2025-05-07T20:30:51.7928062Z self=, 2025-05-07T20:30:51.7928370Z int4_kv=False, 2025-05-07T20:30:51.7928575Z num_groups=1, 2025-05-07T20:30:51.7928773Z B=38, 2025-05-07T20:30:51.7928961Z MAX_T=38, 2025-05-07T20:30:51.7929147Z N_H_L=42, 2025-05-07T20:30:51.7929329Z ) 2025-05-07T20:30:51.7929516Z Trying example: test_gqa( 2025-05-07T20:30:51.7929842Z self=, 2025-05-07T20:30:51.7930149Z int4_kv=False, 2025-05-07T20:30:51.7930360Z num_groups=1, 2025-05-07T20:30:51.7930559Z B=38, 2025-05-07T20:30:51.7930748Z MAX_T=42, 2025-05-07T20:30:51.7930935Z N_H_L=42, 2025-05-07T20:30:51.7931117Z ) 2025-05-07T20:30:51.7931305Z Trying example: test_gqa( 2025-05-07T20:30:51.7931595Z self=, 2025-05-07T20:30:51.7931903Z int4_kv=False, 2025-05-07T20:30:51.7932109Z num_groups=1, 2025-05-07T20:30:51.7932311Z B=42, 2025-05-07T20:30:51.7932498Z MAX_T=42, 2025-05-07T20:30:51.7932680Z N_H_L=42, 2025-05-07T20:30:51.7932870Z ) 2025-05-07T20:30:51.7933163Z Trying example: test_gqa( 2025-05-07T20:30:51.7933447Z self=, 2025-05-07T20:30:51.7933753Z int4_kv=True, 2025-05-07T20:30:51.7933957Z num_groups=1, 2025-05-07T20:30:51.7934151Z B=74, 2025-05-07T20:30:51.7934333Z MAX_T=20, 2025-05-07T20:30:51.7934518Z N_H_L=15, 2025-05-07T20:30:51.7934770Z ) 2025-05-07T20:30:51.7934957Z Trying example: test_gqa( 2025-05-07T20:30:51.7935240Z self=, 2025-05-07T20:30:51.7935533Z int4_kv=True, 2025-05-07T20:30:51.7935739Z num_groups=1, 2025-05-07T20:30:51.7935936Z B=20, 2025-05-07T20:30:51.7936114Z MAX_T=20, 2025-05-07T20:30:51.7936300Z N_H_L=15, 2025-05-07T20:30:51.7936485Z ) 2025-05-07T20:30:51.7936666Z Trying example: test_gqa( 2025-05-07T20:30:51.7936954Z self=, 2025-05-07T20:30:51.7937261Z int4_kv=True, 2025-05-07T20:30:51.7937455Z num_groups=1, 2025-05-07T20:30:51.7937656Z B=20, 2025-05-07T20:30:51.7937848Z MAX_T=15, 2025-05-07T20:30:51.7938032Z N_H_L=15, 2025-05-07T20:30:51.7938214Z ) 2025-05-07T20:30:51.7938407Z Trying example: test_gqa( 2025-05-07T20:30:51.7938687Z self=, 2025-05-07T20:30:51.7938997Z int4_kv=True, 2025-05-07T20:30:51.7939200Z num_groups=1, 2025-05-07T20:30:51.7939408Z B=15, 2025-05-07T20:30:51.7939589Z MAX_T=20, 2025-05-07T20:30:51.7939779Z N_H_L=15, 2025-05-07T20:30:51.7939962Z ) 2025-05-07T20:30:51.7940145Z Trying example: test_gqa( 2025-05-07T20:30:51.7940427Z self=, 2025-05-07T20:30:51.7940758Z int4_kv=True, 2025-05-07T20:30:51.7940954Z num_groups=1, 2025-05-07T20:30:51.7941151Z B=15, 2025-05-07T20:30:51.7941337Z MAX_T=15, 2025-05-07T20:30:51.7941520Z N_H_L=15, 2025-05-07T20:30:51.7941703Z ) 2025-05-07T20:30:51.7941892Z Trying example: test_gqa( 2025-05-07T20:30:51.7942173Z self=, 2025-05-07T20:30:51.7942482Z int4_kv=False, 2025-05-07T20:30:51.7942691Z num_groups=4, 2025-05-07T20:30:51.7942893Z B=117, 2025-05-07T20:30:51.7943073Z MAX_T=104, 2025-05-07T20:30:51.7943263Z N_H_L=69, 2025-05-07T20:30:51.7943449Z ) 2025-05-07T20:30:51.7943640Z Trying example: test_gqa( 2025-05-07T20:30:51.7943933Z self=, 2025-05-07T20:30:51.7944243Z int4_kv=False, 2025-05-07T20:30:51.7944447Z num_groups=4, 2025-05-07T20:30:51.7944656Z B=117, 2025-05-07T20:30:51.7944841Z MAX_T=117, 2025-05-07T20:30:51.7945030Z N_H_L=69, 2025-05-07T20:30:51.7945214Z ) 2025-05-07T20:30:51.7945398Z Trying example: test_gqa( 2025-05-07T20:30:51.7945680Z self=, 2025-05-07T20:30:51.7945985Z int4_kv=False, 2025-05-07T20:30:51.7946193Z num_groups=4, 2025-05-07T20:30:51.7946383Z B=69, 2025-05-07T20:30:51.7946572Z MAX_T=117, 2025-05-07T20:30:51.7946767Z N_H_L=69, 2025-05-07T20:30:51.7946946Z ) 2025-05-07T20:30:51.7947137Z Trying example: test_gqa( 2025-05-07T20:30:51.7947416Z self=, 2025-05-07T20:30:51.7947744Z int4_kv=False, 2025-05-07T20:30:51.7947943Z num_groups=4, 2025-05-07T20:30:51.7948147Z B=117, 2025-05-07T20:30:51.7948331Z MAX_T=69, 2025-05-07T20:30:51.7948521Z N_H_L=69, 2025-05-07T20:30:51.7948713Z ) 2025-05-07T20:30:51.7948898Z PASSED 2025-05-07T20:30:51.8307803Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:30:51.8308138Z 2025-05-07T20:30:51.8308287Z =========================== short test summary info ============================ 2025-05-07T20:30:51.8308992Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when CUDA is not available or xformers is not available 2025-05-07T20:30:51.8309685Z ======================== 1 passed, 1 skipped in 38.92s ========================= 2025-05-07T20:30:52.4497665Z 2025-05-07T20:30:52.4498427Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:30:52.4517651Z [TEST] Python test time for ./attention/gqa_test.py: 42 seconds 2025-05-07T20:30:52.4517983Z 2025-05-07T20:30:52.4517987Z 2025-05-07T20:30:52.4517991Z 2025-05-07T20:30:52.4517995Z 2025-05-07T20:30:52.4537858Z ################################################################################ 2025-05-07T20:30:52.4553351Z # [2025-05-07T20:30:52.455Z] Run Python Test Suite: 2025-05-07T20:30:52.4553701Z # ./coalesce/coalesce_test.py 2025-05-07T20:30:52.4553995Z ################################################################################ 2025-05-07T20:30:52.4579315Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:30:52.4579937Z 2025-05-07T20:30:54.6047434Z ============================= test session starts ============================== 2025-05-07T20:30:54.6048529Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:54.6049413Z cachedir: .pytest_cache 2025-05-07T20:30:54.6050382Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:54.6051716Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:54.6052378Z plugins: hypothesis-6.131.14 2025-05-07T20:30:56.1455465Z collecting ... collected 1 item 2025-05-07T20:30:56.1455700Z 2025-05-07T20:30:56.8724776Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:30:56.8725287Z 2025-05-07T20:30:56.8725504Z ============================== 1 passed in 2.40s =============================== 2025-05-07T20:30:57.4714656Z 2025-05-07T20:30:57.4715360Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:30:57.4735465Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:30:57.4735906Z 2025-05-07T20:30:57.4735924Z 2025-05-07T20:30:57.4735930Z 2025-05-07T20:30:57.4735935Z 2025-05-07T20:30:57.4757626Z ################################################################################ 2025-05-07T20:30:57.4773020Z # [2025-05-07T20:30:57.476Z] Run Python Test Suite: 2025-05-07T20:30:57.4773502Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:30:57.4773921Z ################################################################################ 2025-05-07T20:30:57.4797076Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:30:57.4797826Z 2025-05-07T20:30:59.6352272Z ============================= test session starts ============================== 2025-05-07T20:30:59.6353136Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:59.6353668Z cachedir: .pytest_cache 2025-05-07T20:30:59.6354271Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:59.6355003Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:59.6355418Z plugins: hypothesis-6.131.14 2025-05-07T20:31:01.2198618Z collecting ... collected 5 items 2025-05-07T20:31:01.2198972Z 2025-05-07T20:31:01.2209823Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:01.2218274Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:01.2226151Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:01.2234036Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:01.2251101Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:01.2251569Z 2025-05-07T20:31:01.2252144Z =========================== short test summary info ============================ 2025-05-07T20:31:01.2252824Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.2253744Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.2254804Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.2255728Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.2256643Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:01.2257284Z ============================== 5 skipped in 1.72s ============================== 2025-05-07T20:31:01.7385554Z 2025-05-07T20:31:01.7386238Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:01.7405522Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds 2025-05-07T20:31:01.7405935Z 2025-05-07T20:31:01.7405941Z 2025-05-07T20:31:01.7405947Z 2025-05-07T20:31:01.7405972Z 2025-05-07T20:31:01.7428015Z ################################################################################ 2025-05-07T20:31:01.7443638Z # [2025-05-07T20:31:01.744Z] Run Python Test Suite: 2025-05-07T20:31:01.7444102Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:01.7444508Z ################################################################################ 2025-05-07T20:31:01.7468458Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:01.7469272Z 2025-05-07T20:31:03.9219433Z ============================= test session starts ============================== 2025-05-07T20:31:03.9220066Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:03.9220587Z cachedir: .pytest_cache 2025-05-07T20:31:03.9221152Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:03.9221880Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:03.9222294Z plugins: hypothesis-6.131.14 2025-05-07T20:31:05.5765890Z collecting ... collected 2 items 2025-05-07T20:31:05.5766288Z 2025-05-07T20:31:05.5777379Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:31:05.5792329Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:31:05.5792890Z 2025-05-07T20:31:05.5793084Z =========================== short test summary info ============================ 2025-05-07T20:31:05.5793716Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:05.5794551Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:05.5795161Z ============================== 2 skipped in 1.79s ============================== 2025-05-07T20:31:06.1092211Z 2025-05-07T20:31:06.1093141Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:06.1112726Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds 2025-05-07T20:31:06.1113248Z 2025-05-07T20:31:06.1113254Z 2025-05-07T20:31:06.1113260Z 2025-05-07T20:31:06.1113265Z 2025-05-07T20:31:06.1135306Z ################################################################################ 2025-05-07T20:31:06.1150909Z # [2025-05-07T20:31:06.114Z] Run Python Test Suite: 2025-05-07T20:31:06.1151741Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:31:06.1152128Z ################################################################################ 2025-05-07T20:31:06.1175883Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:31:06.1176821Z 2025-05-07T20:31:08.2610537Z ============================= test session starts ============================== 2025-05-07T20:31:08.2611192Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:08.2611719Z cachedir: .pytest_cache 2025-05-07T20:31:08.2612304Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:08.2613021Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:08.2613481Z plugins: hypothesis-6.131.14 2025-05-07T20:31:09.8171686Z collecting ... collected 4 items 2025-05-07T20:31:09.8171998Z 2025-05-07T20:31:12.8190840Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:31:12.8355761Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:31:12.8552547Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:31:12.8716913Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:31:12.8717370Z 2025-05-07T20:31:12.8717529Z =========================== short test summary info ============================ 2025-05-07T20:31:12.8718230Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when H100 is not available or MI300 is not available 2025-05-07T20:31:12.8719135Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/unittest/case.py:117: Skip when xformers is not available 2025-05-07T20:31:12.8719759Z ============================== 4 skipped in 4.74s ============================== 2025-05-07T20:31:14.4971109Z 2025-05-07T20:31:14.4971639Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:31:14.4988521Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds 2025-05-07T20:31:14.4988816Z 2025-05-07T20:31:14.4988878Z 2025-05-07T20:31:14.4989037Z 2025-05-07T20:31:14.4989048Z 2025-05-07T20:31:14.5011315Z ################################################################################ 2025-05-07T20:31:14.5028016Z # [2025-05-07T20:31:14.502Z] Run Python Test Suite: 2025-05-07T20:31:14.5028358Z # ./moe/activation_test.py 2025-05-07T20:31:14.5028643Z ################################################################################ 2025-05-07T20:31:14.5054506Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:31:14.5055122Z 2025-05-07T20:31:16.6631304Z ============================= test session starts ============================== 2025-05-07T20:31:16.6631932Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:16.6632450Z cachedir: .pytest_cache 2025-05-07T20:31:16.6633014Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:16.6633748Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:16.6634155Z plugins: hypothesis-6.131.14 2025-05-07T20:31:18.3047011Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:18.5181567Z collecting ... collected 2 items 2025-05-07T20:31:18.5181792Z 2025-05-07T20:31:24.4462540Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:31:24.4464179Z self=, 2025-05-07T20:31:24.4465480Z T=1, 2025-05-07T20:31:24.4465930Z D=5120, 2025-05-07T20:31:24.4466474Z contiguous=True, 2025-05-07T20:31:24.4467003Z compiled=True, 2025-05-07T20:31:24.4467323Z ) 2025-05-07T20:31:24.4467587Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4468042Z self=, 2025-05-07T20:31:24.4468626Z T=4096, 2025-05-07T20:31:24.4468824Z D=5120, 2025-05-07T20:31:24.4469012Z contiguous=True, 2025-05-07T20:31:24.4469224Z compiled=True, 2025-05-07T20:31:24.4469428Z ) 2025-05-07T20:31:24.4469623Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4470107Z self=, 2025-05-07T20:31:24.4470486Z T=4096, 2025-05-07T20:31:24.4470670Z D=7168, 2025-05-07T20:31:24.4470860Z contiguous=False, 2025-05-07T20:31:24.4471086Z compiled=False, 2025-05-07T20:31:24.4471292Z ) 2025-05-07T20:31:24.4471480Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4471859Z self=, 2025-05-07T20:31:24.4472236Z T=4096, 2025-05-07T20:31:24.4472420Z D=5120, 2025-05-07T20:31:24.4472612Z contiguous=False, 2025-05-07T20:31:24.4472835Z compiled=True, 2025-05-07T20:31:24.4473036Z ) 2025-05-07T20:31:24.4473226Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4473606Z self=, 2025-05-07T20:31:24.4473985Z T=1, 2025-05-07T20:31:24.4474161Z D=7168, 2025-05-07T20:31:24.4474357Z contiguous=True, 2025-05-07T20:31:24.4474574Z compiled=True, 2025-05-07T20:31:24.4474767Z ) 2025-05-07T20:31:24.4474960Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4475327Z self=, 2025-05-07T20:31:24.4475690Z T=1, 2025-05-07T20:31:24.4475870Z D=7168, 2025-05-07T20:31:24.4476065Z contiguous=False, 2025-05-07T20:31:24.4476277Z compiled=True, 2025-05-07T20:31:24.4476490Z ) 2025-05-07T20:31:24.4476684Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4477042Z self=, 2025-05-07T20:31:24.4477417Z T=4096, 2025-05-07T20:31:24.4477604Z D=5120, 2025-05-07T20:31:24.4477791Z contiguous=False, 2025-05-07T20:31:24.4478022Z compiled=False, 2025-05-07T20:31:24.4478224Z ) 2025-05-07T20:31:24.4478421Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4478780Z self=, 2025-05-07T20:31:24.4479152Z T=1, 2025-05-07T20:31:24.4479328Z D=7168, 2025-05-07T20:31:24.4479521Z contiguous=True, 2025-05-07T20:31:24.4479743Z compiled=False, 2025-05-07T20:31:24.4479951Z ) 2025-05-07T20:31:24.4480138Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4480505Z self=, 2025-05-07T20:31:24.4480873Z T=2048, 2025-05-07T20:31:24.4481050Z D=5120, 2025-05-07T20:31:24.4481255Z contiguous=True, 2025-05-07T20:31:24.4481475Z compiled=True, 2025-05-07T20:31:24.4481670Z ) 2025-05-07T20:31:24.4481875Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4482242Z self=, 2025-05-07T20:31:24.4482610Z T=2048, 2025-05-07T20:31:24.4482803Z D=7168, 2025-05-07T20:31:24.4483004Z contiguous=True, 2025-05-07T20:31:24.4483217Z compiled=True, 2025-05-07T20:31:24.4483420Z ) 2025-05-07T20:31:24.4483616Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4483974Z self=, 2025-05-07T20:31:24.4484350Z T=2048, 2025-05-07T20:31:24.4484540Z D=7168, 2025-05-07T20:31:24.4484733Z contiguous=True, 2025-05-07T20:31:24.4484948Z compiled=False, 2025-05-07T20:31:24.4485156Z ) 2025-05-07T20:31:24.4485352Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4485715Z self=, 2025-05-07T20:31:24.4486189Z T=128, 2025-05-07T20:31:24.4486374Z D=5120, 2025-05-07T20:31:24.4486561Z contiguous=False, 2025-05-07T20:31:24.4486783Z compiled=True, 2025-05-07T20:31:24.4486985Z ) 2025-05-07T20:31:24.4487170Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4487533Z self=, 2025-05-07T20:31:24.4487981Z T=128, 2025-05-07T20:31:24.4488155Z D=5120, 2025-05-07T20:31:24.4488347Z contiguous=True, 2025-05-07T20:31:24.4488566Z compiled=True, 2025-05-07T20:31:24.4488760Z ) 2025-05-07T20:31:24.4488953Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4489316Z self=, 2025-05-07T20:31:24.4489683Z T=16384, 2025-05-07T20:31:24.4489877Z D=5120, 2025-05-07T20:31:24.4490071Z contiguous=False, 2025-05-07T20:31:24.4490287Z compiled=True, 2025-05-07T20:31:24.4490488Z ) 2025-05-07T20:31:24.4490679Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4491051Z self=, 2025-05-07T20:31:24.4491416Z T=16384, 2025-05-07T20:31:24.4491609Z D=5120, 2025-05-07T20:31:24.4491800Z contiguous=False, 2025-05-07T20:31:24.4492018Z compiled=False, 2025-05-07T20:31:24.4492234Z ) 2025-05-07T20:31:24.4492434Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4492802Z self=, 2025-05-07T20:31:24.4493183Z T=128, 2025-05-07T20:31:24.4493376Z D=7168, 2025-05-07T20:31:24.4493562Z contiguous=True, 2025-05-07T20:31:24.4493794Z compiled=False, 2025-05-07T20:31:24.4493996Z ) 2025-05-07T20:31:24.4494187Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4494553Z self=, 2025-05-07T20:31:24.4494926Z T=128, 2025-05-07T20:31:24.4495106Z D=7168, 2025-05-07T20:31:24.4495303Z contiguous=False, 2025-05-07T20:31:24.4495530Z compiled=False, 2025-05-07T20:31:24.4495729Z ) 2025-05-07T20:31:24.4495924Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4496295Z self=, 2025-05-07T20:31:24.4496658Z T=1, 2025-05-07T20:31:24.4496841Z D=5120, 2025-05-07T20:31:24.4497038Z contiguous=False, 2025-05-07T20:31:24.4497266Z compiled=False, 2025-05-07T20:31:24.4497471Z ) 2025-05-07T20:31:24.4497660Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4498023Z self=, 2025-05-07T20:31:24.4498387Z T=1, 2025-05-07T20:31:24.4498572Z D=7168, 2025-05-07T20:31:24.4498760Z contiguous=False, 2025-05-07T20:31:24.4498974Z compiled=False, 2025-05-07T20:31:24.4499174Z ) 2025-05-07T20:31:24.4499364Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4499724Z self=, 2025-05-07T20:31:24.4500097Z T=4096, 2025-05-07T20:31:24.4500283Z D=5120, 2025-05-07T20:31:24.4500477Z contiguous=True, 2025-05-07T20:31:24.4500696Z compiled=False, 2025-05-07T20:31:24.4500901Z ) 2025-05-07T20:31:24.4501088Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4501456Z self=, 2025-05-07T20:31:24.4501829Z T=128, 2025-05-07T20:31:24.4502012Z D=7168, 2025-05-07T20:31:24.4502208Z contiguous=True, 2025-05-07T20:31:24.4502429Z compiled=True, 2025-05-07T20:31:24.4502625Z ) 2025-05-07T20:31:24.4502820Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4503184Z self=, 2025-05-07T20:31:24.4503560Z T=1, 2025-05-07T20:31:24.4504075Z D=5120, 2025-05-07T20:31:24.4504295Z contiguous=False, 2025-05-07T20:31:24.4504517Z compiled=True, 2025-05-07T20:31:24.4504710Z ) 2025-05-07T20:31:24.4504902Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4505274Z self=, 2025-05-07T20:31:24.4505789Z T=4096, 2025-05-07T20:31:24.4505984Z D=7168, 2025-05-07T20:31:24.4506173Z contiguous=True, 2025-05-07T20:31:24.4506386Z compiled=False, 2025-05-07T20:31:24.4506591Z ) 2025-05-07T20:31:24.4506786Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4507151Z self=, 2025-05-07T20:31:24.4507641Z T=4096, 2025-05-07T20:31:24.4507828Z D=7168, 2025-05-07T20:31:24.4508023Z contiguous=False, 2025-05-07T20:31:24.4508249Z compiled=True, 2025-05-07T20:31:24.4508454Z ) 2025-05-07T20:31:24.4508648Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4509020Z self=, 2025-05-07T20:31:24.4509397Z T=128, 2025-05-07T20:31:24.4509582Z D=5120, 2025-05-07T20:31:24.4509771Z contiguous=True, 2025-05-07T20:31:24.4510072Z compiled=False, 2025-05-07T20:31:24.4510280Z ) 2025-05-07T20:31:24.4510471Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4510846Z self=, 2025-05-07T20:31:24.4511218Z T=128, 2025-05-07T20:31:24.4511397Z D=5120, 2025-05-07T20:31:24.4511588Z contiguous=False, 2025-05-07T20:31:24.4511815Z compiled=False, 2025-05-07T20:31:24.4512011Z ) 2025-05-07T20:31:24.4512213Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4512582Z self=, 2025-05-07T20:31:24.4512950Z T=1, 2025-05-07T20:31:24.4513137Z D=5120, 2025-05-07T20:31:24.4513329Z contiguous=True, 2025-05-07T20:31:24.4513547Z compiled=False, 2025-05-07T20:31:24.4513753Z ) 2025-05-07T20:31:24.4513951Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4514314Z self=, 2025-05-07T20:31:24.4514686Z T=2048, 2025-05-07T20:31:24.4514872Z D=7168, 2025-05-07T20:31:24.4515060Z contiguous=False, 2025-05-07T20:31:24.4515284Z compiled=True, 2025-05-07T20:31:24.4515490Z ) 2025-05-07T20:31:24.4515682Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4516049Z self=, 2025-05-07T20:31:24.4516423Z T=2048, 2025-05-07T20:31:24.4516610Z D=7168, 2025-05-07T20:31:24.4516796Z contiguous=False, 2025-05-07T20:31:24.4517029Z compiled=False, 2025-05-07T20:31:24.4517237Z ) 2025-05-07T20:31:24.4517428Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4517798Z self=, 2025-05-07T20:31:24.4518179Z T=16384, 2025-05-07T20:31:24.4518367Z D=7168, 2025-05-07T20:31:24.4518560Z contiguous=False, 2025-05-07T20:31:24.4518784Z compiled=True, 2025-05-07T20:31:24.4518985Z ) 2025-05-07T20:31:24.4519186Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4519552Z self=, 2025-05-07T20:31:24.4519924Z T=16384, 2025-05-07T20:31:24.4520117Z D=7168, 2025-05-07T20:31:24.4520317Z contiguous=True, 2025-05-07T20:31:24.4520528Z compiled=True, 2025-05-07T20:31:24.4520731Z ) 2025-05-07T20:31:24.4520935Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4521298Z self=, 2025-05-07T20:31:24.4521679Z T=4096, 2025-05-07T20:31:24.4521871Z D=7168, 2025-05-07T20:31:24.4522069Z contiguous=True, 2025-05-07T20:31:24.4522279Z compiled=True, 2025-05-07T20:31:24.4522484Z ) 2025-05-07T20:31:24.4522684Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4523049Z self=, 2025-05-07T20:31:24.4523434Z T=2048, 2025-05-07T20:31:24.4523618Z D=5120, 2025-05-07T20:31:24.4523805Z contiguous=False, 2025-05-07T20:31:24.4524034Z compiled=False, 2025-05-07T20:31:24.4524240Z ) 2025-05-07T20:31:24.4524440Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4524951Z self=, 2025-05-07T20:31:24.4525340Z T=2048, 2025-05-07T20:31:24.4525531Z D=5120, 2025-05-07T20:31:24.4525727Z contiguous=True, 2025-05-07T20:31:24.4525958Z compiled=False, 2025-05-07T20:31:24.4526168Z ) 2025-05-07T20:31:24.4526362Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4526742Z self=, 2025-05-07T20:31:24.4527191Z T=128, 2025-05-07T20:31:24.4527376Z D=7168, 2025-05-07T20:31:24.4527581Z contiguous=False, 2025-05-07T20:31:24.4527814Z compiled=True, 2025-05-07T20:31:24.4528023Z ) 2025-05-07T20:31:24.4528224Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4528609Z self=, 2025-05-07T20:31:24.4537641Z T=16384, 2025-05-07T20:31:24.4537951Z D=5120, 2025-05-07T20:31:24.4538233Z contiguous=True, 2025-05-07T20:31:24.4538484Z compiled=True, 2025-05-07T20:31:24.4538690Z ) 2025-05-07T20:31:24.4538897Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4539291Z self=, 2025-05-07T20:31:24.4539677Z T=2048, 2025-05-07T20:31:24.4539877Z D=5120, 2025-05-07T20:31:24.4540088Z contiguous=False, 2025-05-07T20:31:24.4540329Z compiled=True, 2025-05-07T20:31:24.4540538Z ) 2025-05-07T20:31:24.4540749Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4541126Z self=, 2025-05-07T20:31:24.4541501Z T=16384, 2025-05-07T20:31:24.4541700Z D=5120, 2025-05-07T20:31:24.4541901Z contiguous=True, 2025-05-07T20:31:24.4542120Z compiled=False, 2025-05-07T20:31:24.4542332Z ) 2025-05-07T20:31:24.4542537Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4542907Z self=, 2025-05-07T20:31:24.4543286Z T=16384, 2025-05-07T20:31:24.4543479Z D=7168, 2025-05-07T20:31:24.4543675Z contiguous=False, 2025-05-07T20:31:24.4543910Z compiled=False, 2025-05-07T20:31:24.4544119Z ) 2025-05-07T20:31:24.4544310Z Trying example: test_silu_mul( 2025-05-07T20:31:24.4544685Z self=, 2025-05-07T20:31:24.4545064Z T=16384, 2025-05-07T20:31:24.4545254Z D=7168, 2025-05-07T20:31:24.4545452Z contiguous=True, 2025-05-07T20:31:24.4545683Z compiled=False, 2025-05-07T20:31:24.4545889Z ) 2025-05-07T20:31:24.4546071Z PASSED 2025-05-07T20:31:24.5151099Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:24.5152321Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:24.5153707Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:24.5155178Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:24.5156578Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:24.5157970Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:24.5159647Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:24.5161052Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:24.5162482Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:24.5163893Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:24.5165124Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:24.5166353Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:24.5167398Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:24.5168435Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:24.5169664Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:24.5170955Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:24.5172086Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:24.5173142Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:24.5174326Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:24.5175696Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:24.5176769Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:24.5177696Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:24.5178457Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:24.5179478Z W0507 20:31:24.513709 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:24.5324523Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:24.5325753Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:24.5327399Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:24.5328835Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:24.5330217Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:24.5331752Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:24.5333066Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:24.5334459Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:24.5335873Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:24.5337137Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:24.5338360Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:24.5339578Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:24.5340621Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:24.5341643Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:24.5342873Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:24.5344167Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:24.5345288Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:24.5346334Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:24.5347513Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:24.5348873Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:24.5350050Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:24.5350966Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:24.5351864Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:24.5352897Z W0507 20:31:24.531869 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:24.5746339Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:24.5747548Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:24.5748901Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:24.5750475Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:24.5751854Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:24.5753267Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:24.5754586Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:24.5755977Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:24.5757403Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:24.5758666Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:24.5759893Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:24.5761117Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:24.5762176Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:24.5763207Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:24.5764432Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:24.5765728Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:24.5766846Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:24.5768199Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:24.5769385Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:24.5770881Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:24.5771945Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:24.5772860Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:24.5773616Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:24.5774634Z W0507 20:31:24.574000 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:24.5788554Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:24.5789772Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:24.5791173Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:24.5792599Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:24.5793987Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:24.5795362Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:24.5796672Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:24.5798060Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:24.5799478Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:24.5800735Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:24.5801960Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:24.5803179Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:24.5805212Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:24.5806283Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:24.5807503Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:24.5808909Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:24.5810028Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:24.5811076Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:24.5812252Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:24.5813622Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:24.5814674Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:24.5815589Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:24.5816336Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:24.5817362Z W0507 20:31:24.578363 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:25.0843750Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:25.0844685Z self=, 2025-05-07T20:31:25.0845117Z T=1, 2025-05-07T20:31:25.0845309Z D=5120, 2025-05-07T20:31:25.0845503Z scale_ub=None, 2025-05-07T20:31:25.0845723Z contiguous=True, 2025-05-07T20:31:25.0845949Z compiled=True, 2025-05-07T20:31:25.0846204Z ) 2025-05-07T20:31:25.0846575Z self = 2025-05-07T20:31:25.0847256Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:25.0847520Z 2025-05-07T20:31:25.0847603Z @given( 2025-05-07T20:31:25.0847838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:25.0848173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:25.0848595Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:25.0849036Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:25.0849401Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:25.0849816Z ) 2025-05-07T20:31:25.0850291Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:25.0850887Z def test_silu_mul_quant( 2025-05-07T20:31:25.0851164Z self, 2025-05-07T20:31:25.0851351Z T: int, 2025-05-07T20:31:25.0851547Z D: int, 2025-05-07T20:31:25.0851761Z scale_ub: Optional[float], 2025-05-07T20:31:25.0852028Z contiguous: bool, 2025-05-07T20:31:25.0852264Z compiled: bool, 2025-05-07T20:31:25.0852491Z ) -> None: 2025-05-07T20:31:25.0852706Z torch.manual_seed(2025) 2025-05-07T20:31:25.0852939Z 2025-05-07T20:31:25.0853524Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:25.0853874Z 2025-05-07T20:31:25.0854063Z x_sign = torch.sign(x) 2025-05-07T20:31:25.0854352Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:25.0854662Z x = x_sign * x_clamp 2025-05-07T20:31:25.0854897Z x0 = x[:, :D] 2025-05-07T20:31:25.0855253Z x1 = x[:, D:] 2025-05-07T20:31:25.0855459Z 2025-05-07T20:31:25.0855640Z if contiguous: 2025-05-07T20:31:25.0855876Z x0 = x0.contiguous() 2025-05-07T20:31:25.0856142Z x1 = x1.contiguous() 2025-05-07T20:31:25.0856375Z 2025-05-07T20:31:25.0856570Z if scale_ub is not None: 2025-05-07T20:31:25.0856846Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:25.0857178Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:25.0857489Z ) 2025-05-07T20:31:25.0857685Z else: 2025-05-07T20:31:25.0857902Z scale_ub_tensor = None 2025-05-07T20:31:25.0858154Z 2025-05-07T20:31:25.0858387Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:25.0858708Z op = silu_mul_quant 2025-05-07T20:31:25.0858953Z if compiled: 2025-05-07T20:31:25.0859207Z op = torch.compile(op) 2025-05-07T20:31:25.0859516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:25.0859787Z 2025-05-07T20:31:25.0859982Z y_fp8, y_scale = fn() 2025-05-07T20:31:25.0860273Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:25.0860564Z 2025-05-07T20:31:25.0860807Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:25.0861147Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:25.0861440Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:25.0861767Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:25.0862123Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:25.0862429Z 2025-05-07T20:31:25.0862632Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:25.0862841Z 2025-05-07T20:31:25.0862946Z moe/activation_test.py:126: 2025-05-07T20:31:25.0863254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:25.0863580Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:25.0863908Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:25.0864714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:25.0865480Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:25.0866014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:25.0866696Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:25.0867379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:25.0868105Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:25.0868844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:25.0869600Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:25.0870405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:25.0871048Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:25.0871639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:25.0872159Z fn() 2025-05-07T20:31:25.0872654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:25.0873320Z self.fn.run( 2025-05-07T20:31:25.0873791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:25.0874322Z kernel = self.compile( 2025-05-07T20:31:25.0874858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:25.0875581Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:25.0875973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:25.0876198Z 2025-05-07T20:31:25.0876412Z self = 2025-05-07T20:31:25.0877552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:25.0879021Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faba5c74820>} 2025-05-07T20:31:25.0880377Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:25.0881420Z context = 2025-05-07T20:31:25.0881716Z 2025-05-07T20:31:25.0881886Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:25.0882415Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:25.0882876Z module_map=module_map) 2025-05-07T20:31:25.0883251Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:25.0883622Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:25.0883884Z E ^ 2025-05-07T20:31:25.0884359Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:25.0884812Z 2025-05-07T20:31:25.0885225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:25.0885746Z 2025-05-07T20:31:25.0885869Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:25.0886289Z self=, 2025-05-07T20:31:25.0886702Z T=2048, 2025-05-07T20:31:25.0886897Z D=5120, 2025-05-07T20:31:25.0887088Z scale_ub=1200.0, 2025-05-07T20:31:25.0887353Z contiguous=True, 2025-05-07T20:31:25.0887592Z compiled=False, 2025-05-07T20:31:25.0887790Z ) 2025-05-07T20:31:25.6712697Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:25.6714223Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:25.6715770Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:25.6717215Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:25.6718589Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:25.6720291Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:25.6721599Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:25.6723129Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:25.6724556Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:25.6725799Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:25.6727021Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:25.6728228Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:25.6729262Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:25.6730285Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:25.6731506Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:25.6732795Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:25.6733906Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:25.6734945Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:25.6736119Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:25.6737498Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:25.6738594Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:25.6739508Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:25.6740250Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:25.6741266Z W0507 20:31:25.666767 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:25.8776163Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:25.8778244Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:25.8780914Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:25.8783974Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:25.8786714Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:25.8788468Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:25.8789773Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:25.8791266Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:25.8792675Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:25.8793925Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:25.8795151Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:25.8796371Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:25.8797429Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:25.8798486Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:25.8799714Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:25.8800998Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:25.8802111Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:25.8803162Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:25.8804717Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:25.8806183Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:25.8807420Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:25.8808358Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:25.8809096Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:25.8810222Z W0507 20:31:25.873613 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.4378888Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.4380035Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:26.4381415Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.4382875Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.4384261Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.4385637Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.4386950Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.4388379Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.4389907Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.4391154Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:26.4392376Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.4393573Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:26.4394614Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:26.4395638Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:26.4396863Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.4398495Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.4399606Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:26.4400649Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:26.4401956Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.4403306Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.4404654Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.4405560Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.4406300Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:26.4407334Z W0507 20:31:26.433857 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:26.4774924Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:26.4776286Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:26.4777620Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:26.4779088Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:26.4780465Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:26.4781853Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:26.4783162Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:26.4784533Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:26.4785956Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:26.4787220Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:26.4788645Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:26.4789939Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:26.4790965Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:26.4792137Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:26.4793355Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:26.4794633Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:26.4795748Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:26.4796783Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:26.4798015Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:26.4799370Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:26.4800425Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:26.4801337Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:26.4802076Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:26.4803090Z W0507 20:31:26.473669 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.2318268Z self = 2025-05-07T20:31:27.2319040Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:27.2319413Z 2025-05-07T20:31:27.2319535Z @given( 2025-05-07T20:31:27.2319774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:27.2320090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:27.2320435Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:27.2320766Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:27.2321097Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:27.2321379Z ) 2025-05-07T20:31:27.2329667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:27.2330147Z def test_silu_mul_quant( 2025-05-07T20:31:27.2330396Z self, 2025-05-07T20:31:27.2330599Z T: int, 2025-05-07T20:31:27.2330804Z D: int, 2025-05-07T20:31:27.2331022Z scale_ub: Optional[float], 2025-05-07T20:31:27.2331304Z contiguous: bool, 2025-05-07T20:31:27.2331555Z compiled: bool, 2025-05-07T20:31:27.2331797Z ) -> None: 2025-05-07T20:31:27.2332015Z torch.manual_seed(2025) 2025-05-07T20:31:27.2332272Z 2025-05-07T20:31:27.2332554Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:27.2332901Z 2025-05-07T20:31:27.2333100Z x_sign = torch.sign(x) 2025-05-07T20:31:27.2333735Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:27.2334046Z x = x_sign * x_clamp 2025-05-07T20:31:27.2334301Z x0 = x[:, :D] 2025-05-07T20:31:27.2334530Z x1 = x[:, D:] 2025-05-07T20:31:27.2334738Z 2025-05-07T20:31:27.2334932Z if contiguous: 2025-05-07T20:31:27.2335317Z x0 = x0.contiguous() 2025-05-07T20:31:27.2335577Z x1 = x1.contiguous() 2025-05-07T20:31:27.2335822Z 2025-05-07T20:31:27.2336020Z if scale_ub is not None: 2025-05-07T20:31:27.2336293Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:27.2336640Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:27.2336955Z ) 2025-05-07T20:31:27.2337152Z else: 2025-05-07T20:31:27.2337362Z scale_ub_tensor = None 2025-05-07T20:31:27.2337627Z 2025-05-07T20:31:27.2337868Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:27.2338188Z op = silu_mul_quant 2025-05-07T20:31:27.2338446Z if compiled: 2025-05-07T20:31:27.2338697Z op = torch.compile(op) 2025-05-07T20:31:27.2338995Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:27.2339278Z 2025-05-07T20:31:27.2339473Z > y_fp8, y_scale = fn() 2025-05-07T20:31:27.2339646Z 2025-05-07T20:31:27.2339748Z moe/activation_test.py:117: 2025-05-07T20:31:27.2340046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:27.2340389Z moe/activation_test.py:115: in fn 2025-05-07T20:31:27.2340671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:27.2341368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:27.2342070Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:27.2342615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:27.2343308Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:27.2343975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:27.2344514Z kernel = self.compile( 2025-05-07T20:31:27.2345066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:27.2345730Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.2346132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:27.2346360Z 2025-05-07T20:31:27.2346578Z self = 2025-05-07T20:31:27.2347678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:27.2349081Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faba62e1ee0>} 2025-05-07T20:31:27.2350553Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:27.2351615Z context = 2025-05-07T20:31:27.2351905Z 2025-05-07T20:31:27.2352080Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:27.2352599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.2353077Z module_map=module_map) 2025-05-07T20:31:27.2353447Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.2353886Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.2354146Z E ^ 2025-05-07T20:31:27.2354622Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.2355074Z 2025-05-07T20:31:27.2355490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:27.2356112Z 2025-05-07T20:31:27.2356218Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:27.2356632Z self=, 2025-05-07T20:31:27.2357035Z T=2048, 2025-05-07T20:31:27.2357226Z D=5120, 2025-05-07T20:31:27.2357424Z scale_ub=1200.0, 2025-05-07T20:31:27.2357654Z contiguous=True, 2025-05-07T20:31:27.2357874Z compiled=True, 2025-05-07T20:31:27.2358086Z ) 2025-05-07T20:31:27.2358411Z self = 2025-05-07T20:31:27.2358913Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:27.2359189Z 2025-05-07T20:31:27.2359269Z @given( 2025-05-07T20:31:27.2359505Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:27.2359815Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:27.2360127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:27.2360464Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:27.2360801Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:27.2361085Z ) 2025-05-07T20:31:27.2361443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:27.2361892Z def test_silu_mul_quant( 2025-05-07T20:31:27.2362140Z self, 2025-05-07T20:31:27.2362341Z T: int, 2025-05-07T20:31:27.2362549Z D: int, 2025-05-07T20:31:27.2362761Z scale_ub: Optional[float], 2025-05-07T20:31:27.2363034Z contiguous: bool, 2025-05-07T20:31:27.2363279Z compiled: bool, 2025-05-07T20:31:27.2363504Z ) -> None: 2025-05-07T20:31:27.2363720Z torch.manual_seed(2025) 2025-05-07T20:31:27.2363964Z 2025-05-07T20:31:27.2364229Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:27.2364570Z 2025-05-07T20:31:27.2364769Z x_sign = torch.sign(x) 2025-05-07T20:31:27.2365069Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:27.2365382Z x = x_sign * x_clamp 2025-05-07T20:31:27.2365627Z x0 = x[:, :D] 2025-05-07T20:31:27.2365845Z x1 = x[:, D:] 2025-05-07T20:31:27.2366054Z 2025-05-07T20:31:27.2366243Z if contiguous: 2025-05-07T20:31:27.2366481Z x0 = x0.contiguous() 2025-05-07T20:31:27.2366735Z x1 = x1.contiguous() 2025-05-07T20:31:27.2366976Z 2025-05-07T20:31:27.2367169Z if scale_ub is not None: 2025-05-07T20:31:27.2367442Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:27.2367818Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:27.2368151Z ) 2025-05-07T20:31:27.2368341Z else: 2025-05-07T20:31:27.2368554Z scale_ub_tensor = None 2025-05-07T20:31:27.2368810Z 2025-05-07T20:31:27.2369041Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:27.2369356Z op = silu_mul_quant 2025-05-07T20:31:27.2369620Z if compiled: 2025-05-07T20:31:27.2369865Z op = torch.compile(op) 2025-05-07T20:31:27.2370166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:27.2370446Z 2025-05-07T20:31:27.2370635Z y_fp8, y_scale = fn() 2025-05-07T20:31:27.2370923Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:27.2371221Z 2025-05-07T20:31:27.2371464Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:27.2371797Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:27.2372094Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:27.2372499Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:27.2372859Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:27.2373174Z 2025-05-07T20:31:27.2373380Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:27.2373577Z 2025-05-07T20:31:27.2373678Z moe/activation_test.py:126: 2025-05-07T20:31:27.2374054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:27.2374396Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:27.2374727Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:27.2375517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:27.2376292Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:27.2376839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:27.2377527Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:27.2378248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:27.2378986Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:27.2379746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:27.2380493Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:27.2381227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:27.2381881Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:27.2382486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:27.2383010Z fn() 2025-05-07T20:31:27.2383519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:27.2384104Z self.fn.run( 2025-05-07T20:31:27.2384568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:27.2385113Z kernel = self.compile( 2025-05-07T20:31:27.2385654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:27.2386315Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.2386706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:27.2386942Z 2025-05-07T20:31:27.2387152Z self = 2025-05-07T20:31:27.2388300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:27.2389685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faba62dd5e0>} 2025-05-07T20:31:27.2391088Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:27.2392128Z context = 2025-05-07T20:31:27.2392420Z 2025-05-07T20:31:27.2392587Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:27.2393119Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.2393587Z module_map=module_map) 2025-05-07T20:31:27.2394042Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.2394405Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:27.2394677Z E ^ 2025-05-07T20:31:27.2395150Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.2395687Z 2025-05-07T20:31:27.2396104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:27.2396630Z 2025-05-07T20:31:27.2396744Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:27.2397165Z self=, 2025-05-07T20:31:27.2397574Z T=16384, 2025-05-07T20:31:27.2397780Z D=7168, 2025-05-07T20:31:27.2397979Z scale_ub=1200.0, 2025-05-07T20:31:27.2398202Z contiguous=False, 2025-05-07T20:31:27.2398434Z compiled=False, 2025-05-07T20:31:27.2398652Z ) 2025-05-07T20:31:27.6552146Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.6553628Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:27.6555272Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.6556716Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.6558119Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.6559513Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.6560830Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.6562230Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.6563661Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.6564926Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:27.6566150Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.6567370Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:27.6568424Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:27.6569445Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:27.6571009Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.6572319Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.6573585Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:27.6574632Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:27.6575818Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.6577189Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.6578260Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.6579183Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.6579928Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:27.6580947Z W0507 20:31:27.651142 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:27.8180564Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:27.8181942Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:27.8183289Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:27.8184766Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:27.8186165Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:27.8187564Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:27.8188890Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:27.8190344Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:27.8191780Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:27.8193400Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:27.8194633Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:27.8196025Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:27.8197092Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:27.8198269Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:27.8199829Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:27.8201190Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:27.8202326Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:27.8203380Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:27.8204883Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:27.8206261Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:27.8207330Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:27.8208473Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:27.8209420Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:27.8210522Z W0507 20:31:27.813980 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:28.3051221Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:28.3052457Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:28.3053825Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:28.3055280Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:28.3056670Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:28.3058459Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:28.3059786Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:28.3061291Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:28.3062714Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:28.3063977Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:28.3065204Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:28.3066418Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:28.3067453Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:28.3068518Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:28.3069761Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:28.3071127Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:28.3072248Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:28.3073293Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:28.3074473Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:28.3075842Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:28.3076912Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:28.3077827Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:28.3078580Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:28.3079602Z W0507 20:31:28.301030 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:28.3444810Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:28.3446133Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:28.3447479Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:28.3449089Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:28.3450472Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:28.3451862Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:28.3453178Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:28.3454564Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:28.3455970Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:28.3457219Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:28.3458498Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:28.3459707Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:28.3460750Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:28.3461766Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:28.3462984Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:28.3464279Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:28.3465402Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:28.3466451Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:28.3467621Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:28.3468987Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:28.3470223Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:28.3471143Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:28.3471955Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:28.3472976Z W0507 20:31:28.340664 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:29.8223452Z self = 2025-05-07T20:31:29.8224252Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:29.8224649Z 2025-05-07T20:31:29.8224759Z @given( 2025-05-07T20:31:29.8225109Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:29.8225430Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:29.8225734Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:29.8226070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:29.8226398Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:29.8226686Z ) 2025-05-07T20:31:29.8227035Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:29.8227477Z def test_silu_mul_quant( 2025-05-07T20:31:29.8227725Z self, 2025-05-07T20:31:29.8227913Z T: int, 2025-05-07T20:31:29.8228112Z D: int, 2025-05-07T20:31:29.8228334Z scale_ub: Optional[float], 2025-05-07T20:31:29.8228596Z contiguous: bool, 2025-05-07T20:31:29.8228835Z compiled: bool, 2025-05-07T20:31:29.8229061Z ) -> None: 2025-05-07T20:31:29.8229270Z torch.manual_seed(2025) 2025-05-07T20:31:29.8229511Z 2025-05-07T20:31:29.8229786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:29.8230208Z 2025-05-07T20:31:29.8230404Z x_sign = torch.sign(x) 2025-05-07T20:31:29.8230693Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:29.8230992Z x = x_sign * x_clamp 2025-05-07T20:31:29.8231232Z x0 = x[:, :D] 2025-05-07T20:31:29.8231446Z x1 = x[:, D:] 2025-05-07T20:31:29.8231646Z 2025-05-07T20:31:29.8231830Z if contiguous: 2025-05-07T20:31:29.8232058Z x0 = x0.contiguous() 2025-05-07T20:31:29.8232308Z x1 = x1.contiguous() 2025-05-07T20:31:29.8232546Z 2025-05-07T20:31:29.8232738Z if scale_ub is not None: 2025-05-07T20:31:29.8233010Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:29.8233344Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:29.8233657Z ) 2025-05-07T20:31:29.8233847Z else: 2025-05-07T20:31:29.8234059Z scale_ub_tensor = None 2025-05-07T20:31:29.8234306Z 2025-05-07T20:31:29.8234536Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:29.8234844Z op = silu_mul_quant 2025-05-07T20:31:29.8235090Z if compiled: 2025-05-07T20:31:29.8235334Z op = torch.compile(op) 2025-05-07T20:31:29.8235627Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:29.8235900Z 2025-05-07T20:31:29.8236091Z > y_fp8, y_scale = fn() 2025-05-07T20:31:29.8236257Z 2025-05-07T20:31:29.8236355Z moe/activation_test.py:117: 2025-05-07T20:31:29.8236656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:29.8236988Z moe/activation_test.py:115: in fn 2025-05-07T20:31:29.8237272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:29.8237961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:29.8239054Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:29.8239606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:29.8240288Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:29.8240952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:29.8241628Z kernel = self.compile( 2025-05-07T20:31:29.8242166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:29.8242811Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:29.8243205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:29.8243432Z 2025-05-07T20:31:29.8243650Z self = 2025-05-07T20:31:29.8244753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:29.8246151Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab6864e160>} 2025-05-07T20:31:29.8247518Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:29.8248544Z context = 2025-05-07T20:31:29.8248831Z 2025-05-07T20:31:29.8249000Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:29.8249514Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:29.8249987Z module_map=module_map) 2025-05-07T20:31:29.8250350Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:29.8250699Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:29.8250947Z E ^ 2025-05-07T20:31:29.8251414Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:29.8251870Z 2025-05-07T20:31:29.8252290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:29.8252798Z 2025-05-07T20:31:29.8252905Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:29.8253311Z self=, 2025-05-07T20:31:29.8253712Z T=1, 2025-05-07T20:31:29.8253912Z D=7168, 2025-05-07T20:31:29.8254108Z scale_ub=None, 2025-05-07T20:31:29.8254322Z contiguous=True, 2025-05-07T20:31:29.8254540Z compiled=True, 2025-05-07T20:31:29.8254749Z ) 2025-05-07T20:31:29.8255070Z self = 2025-05-07T20:31:29.8255558Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:29.8255819Z 2025-05-07T20:31:29.8255894Z @given( 2025-05-07T20:31:29.8256122Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:29.8256441Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:29.8256741Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:29.8257067Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:29.8257396Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:29.8257675Z ) 2025-05-07T20:31:29.8266929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:29.8267417Z def test_silu_mul_quant( 2025-05-07T20:31:29.8267671Z self, 2025-05-07T20:31:29.8267869Z T: int, 2025-05-07T20:31:29.8268075Z D: int, 2025-05-07T20:31:29.8268416Z scale_ub: Optional[float], 2025-05-07T20:31:29.8268692Z contiguous: bool, 2025-05-07T20:31:29.8268941Z compiled: bool, 2025-05-07T20:31:29.8269172Z ) -> None: 2025-05-07T20:31:29.8269386Z torch.manual_seed(2025) 2025-05-07T20:31:29.8269641Z 2025-05-07T20:31:29.8270089Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:29.8270431Z 2025-05-07T20:31:29.8270634Z x_sign = torch.sign(x) 2025-05-07T20:31:29.8270933Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:29.8271250Z x = x_sign * x_clamp 2025-05-07T20:31:29.8271491Z x0 = x[:, :D] 2025-05-07T20:31:29.8271716Z x1 = x[:, D:] 2025-05-07T20:31:29.8271931Z 2025-05-07T20:31:29.8272117Z if contiguous: 2025-05-07T20:31:29.8272358Z x0 = x0.contiguous() 2025-05-07T20:31:29.8272623Z x1 = x1.contiguous() 2025-05-07T20:31:29.8272860Z 2025-05-07T20:31:29.8273065Z if scale_ub is not None: 2025-05-07T20:31:29.8273349Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:29.8273683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:29.8273998Z ) 2025-05-07T20:31:29.8274200Z else: 2025-05-07T20:31:29.8274411Z scale_ub_tensor = None 2025-05-07T20:31:29.8274672Z 2025-05-07T20:31:29.8274908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:29.8275222Z op = silu_mul_quant 2025-05-07T20:31:29.8275486Z if compiled: 2025-05-07T20:31:29.8275755Z op = torch.compile(op) 2025-05-07T20:31:29.8276058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:29.8276332Z 2025-05-07T20:31:29.8276528Z y_fp8, y_scale = fn() 2025-05-07T20:31:29.8276822Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:29.8277109Z 2025-05-07T20:31:29.8277347Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:29.8277691Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:29.8277990Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:29.8278298Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:29.8278659Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:29.8278983Z 2025-05-07T20:31:29.8279183Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:29.8279386Z 2025-05-07T20:31:29.8279488Z moe/activation_test.py:126: 2025-05-07T20:31:29.8279791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:29.8280122Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:29.8280456Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:29.8281252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:29.8282031Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:29.8282580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:29.8283280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:29.8283976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:29.8284719Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:29.8285477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:29.8286238Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:29.8286973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:29.8287702Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:29.8288308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:29.8288844Z fn() 2025-05-07T20:31:29.8289345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:29.8290008Z self.fn.run( 2025-05-07T20:31:29.8290475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:29.8291000Z kernel = self.compile( 2025-05-07T20:31:29.8291545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:29.8292208Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:29.8292605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:29.8292835Z 2025-05-07T20:31:29.8293049Z self = 2025-05-07T20:31:29.8294146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:29.8295549Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faba5c20280>} 2025-05-07T20:31:29.8296904Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:29.8297943Z context = 2025-05-07T20:31:29.8298235Z 2025-05-07T20:31:29.8298404Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:29.8298937Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:29.8299407Z module_map=module_map) 2025-05-07T20:31:29.8299775Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:29.8300134Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:29.8300409Z E ^ 2025-05-07T20:31:29.8300883Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:29.8301346Z 2025-05-07T20:31:29.8301761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:29.8302289Z 2025-05-07T20:31:29.8302396Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:29.8302813Z self=, 2025-05-07T20:31:29.8303213Z T=4096, 2025-05-07T20:31:29.8303408Z D=5120, 2025-05-07T20:31:29.8303608Z scale_ub=None, 2025-05-07T20:31:29.8304231Z contiguous=False, 2025-05-07T20:31:29.8304462Z compiled=False, 2025-05-07T20:31:29.8304674Z ) 2025-05-07T20:31:30.4614829Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.4616175Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:30.4617541Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.4619066Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.4620806Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.4622203Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.4623669Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.4625057Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.4626491Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.4627751Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:30.4629039Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.4630391Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:30.4631440Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:30.4632481Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:30.4633705Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.4635009Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.4636134Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:30.4637184Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:30.4638467Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.4639912Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.4640981Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.4641897Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.4642642Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:30.4643754Z W0507 20:31:30.457041 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.0694092Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:31.0695452Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:31.0697183Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:31.0698655Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:31.0700107Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:31.0701505Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.0702830Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:31.0704531Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.0705972Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:31.0707233Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:31.0708480Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:31.0710064Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:31.0711135Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:31.0712164Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:31.0713377Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:31.0714681Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:31.0715800Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:31.0716843Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:31.0718166Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:31.0719529Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:31.0720706Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.0721621Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:31.0722371Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:31.0723384Z W0507 20:31:31.065247 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.8447552Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:31.8448660Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:31.8450019Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:31.8451448Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:31.8452842Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:31.8454243Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.8455558Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:31.8456945Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.8458377Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:31.8459685Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:31.8460915Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:31.8462131Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:31.8463172Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:31.8464191Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:31.8465684Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:31.8466990Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:31.8468232Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:31.8469338Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:31.8470607Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:31.8471987Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:31.8473051Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.8473978Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:31.8474721Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:31.8475739Z W0507 20:31:31.840713 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.8850000Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:31.8851233Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:31.8852572Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:31.8854008Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:31.8855388Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:31.8856774Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.8858085Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:31.8859479Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.8860907Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:31.8862323Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:31.8863548Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:31.8864872Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:31.8865913Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:31.8866941Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:31.8868164Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:31.8869461Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:31.8870702Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:31.8871754Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:31.8872941Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:31.8874313Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:31.8875378Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.8876298Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:31.8877052Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:31.8878072Z W0507 20:31:31.881127 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.4648220Z self = 2025-05-07T20:31:35.4649052Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:35.4649440Z 2025-05-07T20:31:35.4649546Z @given( 2025-05-07T20:31:35.4649857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:35.4650253Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:35.4650571Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:35.4650921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:35.4651246Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:35.4651538Z ) 2025-05-07T20:31:35.4651893Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:35.4652334Z def test_silu_mul_quant( 2025-05-07T20:31:35.4652585Z self, 2025-05-07T20:31:35.4652791Z T: int, 2025-05-07T20:31:35.4652989Z D: int, 2025-05-07T20:31:35.4653216Z scale_ub: Optional[float], 2025-05-07T20:31:35.4653492Z contiguous: bool, 2025-05-07T20:31:35.4654099Z compiled: bool, 2025-05-07T20:31:35.4654331Z ) -> None: 2025-05-07T20:31:35.4654554Z torch.manual_seed(2025) 2025-05-07T20:31:35.4654801Z 2025-05-07T20:31:35.4655068Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:35.4655412Z 2025-05-07T20:31:35.4655606Z x_sign = torch.sign(x) 2025-05-07T20:31:35.4656046Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:35.4656366Z x = x_sign * x_clamp 2025-05-07T20:31:35.4656611Z x0 = x[:, :D] 2025-05-07T20:31:35.4656824Z x1 = x[:, D:] 2025-05-07T20:31:35.4657041Z 2025-05-07T20:31:35.4657232Z if contiguous: 2025-05-07T20:31:35.4657460Z x0 = x0.contiguous() 2025-05-07T20:31:35.4657725Z x1 = x1.contiguous() 2025-05-07T20:31:35.4657971Z 2025-05-07T20:31:35.4658175Z if scale_ub is not None: 2025-05-07T20:31:35.4658447Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:35.4658796Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:35.4659111Z ) 2025-05-07T20:31:35.4659310Z else: 2025-05-07T20:31:35.4659519Z scale_ub_tensor = None 2025-05-07T20:31:35.4659775Z 2025-05-07T20:31:35.4660012Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.4660333Z op = silu_mul_quant 2025-05-07T20:31:35.4660589Z if compiled: 2025-05-07T20:31:35.4660845Z op = torch.compile(op) 2025-05-07T20:31:35.4661143Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.4661427Z 2025-05-07T20:31:35.4661625Z > y_fp8, y_scale = fn() 2025-05-07T20:31:35.4661790Z 2025-05-07T20:31:35.4661896Z moe/activation_test.py:117: 2025-05-07T20:31:35.4662198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.4662535Z moe/activation_test.py:115: in fn 2025-05-07T20:31:35.4662815Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.4663520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:35.4664225Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:35.4664766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:35.4665451Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:35.4666111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:35.4666649Z kernel = self.compile( 2025-05-07T20:31:35.4667191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:35.4667852Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.4668248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.4668477Z 2025-05-07T20:31:35.4668695Z self = 2025-05-07T20:31:35.4669782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:35.4671278Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab69195f70>} 2025-05-07T20:31:35.4672620Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:35.4673658Z context = 2025-05-07T20:31:35.4673944Z 2025-05-07T20:31:35.4674425Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:35.4674950Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.4675419Z module_map=module_map) 2025-05-07T20:31:35.4675789Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.4676226Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.4676482Z E ^ 2025-05-07T20:31:35.4676964Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.4677414Z 2025-05-07T20:31:35.4677838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:35.4678360Z 2025-05-07T20:31:35.4678464Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:35.4678880Z self=, 2025-05-07T20:31:35.4679282Z T=4096, 2025-05-07T20:31:35.4679479Z D=7168, 2025-05-07T20:31:35.4679669Z scale_ub=None, 2025-05-07T20:31:35.4679889Z contiguous=False, 2025-05-07T20:31:35.4680118Z compiled=False, 2025-05-07T20:31:35.4680324Z ) 2025-05-07T20:31:35.4680644Z self = 2025-05-07T20:31:35.4681149Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:35.4681421Z 2025-05-07T20:31:35.4681504Z @given( 2025-05-07T20:31:35.4681734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:35.4682055Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:35.4682358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:35.4682692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:35.4683027Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:35.4683314Z ) 2025-05-07T20:31:35.4683660Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:35.4684107Z def test_silu_mul_quant( 2025-05-07T20:31:35.4684354Z self, 2025-05-07T20:31:35.4684545Z T: int, 2025-05-07T20:31:35.4684748Z D: int, 2025-05-07T20:31:35.4684970Z scale_ub: Optional[float], 2025-05-07T20:31:35.4685239Z contiguous: bool, 2025-05-07T20:31:35.4685485Z compiled: bool, 2025-05-07T20:31:35.4685716Z ) -> None: 2025-05-07T20:31:35.4685928Z torch.manual_seed(2025) 2025-05-07T20:31:35.4686170Z 2025-05-07T20:31:35.4686445Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:35.4686781Z 2025-05-07T20:31:35.4686974Z x_sign = torch.sign(x) 2025-05-07T20:31:35.4687266Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:35.4687568Z x = x_sign * x_clamp 2025-05-07T20:31:35.4687807Z x0 = x[:, :D] 2025-05-07T20:31:35.4688041Z x1 = x[:, D:] 2025-05-07T20:31:35.4688254Z 2025-05-07T20:31:35.4688434Z if contiguous: 2025-05-07T20:31:35.4688673Z x0 = x0.contiguous() 2025-05-07T20:31:35.4688934Z x1 = x1.contiguous() 2025-05-07T20:31:35.4689173Z 2025-05-07T20:31:35.4689369Z if scale_ub is not None: 2025-05-07T20:31:35.4689660Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:35.4690032Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:35.4690347Z ) 2025-05-07T20:31:35.4690548Z else: 2025-05-07T20:31:35.4690759Z scale_ub_tensor = None 2025-05-07T20:31:35.4691006Z 2025-05-07T20:31:35.4691242Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.4691562Z op = silu_mul_quant 2025-05-07T20:31:35.4691807Z if compiled: 2025-05-07T20:31:35.4692057Z op = torch.compile(op) 2025-05-07T20:31:35.4692359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.4692633Z 2025-05-07T20:31:35.4692830Z > y_fp8, y_scale = fn() 2025-05-07T20:31:35.4692993Z 2025-05-07T20:31:35.4693183Z moe/activation_test.py:117: 2025-05-07T20:31:35.4693475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.4693810Z moe/activation_test.py:115: in fn 2025-05-07T20:31:35.4694095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.4694870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:35.4695566Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:35.4696106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:35.4696788Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:35.4697440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:35.4697978Z kernel = self.compile( 2025-05-07T20:31:35.4698525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:35.4699178Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.4699572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.4699814Z 2025-05-07T20:31:35.4700033Z self = 2025-05-07T20:31:35.4701167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:35.4702546Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faba0eadee0>} 2025-05-07T20:31:35.4704071Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:35.4705237Z context = 2025-05-07T20:31:35.4705534Z 2025-05-07T20:31:35.4705701Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:35.4706231Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.4706693Z module_map=module_map) 2025-05-07T20:31:35.4707064Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.4707422Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.4707682Z E ^ 2025-05-07T20:31:35.4708148Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.4708605Z 2025-05-07T20:31:35.4709035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:35.4709554Z 2025-05-07T20:31:35.4709662Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:35.4710111Z self=, 2025-05-07T20:31:35.4710512Z T=128, 2025-05-07T20:31:35.4710706Z D=7168, 2025-05-07T20:31:35.4710899Z scale_ub=None, 2025-05-07T20:31:35.4711111Z contiguous=False, 2025-05-07T20:31:35.4711337Z compiled=True, 2025-05-07T20:31:35.4711566Z ) 2025-05-07T20:31:35.5503465Z self = 2025-05-07T20:31:35.5505092Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:35.5505641Z 2025-05-07T20:31:35.5505799Z @given( 2025-05-07T20:31:35.5506264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:35.5506885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:35.5507803Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:35.5508468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:35.5509121Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:35.5509698Z ) 2025-05-07T20:31:35.5510124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:35.5518934Z def test_silu_mul_quant( 2025-05-07T20:31:35.5519197Z self, 2025-05-07T20:31:35.5519402Z T: int, 2025-05-07T20:31:35.5519606Z D: int, 2025-05-07T20:31:35.5519826Z scale_ub: Optional[float], 2025-05-07T20:31:35.5520112Z contiguous: bool, 2025-05-07T20:31:35.5520361Z compiled: bool, 2025-05-07T20:31:35.5520596Z ) -> None: 2025-05-07T20:31:35.5520812Z torch.manual_seed(2025) 2025-05-07T20:31:35.5521065Z 2025-05-07T20:31:35.5521353Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:35.5521698Z 2025-05-07T20:31:35.5521901Z x_sign = torch.sign(x) 2025-05-07T20:31:35.5522211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:35.5522524Z x = x_sign * x_clamp 2025-05-07T20:31:35.5522778Z x0 = x[:, :D] 2025-05-07T20:31:35.5523005Z x1 = x[:, D:] 2025-05-07T20:31:35.5523210Z 2025-05-07T20:31:35.5523405Z if contiguous: 2025-05-07T20:31:35.5523655Z x0 = x0.contiguous() 2025-05-07T20:31:35.5523918Z x1 = x1.contiguous() 2025-05-07T20:31:35.5524171Z 2025-05-07T20:31:35.5524367Z if scale_ub is not None: 2025-05-07T20:31:35.5524639Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:35.5524987Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:35.5525301Z ) 2025-05-07T20:31:35.5525499Z else: 2025-05-07T20:31:35.5525707Z scale_ub_tensor = None 2025-05-07T20:31:35.5525964Z 2025-05-07T20:31:35.5526209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.5526529Z op = silu_mul_quant 2025-05-07T20:31:35.5526785Z if compiled: 2025-05-07T20:31:35.5527044Z op = torch.compile(op) 2025-05-07T20:31:35.5527342Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.5527626Z 2025-05-07T20:31:35.5527829Z y_fp8, y_scale = fn() 2025-05-07T20:31:35.5528120Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:35.5528420Z 2025-05-07T20:31:35.5528663Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.5528997Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:35.5529299Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:35.5529629Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:35.5530037Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:35.5530346Z 2025-05-07T20:31:35.5530553Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:35.5530756Z 2025-05-07T20:31:35.5530869Z moe/activation_test.py:126: 2025-05-07T20:31:35.5531167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.5531508Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:35.5531843Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:35.5532655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:35.5533429Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:35.5533996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:35.5534696Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:35.5535388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:35.5536240Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:35.5537008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:35.5537773Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:35.5538508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:35.5539243Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:35.5539856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:35.5540425Z fn() 2025-05-07T20:31:35.5540932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:35.5541509Z self.fn.run( 2025-05-07T20:31:35.5541970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:35.5542505Z kernel = self.compile( 2025-05-07T20:31:35.5543044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:35.5543694Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.5544098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.5544340Z 2025-05-07T20:31:35.5544548Z self = 2025-05-07T20:31:35.5545665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:35.5547072Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faba1a29700>} 2025-05-07T20:31:35.5548448Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:35.5549493Z context = 2025-05-07T20:31:35.5549789Z 2025-05-07T20:31:35.5550028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:35.5550557Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.5551025Z module_map=module_map) 2025-05-07T20:31:35.5551396Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.5551755Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:35.5552019Z E ^ 2025-05-07T20:31:35.5552492Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.5552958Z 2025-05-07T20:31:35.5553377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:35.5553895Z 2025-05-07T20:31:35.5554004Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:35.5554418Z self=, 2025-05-07T20:31:35.5554829Z T=128, 2025-05-07T20:31:35.5555017Z D=7168, 2025-05-07T20:31:35.5555207Z scale_ub=None, 2025-05-07T20:31:35.5555427Z contiguous=False, 2025-05-07T20:31:35.5555656Z compiled=False, 2025-05-07T20:31:35.5555859Z ) 2025-05-07T20:31:35.8037514Z self = 2025-05-07T20:31:35.8038140Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:35.8038475Z 2025-05-07T20:31:35.8038556Z @given( 2025-05-07T20:31:35.8038790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:35.8039384Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:35.8039696Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:35.8040029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:35.8040363Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:35.8040645Z ) 2025-05-07T20:31:35.8041125Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:35.8041567Z def test_silu_mul_quant( 2025-05-07T20:31:35.8041810Z self, 2025-05-07T20:31:35.8042012Z T: int, 2025-05-07T20:31:35.8042213Z D: int, 2025-05-07T20:31:35.8042431Z scale_ub: Optional[float], 2025-05-07T20:31:35.8042705Z contiguous: bool, 2025-05-07T20:31:35.8042949Z compiled: bool, 2025-05-07T20:31:35.8043170Z ) -> None: 2025-05-07T20:31:35.8043392Z torch.manual_seed(2025) 2025-05-07T20:31:35.8043638Z 2025-05-07T20:31:35.8043917Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:35.8044261Z 2025-05-07T20:31:35.8044462Z x_sign = torch.sign(x) 2025-05-07T20:31:35.8044755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:35.8045061Z x = x_sign * x_clamp 2025-05-07T20:31:35.8045304Z x0 = x[:, :D] 2025-05-07T20:31:35.8045529Z x1 = x[:, D:] 2025-05-07T20:31:35.8045732Z 2025-05-07T20:31:35.8045926Z if contiguous: 2025-05-07T20:31:35.8046158Z x0 = x0.contiguous() 2025-05-07T20:31:35.8046417Z x1 = x1.contiguous() 2025-05-07T20:31:35.8046659Z 2025-05-07T20:31:35.8046851Z if scale_ub is not None: 2025-05-07T20:31:35.8047121Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:35.8047463Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:35.8047776Z ) 2025-05-07T20:31:35.8047974Z else: 2025-05-07T20:31:35.8048185Z scale_ub_tensor = None 2025-05-07T20:31:35.8048436Z 2025-05-07T20:31:35.8048670Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.8048986Z op = silu_mul_quant 2025-05-07T20:31:35.8049239Z if compiled: 2025-05-07T20:31:35.8049488Z op = torch.compile(op) 2025-05-07T20:31:35.8049781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.8050064Z 2025-05-07T20:31:35.8050257Z > y_fp8, y_scale = fn() 2025-05-07T20:31:35.8050421Z 2025-05-07T20:31:35.8050525Z moe/activation_test.py:117: 2025-05-07T20:31:35.8050822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.8051154Z moe/activation_test.py:115: in fn 2025-05-07T20:31:35.8051432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.8052129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:35.8052820Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:35.8053361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:35.8054035Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:35.8054693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:35.8055226Z kernel = self.compile( 2025-05-07T20:31:35.8055760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:35.8056412Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.8056812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.8057043Z 2025-05-07T20:31:35.8057258Z self = 2025-05-07T20:31:35.8058422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:35.8059806Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab681d7d30>} 2025-05-07T20:31:35.8061232Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:35.8062256Z context = 2025-05-07T20:31:35.8062546Z 2025-05-07T20:31:35.8062721Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:35.8063237Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.8063713Z module_map=module_map) 2025-05-07T20:31:35.8064085Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.8064438Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.8064706Z E ^ 2025-05-07T20:31:35.8065185Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.8065641Z 2025-05-07T20:31:35.8066061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:35.8066571Z 2025-05-07T20:31:35.8066677Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:35.8067104Z self=, 2025-05-07T20:31:35.8067513Z T=4096, 2025-05-07T20:31:35.8067699Z D=5120, 2025-05-07T20:31:35.8067893Z scale_ub=1200.0, 2025-05-07T20:31:35.8068125Z contiguous=True, 2025-05-07T20:31:35.8068354Z compiled=False, 2025-05-07T20:31:35.8068554Z ) 2025-05-07T20:31:35.8068880Z self = 2025-05-07T20:31:35.8069376Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:35.8069650Z 2025-05-07T20:31:35.8069728Z @given( 2025-05-07T20:31:35.8070033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:35.8070349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:35.8070649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:35.8070976Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:35.8071304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:35.8071583Z ) 2025-05-07T20:31:35.8071932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:35.8072374Z def test_silu_mul_quant( 2025-05-07T20:31:35.8072625Z self, 2025-05-07T20:31:35.8072823Z T: int, 2025-05-07T20:31:35.8073023Z D: int, 2025-05-07T20:31:35.8073246Z scale_ub: Optional[float], 2025-05-07T20:31:35.8073515Z contiguous: bool, 2025-05-07T20:31:35.8073754Z compiled: bool, 2025-05-07T20:31:35.8073978Z ) -> None: 2025-05-07T20:31:35.8074189Z torch.manual_seed(2025) 2025-05-07T20:31:35.8074437Z 2025-05-07T20:31:35.8074711Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:35.8075052Z 2025-05-07T20:31:35.8075251Z x_sign = torch.sign(x) 2025-05-07T20:31:35.8075543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:35.8075847Z x = x_sign * x_clamp 2025-05-07T20:31:35.8076090Z x0 = x[:, :D] 2025-05-07T20:31:35.8076313Z x1 = x[:, D:] 2025-05-07T20:31:35.8076520Z 2025-05-07T20:31:35.8076714Z if contiguous: 2025-05-07T20:31:35.8076951Z x0 = x0.contiguous() 2025-05-07T20:31:35.8077216Z x1 = x1.contiguous() 2025-05-07T20:31:35.8077453Z 2025-05-07T20:31:35.8077651Z if scale_ub is not None: 2025-05-07T20:31:35.8078015Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:35.8078346Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:35.8078663Z ) 2025-05-07T20:31:35.8078859Z else: 2025-05-07T20:31:35.8079073Z scale_ub_tensor = None 2025-05-07T20:31:35.8079328Z 2025-05-07T20:31:35.8079671Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:35.8079982Z op = silu_mul_quant 2025-05-07T20:31:35.8080235Z if compiled: 2025-05-07T20:31:35.8080483Z op = torch.compile(op) 2025-05-07T20:31:35.8080776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.8081052Z 2025-05-07T20:31:35.8081246Z > y_fp8, y_scale = fn() 2025-05-07T20:31:35.8081410Z 2025-05-07T20:31:35.8081513Z moe/activation_test.py:117: 2025-05-07T20:31:35.8081818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.8082148Z moe/activation_test.py:115: in fn 2025-05-07T20:31:35.8082435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:35.8083114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:35.8083798Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:35.8084339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:35.8085017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:35.8085669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:35.8086198Z kernel = self.compile( 2025-05-07T20:31:35.8086734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:35.8087383Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:35.8087774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:35.8088004Z 2025-05-07T20:31:35.8088208Z self = 2025-05-07T20:31:35.8089286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:35.8090666Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab69195940>} 2025-05-07T20:31:35.8092002Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:35.8093028Z context = 2025-05-07T20:31:35.8093323Z 2025-05-07T20:31:35.8093488Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:35.8094012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:35.8094474Z module_map=module_map) 2025-05-07T20:31:35.8094844Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:35.8095196Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:35.8095448Z E ^ 2025-05-07T20:31:35.8095916Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:35.8096373Z 2025-05-07T20:31:35.8096786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:35.8097293Z 2025-05-07T20:31:35.8097402Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:35.8097900Z self=, 2025-05-07T20:31:35.8098315Z T=1, 2025-05-07T20:31:35.8098497Z D=5120, 2025-05-07T20:31:35.8098685Z scale_ub=None, 2025-05-07T20:31:35.8098901Z contiguous=True, 2025-05-07T20:31:35.8099127Z compiled=True, 2025-05-07T20:31:35.8099330Z ) 2025-05-07T20:31:36.4083161Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:36.4085697Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:36.4088392Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:36.4090624Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:36.4092012Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:36.4093403Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.4094715Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:36.4096091Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.4097512Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:36.4098763Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:36.4099980Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:36.4101194Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:36.4102236Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:36.4103256Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:36.4104663Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:36.4105957Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:36.4107071Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:36.4108240Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:36.4109418Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:36.4111230Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:36.4112293Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.4113205Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.4113942Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:36.4114964Z W0507 20:31:36.404152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.5971476Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:36.5973658Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:36.5976321Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:36.5979159Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:36.5980931Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:36.5982316Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.5983618Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:36.5984989Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.5986402Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:36.5987650Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:36.5988883Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:36.5990194Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:36.5991396Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:36.5992420Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:36.5993639Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:36.5995030Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:36.5996146Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:36.5997187Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:36.5998361Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:36.5999718Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:36.6000831Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.6001744Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.6002487Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:36.6003502Z W0507 20:31:36.593111 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.1047401Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.1048516Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.1049858Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.1051278Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.1052667Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.1054050Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.1055358Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.1056729Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.1058290Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.1059551Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.1060886Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.1062097Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.1063139Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:37.1064163Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.1065388Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.1066675Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.1067793Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:37.1068837Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.1070074Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.1071435Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.1072511Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.1073427Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.1074177Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.1075203Z W0507 20:31:37.100717 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.1446671Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.1447882Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.1449226Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.1450645Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.1452195Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.1453592Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.1455020Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.1456396Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.1457826Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.1459081Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.1460302Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.1461517Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.1462563Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:37.1463595Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.1464819Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.1466118Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.1467236Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:37.1468280Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.1469468Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.1470901Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.1471968Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.1472884Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.1473622Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.1474719Z W0507 20:31:37.140809 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.4884455Z self = 2025-05-07T20:31:37.4885903Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:37.4886621Z 2025-05-07T20:31:37.4886829Z @given( 2025-05-07T20:31:37.4887457Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.4888402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.4895667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.4896050Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.4896384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.4896672Z ) 2025-05-07T20:31:37.4897038Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.4897487Z def test_silu_mul_quant( 2025-05-07T20:31:37.4897729Z self, 2025-05-07T20:31:37.4897930Z T: int, 2025-05-07T20:31:37.4898130Z D: int, 2025-05-07T20:31:37.4898354Z scale_ub: Optional[float], 2025-05-07T20:31:37.4898634Z contiguous: bool, 2025-05-07T20:31:37.4898876Z compiled: bool, 2025-05-07T20:31:37.4899105Z ) -> None: 2025-05-07T20:31:37.4899326Z torch.manual_seed(2025) 2025-05-07T20:31:37.4899576Z 2025-05-07T20:31:37.4899854Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.4900206Z 2025-05-07T20:31:37.4900406Z x_sign = torch.sign(x) 2025-05-07T20:31:37.4900703Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.4901009Z x = x_sign * x_clamp 2025-05-07T20:31:37.4901253Z x0 = x[:, :D] 2025-05-07T20:31:37.4901470Z x1 = x[:, D:] 2025-05-07T20:31:37.4901675Z 2025-05-07T20:31:37.4901866Z if contiguous: 2025-05-07T20:31:37.4902103Z x0 = x0.contiguous() 2025-05-07T20:31:37.4902360Z x1 = x1.contiguous() 2025-05-07T20:31:37.4902605Z 2025-05-07T20:31:37.4902803Z if scale_ub is not None: 2025-05-07T20:31:37.4903076Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.4903418Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.4903980Z ) 2025-05-07T20:31:37.4904174Z else: 2025-05-07T20:31:37.4904390Z scale_ub_tensor = None 2025-05-07T20:31:37.4904648Z 2025-05-07T20:31:37.4904872Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.4905186Z op = silu_mul_quant 2025-05-07T20:31:37.4905439Z if compiled: 2025-05-07T20:31:37.4905684Z op = torch.compile(op) 2025-05-07T20:31:37.4905980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.4906251Z 2025-05-07T20:31:37.4906444Z y_fp8, y_scale = fn() 2025-05-07T20:31:37.4906724Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:37.4907015Z 2025-05-07T20:31:37.4907248Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.4907579Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:37.4907868Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:37.4908184Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:37.4908535Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.4908851Z 2025-05-07T20:31:37.4909051Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:37.4909247Z 2025-05-07T20:31:37.4909353Z moe/activation_test.py:126: 2025-05-07T20:31:37.4909646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.4910047Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:37.4910374Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.4911161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:37.4912077Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:37.4912625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.4913308Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.4913988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:37.4914831Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.4915576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:37.4916325Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.4917048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:37.4917690Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:37.4918286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:37.4918806Z fn() 2025-05-07T20:31:37.4919307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:37.4919887Z self.fn.run( 2025-05-07T20:31:37.4920353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.4920880Z kernel = self.compile( 2025-05-07T20:31:37.4921417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.4922064Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.4922463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.4922691Z 2025-05-07T20:31:37.4922916Z self = 2025-05-07T20:31:37.4924007Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.4925411Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab68cea820>} 2025-05-07T20:31:37.4926753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.4927776Z context = 2025-05-07T20:31:37.4928061Z 2025-05-07T20:31:37.4928237Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.4928765Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.4929226Z module_map=module_map) 2025-05-07T20:31:37.4929593Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.4929946Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:37.4930223Z E ^ 2025-05-07T20:31:37.4930693Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.4931147Z 2025-05-07T20:31:37.4931559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.4932066Z 2025-05-07T20:31:37.4932175Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.4932580Z self=, 2025-05-07T20:31:37.4932983Z T=2048, 2025-05-07T20:31:37.4933167Z D=5120, 2025-05-07T20:31:37.4933353Z scale_ub=None, 2025-05-07T20:31:37.4933656Z contiguous=True, 2025-05-07T20:31:37.4933879Z compiled=True, 2025-05-07T20:31:37.4934078Z ) 2025-05-07T20:31:37.9990358Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.9991871Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:37.9993201Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.9994616Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.9996005Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.9997380Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.9998686Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.0000050Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.0001461Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.0002705Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.0004091Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.0005291Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.0006321Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:38.0007345Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.0008867Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.0010465Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.0011574Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:38.0012614Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.0013916Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.0015274Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.0016434Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.0017339Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.0018077Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.0019100Z W0507 20:31:37.994772 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.1869104Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.1870463Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.1871801Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.1873209Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.1874584Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.1875962Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.1877260Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.1878624Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.1880054Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.1881298Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.1882519Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.1883728Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.1884764Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:38.1885949Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.1887172Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.1888558Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.1889670Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:38.1890711Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.1891889Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.1893241Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.1894307Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.1895209Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.1895951Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.1896965Z W0507 20:31:38.182895 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.6913523Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.6914741Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.6916089Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.6917511Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.6918895Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.6920272Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.6921576Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.6922940Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.6924509Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.6925754Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.6926969Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.6928279Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.6929312Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:38.6930326Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.6931547Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.6932816Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.6933932Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:38.6934966Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.6936153Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.6937501Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.6938553Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.6939464Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.6940204Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.6941276Z W0507 20:31:38.687286 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.7312594Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.7313895Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.7315229Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.7316642Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.7318195Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.7319574Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.7320866Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.7322343Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.7323757Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.7325003Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.7326221Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.7327425Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.7328450Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:38.7329466Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.7330725Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.7332009Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.7333114Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:38.7334152Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.7335325Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.7336677Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.7337738Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.7338649Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.7339390Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.7340406Z W0507 20:31:38.727338 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.2168240Z self = 2025-05-07T20:31:39.2169160Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:39.2169506Z 2025-05-07T20:31:39.2169589Z @given( 2025-05-07T20:31:39.2169828Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.2170140Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.2170596Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.2171493Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.2172154Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.2172719Z ) 2025-05-07T20:31:39.2173416Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.2174298Z def test_silu_mul_quant( 2025-05-07T20:31:39.2174779Z self, 2025-05-07T20:31:39.2175166Z T: int, 2025-05-07T20:31:39.2175555Z D: int, 2025-05-07T20:31:39.2175987Z scale_ub: Optional[float], 2025-05-07T20:31:39.2176535Z contiguous: bool, 2025-05-07T20:31:39.2177019Z compiled: bool, 2025-05-07T20:31:39.2177462Z ) -> None: 2025-05-07T20:31:39.2177891Z torch.manual_seed(2025) 2025-05-07T20:31:39.2178381Z 2025-05-07T20:31:39.2178917Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.2179603Z 2025-05-07T20:31:39.2179989Z x_sign = torch.sign(x) 2025-05-07T20:31:39.2180577Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.2181022Z x = x_sign * x_clamp 2025-05-07T20:31:39.2181267Z x0 = x[:, :D] 2025-05-07T20:31:39.2181483Z x1 = x[:, D:] 2025-05-07T20:31:39.2181685Z 2025-05-07T20:31:39.2181873Z if contiguous: 2025-05-07T20:31:39.2182108Z x0 = x0.contiguous() 2025-05-07T20:31:39.2182365Z x1 = x1.contiguous() 2025-05-07T20:31:39.2182611Z 2025-05-07T20:31:39.2182805Z if scale_ub is not None: 2025-05-07T20:31:39.2183077Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.2183424Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.2183736Z ) 2025-05-07T20:31:39.2183927Z else: 2025-05-07T20:31:39.2184142Z scale_ub_tensor = None 2025-05-07T20:31:39.2184396Z 2025-05-07T20:31:39.2184630Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.2184956Z op = silu_mul_quant 2025-05-07T20:31:39.2185213Z if compiled: 2025-05-07T20:31:39.2185465Z op = torch.compile(op) 2025-05-07T20:31:39.2185762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.2186042Z 2025-05-07T20:31:39.2186239Z y_fp8, y_scale = fn() 2025-05-07T20:31:39.2186531Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:39.2186823Z 2025-05-07T20:31:39.2187063Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.2187398Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:39.2187695Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:39.2188017Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:39.2188375Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:39.2188692Z 2025-05-07T20:31:39.2188898Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:39.2189095Z 2025-05-07T20:31:39.2189207Z moe/activation_test.py:126: 2025-05-07T20:31:39.2189504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.2189905Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:39.2190237Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:39.2191030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:39.2191796Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:39.2192348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.2193187Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.2194009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:39.2194733Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:39.2195794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:39.2196543Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:39.2197268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:39.2197911Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:39.2198515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:39.2199033Z fn() 2025-05-07T20:31:39.2199538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:39.2200126Z self.fn.run( 2025-05-07T20:31:39.2200597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.2201128Z kernel = self.compile( 2025-05-07T20:31:39.2201672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.2202325Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.2202722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.2202956Z 2025-05-07T20:31:39.2203169Z self = 2025-05-07T20:31:39.2204441Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.2205832Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa04c8d940>} 2025-05-07T20:31:39.2207195Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.2208212Z context = 2025-05-07T20:31:39.2208505Z 2025-05-07T20:31:39.2208672Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.2209197Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.2209663Z module_map=module_map) 2025-05-07T20:31:39.2210028Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.2210389Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:39.2210657Z E ^ 2025-05-07T20:31:39.2211118Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.2211583Z 2025-05-07T20:31:39.2211998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.2212519Z 2025-05-07T20:31:39.2212627Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.2213057Z self=, 2025-05-07T20:31:39.2213461Z T=128, 2025-05-07T20:31:39.2213655Z D=5120, 2025-05-07T20:31:39.2213852Z scale_ub=None, 2025-05-07T20:31:39.2214064Z contiguous=True, 2025-05-07T20:31:39.2214295Z compiled=True, 2025-05-07T20:31:39.2214501Z ) 2025-05-07T20:31:39.7427818Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.7430328Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:39.7431924Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.7433352Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.7434732Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.7436114Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.7437418Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.7438795Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.7440210Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.7441451Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:39.7442659Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.7443871Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:39.7444905Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:39.7445918Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:39.7447129Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.7448406Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.7449522Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:39.7450563Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:39.7451735Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.7453163Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.7454227Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.7455211Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.7455949Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:39.7456957Z W0507 20:31:39.738538 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.9320088Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.9321329Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:39.9322672Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.9324088Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.9325467Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.9326840Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.9328140Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.9329507Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.9330958Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.9332213Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:39.9333422Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.9334625Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:39.9335667Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:39.9336679Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:39.9338062Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.9339348Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.9340578Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:39.9341676Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:39.9342857Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.9344227Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.9345294Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.9346218Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.9346965Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:39.9347980Z W0507 20:31:39.928007 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.4442163Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.4443508Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:40.4444841Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.4446266Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.4447661Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.4449050Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.4450359Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.4451731Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.4453140Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.4454544Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:40.4455776Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.4462913Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:40.4463985Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:40.4465017Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:40.4466245Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.4467533Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.4468647Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:40.4469702Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:40.4470947Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.4472300Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.4473369Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.4474279Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.4475035Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:40.4476049Z W0507 20:31:40.440152 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.4834593Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.4837124Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:40.4839770Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.4841685Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.4843060Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.4844601Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.4845909Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.4847389Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.4848807Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.4850054Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:40.4851322Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.4852529Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:40.4853565Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:40.4854585Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:40.4855807Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.4857084Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.4858197Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:40.4859238Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:40.4860413Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.4861820Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.4862875Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.4863787Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.4864536Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:40.4865549Z W0507 20:31:40.479477 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.9399982Z self = 2025-05-07T20:31:40.9400678Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:40.9401056Z 2025-05-07T20:31:40.9401182Z @given( 2025-05-07T20:31:40.9401723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.9402170Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.9402503Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.9402828Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.9403184Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.9403601Z ) 2025-05-07T20:31:40.9404129Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.9404570Z def test_silu_mul_quant( 2025-05-07T20:31:40.9404809Z self, 2025-05-07T20:31:40.9405010Z T: int, 2025-05-07T20:31:40.9405208Z D: int, 2025-05-07T20:31:40.9405428Z scale_ub: Optional[float], 2025-05-07T20:31:40.9405700Z contiguous: bool, 2025-05-07T20:31:40.9405935Z compiled: bool, 2025-05-07T20:31:40.9406163Z ) -> None: 2025-05-07T20:31:40.9406382Z torch.manual_seed(2025) 2025-05-07T20:31:40.9406619Z 2025-05-07T20:31:40.9406899Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.9407239Z 2025-05-07T20:31:40.9407425Z x_sign = torch.sign(x) 2025-05-07T20:31:40.9407714Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.9408024Z x = x_sign * x_clamp 2025-05-07T20:31:40.9408273Z x0 = x[:, :D] 2025-05-07T20:31:40.9408479Z x1 = x[:, D:] 2025-05-07T20:31:40.9408683Z 2025-05-07T20:31:40.9408870Z if contiguous: 2025-05-07T20:31:40.9409095Z x0 = x0.contiguous() 2025-05-07T20:31:40.9409357Z x1 = x1.contiguous() 2025-05-07T20:31:40.9409598Z 2025-05-07T20:31:40.9409785Z if scale_ub is not None: 2025-05-07T20:31:40.9410058Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.9410394Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.9410698Z ) 2025-05-07T20:31:40.9410895Z else: 2025-05-07T20:31:40.9411111Z scale_ub_tensor = None 2025-05-07T20:31:40.9411355Z 2025-05-07T20:31:40.9411589Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.9411903Z op = silu_mul_quant 2025-05-07T20:31:40.9412147Z if compiled: 2025-05-07T20:31:40.9412393Z op = torch.compile(op) 2025-05-07T20:31:40.9412696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.9412964Z 2025-05-07T20:31:40.9413153Z y_fp8, y_scale = fn() 2025-05-07T20:31:40.9413437Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:40.9413730Z 2025-05-07T20:31:40.9413956Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.9414289Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:40.9414582Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:40.9414892Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:40.9415248Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.9415560Z 2025-05-07T20:31:40.9415753Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:40.9415953Z 2025-05-07T20:31:40.9416054Z moe/activation_test.py:126: 2025-05-07T20:31:40.9416350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.9416687Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:40.9417007Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.9417796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:40.9418559Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:40.9419099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.9419784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.9420591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:40.9421369Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.9422110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:40.9422956Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.9423678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:40.9424315Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:40.9424910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:40.9425427Z fn() 2025-05-07T20:31:40.9425931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:40.9426502Z self.fn.run( 2025-05-07T20:31:40.9426965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.9427490Z kernel = self.compile( 2025-05-07T20:31:40.9428029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.9428682Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.9429077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.9429307Z 2025-05-07T20:31:40.9429523Z self = 2025-05-07T20:31:40.9430680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.9432114Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa04853c10>} 2025-05-07T20:31:40.9433463Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.9434495Z context = 2025-05-07T20:31:40.9434782Z 2025-05-07T20:31:40.9434956Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.9435469Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.9435931Z module_map=module_map) 2025-05-07T20:31:40.9436296Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.9436653Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:40.9436920Z E ^ 2025-05-07T20:31:40.9437392Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.9437841Z 2025-05-07T20:31:40.9438262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.9438776Z 2025-05-07T20:31:40.9438878Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.9439292Z self=, 2025-05-07T20:31:40.9439697Z T=4096, 2025-05-07T20:31:40.9439882Z D=5120, 2025-05-07T20:31:40.9440069Z scale_ub=None, 2025-05-07T20:31:40.9440284Z contiguous=True, 2025-05-07T20:31:40.9440510Z compiled=True, 2025-05-07T20:31:40.9440710Z ) 2025-05-07T20:31:41.4753559Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:41.4754849Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:41.4756201Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:41.4758142Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:41.4759525Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:41.4760929Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.4762288Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:41.4763672Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.4765088Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:41.4766346Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:41.4767557Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:41.4768766Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:41.4769804Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:41.4770824Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:41.4772094Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:41.4773376Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:41.4774490Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:41.4775532Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:41.4776706Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:41.4778132Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:41.4779199Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.4780119Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.4780940Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:41.4781959Z W0507 20:31:41.471258 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.6648457Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:41.6650945Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:41.6652296Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:41.6653719Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:41.6655093Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:41.6656466Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.6657766Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:41.6659146Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.6660568Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:41.6661861Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:41.6663082Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:41.6664284Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:41.6665330Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:41.6666351Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:41.6667579Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:41.6669075Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:41.6670264Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:41.6671435Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:41.6672655Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:41.6674012Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:41.6675075Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.6675987Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.6676745Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:41.6677762Z W0507 20:31:41.660749 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.1712163Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:42.1713557Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:42.1714901Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:42.1716327Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:42.1717697Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:42.1719065Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.1720369Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:42.1721736Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.1723149Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:42.1724386Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:42.1725759Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:42.1726959Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:42.1727985Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:42.1729104Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:42.1730310Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:42.1731580Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:42.1732680Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:42.1733715Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:42.1734882Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:42.1736221Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:42.1737272Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.1738171Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.1738906Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:42.1739912Z W0507 20:31:42.167170 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.2099936Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:42.2101215Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:42.2102595Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:42.2104184Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:42.2105560Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:42.2106926Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.2108403Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:42.2109765Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.2111354Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:42.2112584Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:42.2113795Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:42.2114993Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:42.2116015Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:42.2117033Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:42.2118239Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:42.2119510Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:42.2120612Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:42.2121642Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:42.2122807Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:42.2124154Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:42.2125211Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.2126113Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.2126846Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:42.2127865Z W0507 20:31:42.206164 86817 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.6598461Z self = 2025-05-07T20:31:42.6599905Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:42.6600634Z 2025-05-07T20:31:42.6600840Z @given( 2025-05-07T20:31:42.6601386Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.6601781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.6602298Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.6602637Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.6602967Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.6603246Z ) 2025-05-07T20:31:42.6603595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.6604343Z def test_silu_mul_quant( 2025-05-07T20:31:42.6604586Z self, 2025-05-07T20:31:42.6604780Z T: int, 2025-05-07T20:31:42.6604979Z D: int, 2025-05-07T20:31:42.6605189Z scale_ub: Optional[float], 2025-05-07T20:31:42.6605461Z contiguous: bool, 2025-05-07T20:31:42.6605700Z compiled: bool, 2025-05-07T20:31:42.6605918Z ) -> None: 2025-05-07T20:31:42.6606134Z torch.manual_seed(2025) 2025-05-07T20:31:42.6606375Z 2025-05-07T20:31:42.6606642Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.6606985Z 2025-05-07T20:31:42.6607185Z x_sign = torch.sign(x) 2025-05-07T20:31:42.6607472Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.6607782Z x = x_sign * x_clamp 2025-05-07T20:31:42.6608024Z x0 = x[:, :D] 2025-05-07T20:31:42.6608239Z x1 = x[:, D:] 2025-05-07T20:31:42.6608442Z 2025-05-07T20:31:42.6608634Z if contiguous: 2025-05-07T20:31:42.6608863Z x0 = x0.contiguous() 2025-05-07T20:31:42.6609115Z x1 = x1.contiguous() 2025-05-07T20:31:42.6609352Z 2025-05-07T20:31:42.6609543Z if scale_ub is not None: 2025-05-07T20:31:42.6609810Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.6610149Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.6610457Z ) 2025-05-07T20:31:42.6610646Z else: 2025-05-07T20:31:42.6610856Z scale_ub_tensor = None 2025-05-07T20:31:42.6611107Z 2025-05-07T20:31:42.6611338Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.6611659Z op = silu_mul_quant 2025-05-07T20:31:42.6611908Z if compiled: 2025-05-07T20:31:42.6612151Z op = torch.compile(op) 2025-05-07T20:31:42.6612449Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.6612727Z 2025-05-07T20:31:42.6612917Z y_fp8, y_scale = fn() 2025-05-07T20:31:42.6613203Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:42.6613494Z 2025-05-07T20:31:42.6613731Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.6614056Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:42.6614349Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:42.6614662Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:42.6615015Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:42.6615323Z 2025-05-07T20:31:42.6615522Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:42.6615717Z 2025-05-07T20:31:42.6615823Z moe/activation_test.py:126: 2025-05-07T20:31:42.6616121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.6616452Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:42.6616777Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:42.6617558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:42.6618323Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:42.6618868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.6619544Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.6620225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:42.6621058Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:42.6621809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:42.6622544Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:42.6623382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:42.6624015Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:42.6624611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:42.6625119Z fn() 2025-05-07T20:31:42.6625618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:42.6626199Z self.fn.run( 2025-05-07T20:31:42.6626663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.6627192Z kernel = self.compile( 2025-05-07T20:31:42.6627730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.6628383Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.6628783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.6629014Z 2025-05-07T20:31:42.6629221Z self = 2025-05-07T20:31:42.6630368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.6631758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa042d21f0>} 2025-05-07T20:31:42.6633096Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.6634121Z context = 2025-05-07T20:31:42.6634414Z 2025-05-07T20:31:42.6634581Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.6635097Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.6635552Z module_map=module_map) 2025-05-07T20:31:42.6635917Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.6636274Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:42.6636538Z E ^ 2025-05-07T20:31:42.6637007Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.6637462Z 2025-05-07T20:31:42.6637875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.6638384Z 2025-05-07T20:31:42.6638493Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.6638907Z self=, 2025-05-07T20:31:42.6639306Z T=16384, 2025-05-07T20:31:42.6639500Z D=5120, 2025-05-07T20:31:42.6639690Z scale_ub=None, 2025-05-07T20:31:42.6639899Z contiguous=True, 2025-05-07T20:31:42.6640122Z compiled=True, 2025-05-07T20:31:42.6640325Z ) 2025-05-07T20:31:42.7083732Z W0507 20:31:42.706620 86817 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:31:42.7086532Z W0507 20:31:42.706620 86817 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:31:42.7089191Z W0507 20:31:42.706620 86817 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:31:42.7091176Z W0507 20:31:42.706620 86817 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:31:42.7092473Z W0507 20:31:42.706620 86817 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:31:42.8286155Z self = 2025-05-07T20:31:42.8286897Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:42.8287273Z 2025-05-07T20:31:42.8287377Z @given( 2025-05-07T20:31:42.8287686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.8288017Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.8288322Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.8288654Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.8288980Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.8289268Z ) 2025-05-07T20:31:42.8289614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.8290055Z def test_silu_mul_quant( 2025-05-07T20:31:42.8290296Z self, 2025-05-07T20:31:42.8290492Z T: int, 2025-05-07T20:31:42.8290686Z D: int, 2025-05-07T20:31:42.8290898Z scale_ub: Optional[float], 2025-05-07T20:31:42.8291168Z contiguous: bool, 2025-05-07T20:31:42.8298911Z compiled: bool, 2025-05-07T20:31:42.8299172Z ) -> None: 2025-05-07T20:31:42.8299407Z torch.manual_seed(2025) 2025-05-07T20:31:42.8299658Z 2025-05-07T20:31:42.8299948Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.8300348Z 2025-05-07T20:31:42.8300551Z x_sign = torch.sign(x) 2025-05-07T20:31:42.8300867Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.8301223Z x = x_sign * x_clamp 2025-05-07T20:31:42.8301487Z x0 = x[:, :D] 2025-05-07T20:31:42.8301755Z x1 = x[:, D:] 2025-05-07T20:31:42.8301995Z 2025-05-07T20:31:42.8302191Z if contiguous: 2025-05-07T20:31:42.8302434Z x0 = x0.contiguous() 2025-05-07T20:31:42.8302722Z x1 = x1.contiguous() 2025-05-07T20:31:42.8302990Z 2025-05-07T20:31:42.8303195Z if scale_ub is not None: 2025-05-07T20:31:42.8303476Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.8303989Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.8304303Z ) 2025-05-07T20:31:42.8304496Z else: 2025-05-07T20:31:42.8304709Z scale_ub_tensor = None 2025-05-07T20:31:42.8304963Z 2025-05-07T20:31:42.8305196Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.8305519Z op = silu_mul_quant 2025-05-07T20:31:42.8305776Z if compiled: 2025-05-07T20:31:42.8306022Z op = torch.compile(op) 2025-05-07T20:31:42.8306322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.8306604Z 2025-05-07T20:31:42.8306793Z y_fp8, y_scale = fn() 2025-05-07T20:31:42.8307091Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:42.8307390Z 2025-05-07T20:31:42.8307623Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.8307959Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:42.8308251Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:42.8308571Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:42.8308932Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:42.8309250Z 2025-05-07T20:31:42.8309641Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:42.8309910Z 2025-05-07T20:31:42.8310014Z moe/activation_test.py:126: 2025-05-07T20:31:42.8310316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.8310653Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:42.8311100Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:42.8311959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:42.8312733Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:42.8313288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.8313969Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.8314672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:42.8315408Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:42.8316167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:42.8316926Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:42.8317677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:42.8318324Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:42.8318939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:42.8319450Z fn() 2025-05-07T20:31:42.8319959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:42.8320552Z self.fn.run( 2025-05-07T20:31:42.8321026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.8321562Z kernel = self.compile( 2025-05-07T20:31:42.8322114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.8322773Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.8323181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.8323416Z 2025-05-07T20:31:42.8323632Z self = 2025-05-07T20:31:42.8324738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.8326152Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa049af790>} 2025-05-07T20:31:42.8327523Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.8328574Z context = 2025-05-07T20:31:42.8328867Z 2025-05-07T20:31:42.8329042Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.8329570Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.8330042Z module_map=module_map) 2025-05-07T20:31:42.8330419Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.8330781Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:42.8331050Z E ^ 2025-05-07T20:31:42.8331616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.8332074Z 2025-05-07T20:31:42.8332501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.8333092Z 2025-05-07T20:31:42.8333208Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.8333626Z self=, 2025-05-07T20:31:42.8334031Z T=1, 2025-05-07T20:31:42.8334210Z D=5120, 2025-05-07T20:31:42.8334405Z scale_ub=1200.0, 2025-05-07T20:31:42.8334626Z contiguous=True, 2025-05-07T20:31:42.8334852Z compiled=True, 2025-05-07T20:31:42.8335060Z ) 2025-05-07T20:31:43.2025096Z self = 2025-05-07T20:31:43.2026308Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:43.2026924Z 2025-05-07T20:31:43.2027107Z @given( 2025-05-07T20:31:43.2027572Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.2028188Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.2028782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.2029417Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.2030162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.2030718Z ) 2025-05-07T20:31:43.2031397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.2031912Z def test_silu_mul_quant( 2025-05-07T20:31:43.2032154Z self, 2025-05-07T20:31:43.2032341Z T: int, 2025-05-07T20:31:43.2032536Z D: int, 2025-05-07T20:31:43.2032753Z scale_ub: Optional[float], 2025-05-07T20:31:43.2033019Z contiguous: bool, 2025-05-07T20:31:43.2033256Z compiled: bool, 2025-05-07T20:31:43.2033478Z ) -> None: 2025-05-07T20:31:43.2033684Z torch.manual_seed(2025) 2025-05-07T20:31:43.2033932Z 2025-05-07T20:31:43.2034210Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.2034552Z 2025-05-07T20:31:43.2034733Z x_sign = torch.sign(x) 2025-05-07T20:31:43.2035025Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.2035336Z x = x_sign * x_clamp 2025-05-07T20:31:43.2035568Z x0 = x[:, :D] 2025-05-07T20:31:43.2035783Z x1 = x[:, D:] 2025-05-07T20:31:43.2035985Z 2025-05-07T20:31:43.2036161Z if contiguous: 2025-05-07T20:31:43.2036389Z x0 = x0.contiguous() 2025-05-07T20:31:43.2036649Z x1 = x1.contiguous() 2025-05-07T20:31:43.2036875Z 2025-05-07T20:31:43.2037070Z if scale_ub is not None: 2025-05-07T20:31:43.2037343Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.2037673Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.2037982Z ) 2025-05-07T20:31:43.2038167Z else: 2025-05-07T20:31:43.2038371Z scale_ub_tensor = None 2025-05-07T20:31:43.2038618Z 2025-05-07T20:31:43.2038850Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.2039161Z op = silu_mul_quant 2025-05-07T20:31:43.2039412Z if compiled: 2025-05-07T20:31:43.2039659Z op = torch.compile(op) 2025-05-07T20:31:43.2039954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2040224Z 2025-05-07T20:31:43.2040409Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.2040569Z 2025-05-07T20:31:43.2040679Z moe/activation_test.py:117: 2025-05-07T20:31:43.2040967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2041293Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.2041568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2042124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.2042851Z return fn(*args, **kwargs) 2025-05-07T20:31:43.2043515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.2044195Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.2044718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.2045560Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.2046216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.2046747Z kernel = self.compile( 2025-05-07T20:31:43.2047278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.2047925Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.2048323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2048549Z 2025-05-07T20:31:43.2048754Z self = 2025-05-07T20:31:43.2049832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.2051267Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa049af310>} 2025-05-07T20:31:43.2052659Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.2053680Z context = 2025-05-07T20:31:43.2053971Z 2025-05-07T20:31:43.2054139Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.2054653Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.2055116Z module_map=module_map) 2025-05-07T20:31:43.2055481Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.2055825Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.2056078Z E ^ 2025-05-07T20:31:43.2056547Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.2056993Z 2025-05-07T20:31:43.2057411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.2057914Z 2025-05-07T20:31:43.2058013Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.2058421Z self=, 2025-05-07T20:31:43.2058824Z T=1, 2025-05-07T20:31:43.2058996Z D=5120, 2025-05-07T20:31:43.2059182Z scale_ub=None, 2025-05-07T20:31:43.2059395Z contiguous=False, 2025-05-07T20:31:43.2059615Z compiled=True, 2025-05-07T20:31:43.2059819Z ) 2025-05-07T20:31:43.2866185Z self = 2025-05-07T20:31:43.2867224Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.2867746Z 2025-05-07T20:31:43.2867909Z @given( 2025-05-07T20:31:43.2868355Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.2868975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.2869581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.2870313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.2870960Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.2871521Z ) 2025-05-07T20:31:43.2872192Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.2872644Z def test_silu_mul_quant( 2025-05-07T20:31:43.2872896Z self, 2025-05-07T20:31:43.2873093Z T: int, 2025-05-07T20:31:43.2873291Z D: int, 2025-05-07T20:31:43.2873509Z scale_ub: Optional[float], 2025-05-07T20:31:43.2873892Z contiguous: bool, 2025-05-07T20:31:43.2874125Z compiled: bool, 2025-05-07T20:31:43.2874352Z ) -> None: 2025-05-07T20:31:43.2874567Z torch.manual_seed(2025) 2025-05-07T20:31:43.2874809Z 2025-05-07T20:31:43.2875081Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.2875429Z 2025-05-07T20:31:43.2875621Z x_sign = torch.sign(x) 2025-05-07T20:31:43.2875914Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.2876223Z x = x_sign * x_clamp 2025-05-07T20:31:43.2876463Z x0 = x[:, :D] 2025-05-07T20:31:43.2876681Z x1 = x[:, D:] 2025-05-07T20:31:43.2876897Z 2025-05-07T20:31:43.2877077Z if contiguous: 2025-05-07T20:31:43.2877313Z x0 = x0.contiguous() 2025-05-07T20:31:43.2877580Z x1 = x1.contiguous() 2025-05-07T20:31:43.2877813Z 2025-05-07T20:31:43.2878007Z if scale_ub is not None: 2025-05-07T20:31:43.2878283Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.2878625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.2878941Z ) 2025-05-07T20:31:43.2879141Z else: 2025-05-07T20:31:43.2879359Z scale_ub_tensor = None 2025-05-07T20:31:43.2879607Z 2025-05-07T20:31:43.2879840Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.2880155Z op = silu_mul_quant 2025-05-07T20:31:43.2880400Z if compiled: 2025-05-07T20:31:43.2880651Z op = torch.compile(op) 2025-05-07T20:31:43.2880943Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.2881214Z 2025-05-07T20:31:43.2881411Z y_fp8, y_scale = fn() 2025-05-07T20:31:43.2881693Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:43.2881977Z 2025-05-07T20:31:43.2882211Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.2882545Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:43.2882833Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:43.2883143Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:43.2883501Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:43.2883806Z 2025-05-07T20:31:43.2884004Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:43.2884206Z 2025-05-07T20:31:43.2884311Z moe/activation_test.py:126: 2025-05-07T20:31:43.2884605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2884931Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:43.2885259Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:43.2886040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:43.2886798Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:43.2887334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.2888018Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.2888710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:43.2889424Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:43.2890174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:43.2891008Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:43.2891737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:43.2892538Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:43.2893138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:43.2893734Z fn() 2025-05-07T20:31:43.2894237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:43.2894809Z self.fn.run( 2025-05-07T20:31:43.2895274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.2895805Z kernel = self.compile( 2025-05-07T20:31:43.2896341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.2896996Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.2897394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.2897625Z 2025-05-07T20:31:43.2897831Z self = 2025-05-07T20:31:43.2898917Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.2900303Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03b8ff70>} 2025-05-07T20:31:43.2901648Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.2902667Z context = 2025-05-07T20:31:43.2902957Z 2025-05-07T20:31:43.2903125Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.2903649Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.2904263Z module_map=module_map) 2025-05-07T20:31:43.2904624Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.2904985Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:43.2905258Z E ^ 2025-05-07T20:31:43.2905726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.2906179Z 2025-05-07T20:31:43.2906593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.2907104Z 2025-05-07T20:31:43.2907206Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.2907625Z self=, 2025-05-07T20:31:43.2908027Z T=1, 2025-05-07T20:31:43.2908206Z D=5120, 2025-05-07T20:31:43.2908396Z scale_ub=None, 2025-05-07T20:31:43.2908603Z contiguous=True, 2025-05-07T20:31:43.2908826Z compiled=False, 2025-05-07T20:31:43.2909035Z ) 2025-05-07T20:31:43.4871266Z self = 2025-05-07T20:31:43.4871815Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:43.4872109Z 2025-05-07T20:31:43.4872200Z @given( 2025-05-07T20:31:43.4872529Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.4872920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.4873230Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.4873559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.4874076Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.4874368Z ) 2025-05-07T20:31:43.4874718Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.4875152Z def test_silu_mul_quant( 2025-05-07T20:31:43.4875396Z self, 2025-05-07T20:31:43.4875588Z T: int, 2025-05-07T20:31:43.4875896Z D: int, 2025-05-07T20:31:43.4876115Z scale_ub: Optional[float], 2025-05-07T20:31:43.4876389Z contiguous: bool, 2025-05-07T20:31:43.4876624Z compiled: bool, 2025-05-07T20:31:43.4876850Z ) -> None: 2025-05-07T20:31:43.4877063Z torch.manual_seed(2025) 2025-05-07T20:31:43.4877309Z 2025-05-07T20:31:43.4877580Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.4877921Z 2025-05-07T20:31:43.4878118Z x_sign = torch.sign(x) 2025-05-07T20:31:43.4878403Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.4878712Z x = x_sign * x_clamp 2025-05-07T20:31:43.4878965Z x0 = x[:, :D] 2025-05-07T20:31:43.4879178Z x1 = x[:, D:] 2025-05-07T20:31:43.4879387Z 2025-05-07T20:31:43.4879580Z if contiguous: 2025-05-07T20:31:43.4879811Z x0 = x0.contiguous() 2025-05-07T20:31:43.4880071Z x1 = x1.contiguous() 2025-05-07T20:31:43.4880314Z 2025-05-07T20:31:43.4880511Z if scale_ub is not None: 2025-05-07T20:31:43.4880785Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.4881123Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.4881432Z ) 2025-05-07T20:31:43.4881632Z else: 2025-05-07T20:31:43.4881869Z scale_ub_tensor = None 2025-05-07T20:31:43.4882143Z 2025-05-07T20:31:43.4882375Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.4882694Z op = silu_mul_quant 2025-05-07T20:31:43.4882950Z if compiled: 2025-05-07T20:31:43.4883193Z op = torch.compile(op) 2025-05-07T20:31:43.4883499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.4883777Z 2025-05-07T20:31:43.4883965Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.4884137Z 2025-05-07T20:31:43.4884239Z moe/activation_test.py:117: 2025-05-07T20:31:43.4884533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.4884867Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.4885154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.4885850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.4886542Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.4887074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.4887752Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.4888415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.4888939Z kernel = self.compile( 2025-05-07T20:31:43.4889479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.4890134Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.4890535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.4890767Z 2025-05-07T20:31:43.4890973Z self = 2025-05-07T20:31:43.4892055Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.4893516Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa04853e50>} 2025-05-07T20:31:43.4894863Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.4895953Z context = 2025-05-07T20:31:43.4896244Z 2025-05-07T20:31:43.4896410Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.4896929Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.4897392Z module_map=module_map) 2025-05-07T20:31:43.4897750Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.4898103Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.4898360Z E ^ 2025-05-07T20:31:43.4898846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.4899293Z 2025-05-07T20:31:43.4899711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.4900224Z 2025-05-07T20:31:43.4900329Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.4900753Z self=, 2025-05-07T20:31:43.4901157Z T=128, 2025-05-07T20:31:43.4901342Z D=5120, 2025-05-07T20:31:43.4901539Z scale_ub=None, 2025-05-07T20:31:43.4901758Z contiguous=False, 2025-05-07T20:31:43.4901979Z compiled=True, 2025-05-07T20:31:43.4902188Z ) 2025-05-07T20:31:43.4902503Z self = 2025-05-07T20:31:43.4902991Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.4903256Z 2025-05-07T20:31:43.4903336Z @given( 2025-05-07T20:31:43.4903574Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.4904124Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.4904513Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.4904846Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.4905178Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.4905468Z ) 2025-05-07T20:31:43.4905817Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.4906258Z def test_silu_mul_quant( 2025-05-07T20:31:43.4906499Z self, 2025-05-07T20:31:43.4906691Z T: int, 2025-05-07T20:31:43.4906893Z D: int, 2025-05-07T20:31:43.4907122Z scale_ub: Optional[float], 2025-05-07T20:31:43.4907391Z contiguous: bool, 2025-05-07T20:31:43.4907635Z compiled: bool, 2025-05-07T20:31:43.4914573Z ) -> None: 2025-05-07T20:31:43.4914925Z torch.manual_seed(2025) 2025-05-07T20:31:43.4915177Z 2025-05-07T20:31:43.4915460Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.4915810Z 2025-05-07T20:31:43.4916010Z x_sign = torch.sign(x) 2025-05-07T20:31:43.4916310Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.4916618Z x = x_sign * x_clamp 2025-05-07T20:31:43.4916872Z x0 = x[:, :D] 2025-05-07T20:31:43.4917096Z x1 = x[:, D:] 2025-05-07T20:31:43.4917300Z 2025-05-07T20:31:43.4917486Z if contiguous: 2025-05-07T20:31:43.4917724Z x0 = x0.contiguous() 2025-05-07T20:31:43.4917984Z x1 = x1.contiguous() 2025-05-07T20:31:43.4918229Z 2025-05-07T20:31:43.4918424Z if scale_ub is not None: 2025-05-07T20:31:43.4918693Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.4919035Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.4919348Z ) 2025-05-07T20:31:43.4919538Z else: 2025-05-07T20:31:43.4919924Z scale_ub_tensor = None 2025-05-07T20:31:43.4920182Z 2025-05-07T20:31:43.4920413Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.4920737Z op = silu_mul_quant 2025-05-07T20:31:43.4920994Z if compiled: 2025-05-07T20:31:43.4921246Z op = torch.compile(op) 2025-05-07T20:31:43.4921690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.4921995Z 2025-05-07T20:31:43.4922187Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.4922351Z 2025-05-07T20:31:43.4922456Z moe/activation_test.py:117: 2025-05-07T20:31:43.4922761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.4923095Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.4923376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.4923940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.4924498Z return fn(*args, **kwargs) 2025-05-07T20:31:43.4925166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.4925857Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.4926389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.4927080Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.4927740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.4928281Z kernel = self.compile( 2025-05-07T20:31:43.4928819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.4929469Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.4929863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.4930100Z 2025-05-07T20:31:43.4930309Z self = 2025-05-07T20:31:43.4931395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.4932779Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03bb00d0>} 2025-05-07T20:31:43.4934117Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.4935147Z context = 2025-05-07T20:31:43.4935440Z 2025-05-07T20:31:43.4935612Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.4936139Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.4936607Z module_map=module_map) 2025-05-07T20:31:43.4936981Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.4937339Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.4937596Z E ^ 2025-05-07T20:31:43.4938065Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.4938521Z 2025-05-07T20:31:43.4938939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.4939450Z 2025-05-07T20:31:43.4939558Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.4939976Z self=, 2025-05-07T20:31:43.4940379Z T=128, 2025-05-07T20:31:43.4940655Z D=7168, 2025-05-07T20:31:43.4940852Z scale_ub=1200.0, 2025-05-07T20:31:43.4941075Z contiguous=False, 2025-05-07T20:31:43.4941301Z compiled=False, 2025-05-07T20:31:43.4941501Z ) 2025-05-07T20:31:43.6461799Z self = 2025-05-07T20:31:43.6462621Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:43.6462982Z 2025-05-07T20:31:43.6463077Z @given( 2025-05-07T20:31:43.6463313Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.6463632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.6463952Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.6464278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.6464609Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.6464899Z ) 2025-05-07T20:31:43.6465252Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.6465694Z def test_silu_mul_quant( 2025-05-07T20:31:43.6465937Z self, 2025-05-07T20:31:43.6466163Z T: int, 2025-05-07T20:31:43.6466366Z D: int, 2025-05-07T20:31:43.6466581Z scale_ub: Optional[float], 2025-05-07T20:31:43.6466857Z contiguous: bool, 2025-05-07T20:31:43.6467105Z compiled: bool, 2025-05-07T20:31:43.6467325Z ) -> None: 2025-05-07T20:31:43.6467542Z torch.manual_seed(2025) 2025-05-07T20:31:43.6467788Z 2025-05-07T20:31:43.6468057Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.6468398Z 2025-05-07T20:31:43.6468591Z x_sign = torch.sign(x) 2025-05-07T20:31:43.6468877Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.6469188Z x = x_sign * x_clamp 2025-05-07T20:31:43.6469429Z x0 = x[:, :D] 2025-05-07T20:31:43.6469651Z x1 = x[:, D:] 2025-05-07T20:31:43.6469933Z 2025-05-07T20:31:43.6470125Z if contiguous: 2025-05-07T20:31:43.6470361Z x0 = x0.contiguous() 2025-05-07T20:31:43.6470617Z x1 = x1.contiguous() 2025-05-07T20:31:43.6470853Z 2025-05-07T20:31:43.6471048Z if scale_ub is not None: 2025-05-07T20:31:43.6471316Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.6471655Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.6471970Z ) 2025-05-07T20:31:43.6472160Z else: 2025-05-07T20:31:43.6472374Z scale_ub_tensor = None 2025-05-07T20:31:43.6472627Z 2025-05-07T20:31:43.6472856Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.6473173Z op = silu_mul_quant 2025-05-07T20:31:43.6473426Z if compiled: 2025-05-07T20:31:43.6473674Z op = torch.compile(op) 2025-05-07T20:31:43.6473971Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6474246Z 2025-05-07T20:31:43.6474434Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.6474612Z 2025-05-07T20:31:43.6474715Z moe/activation_test.py:117: 2025-05-07T20:31:43.6475019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6475355Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.6475632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6476332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.6477022Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.6477554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.6478235Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.6478894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.6479426Z kernel = self.compile( 2025-05-07T20:31:43.6480104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.6480756Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.6481159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6481510Z 2025-05-07T20:31:43.6481724Z self = 2025-05-07T20:31:43.6482806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.6484187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0373aee0>} 2025-05-07T20:31:43.6485544Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.6486567Z context = 2025-05-07T20:31:43.6486858Z 2025-05-07T20:31:43.6487023Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.6487551Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.6488018Z module_map=module_map) 2025-05-07T20:31:43.6488383Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.6488730Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.6488994Z E ^ 2025-05-07T20:31:43.6489464Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.6489913Z 2025-05-07T20:31:43.6490338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.6490850Z 2025-05-07T20:31:43.6490954Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.6491367Z self=, 2025-05-07T20:31:43.6491775Z T=128, 2025-05-07T20:31:43.6491959Z D=5120, 2025-05-07T20:31:43.6492153Z scale_ub=None, 2025-05-07T20:31:43.6492370Z contiguous=False, 2025-05-07T20:31:43.6492593Z compiled=False, 2025-05-07T20:31:43.6492793Z ) 2025-05-07T20:31:43.6493112Z self = 2025-05-07T20:31:43.6493600Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:43.6493876Z 2025-05-07T20:31:43.6493952Z @given( 2025-05-07T20:31:43.6494185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.6494499Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.6494804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.6495138Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.6495469Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.6495758Z ) 2025-05-07T20:31:43.6496116Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.6496563Z def test_silu_mul_quant( 2025-05-07T20:31:43.6496804Z self, 2025-05-07T20:31:43.6497003Z T: int, 2025-05-07T20:31:43.6497204Z D: int, 2025-05-07T20:31:43.6497427Z scale_ub: Optional[float], 2025-05-07T20:31:43.6497699Z contiguous: bool, 2025-05-07T20:31:43.6497939Z compiled: bool, 2025-05-07T20:31:43.6498165Z ) -> None: 2025-05-07T20:31:43.6498375Z torch.manual_seed(2025) 2025-05-07T20:31:43.6498615Z 2025-05-07T20:31:43.6498884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.6499221Z 2025-05-07T20:31:43.6499899Z x_sign = torch.sign(x) 2025-05-07T20:31:43.6500196Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.6500499Z x = x_sign * x_clamp 2025-05-07T20:31:43.6500744Z x0 = x[:, :D] 2025-05-07T20:31:43.6500962Z x1 = x[:, D:] 2025-05-07T20:31:43.6501167Z 2025-05-07T20:31:43.6501462Z if contiguous: 2025-05-07T20:31:43.6501696Z x0 = x0.contiguous() 2025-05-07T20:31:43.6501953Z x1 = x1.contiguous() 2025-05-07T20:31:43.6502194Z 2025-05-07T20:31:43.6502391Z if scale_ub is not None: 2025-05-07T20:31:43.6502662Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.6502995Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.6503306Z ) 2025-05-07T20:31:43.6503496Z else: 2025-05-07T20:31:43.6503884Z scale_ub_tensor = None 2025-05-07T20:31:43.6504139Z 2025-05-07T20:31:43.6504367Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.6504688Z op = silu_mul_quant 2025-05-07T20:31:43.6504940Z if compiled: 2025-05-07T20:31:43.6505185Z op = torch.compile(op) 2025-05-07T20:31:43.6505476Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6505751Z 2025-05-07T20:31:43.6505944Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.6506116Z 2025-05-07T20:31:43.6506218Z moe/activation_test.py:117: 2025-05-07T20:31:43.6506517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6506850Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.6507131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6507828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.6508524Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.6509074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.6509752Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.6510473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.6511008Z kernel = self.compile( 2025-05-07T20:31:43.6511557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.6512207Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.6512607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6512834Z 2025-05-07T20:31:43.6513043Z self = 2025-05-07T20:31:43.6514127Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.6515508Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03d1c820>} 2025-05-07T20:31:43.6516853Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.6517877Z context = 2025-05-07T20:31:43.6518164Z 2025-05-07T20:31:43.6518338Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.6518858Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.6519324Z module_map=module_map) 2025-05-07T20:31:43.6519692Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.6520163Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.6520429Z E ^ 2025-05-07T20:31:43.6520899Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.6521349Z 2025-05-07T20:31:43.6521815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.6522448Z 2025-05-07T20:31:43.6522553Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.6522967Z self=, 2025-05-07T20:31:43.6523368Z T=128, 2025-05-07T20:31:43.6523556Z D=5120, 2025-05-07T20:31:43.6523744Z scale_ub=1200.0, 2025-05-07T20:31:43.6523970Z contiguous=True, 2025-05-07T20:31:43.6524196Z compiled=False, 2025-05-07T20:31:43.6524397Z ) 2025-05-07T20:31:43.8808825Z self = 2025-05-07T20:31:43.8810286Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:43.8811020Z 2025-05-07T20:31:43.8811172Z @given( 2025-05-07T20:31:43.8811624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.8812068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.8812415Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.8812736Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.8813061Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.8813340Z ) 2025-05-07T20:31:43.8813683Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.8814121Z def test_silu_mul_quant( 2025-05-07T20:31:43.8814356Z self, 2025-05-07T20:31:43.8814550Z T: int, 2025-05-07T20:31:43.8814747Z D: int, 2025-05-07T20:31:43.8814957Z scale_ub: Optional[float], 2025-05-07T20:31:43.8815230Z contiguous: bool, 2025-05-07T20:31:43.8815471Z compiled: bool, 2025-05-07T20:31:43.8815687Z ) -> None: 2025-05-07T20:31:43.8815902Z torch.manual_seed(2025) 2025-05-07T20:31:43.8816140Z 2025-05-07T20:31:43.8816403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.8816743Z 2025-05-07T20:31:43.8816942Z x_sign = torch.sign(x) 2025-05-07T20:31:43.8817232Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.8817532Z x = x_sign * x_clamp 2025-05-07T20:31:43.8817769Z x0 = x[:, :D] 2025-05-07T20:31:43.8817979Z x1 = x[:, D:] 2025-05-07T20:31:43.8818178Z 2025-05-07T20:31:43.8818360Z if contiguous: 2025-05-07T20:31:43.8818593Z x0 = x0.contiguous() 2025-05-07T20:31:43.8818846Z x1 = x1.contiguous() 2025-05-07T20:31:43.8819087Z 2025-05-07T20:31:43.8819278Z if scale_ub is not None: 2025-05-07T20:31:43.8819544Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.8819884Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.8820188Z ) 2025-05-07T20:31:43.8820374Z else: 2025-05-07T20:31:43.8820586Z scale_ub_tensor = None 2025-05-07T20:31:43.8820833Z 2025-05-07T20:31:43.8821057Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.8821370Z op = silu_mul_quant 2025-05-07T20:31:43.8821619Z if compiled: 2025-05-07T20:31:43.8821863Z op = torch.compile(op) 2025-05-07T20:31:43.8822152Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8822424Z 2025-05-07T20:31:43.8822619Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.8822782Z 2025-05-07T20:31:43.8822881Z moe/activation_test.py:117: 2025-05-07T20:31:43.8823172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8823503Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.8823780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8824638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.8825329Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.8825863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.8826653Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.8827305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.8827833Z kernel = self.compile( 2025-05-07T20:31:43.8828358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.8829006Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.8829396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8829627Z 2025-05-07T20:31:43.8829919Z self = 2025-05-07T20:31:43.8830992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.8832421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03b8fdc0>} 2025-05-07T20:31:43.8833756Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.8834767Z context = 2025-05-07T20:31:43.8835052Z 2025-05-07T20:31:43.8835223Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.8835733Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.8836195Z module_map=module_map) 2025-05-07T20:31:43.8836554Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.8836901Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.8837153Z E ^ 2025-05-07T20:31:43.8837616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.8838062Z 2025-05-07T20:31:43.8838478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.8838984Z 2025-05-07T20:31:43.8839086Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.8839494Z self=, 2025-05-07T20:31:43.8839892Z T=1, 2025-05-07T20:31:43.8840073Z D=7168, 2025-05-07T20:31:43.8840266Z scale_ub=1200.0, 2025-05-07T20:31:43.8840488Z contiguous=True, 2025-05-07T20:31:43.8840702Z compiled=True, 2025-05-07T20:31:43.8840914Z ) 2025-05-07T20:31:43.8841233Z self = 2025-05-07T20:31:43.8841729Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:43.8842013Z 2025-05-07T20:31:43.8842095Z @given( 2025-05-07T20:31:43.8842338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.8842646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.8842942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.8843268Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.8843593Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.8843875Z ) 2025-05-07T20:31:43.8844216Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.8844738Z def test_silu_mul_quant( 2025-05-07T20:31:43.8844980Z self, 2025-05-07T20:31:43.8845164Z T: int, 2025-05-07T20:31:43.8845354Z D: int, 2025-05-07T20:31:43.8845569Z scale_ub: Optional[float], 2025-05-07T20:31:43.8845834Z contiguous: bool, 2025-05-07T20:31:43.8846141Z compiled: bool, 2025-05-07T20:31:43.8846357Z ) -> None: 2025-05-07T20:31:43.8846562Z torch.manual_seed(2025) 2025-05-07T20:31:43.8846797Z 2025-05-07T20:31:43.8847068Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.8847398Z 2025-05-07T20:31:43.8847591Z x_sign = torch.sign(x) 2025-05-07T20:31:43.8847879Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.8848176Z x = x_sign * x_clamp 2025-05-07T20:31:43.8848413Z x0 = x[:, :D] 2025-05-07T20:31:43.8848625Z x1 = x[:, D:] 2025-05-07T20:31:43.8848820Z 2025-05-07T20:31:43.8849009Z if contiguous: 2025-05-07T20:31:43.8849238Z x0 = x0.contiguous() 2025-05-07T20:31:43.8849491Z x1 = x1.contiguous() 2025-05-07T20:31:43.8849726Z 2025-05-07T20:31:43.8849918Z if scale_ub is not None: 2025-05-07T20:31:43.8850186Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.8850520Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.8850824Z ) 2025-05-07T20:31:43.8851018Z else: 2025-05-07T20:31:43.8851223Z scale_ub_tensor = None 2025-05-07T20:31:43.8851478Z 2025-05-07T20:31:43.8851706Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.8852012Z op = silu_mul_quant 2025-05-07T20:31:43.8852254Z if compiled: 2025-05-07T20:31:43.8852499Z op = torch.compile(op) 2025-05-07T20:31:43.8852795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8853069Z 2025-05-07T20:31:43.8853257Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.8853423Z 2025-05-07T20:31:43.8853522Z moe/activation_test.py:117: 2025-05-07T20:31:43.8853815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8854140Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.8854419Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8854972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.8855520Z return fn(*args, **kwargs) 2025-05-07T20:31:43.8856178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.8856854Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.8857380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.8858050Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.8858704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.8859224Z kernel = self.compile( 2025-05-07T20:31:43.8859754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.8860418Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.8860809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8861033Z 2025-05-07T20:31:43.8861242Z self = 2025-05-07T20:31:43.8862584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.8864359Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0419cd30>} 2025-05-07T20:31:43.8865701Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.8866785Z context = 2025-05-07T20:31:43.8867069Z 2025-05-07T20:31:43.8867232Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.8867750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.8875153Z module_map=module_map) 2025-05-07T20:31:43.8875587Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.8875937Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.8876198Z E ^ 2025-05-07T20:31:43.8876675Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.8877129Z 2025-05-07T20:31:43.8877553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.8878061Z 2025-05-07T20:31:43.8878170Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.8878584Z self=, 2025-05-07T20:31:43.8878983Z T=1, 2025-05-07T20:31:43.8879160Z D=7168, 2025-05-07T20:31:43.8879357Z scale_ub=1200.0, 2025-05-07T20:31:43.8879577Z contiguous=False, 2025-05-07T20:31:43.8879794Z compiled=True, 2025-05-07T20:31:43.8879998Z ) 2025-05-07T20:31:44.2520565Z self = 2025-05-07T20:31:44.2521134Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.2521401Z 2025-05-07T20:31:44.2521482Z @given( 2025-05-07T20:31:44.2521734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.2522051Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.2522357Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.2522692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.2523028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.2523317Z ) 2025-05-07T20:31:44.2523665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.2524111Z def test_silu_mul_quant( 2025-05-07T20:31:44.2524359Z self, 2025-05-07T20:31:44.2524551Z T: int, 2025-05-07T20:31:44.2524752Z D: int, 2025-05-07T20:31:44.2524970Z scale_ub: Optional[float], 2025-05-07T20:31:44.2525240Z contiguous: bool, 2025-05-07T20:31:44.2525476Z compiled: bool, 2025-05-07T20:31:44.2525705Z ) -> None: 2025-05-07T20:31:44.2525917Z torch.manual_seed(2025) 2025-05-07T20:31:44.2526163Z 2025-05-07T20:31:44.2526434Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.2526774Z 2025-05-07T20:31:44.2526974Z x_sign = torch.sign(x) 2025-05-07T20:31:44.2527262Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.2527581Z x = x_sign * x_clamp 2025-05-07T20:31:44.2527817Z x0 = x[:, :D] 2025-05-07T20:31:44.2528033Z x1 = x[:, D:] 2025-05-07T20:31:44.2528239Z 2025-05-07T20:31:44.2528435Z if contiguous: 2025-05-07T20:31:44.2528668Z x0 = x0.contiguous() 2025-05-07T20:31:44.2528930Z x1 = x1.contiguous() 2025-05-07T20:31:44.2529164Z 2025-05-07T20:31:44.2529357Z if scale_ub is not None: 2025-05-07T20:31:44.2529633Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.2529976Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.2530282Z ) 2025-05-07T20:31:44.2530478Z else: 2025-05-07T20:31:44.2530878Z scale_ub_tensor = None 2025-05-07T20:31:44.2531127Z 2025-05-07T20:31:44.2531362Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.2531680Z op = silu_mul_quant 2025-05-07T20:31:44.2531949Z if compiled: 2025-05-07T20:31:44.2532224Z op = torch.compile(op) 2025-05-07T20:31:44.2532647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2532916Z 2025-05-07T20:31:44.2533107Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.2533272Z 2025-05-07T20:31:44.2533380Z moe/activation_test.py:117: 2025-05-07T20:31:44.2533670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2534006Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.2534292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2534851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.2535413Z return fn(*args, **kwargs) 2025-05-07T20:31:44.2536075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.2536760Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.2537292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.2537978Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.2538635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.2539161Z kernel = self.compile( 2025-05-07T20:31:44.2539693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.2540341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.2540738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2540965Z 2025-05-07T20:31:44.2541177Z self = 2025-05-07T20:31:44.2542311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.2543745Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03b9cb80>} 2025-05-07T20:31:44.2545087Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.2546103Z context = 2025-05-07T20:31:44.2546385Z 2025-05-07T20:31:44.2546558Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.2547074Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.2547539Z module_map=module_map) 2025-05-07T20:31:44.2547912Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.2548263Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.2548514Z E ^ 2025-05-07T20:31:44.2548974Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.2549424Z 2025-05-07T20:31:44.2549925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.2550438Z 2025-05-07T20:31:44.2550540Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.2550951Z self=, 2025-05-07T20:31:44.2551459Z T=1, 2025-05-07T20:31:44.2551649Z D=7168, 2025-05-07T20:31:44.2551838Z scale_ub=None, 2025-05-07T20:31:44.2552054Z contiguous=False, 2025-05-07T20:31:44.2552280Z compiled=True, 2025-05-07T20:31:44.2552482Z ) 2025-05-07T20:31:44.3688834Z self = 2025-05-07T20:31:44.3689803Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:44.3690163Z 2025-05-07T20:31:44.3690274Z @given( 2025-05-07T20:31:44.3690577Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.3690919Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.3691225Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.3691554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.3691884Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.3692197Z ) 2025-05-07T20:31:44.3692574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.3693009Z def test_silu_mul_quant( 2025-05-07T20:31:44.3693253Z self, 2025-05-07T20:31:44.3693448Z T: int, 2025-05-07T20:31:44.3693639Z D: int, 2025-05-07T20:31:44.3693860Z scale_ub: Optional[float], 2025-05-07T20:31:44.3694138Z contiguous: bool, 2025-05-07T20:31:44.3694375Z compiled: bool, 2025-05-07T20:31:44.3694600Z ) -> None: 2025-05-07T20:31:44.3694814Z torch.manual_seed(2025) 2025-05-07T20:31:44.3695050Z 2025-05-07T20:31:44.3695317Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.3695660Z 2025-05-07T20:31:44.3695846Z x_sign = torch.sign(x) 2025-05-07T20:31:44.3696136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.3696443Z x = x_sign * x_clamp 2025-05-07T20:31:44.3696681Z x0 = x[:, :D] 2025-05-07T20:31:44.3696890Z x1 = x[:, D:] 2025-05-07T20:31:44.3697098Z 2025-05-07T20:31:44.3697289Z if contiguous: 2025-05-07T20:31:44.3697518Z x0 = x0.contiguous() 2025-05-07T20:31:44.3697778Z x1 = x1.contiguous() 2025-05-07T20:31:44.3698017Z 2025-05-07T20:31:44.3698202Z if scale_ub is not None: 2025-05-07T20:31:44.3698473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.3698813Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.3699118Z ) 2025-05-07T20:31:44.3699311Z else: 2025-05-07T20:31:44.3699516Z scale_ub_tensor = None 2025-05-07T20:31:44.3699762Z 2025-05-07T20:31:44.3699993Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.3700307Z op = silu_mul_quant 2025-05-07T20:31:44.3700561Z if compiled: 2025-05-07T20:31:44.3700807Z op = torch.compile(op) 2025-05-07T20:31:44.3701103Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.3701378Z 2025-05-07T20:31:44.3701567Z y_fp8, y_scale = fn() 2025-05-07T20:31:44.3701852Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:44.3702144Z 2025-05-07T20:31:44.3702374Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.3702715Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:44.3703014Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:44.3703323Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:44.3703684Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:44.3704184Z 2025-05-07T20:31:44.3704485Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:44.3704685Z 2025-05-07T20:31:44.3704786Z moe/activation_test.py:126: 2025-05-07T20:31:44.3705081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.3705413Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:44.3705731Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:44.3706672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:44.3707437Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:44.3707979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.3708769Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.3709450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:44.3710245Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:44.3710988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:44.3711739Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:44.3712513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:44.3713151Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:44.3713743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:44.3714265Z fn() 2025-05-07T20:31:44.3714767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:44.3715341Z self.fn.run( 2025-05-07T20:31:44.3715802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.3716327Z kernel = self.compile( 2025-05-07T20:31:44.3716857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.3717505Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.3717900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.3718127Z 2025-05-07T20:31:44.3718343Z self = 2025-05-07T20:31:44.3719431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.3720822Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03105430>} 2025-05-07T20:31:44.3722196Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.3723245Z context = 2025-05-07T20:31:44.3723531Z 2025-05-07T20:31:44.3723700Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.3724221Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.3724686Z module_map=module_map) 2025-05-07T20:31:44.3725047Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.3725404Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:44.3725667Z E ^ 2025-05-07T20:31:44.3726139Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.3726595Z 2025-05-07T20:31:44.3727011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.3727519Z 2025-05-07T20:31:44.3727629Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.3728144Z self=, 2025-05-07T20:31:44.3728549Z T=1, 2025-05-07T20:31:44.3728730Z D=5120, 2025-05-07T20:31:44.3728926Z scale_ub=1200.0, 2025-05-07T20:31:44.3729154Z contiguous=False, 2025-05-07T20:31:44.3729380Z compiled=True, 2025-05-07T20:31:44.3729652Z ) 2025-05-07T20:31:44.5734783Z self = 2025-05-07T20:31:44.5735537Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.5735908Z 2025-05-07T20:31:44.5736018Z @given( 2025-05-07T20:31:44.5736317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.5736734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.5737042Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.5737370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.5737691Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.5737981Z ) 2025-05-07T20:31:44.5738326Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.5738754Z def test_silu_mul_quant( 2025-05-07T20:31:44.5738994Z self, 2025-05-07T20:31:44.5739181Z T: int, 2025-05-07T20:31:44.5739378Z D: int, 2025-05-07T20:31:44.5739598Z scale_ub: Optional[float], 2025-05-07T20:31:44.5739868Z contiguous: bool, 2025-05-07T20:31:44.5740092Z compiled: bool, 2025-05-07T20:31:44.5740309Z ) -> None: 2025-05-07T20:31:44.5740519Z torch.manual_seed(2025) 2025-05-07T20:31:44.5740750Z 2025-05-07T20:31:44.5741017Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.5741351Z 2025-05-07T20:31:44.5741541Z x_sign = torch.sign(x) 2025-05-07T20:31:44.5741818Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.5742146Z x = x_sign * x_clamp 2025-05-07T20:31:44.5742417Z x0 = x[:, :D] 2025-05-07T20:31:44.5742628Z x1 = x[:, D:] 2025-05-07T20:31:44.5742829Z 2025-05-07T20:31:44.5743007Z if contiguous: 2025-05-07T20:31:44.5743225Z x0 = x0.contiguous() 2025-05-07T20:31:44.5743481Z x1 = x1.contiguous() 2025-05-07T20:31:44.5743714Z 2025-05-07T20:31:44.5743900Z if scale_ub is not None: 2025-05-07T20:31:44.5744168Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.5744504Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.5744805Z ) 2025-05-07T20:31:44.5744992Z else: 2025-05-07T20:31:44.5745202Z scale_ub_tensor = None 2025-05-07T20:31:44.5745445Z 2025-05-07T20:31:44.5745670Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.5745975Z op = silu_mul_quant 2025-05-07T20:31:44.5746217Z if compiled: 2025-05-07T20:31:44.5746458Z op = torch.compile(op) 2025-05-07T20:31:44.5746756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.5747024Z 2025-05-07T20:31:44.5747212Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.5747378Z 2025-05-07T20:31:44.5747479Z moe/activation_test.py:117: 2025-05-07T20:31:44.5747773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.5748099Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.5748380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.5748943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.5749488Z return fn(*args, **kwargs) 2025-05-07T20:31:44.5750249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.5750932Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.5751638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.5752314Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.5752966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.5753487Z kernel = self.compile( 2025-05-07T20:31:44.5754128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.5754774Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.5755166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.5755390Z 2025-05-07T20:31:44.5755599Z self = 2025-05-07T20:31:44.5756678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.5758080Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03105e50>} 2025-05-07T20:31:44.5759409Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.5760436Z context = 2025-05-07T20:31:44.5760724Z 2025-05-07T20:31:44.5760886Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.5761407Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.5761879Z module_map=module_map) 2025-05-07T20:31:44.5762280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.5762633Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.5762882Z E ^ 2025-05-07T20:31:44.5763345Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.5763796Z 2025-05-07T20:31:44.5764209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.5764721Z 2025-05-07T20:31:44.5764826Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.5765234Z self=, 2025-05-07T20:31:44.5765638Z T=1, 2025-05-07T20:31:44.5765820Z D=5120, 2025-05-07T20:31:44.5766002Z scale_ub=1200.0, 2025-05-07T20:31:44.5766226Z contiguous=False, 2025-05-07T20:31:44.5766454Z compiled=False, 2025-05-07T20:31:44.5766654Z ) 2025-05-07T20:31:44.5766965Z self = 2025-05-07T20:31:44.5767451Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:44.5767713Z 2025-05-07T20:31:44.5767795Z @given( 2025-05-07T20:31:44.5768014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.5768328Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.5768635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.5768953Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.5769282Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.5769563Z ) 2025-05-07T20:31:44.5769902Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.5770335Z def test_silu_mul_quant( 2025-05-07T20:31:44.5770570Z self, 2025-05-07T20:31:44.5770753Z T: int, 2025-05-07T20:31:44.5770947Z D: int, 2025-05-07T20:31:44.5771171Z scale_ub: Optional[float], 2025-05-07T20:31:44.5771439Z contiguous: bool, 2025-05-07T20:31:44.5771748Z compiled: bool, 2025-05-07T20:31:44.5771969Z ) -> None: 2025-05-07T20:31:44.5772185Z torch.manual_seed(2025) 2025-05-07T20:31:44.5772457Z 2025-05-07T20:31:44.5772728Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.5773065Z 2025-05-07T20:31:44.5773321Z x_sign = torch.sign(x) 2025-05-07T20:31:44.5773607Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.5773912Z x = x_sign * x_clamp 2025-05-07T20:31:44.5774143Z x0 = x[:, :D] 2025-05-07T20:31:44.5774359Z x1 = x[:, D:] 2025-05-07T20:31:44.5774563Z 2025-05-07T20:31:44.5774736Z if contiguous: 2025-05-07T20:31:44.5774961Z x0 = x0.contiguous() 2025-05-07T20:31:44.5775215Z x1 = x1.contiguous() 2025-05-07T20:31:44.5775445Z 2025-05-07T20:31:44.5775633Z if scale_ub is not None: 2025-05-07T20:31:44.5775903Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.5776232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.5776539Z ) 2025-05-07T20:31:44.5776725Z else: 2025-05-07T20:31:44.5776930Z scale_ub_tensor = None 2025-05-07T20:31:44.5777165Z 2025-05-07T20:31:44.5777390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.5777708Z op = silu_mul_quant 2025-05-07T20:31:44.5777945Z if compiled: 2025-05-07T20:31:44.5778186Z op = torch.compile(op) 2025-05-07T20:31:44.5778479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.5778746Z 2025-05-07T20:31:44.5778931Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.5779094Z 2025-05-07T20:31:44.5779198Z moe/activation_test.py:117: 2025-05-07T20:31:44.5779481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.5779804Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.5780086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.5780776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.5781450Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.5781980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.5782662Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.5783309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.5783838Z kernel = self.compile( 2025-05-07T20:31:44.5784368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.5785011Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.5785395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.5785625Z 2025-05-07T20:31:44.5785829Z self = 2025-05-07T20:31:44.5786901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.5788278Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02f22820>} 2025-05-07T20:31:44.5789619Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.5790703Z context = 2025-05-07T20:31:44.5790994Z 2025-05-07T20:31:44.5791237Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.5791764Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.5792268Z module_map=module_map) 2025-05-07T20:31:44.5792634Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.5793085Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.5793341Z E ^ 2025-05-07T20:31:44.5793797Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.5794249Z 2025-05-07T20:31:44.5794662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.5795169Z 2025-05-07T20:31:44.5795279Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.5795689Z self=, 2025-05-07T20:31:44.5796089Z T=16384, 2025-05-07T20:31:44.5796284Z D=5120, 2025-05-07T20:31:44.5796472Z scale_ub=1200.0, 2025-05-07T20:31:44.5796688Z contiguous=False, 2025-05-07T20:31:44.5796909Z compiled=True, 2025-05-07T20:31:44.5797111Z ) 2025-05-07T20:31:44.6993490Z self = 2025-05-07T20:31:44.6994281Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.6994667Z 2025-05-07T20:31:44.6994785Z @given( 2025-05-07T20:31:44.6995054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6995373Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6995686Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6996014Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6996352Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6996640Z ) 2025-05-07T20:31:44.6996993Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6997437Z def test_silu_mul_quant( 2025-05-07T20:31:44.6997685Z self, 2025-05-07T20:31:44.6997875Z T: int, 2025-05-07T20:31:44.6998081Z D: int, 2025-05-07T20:31:44.6998301Z scale_ub: Optional[float], 2025-05-07T20:31:44.6998577Z contiguous: bool, 2025-05-07T20:31:44.6998819Z compiled: bool, 2025-05-07T20:31:44.6999045Z ) -> None: 2025-05-07T20:31:44.6999267Z torch.manual_seed(2025) 2025-05-07T20:31:44.6999511Z 2025-05-07T20:31:44.6999784Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.7000134Z 2025-05-07T20:31:44.7000327Z x_sign = torch.sign(x) 2025-05-07T20:31:44.7000626Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.7000938Z x = x_sign * x_clamp 2025-05-07T20:31:44.7001168Z x0 = x[:, :D] 2025-05-07T20:31:44.7001384Z x1 = x[:, D:] 2025-05-07T20:31:44.7001592Z 2025-05-07T20:31:44.7001783Z if contiguous: 2025-05-07T20:31:44.7002052Z x0 = x0.contiguous() 2025-05-07T20:31:44.7002334Z x1 = x1.contiguous() 2025-05-07T20:31:44.7002567Z 2025-05-07T20:31:44.7002765Z if scale_ub is not None: 2025-05-07T20:31:44.7003039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.7003376Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.7003869Z ) 2025-05-07T20:31:44.7004068Z else: 2025-05-07T20:31:44.7004281Z scale_ub_tensor = None 2025-05-07T20:31:44.7004530Z 2025-05-07T20:31:44.7004762Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.7005076Z op = silu_mul_quant 2025-05-07T20:31:44.7005319Z if compiled: 2025-05-07T20:31:44.7005566Z op = torch.compile(op) 2025-05-07T20:31:44.7005862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.7006131Z 2025-05-07T20:31:44.7006320Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.7006656Z 2025-05-07T20:31:44.7006770Z moe/activation_test.py:117: 2025-05-07T20:31:44.7007063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.7007393Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.7007676Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.7008359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.7008909Z return fn(*args, **kwargs) 2025-05-07T20:31:44.7009569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.7010254Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.7010784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.7011459Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.7012154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.7012718Z kernel = self.compile( 2025-05-07T20:31:44.7013250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.7013912Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.7014306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.7014533Z 2025-05-07T20:31:44.7014745Z self = 2025-05-07T20:31:44.7015823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.7017206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0295d790>} 2025-05-07T20:31:44.7018547Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.7019590Z context = 2025-05-07T20:31:44.7019877Z 2025-05-07T20:31:44.7020048Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.7020571Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.7021035Z module_map=module_map) 2025-05-07T20:31:44.7021400Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.7021745Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.7022002Z E ^ 2025-05-07T20:31:44.7022475Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.7022926Z 2025-05-07T20:31:44.7023346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.7023856Z 2025-05-07T20:31:44.7023964Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.7024376Z self=, 2025-05-07T20:31:44.7024776Z T=2048, 2025-05-07T20:31:44.7024957Z D=7168, 2025-05-07T20:31:44.7025148Z scale_ub=1200.0, 2025-05-07T20:31:44.7025372Z contiguous=False, 2025-05-07T20:31:44.7025597Z compiled=True, 2025-05-07T20:31:44.7025794Z ) 2025-05-07T20:31:44.7026109Z self = 2025-05-07T20:31:44.7026602Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.7026874Z 2025-05-07T20:31:44.7027481Z @given( 2025-05-07T20:31:44.7027713Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.7028029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.7028330Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.7028655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.7029062Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.7029344Z ) 2025-05-07T20:31:44.7029691Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.7030198Z def test_silu_mul_quant( 2025-05-07T20:31:44.7030438Z self, 2025-05-07T20:31:44.7030625Z T: int, 2025-05-07T20:31:44.7030822Z D: int, 2025-05-07T20:31:44.7031038Z scale_ub: Optional[float], 2025-05-07T20:31:44.7031299Z contiguous: bool, 2025-05-07T20:31:44.7031539Z compiled: bool, 2025-05-07T20:31:44.7031764Z ) -> None: 2025-05-07T20:31:44.7031995Z torch.manual_seed(2025) 2025-05-07T20:31:44.7032268Z 2025-05-07T20:31:44.7032546Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.7032882Z 2025-05-07T20:31:44.7033076Z x_sign = torch.sign(x) 2025-05-07T20:31:44.7033367Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.7033675Z x = x_sign * x_clamp 2025-05-07T20:31:44.7033914Z x0 = x[:, :D] 2025-05-07T20:31:44.7034129Z x1 = x[:, D:] 2025-05-07T20:31:44.7034329Z 2025-05-07T20:31:44.7034515Z if contiguous: 2025-05-07T20:31:44.7034743Z x0 = x0.contiguous() 2025-05-07T20:31:44.7035006Z x1 = x1.contiguous() 2025-05-07T20:31:44.7035239Z 2025-05-07T20:31:44.7035428Z if scale_ub is not None: 2025-05-07T20:31:44.7035697Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.7036024Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.7036335Z ) 2025-05-07T20:31:44.7036531Z else: 2025-05-07T20:31:44.7036739Z scale_ub_tensor = None 2025-05-07T20:31:44.7036986Z 2025-05-07T20:31:44.7037220Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.7037528Z op = silu_mul_quant 2025-05-07T20:31:44.7037778Z if compiled: 2025-05-07T20:31:44.7038033Z op = torch.compile(op) 2025-05-07T20:31:44.7038328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.7038600Z 2025-05-07T20:31:44.7038792Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.7038953Z 2025-05-07T20:31:44.7039053Z moe/activation_test.py:117: 2025-05-07T20:31:44.7039346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.7039676Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.7039958Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.7040506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.7041067Z return fn(*args, **kwargs) 2025-05-07T20:31:44.7041728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.7042461Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.7042994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.7043677Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.7044332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.7044852Z kernel = self.compile( 2025-05-07T20:31:44.7045387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.7046038Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.7046512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.7046740Z 2025-05-07T20:31:44.7046946Z self = 2025-05-07T20:31:44.7048029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.7049481Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02a404c0>} 2025-05-07T20:31:44.7050828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.7051880Z context = 2025-05-07T20:31:44.7052202Z 2025-05-07T20:31:44.7052369Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.7052898Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.7053363Z module_map=module_map) 2025-05-07T20:31:44.7053728Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.7054080Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.7054344Z E ^ 2025-05-07T20:31:44.7054810Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.7055265Z 2025-05-07T20:31:44.7055683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.7056195Z 2025-05-07T20:31:44.9728251Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.9728803Z self=, 2025-05-07T20:31:44.9729401Z T=1, 2025-05-07T20:31:44.9729658Z D=5120, 2025-05-07T20:31:44.9729920Z scale_ub=None, 2025-05-07T20:31:44.9730212Z contiguous=False, 2025-05-07T20:31:44.9730474Z compiled=False, 2025-05-07T20:31:44.9730685Z ) 2025-05-07T20:31:44.9731035Z self = 2025-05-07T20:31:44.9731530Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:44.9731803Z 2025-05-07T20:31:44.9731882Z @given( 2025-05-07T20:31:44.9732135Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.9732458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.9732762Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.9733096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.9733429Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.9733713Z ) 2025-05-07T20:31:44.9734062Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.9734503Z def test_silu_mul_quant( 2025-05-07T20:31:44.9734740Z self, 2025-05-07T20:31:44.9734937Z T: int, 2025-05-07T20:31:44.9735137Z D: int, 2025-05-07T20:31:44.9735351Z scale_ub: Optional[float], 2025-05-07T20:31:44.9735625Z contiguous: bool, 2025-05-07T20:31:44.9735865Z compiled: bool, 2025-05-07T20:31:44.9736091Z ) -> None: 2025-05-07T20:31:44.9736303Z torch.manual_seed(2025) 2025-05-07T20:31:44.9736550Z 2025-05-07T20:31:44.9736820Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.9737157Z 2025-05-07T20:31:44.9737353Z x_sign = torch.sign(x) 2025-05-07T20:31:44.9737647Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.9737953Z x = x_sign * x_clamp 2025-05-07T20:31:44.9738194Z x0 = x[:, :D] 2025-05-07T20:31:44.9738413Z x1 = x[:, D:] 2025-05-07T20:31:44.9738810Z 2025-05-07T20:31:44.9739010Z if contiguous: 2025-05-07T20:31:44.9739244Z x0 = x0.contiguous() 2025-05-07T20:31:44.9739501Z x1 = x1.contiguous() 2025-05-07T20:31:44.9739741Z 2025-05-07T20:31:44.9739935Z if scale_ub is not None: 2025-05-07T20:31:44.9740205Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.9740657Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.9740967Z ) 2025-05-07T20:31:44.9741161Z else: 2025-05-07T20:31:44.9741369Z scale_ub_tensor = None 2025-05-07T20:31:44.9741617Z 2025-05-07T20:31:44.9741848Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.9742158Z op = silu_mul_quant 2025-05-07T20:31:44.9742408Z if compiled: 2025-05-07T20:31:44.9742655Z op = torch.compile(op) 2025-05-07T20:31:44.9742948Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9743222Z 2025-05-07T20:31:44.9743422Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.9743585Z 2025-05-07T20:31:44.9743685Z moe/activation_test.py:117: 2025-05-07T20:31:44.9743978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9744308Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.9744591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9745278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.9745965Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.9746505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.9747176Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.9747840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.9748373Z kernel = self.compile( 2025-05-07T20:31:44.9748911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.9749557Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.9750048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9750279Z 2025-05-07T20:31:44.9750492Z self = 2025-05-07T20:31:44.9751575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.9753005Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02a40820>} 2025-05-07T20:31:44.9754357Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.9755379Z context = 2025-05-07T20:31:44.9755668Z 2025-05-07T20:31:44.9755837Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.9756357Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.9756824Z module_map=module_map) 2025-05-07T20:31:44.9757188Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.9757540Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.9757798Z E ^ 2025-05-07T20:31:44.9758266Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.9758716Z 2025-05-07T20:31:44.9759221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.9759730Z 2025-05-07T20:31:44.9759832Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.9760245Z self=, 2025-05-07T20:31:44.9760731Z T=4096, 2025-05-07T20:31:44.9760914Z D=7168, 2025-05-07T20:31:44.9761105Z scale_ub=1200.0, 2025-05-07T20:31:44.9761333Z contiguous=False, 2025-05-07T20:31:44.9761554Z compiled=False, 2025-05-07T20:31:44.9761757Z ) 2025-05-07T20:31:44.9762077Z self = 2025-05-07T20:31:44.9762619Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:44.9762897Z 2025-05-07T20:31:44.9762979Z @given( 2025-05-07T20:31:44.9763211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.9763527Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.9763832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.9764164Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.9764490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.9764771Z ) 2025-05-07T20:31:44.9765119Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.9765566Z def test_silu_mul_quant( 2025-05-07T20:31:44.9765804Z self, 2025-05-07T20:31:44.9765995Z T: int, 2025-05-07T20:31:44.9766194Z D: int, 2025-05-07T20:31:44.9766403Z scale_ub: Optional[float], 2025-05-07T20:31:44.9766672Z contiguous: bool, 2025-05-07T20:31:44.9766913Z compiled: bool, 2025-05-07T20:31:44.9767133Z ) -> None: 2025-05-07T20:31:44.9767350Z torch.manual_seed(2025) 2025-05-07T20:31:44.9767590Z 2025-05-07T20:31:44.9767859Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.9768199Z 2025-05-07T20:31:44.9768393Z x_sign = torch.sign(x) 2025-05-07T20:31:44.9768685Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.9768987Z x = x_sign * x_clamp 2025-05-07T20:31:44.9769227Z x0 = x[:, :D] 2025-05-07T20:31:44.9769441Z x1 = x[:, D:] 2025-05-07T20:31:44.9769647Z 2025-05-07T20:31:44.9769831Z if contiguous: 2025-05-07T20:31:44.9770062Z x0 = x0.contiguous() 2025-05-07T20:31:44.9770315Z x1 = x1.contiguous() 2025-05-07T20:31:44.9770555Z 2025-05-07T20:31:44.9770746Z if scale_ub is not None: 2025-05-07T20:31:44.9771018Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.9771352Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.9771658Z ) 2025-05-07T20:31:44.9771847Z else: 2025-05-07T20:31:44.9772057Z scale_ub_tensor = None 2025-05-07T20:31:44.9772308Z 2025-05-07T20:31:44.9772545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.9772852Z op = silu_mul_quant 2025-05-07T20:31:44.9773103Z if compiled: 2025-05-07T20:31:44.9773351Z op = torch.compile(op) 2025-05-07T20:31:44.9773644Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9773923Z 2025-05-07T20:31:44.9774116Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.9774280Z 2025-05-07T20:31:44.9774381Z moe/activation_test.py:117: 2025-05-07T20:31:44.9774674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9775004Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.9775282Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.9775968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.9776653Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.9777265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.9777943Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.9778596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.9779198Z kernel = self.compile( 2025-05-07T20:31:44.9779728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.9780379Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.9780773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.9780999Z 2025-05-07T20:31:44.9781210Z self = 2025-05-07T20:31:44.9782349Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.9783726Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02f8faf0>} 2025-05-07T20:31:44.9785072Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.9786093Z context = 2025-05-07T20:31:44.9786379Z 2025-05-07T20:31:44.9786549Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.9787065Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.9787529Z module_map=module_map) 2025-05-07T20:31:44.9787896Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.9788246Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.9788506Z E ^ 2025-05-07T20:31:44.9788972Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.9789424Z 2025-05-07T20:31:44.9789920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.9790429Z 2025-05-07T20:31:44.9790530Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.9790944Z self=, 2025-05-07T20:31:44.9791353Z T=16384, 2025-05-07T20:31:44.9791541Z D=7168, 2025-05-07T20:31:44.9791736Z scale_ub=None, 2025-05-07T20:31:44.9791951Z contiguous=True, 2025-05-07T20:31:44.9792175Z compiled=True, 2025-05-07T20:31:44.9792399Z ) 2025-05-07T20:31:45.0966379Z self = 2025-05-07T20:31:45.0967125Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.0967518Z 2025-05-07T20:31:45.0967628Z @given( 2025-05-07T20:31:45.0967930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.0968364Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.0968685Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.0969012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.0969347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.0969636Z ) 2025-05-07T20:31:45.0969988Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.0970426Z def test_silu_mul_quant( 2025-05-07T20:31:45.0970672Z self, 2025-05-07T20:31:45.0970868Z T: int, 2025-05-07T20:31:45.0971061Z D: int, 2025-05-07T20:31:45.0971278Z scale_ub: Optional[float], 2025-05-07T20:31:45.0971722Z contiguous: bool, 2025-05-07T20:31:45.0971965Z compiled: bool, 2025-05-07T20:31:45.0972192Z ) -> None: 2025-05-07T20:31:45.0972410Z torch.manual_seed(2025) 2025-05-07T20:31:45.0972647Z 2025-05-07T20:31:45.0972920Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.0973376Z 2025-05-07T20:31:45.0973573Z x_sign = torch.sign(x) 2025-05-07T20:31:45.0973864Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.0974180Z x = x_sign * x_clamp 2025-05-07T20:31:45.0974418Z x0 = x[:, :D] 2025-05-07T20:31:45.0974636Z x1 = x[:, D:] 2025-05-07T20:31:45.0974844Z 2025-05-07T20:31:45.0975025Z if contiguous: 2025-05-07T20:31:45.0975256Z x0 = x0.contiguous() 2025-05-07T20:31:45.0975516Z x1 = x1.contiguous() 2025-05-07T20:31:45.0975761Z 2025-05-07T20:31:45.0975950Z if scale_ub is not None: 2025-05-07T20:31:45.0976237Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.0976577Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.0976882Z ) 2025-05-07T20:31:45.0977076Z else: 2025-05-07T20:31:45.0977290Z scale_ub_tensor = None 2025-05-07T20:31:45.0977541Z 2025-05-07T20:31:45.0977771Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.0978088Z op = silu_mul_quant 2025-05-07T20:31:45.0984811Z if compiled: 2025-05-07T20:31:45.0985088Z op = torch.compile(op) 2025-05-07T20:31:45.0985409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.0985698Z 2025-05-07T20:31:45.0985904Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.0986079Z 2025-05-07T20:31:45.0986186Z moe/activation_test.py:117: 2025-05-07T20:31:45.0986487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.0986836Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.0987138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.0987711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.0988292Z return fn(*args, **kwargs) 2025-05-07T20:31:45.0988971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.0989683Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.0990291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.0990994Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.0991672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.0992212Z kernel = self.compile( 2025-05-07T20:31:45.0992808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.0993470Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.0993872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.0994104Z 2025-05-07T20:31:45.0994314Z self = 2025-05-07T20:31:45.0995410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.0996812Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02e67790>} 2025-05-07T20:31:45.0998277Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.0999311Z context = 2025-05-07T20:31:45.0999598Z 2025-05-07T20:31:45.0999766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.1000288Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.1000833Z module_map=module_map) 2025-05-07T20:31:45.1001193Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.1001551Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.1001811Z E ^ 2025-05-07T20:31:45.1002281Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.1002735Z 2025-05-07T20:31:45.1003157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.1003678Z 2025-05-07T20:31:45.1003951Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.1004369Z self=, 2025-05-07T20:31:45.1004772Z T=4096, 2025-05-07T20:31:45.1004954Z D=5120, 2025-05-07T20:31:45.1005142Z scale_ub=None, 2025-05-07T20:31:45.1005370Z contiguous=False, 2025-05-07T20:31:45.1005596Z compiled=True, 2025-05-07T20:31:45.1005799Z ) 2025-05-07T20:31:45.1006115Z self = 2025-05-07T20:31:45.1006603Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.1006879Z 2025-05-07T20:31:45.1006955Z @given( 2025-05-07T20:31:45.1007181Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.1007491Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.1007798Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.1008135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.1008473Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.1008756Z ) 2025-05-07T20:31:45.1009107Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.1009553Z def test_silu_mul_quant( 2025-05-07T20:31:45.1009794Z self, 2025-05-07T20:31:45.1009987Z T: int, 2025-05-07T20:31:45.1010179Z D: int, 2025-05-07T20:31:45.1010391Z scale_ub: Optional[float], 2025-05-07T20:31:45.1010661Z contiguous: bool, 2025-05-07T20:31:45.1010903Z compiled: bool, 2025-05-07T20:31:45.1011152Z ) -> None: 2025-05-07T20:31:45.1011364Z torch.manual_seed(2025) 2025-05-07T20:31:45.1011607Z 2025-05-07T20:31:45.1011886Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.1012265Z 2025-05-07T20:31:45.1012465Z x_sign = torch.sign(x) 2025-05-07T20:31:45.1012763Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.1013070Z x = x_sign * x_clamp 2025-05-07T20:31:45.1013311Z x0 = x[:, :D] 2025-05-07T20:31:45.1013527Z x1 = x[:, D:] 2025-05-07T20:31:45.1013739Z 2025-05-07T20:31:45.1013918Z if contiguous: 2025-05-07T20:31:45.1014154Z x0 = x0.contiguous() 2025-05-07T20:31:45.1014421Z x1 = x1.contiguous() 2025-05-07T20:31:45.1014657Z 2025-05-07T20:31:45.1014851Z if scale_ub is not None: 2025-05-07T20:31:45.1015123Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.1015454Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.1015764Z ) 2025-05-07T20:31:45.1015963Z else: 2025-05-07T20:31:45.1016169Z scale_ub_tensor = None 2025-05-07T20:31:45.1016417Z 2025-05-07T20:31:45.1016649Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.1016957Z op = silu_mul_quant 2025-05-07T20:31:45.1017342Z if compiled: 2025-05-07T20:31:45.1017594Z op = torch.compile(op) 2025-05-07T20:31:45.1017886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.1018160Z 2025-05-07T20:31:45.1018354Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.1018520Z 2025-05-07T20:31:45.1018622Z moe/activation_test.py:117: 2025-05-07T20:31:45.1019024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.1019356Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.1019639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.1020197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.1020758Z return fn(*args, **kwargs) 2025-05-07T20:31:45.1021422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.1022124Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.1022716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.1023402Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.1024063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.1024604Z kernel = self.compile( 2025-05-07T20:31:45.1025144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.1025798Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.1026196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.1026426Z 2025-05-07T20:31:45.1026634Z self = 2025-05-07T20:31:45.1027733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.1029124Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02b5d550>} 2025-05-07T20:31:45.1030537Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.1031570Z context = 2025-05-07T20:31:45.1031858Z 2025-05-07T20:31:45.1032027Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.1032550Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.1033023Z module_map=module_map) 2025-05-07T20:31:45.1033384Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.1033739Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.1033993Z E ^ 2025-05-07T20:31:45.1034455Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.1034918Z 2025-05-07T20:31:45.1035336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.1035855Z 2025-05-07T20:31:45.4949534Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4950909Z self=, 2025-05-07T20:31:45.4952000Z T=4096, 2025-05-07T20:31:45.4952397Z D=5120, 2025-05-07T20:31:45.4952626Z scale_ub=1200.0, 2025-05-07T20:31:45.4952854Z contiguous=False, 2025-05-07T20:31:45.4953079Z compiled=False, 2025-05-07T20:31:45.4953283Z ) 2025-05-07T20:31:45.4953781Z self = 2025-05-07T20:31:45.4954292Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.4954573Z 2025-05-07T20:31:45.4954651Z @given( 2025-05-07T20:31:45.4954885Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4955316Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4955628Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4955958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4956292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4956582Z ) 2025-05-07T20:31:45.4956930Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4957379Z def test_silu_mul_quant( 2025-05-07T20:31:45.4957623Z self, 2025-05-07T20:31:45.4957817Z T: int, 2025-05-07T20:31:45.4958017Z D: int, 2025-05-07T20:31:45.4958250Z scale_ub: Optional[float], 2025-05-07T20:31:45.4958523Z contiguous: bool, 2025-05-07T20:31:45.4958762Z compiled: bool, 2025-05-07T20:31:45.4958993Z ) -> None: 2025-05-07T20:31:45.4959206Z torch.manual_seed(2025) 2025-05-07T20:31:45.4959454Z 2025-05-07T20:31:45.4959728Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4960083Z 2025-05-07T20:31:45.4960274Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4960568Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4960879Z x = x_sign * x_clamp 2025-05-07T20:31:45.4961115Z x0 = x[:, :D] 2025-05-07T20:31:45.4961330Z x1 = x[:, D:] 2025-05-07T20:31:45.4961544Z 2025-05-07T20:31:45.4961729Z if contiguous: 2025-05-07T20:31:45.4961964Z x0 = x0.contiguous() 2025-05-07T20:31:45.4962223Z x1 = x1.contiguous() 2025-05-07T20:31:45.4962460Z 2025-05-07T20:31:45.4962652Z if scale_ub is not None: 2025-05-07T20:31:45.4962933Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4963266Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4963575Z ) 2025-05-07T20:31:45.4963774Z else: 2025-05-07T20:31:45.4963982Z scale_ub_tensor = None 2025-05-07T20:31:45.4964243Z 2025-05-07T20:31:45.4964474Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4964794Z op = silu_mul_quant 2025-05-07T20:31:45.4965042Z if compiled: 2025-05-07T20:31:45.4965300Z op = torch.compile(op) 2025-05-07T20:31:45.4965602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4965873Z 2025-05-07T20:31:45.4966067Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4966234Z 2025-05-07T20:31:45.4966343Z moe/activation_test.py:117: 2025-05-07T20:31:45.4966636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4966973Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4967252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4967941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4968641Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4969195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4969881Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4970537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4971079Z kernel = self.compile( 2025-05-07T20:31:45.4971627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4972289Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4972787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4973021Z 2025-05-07T20:31:45.4973230Z self = 2025-05-07T20:31:45.4974336Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4975839Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02ebd0d0>} 2025-05-07T20:31:45.4977206Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4978258Z context = 2025-05-07T20:31:45.4978557Z 2025-05-07T20:31:45.4978726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4979254Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4979726Z module_map=module_map) 2025-05-07T20:31:45.4980109Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4980469Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4980733Z E ^ 2025-05-07T20:31:45.4981209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4981670Z 2025-05-07T20:31:45.4982091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4982603Z 2025-05-07T20:31:45.4982710Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4983122Z self=, 2025-05-07T20:31:45.4983528Z T=4096, 2025-05-07T20:31:45.4983718Z D=5120, 2025-05-07T20:31:45.4983908Z scale_ub=1200.0, 2025-05-07T20:31:45.4984137Z contiguous=False, 2025-05-07T20:31:45.4984358Z compiled=True, 2025-05-07T20:31:45.4984564Z ) 2025-05-07T20:31:45.4984880Z self = 2025-05-07T20:31:45.4985369Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.4985642Z 2025-05-07T20:31:45.4985727Z @given( 2025-05-07T20:31:45.4985952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4986261Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4986566Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4986887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4987219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4987507Z ) 2025-05-07T20:31:45.4987849Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4988293Z def test_silu_mul_quant( 2025-05-07T20:31:45.4988534Z self, 2025-05-07T20:31:45.4988731Z T: int, 2025-05-07T20:31:45.4988928Z D: int, 2025-05-07T20:31:45.4989146Z scale_ub: Optional[float], 2025-05-07T20:31:45.4989418Z contiguous: bool, 2025-05-07T20:31:45.4989656Z compiled: bool, 2025-05-07T20:31:45.4989950Z ) -> None: 2025-05-07T20:31:45.4990162Z torch.manual_seed(2025) 2025-05-07T20:31:45.4990396Z 2025-05-07T20:31:45.4990667Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4991008Z 2025-05-07T20:31:45.4991199Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4991488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4991795Z x = x_sign * x_clamp 2025-05-07T20:31:45.4992033Z x0 = x[:, :D] 2025-05-07T20:31:45.4992334Z x1 = x[:, D:] 2025-05-07T20:31:45.4992547Z 2025-05-07T20:31:45.4992744Z if contiguous: 2025-05-07T20:31:45.4992978Z x0 = x0.contiguous() 2025-05-07T20:31:45.4993234Z x1 = x1.contiguous() 2025-05-07T20:31:45.4993474Z 2025-05-07T20:31:45.4993661Z if scale_ub is not None: 2025-05-07T20:31:45.4994013Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4994342Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4994652Z ) 2025-05-07T20:31:45.4994847Z else: 2025-05-07T20:31:45.4995057Z scale_ub_tensor = None 2025-05-07T20:31:45.4995308Z 2025-05-07T20:31:45.4995536Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4995846Z op = silu_mul_quant 2025-05-07T20:31:45.4996099Z if compiled: 2025-05-07T20:31:45.4996343Z op = torch.compile(op) 2025-05-07T20:31:45.4996646Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4996919Z 2025-05-07T20:31:45.4997107Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4997275Z 2025-05-07T20:31:45.4997375Z moe/activation_test.py:117: 2025-05-07T20:31:45.4997667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4998003Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4998279Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4998831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4999387Z return fn(*args, **kwargs) 2025-05-07T20:31:45.5000037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.5000720Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.5001251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.5001931Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.5002582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.5003114Z kernel = self.compile( 2025-05-07T20:31:45.5003654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.5004475Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.5004869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.5005103Z 2025-05-07T20:31:45.5005309Z self = 2025-05-07T20:31:45.5006392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.5007760Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02ebddc0>} 2025-05-07T20:31:45.5009101Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.5010124Z context = 2025-05-07T20:31:45.5010409Z 2025-05-07T20:31:45.5010578Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.5011100Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.5011560Z module_map=module_map) 2025-05-07T20:31:45.5011927Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.5012407Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.5012667Z E ^ 2025-05-07T20:31:45.5013134Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.5013581Z 2025-05-07T20:31:45.5014000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.5014617Z 2025-05-07T20:31:45.7773702Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.7774286Z self=, 2025-05-07T20:31:45.7774878Z T=2048, 2025-05-07T20:31:45.7775123Z D=7168, 2025-05-07T20:31:45.7775370Z scale_ub=1200.0, 2025-05-07T20:31:45.7775669Z contiguous=False, 2025-05-07T20:31:45.7775962Z compiled=False, 2025-05-07T20:31:45.7776224Z ) 2025-05-07T20:31:45.7776643Z self = 2025-05-07T20:31:45.7777150Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.7777430Z 2025-05-07T20:31:45.7777510Z @given( 2025-05-07T20:31:45.7777737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.7778048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.7778350Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.7778683Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.7779013Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.7779292Z ) 2025-05-07T20:31:45.7779640Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.7780080Z def test_silu_mul_quant( 2025-05-07T20:31:45.7780318Z self, 2025-05-07T20:31:45.7780509Z T: int, 2025-05-07T20:31:45.7780700Z D: int, 2025-05-07T20:31:45.7780915Z scale_ub: Optional[float], 2025-05-07T20:31:45.7781186Z contiguous: bool, 2025-05-07T20:31:45.7781424Z compiled: bool, 2025-05-07T20:31:45.7781647Z ) -> None: 2025-05-07T20:31:45.7781868Z torch.manual_seed(2025) 2025-05-07T20:31:45.7782107Z 2025-05-07T20:31:45.7782371Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.7782711Z 2025-05-07T20:31:45.7782903Z x_sign = torch.sign(x) 2025-05-07T20:31:45.7783198Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.7783501Z x = x_sign * x_clamp 2025-05-07T20:31:45.7783740Z x0 = x[:, :D] 2025-05-07T20:31:45.7783953Z x1 = x[:, D:] 2025-05-07T20:31:45.7784153Z 2025-05-07T20:31:45.7784339Z if contiguous: 2025-05-07T20:31:45.7784571Z x0 = x0.contiguous() 2025-05-07T20:31:45.7784824Z x1 = x1.contiguous() 2025-05-07T20:31:45.7785062Z 2025-05-07T20:31:45.7785251Z if scale_ub is not None: 2025-05-07T20:31:45.7785519Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.7785859Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.7786169Z ) 2025-05-07T20:31:45.7786358Z else: 2025-05-07T20:31:45.7786568Z scale_ub_tensor = None 2025-05-07T20:31:45.7786822Z 2025-05-07T20:31:45.7787047Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.7787364Z op = silu_mul_quant 2025-05-07T20:31:45.7787616Z if compiled: 2025-05-07T20:31:45.7787860Z op = torch.compile(op) 2025-05-07T20:31:45.7788150Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.7788426Z 2025-05-07T20:31:45.7788616Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.7788783Z 2025-05-07T20:31:45.7788883Z moe/activation_test.py:117: 2025-05-07T20:31:45.7789177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.7789512Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.7789786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.7790720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.7791415Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.7791952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.7792731Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.7793387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.7793911Z kernel = self.compile( 2025-05-07T20:31:45.7794440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.7795089Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.7795481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.7795709Z 2025-05-07T20:31:45.7795930Z self = 2025-05-07T20:31:45.7797009Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.7798387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa029de670>} 2025-05-07T20:31:45.7799734Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.7800754Z context = 2025-05-07T20:31:45.7801038Z 2025-05-07T20:31:45.7801211Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.7801725Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.7802190Z module_map=module_map) 2025-05-07T20:31:45.7802551Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.7802899Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.7803160Z E ^ 2025-05-07T20:31:45.7803629Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.7804358Z 2025-05-07T20:31:45.7804781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.7805291Z 2025-05-07T20:31:45.7805395Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.7805805Z self=, 2025-05-07T20:31:45.7806201Z T=1, 2025-05-07T20:31:45.7806375Z D=7168, 2025-05-07T20:31:45.7806579Z scale_ub=None, 2025-05-07T20:31:45.7806796Z contiguous=True, 2025-05-07T20:31:45.7807025Z compiled=False, 2025-05-07T20:31:45.7807242Z ) 2025-05-07T20:31:45.7807563Z self = 2025-05-07T20:31:45.7814545Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.7814826Z 2025-05-07T20:31:45.7814908Z @given( 2025-05-07T20:31:45.7815133Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.7815449Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.7815760Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.7816084Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.7816408Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.7816692Z ) 2025-05-07T20:31:45.7817042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.7817654Z def test_silu_mul_quant( 2025-05-07T20:31:45.7817903Z self, 2025-05-07T20:31:45.7818095Z T: int, 2025-05-07T20:31:45.7818287Z D: int, 2025-05-07T20:31:45.7818501Z scale_ub: Optional[float], 2025-05-07T20:31:45.7818777Z contiguous: bool, 2025-05-07T20:31:45.7819017Z compiled: bool, 2025-05-07T20:31:45.7819354Z ) -> None: 2025-05-07T20:31:45.7819570Z torch.manual_seed(2025) 2025-05-07T20:31:45.7819804Z 2025-05-07T20:31:45.7820076Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.7820417Z 2025-05-07T20:31:45.7820606Z x_sign = torch.sign(x) 2025-05-07T20:31:45.7820905Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.7821219Z x = x_sign * x_clamp 2025-05-07T20:31:45.7821458Z x0 = x[:, :D] 2025-05-07T20:31:45.7821677Z x1 = x[:, D:] 2025-05-07T20:31:45.7821891Z 2025-05-07T20:31:45.7822069Z if contiguous: 2025-05-07T20:31:45.7822310Z x0 = x0.contiguous() 2025-05-07T20:31:45.7822577Z x1 = x1.contiguous() 2025-05-07T20:31:45.7822824Z 2025-05-07T20:31:45.7823009Z if scale_ub is not None: 2025-05-07T20:31:45.7823286Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.7823620Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.7823928Z ) 2025-05-07T20:31:45.7824121Z else: 2025-05-07T20:31:45.7824332Z scale_ub_tensor = None 2025-05-07T20:31:45.7824575Z 2025-05-07T20:31:45.7824809Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.7825125Z op = silu_mul_quant 2025-05-07T20:31:45.7825371Z if compiled: 2025-05-07T20:31:45.7825613Z op = torch.compile(op) 2025-05-07T20:31:45.7825903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.7826172Z 2025-05-07T20:31:45.7826361Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.7826521Z 2025-05-07T20:31:45.7826630Z moe/activation_test.py:117: 2025-05-07T20:31:45.7826921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.7827244Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.7827524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.7828212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.7828906Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.7829441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.7830187Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.7830839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.7831357Z kernel = self.compile( 2025-05-07T20:31:45.7831893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.7832541Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.7832928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.7833158Z 2025-05-07T20:31:45.7833370Z self = 2025-05-07T20:31:45.7834454Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.7835828Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0285c160>} 2025-05-07T20:31:45.7837257Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.7838283Z context = 2025-05-07T20:31:45.7838569Z 2025-05-07T20:31:45.7838732Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.7839328Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.7839792Z module_map=module_map) 2025-05-07T20:31:45.7840154Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.7840508Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.7840769Z E ^ 2025-05-07T20:31:45.7841232Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.7841687Z 2025-05-07T20:31:45.7842113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.7842625Z 2025-05-07T20:31:45.7842728Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.7843140Z self=, 2025-05-07T20:31:45.7843537Z T=16384, 2025-05-07T20:31:45.7843732Z D=7168, 2025-05-07T20:31:45.7843927Z scale_ub=1200.0, 2025-05-07T20:31:45.7844145Z contiguous=False, 2025-05-07T20:31:45.7844368Z compiled=True, 2025-05-07T20:31:45.7844566Z ) 2025-05-07T20:31:45.9744898Z self = 2025-05-07T20:31:45.9745435Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.9745725Z 2025-05-07T20:31:45.9745807Z @given( 2025-05-07T20:31:45.9746101Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.9746509Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.9746827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.9747158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.9747482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.9747771Z ) 2025-05-07T20:31:45.9748120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.9748564Z def test_silu_mul_quant( 2025-05-07T20:31:45.9748808Z self, 2025-05-07T20:31:45.9749006Z T: int, 2025-05-07T20:31:45.9749211Z D: int, 2025-05-07T20:31:45.9749424Z scale_ub: Optional[float], 2025-05-07T20:31:45.9749700Z contiguous: bool, 2025-05-07T20:31:45.9750013Z compiled: bool, 2025-05-07T20:31:45.9750233Z ) -> None: 2025-05-07T20:31:45.9750449Z torch.manual_seed(2025) 2025-05-07T20:31:45.9750693Z 2025-05-07T20:31:45.9750959Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.9751300Z 2025-05-07T20:31:45.9751491Z x_sign = torch.sign(x) 2025-05-07T20:31:45.9751784Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.9752094Z x = x_sign * x_clamp 2025-05-07T20:31:45.9752333Z x0 = x[:, :D] 2025-05-07T20:31:45.9752563Z x1 = x[:, D:] 2025-05-07T20:31:45.9752793Z 2025-05-07T20:31:45.9752979Z if contiguous: 2025-05-07T20:31:45.9753210Z x0 = x0.contiguous() 2025-05-07T20:31:45.9753470Z x1 = x1.contiguous() 2025-05-07T20:31:45.9753712Z 2025-05-07T20:31:45.9753898Z if scale_ub is not None: 2025-05-07T20:31:45.9754174Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.9754509Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.9754816Z ) 2025-05-07T20:31:45.9755007Z else: 2025-05-07T20:31:45.9755220Z scale_ub_tensor = None 2025-05-07T20:31:45.9755473Z 2025-05-07T20:31:45.9755707Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.9756180Z op = silu_mul_quant 2025-05-07T20:31:45.9756439Z if compiled: 2025-05-07T20:31:45.9756690Z op = torch.compile(op) 2025-05-07T20:31:45.9756990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.9757263Z 2025-05-07T20:31:45.9757449Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.9757729Z 2025-05-07T20:31:45.9757828Z moe/activation_test.py:117: 2025-05-07T20:31:45.9758129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.9758455Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.9758759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.9759330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.9759886Z return fn(*args, **kwargs) 2025-05-07T20:31:45.9760541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.9761231Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.9761762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.9762438Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.9763094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.9763624Z kernel = self.compile( 2025-05-07T20:31:45.9764162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.9764815Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.9765201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.9765431Z 2025-05-07T20:31:45.9765639Z self = 2025-05-07T20:31:45.9766728Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.9768106Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0285c4c0>} 2025-05-07T20:31:45.9769461Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.9770480Z context = 2025-05-07T20:31:45.9770769Z 2025-05-07T20:31:45.9770933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.9771451Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.9771920Z module_map=module_map) 2025-05-07T20:31:45.9772285Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.9772640Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.9772896Z E ^ 2025-05-07T20:31:45.9773360Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.9773815Z 2025-05-07T20:31:45.9774228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.9774742Z 2025-05-07T20:31:45.9774846Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.9775258Z self=, 2025-05-07T20:31:45.9775651Z T=1, 2025-05-07T20:31:45.9775832Z D=7168, 2025-05-07T20:31:45.9776027Z scale_ub=None, 2025-05-07T20:31:45.9776241Z contiguous=False, 2025-05-07T20:31:45.9776467Z compiled=False, 2025-05-07T20:31:45.9776752Z ) 2025-05-07T20:31:45.9777064Z self = 2025-05-07T20:31:45.9777550Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.9777810Z 2025-05-07T20:31:45.9777891Z @given( 2025-05-07T20:31:45.9778218Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.9778530Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.9778836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.9779163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.9779489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.9779772Z ) 2025-05-07T20:31:45.9780115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.9780552Z def test_silu_mul_quant( 2025-05-07T20:31:45.9780792Z self, 2025-05-07T20:31:45.9780988Z T: int, 2025-05-07T20:31:45.9781185Z D: int, 2025-05-07T20:31:45.9781402Z scale_ub: Optional[float], 2025-05-07T20:31:45.9781674Z contiguous: bool, 2025-05-07T20:31:45.9781907Z compiled: bool, 2025-05-07T20:31:45.9782132Z ) -> None: 2025-05-07T20:31:45.9782348Z torch.manual_seed(2025) 2025-05-07T20:31:45.9782586Z 2025-05-07T20:31:45.9782863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.9783203Z 2025-05-07T20:31:45.9783392Z x_sign = torch.sign(x) 2025-05-07T20:31:45.9783682Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.9783987Z x = x_sign * x_clamp 2025-05-07T20:31:45.9784229Z x0 = x[:, :D] 2025-05-07T20:31:45.9784437Z x1 = x[:, D:] 2025-05-07T20:31:45.9784644Z 2025-05-07T20:31:45.9784828Z if contiguous: 2025-05-07T20:31:45.9785055Z x0 = x0.contiguous() 2025-05-07T20:31:45.9785312Z x1 = x1.contiguous() 2025-05-07T20:31:45.9785552Z 2025-05-07T20:31:45.9785743Z if scale_ub is not None: 2025-05-07T20:31:45.9786013Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.9786342Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.9786644Z ) 2025-05-07T20:31:45.9786840Z else: 2025-05-07T20:31:45.9787061Z scale_ub_tensor = None 2025-05-07T20:31:45.9787306Z 2025-05-07T20:31:45.9787531Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.9787844Z op = silu_mul_quant 2025-05-07T20:31:45.9788092Z if compiled: 2025-05-07T20:31:45.9788330Z op = torch.compile(op) 2025-05-07T20:31:45.9788626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.9788899Z 2025-05-07T20:31:45.9789082Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.9789248Z 2025-05-07T20:31:45.9789344Z moe/activation_test.py:117: 2025-05-07T20:31:45.9789642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.9790027Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.9790304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.9790991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.9791682Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.9792209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.9792886Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.9793537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.9794061Z kernel = self.compile( 2025-05-07T20:31:45.9794596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.9795329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.9795726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.9795951Z 2025-05-07T20:31:45.9796158Z self = 2025-05-07T20:31:45.9797241Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.9798690Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02d1c820>} 2025-05-07T20:31:45.9800038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.9801059Z context = 2025-05-07T20:31:45.9801349Z 2025-05-07T20:31:45.9801514Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.9802038Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.9802509Z module_map=module_map) 2025-05-07T20:31:45.9802867Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.9803220Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.9803479Z E ^ 2025-05-07T20:31:45.9804109Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.9804565Z 2025-05-07T20:31:45.9804979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.9805497Z 2025-05-07T20:31:45.9805603Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.9806027Z self=, 2025-05-07T20:31:45.9806421Z T=2048, 2025-05-07T20:31:45.9806610Z D=7168, 2025-05-07T20:31:45.9806803Z scale_ub=None, 2025-05-07T20:31:45.9807011Z contiguous=False, 2025-05-07T20:31:45.9807233Z compiled=True, 2025-05-07T20:31:45.9807446Z ) 2025-05-07T20:31:46.0997512Z self = 2025-05-07T20:31:46.0998083Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.0998496Z 2025-05-07T20:31:46.0998605Z @given( 2025-05-07T20:31:46.0998884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.0999187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.0999494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.0999822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.1000160Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.1000440Z ) 2025-05-07T20:31:46.1000787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.1001229Z def test_silu_mul_quant( 2025-05-07T20:31:46.1001464Z self, 2025-05-07T20:31:46.1001654Z T: int, 2025-05-07T20:31:46.1001860Z D: int, 2025-05-07T20:31:46.1002084Z scale_ub: Optional[float], 2025-05-07T20:31:46.1002353Z contiguous: bool, 2025-05-07T20:31:46.1002594Z compiled: bool, 2025-05-07T20:31:46.1002814Z ) -> None: 2025-05-07T20:31:46.1003027Z torch.manual_seed(2025) 2025-05-07T20:31:46.1003272Z 2025-05-07T20:31:46.1003545Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.1004124Z 2025-05-07T20:31:46.1004321Z x_sign = torch.sign(x) 2025-05-07T20:31:46.1004610Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.1004918Z x = x_sign * x_clamp 2025-05-07T20:31:46.1005318Z x0 = x[:, :D] 2025-05-07T20:31:46.1005539Z x1 = x[:, D:] 2025-05-07T20:31:46.1005741Z 2025-05-07T20:31:46.1005932Z if contiguous: 2025-05-07T20:31:46.1006163Z x0 = x0.contiguous() 2025-05-07T20:31:46.1006419Z x1 = x1.contiguous() 2025-05-07T20:31:46.1006657Z 2025-05-07T20:31:46.1006962Z if scale_ub is not None: 2025-05-07T20:31:46.1007232Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.1007569Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.1007876Z ) 2025-05-07T20:31:46.1008065Z else: 2025-05-07T20:31:46.1008274Z scale_ub_tensor = None 2025-05-07T20:31:46.1008524Z 2025-05-07T20:31:46.1008753Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.1009072Z op = silu_mul_quant 2025-05-07T20:31:46.1009325Z if compiled: 2025-05-07T20:31:46.1009566Z op = torch.compile(op) 2025-05-07T20:31:46.1009869Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.1010147Z 2025-05-07T20:31:46.1010339Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.1010503Z 2025-05-07T20:31:46.1010602Z moe/activation_test.py:117: 2025-05-07T20:31:46.1010893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.1011230Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.1011508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.1012064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.1012623Z return fn(*args, **kwargs) 2025-05-07T20:31:46.1013273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.1013958Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.1014497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.1015173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.1015824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.1016354Z kernel = self.compile( 2025-05-07T20:31:46.1016896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.1017539Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.1017925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.1018158Z 2025-05-07T20:31:46.1018369Z self = 2025-05-07T20:31:46.1019450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.1020825Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02acf790>} 2025-05-07T20:31:46.1022163Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.1023190Z context = 2025-05-07T20:31:46.1023477Z 2025-05-07T20:31:46.1023645Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.1024165Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.1024622Z module_map=module_map) 2025-05-07T20:31:46.1025067Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.1025425Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.1025679Z E ^ 2025-05-07T20:31:46.1026147Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.1026598Z 2025-05-07T20:31:46.1027014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.1027597Z 2025-05-07T20:31:46.1027706Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.1028115Z self=, 2025-05-07T20:31:46.1028523Z T=4096, 2025-05-07T20:31:46.1028710Z D=7168, 2025-05-07T20:31:46.1028898Z scale_ub=None, 2025-05-07T20:31:46.1029108Z contiguous=False, 2025-05-07T20:31:46.1029332Z compiled=True, 2025-05-07T20:31:46.1029532Z ) 2025-05-07T20:31:46.1029910Z self = 2025-05-07T20:31:46.1030412Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.1030688Z 2025-05-07T20:31:46.1030768Z @given( 2025-05-07T20:31:46.1030992Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.1031299Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.1031610Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.1031934Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.1032270Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.1032605Z ) 2025-05-07T20:31:46.1032949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.1033401Z def test_silu_mul_quant( 2025-05-07T20:31:46.1033640Z self, 2025-05-07T20:31:46.1033833Z T: int, 2025-05-07T20:31:46.1034028Z D: int, 2025-05-07T20:31:46.1034237Z scale_ub: Optional[float], 2025-05-07T20:31:46.1034506Z contiguous: bool, 2025-05-07T20:31:46.1034749Z compiled: bool, 2025-05-07T20:31:46.1034968Z ) -> None: 2025-05-07T20:31:46.1035198Z torch.manual_seed(2025) 2025-05-07T20:31:46.1035438Z 2025-05-07T20:31:46.1035703Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.1036049Z 2025-05-07T20:31:46.1036246Z x_sign = torch.sign(x) 2025-05-07T20:31:46.1036531Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.1036840Z x = x_sign * x_clamp 2025-05-07T20:31:46.1037078Z x0 = x[:, :D] 2025-05-07T20:31:46.1037288Z x1 = x[:, D:] 2025-05-07T20:31:46.1037497Z 2025-05-07T20:31:46.1037678Z if contiguous: 2025-05-07T20:31:46.1037903Z x0 = x0.contiguous() 2025-05-07T20:31:46.1038165Z x1 = x1.contiguous() 2025-05-07T20:31:46.1038403Z 2025-05-07T20:31:46.1038593Z if scale_ub is not None: 2025-05-07T20:31:46.1038862Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.1039200Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.1039507Z ) 2025-05-07T20:31:46.1039691Z else: 2025-05-07T20:31:46.1039905Z scale_ub_tensor = None 2025-05-07T20:31:46.1040156Z 2025-05-07T20:31:46.1040394Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.1040721Z op = silu_mul_quant 2025-05-07T20:31:46.1040970Z if compiled: 2025-05-07T20:31:46.1041226Z op = torch.compile(op) 2025-05-07T20:31:46.1048031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.1048320Z 2025-05-07T20:31:46.1048511Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.1048680Z 2025-05-07T20:31:46.1048785Z moe/activation_test.py:117: 2025-05-07T20:31:46.1049082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.1049413Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.1049691Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.1050360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.1050921Z return fn(*args, **kwargs) 2025-05-07T20:31:46.1051580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.1052331Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.1052859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.1053531Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.1054188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.1054713Z kernel = self.compile( 2025-05-07T20:31:46.1055253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.1055909Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.1056300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.1056534Z 2025-05-07T20:31:46.1056738Z self = 2025-05-07T20:31:46.1057823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.1059196Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa028114c0>} 2025-05-07T20:31:46.1060539Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.1061551Z context = 2025-05-07T20:31:46.1061846Z 2025-05-07T20:31:46.1062009Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.1062571Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.1063049Z module_map=module_map) 2025-05-07T20:31:46.1063410Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.1063765Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.1064023Z E ^ 2025-05-07T20:31:46.1064488Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.1064943Z 2025-05-07T20:31:46.1065359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.1065873Z 2025-05-07T20:31:46.3122487Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.3122954Z self=, 2025-05-07T20:31:46.3123440Z T=16384, 2025-05-07T20:31:46.3123707Z D=5120, 2025-05-07T20:31:46.3123971Z scale_ub=1200.0, 2025-05-07T20:31:46.3124276Z contiguous=False, 2025-05-07T20:31:46.3124582Z compiled=False, 2025-05-07T20:31:46.3124828Z ) 2025-05-07T20:31:46.3125146Z self = 2025-05-07T20:31:46.3125640Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:46.3125925Z 2025-05-07T20:31:46.3126005Z @given( 2025-05-07T20:31:46.3126234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.3126545Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.3126854Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.3127183Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.3127684Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.3127971Z ) 2025-05-07T20:31:46.3128319Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.3128763Z def test_silu_mul_quant( 2025-05-07T20:31:46.3128999Z self, 2025-05-07T20:31:46.3129306Z T: int, 2025-05-07T20:31:46.3129495Z D: int, 2025-05-07T20:31:46.3129705Z scale_ub: Optional[float], 2025-05-07T20:31:46.3129975Z contiguous: bool, 2025-05-07T20:31:46.3130216Z compiled: bool, 2025-05-07T20:31:46.3130437Z ) -> None: 2025-05-07T20:31:46.3130656Z torch.manual_seed(2025) 2025-05-07T20:31:46.3130903Z 2025-05-07T20:31:46.3131186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.3131522Z 2025-05-07T20:31:46.3131719Z x_sign = torch.sign(x) 2025-05-07T20:31:46.3132009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.3132323Z x = x_sign * x_clamp 2025-05-07T20:31:46.3132561Z x0 = x[:, :D] 2025-05-07T20:31:46.3132781Z x1 = x[:, D:] 2025-05-07T20:31:46.3132984Z 2025-05-07T20:31:46.3133170Z if contiguous: 2025-05-07T20:31:46.3133435Z x0 = x0.contiguous() 2025-05-07T20:31:46.3133695Z x1 = x1.contiguous() 2025-05-07T20:31:46.3133949Z 2025-05-07T20:31:46.3134135Z if scale_ub is not None: 2025-05-07T20:31:46.3134408Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.3134755Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.3135058Z ) 2025-05-07T20:31:46.3135251Z else: 2025-05-07T20:31:46.3135460Z scale_ub_tensor = None 2025-05-07T20:31:46.3135708Z 2025-05-07T20:31:46.3135935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.3136247Z op = silu_mul_quant 2025-05-07T20:31:46.3136494Z if compiled: 2025-05-07T20:31:46.3136739Z op = torch.compile(op) 2025-05-07T20:31:46.3137041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3137312Z 2025-05-07T20:31:46.3137502Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.3137673Z 2025-05-07T20:31:46.3137772Z moe/activation_test.py:117: 2025-05-07T20:31:46.3138070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3138396Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.3138677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3139368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.3140057Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.3140588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.3141264Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.3141928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.3142447Z kernel = self.compile( 2025-05-07T20:31:46.3142983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.3143633Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.3144029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3144253Z 2025-05-07T20:31:46.3144457Z self = 2025-05-07T20:31:46.3145543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.3146998Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02811820>} 2025-05-07T20:31:46.3148345Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.3149440Z context = 2025-05-07T20:31:46.3149725Z 2025-05-07T20:31:46.3149962Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.3150479Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.3150945Z module_map=module_map) 2025-05-07T20:31:46.3151302Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.3151651Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.3151909Z E ^ 2025-05-07T20:31:46.3152380Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.3152831Z 2025-05-07T20:31:46.3153246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.3153761Z 2025-05-07T20:31:46.3153864Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.3154276Z self=, 2025-05-07T20:31:46.3154665Z T=16384, 2025-05-07T20:31:46.3154856Z D=5120, 2025-05-07T20:31:46.3155046Z scale_ub=1200.0, 2025-05-07T20:31:46.3155263Z contiguous=True, 2025-05-07T20:31:46.3155486Z compiled=True, 2025-05-07T20:31:46.3155687Z ) 2025-05-07T20:31:46.3156006Z self = 2025-05-07T20:31:46.3156497Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:46.3156770Z 2025-05-07T20:31:46.3156850Z @given( 2025-05-07T20:31:46.3157079Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.3157383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.3157686Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.3158014Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.3158345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.3158626Z ) 2025-05-07T20:31:46.3158976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.3159410Z def test_silu_mul_quant( 2025-05-07T20:31:46.3159641Z self, 2025-05-07T20:31:46.3159829Z T: int, 2025-05-07T20:31:46.3160023Z D: int, 2025-05-07T20:31:46.3160232Z scale_ub: Optional[float], 2025-05-07T20:31:46.3160503Z contiguous: bool, 2025-05-07T20:31:46.3160738Z compiled: bool, 2025-05-07T20:31:46.3160959Z ) -> None: 2025-05-07T20:31:46.3161178Z torch.manual_seed(2025) 2025-05-07T20:31:46.3161416Z 2025-05-07T20:31:46.3161680Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.3162017Z 2025-05-07T20:31:46.3162207Z x_sign = torch.sign(x) 2025-05-07T20:31:46.3162495Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.3162806Z x = x_sign * x_clamp 2025-05-07T20:31:46.3163045Z x0 = x[:, :D] 2025-05-07T20:31:46.3163258Z x1 = x[:, D:] 2025-05-07T20:31:46.3163460Z 2025-05-07T20:31:46.3163648Z if contiguous: 2025-05-07T20:31:46.3163869Z x0 = x0.contiguous() 2025-05-07T20:31:46.3164124Z x1 = x1.contiguous() 2025-05-07T20:31:46.3164360Z 2025-05-07T20:31:46.3164545Z if scale_ub is not None: 2025-05-07T20:31:46.3164813Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.3165143Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.3165452Z ) 2025-05-07T20:31:46.3165721Z else: 2025-05-07T20:31:46.3165931Z scale_ub_tensor = None 2025-05-07T20:31:46.3166178Z 2025-05-07T20:31:46.3166402Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.3166712Z op = silu_mul_quant 2025-05-07T20:31:46.3166957Z if compiled: 2025-05-07T20:31:46.3167297Z op = torch.compile(op) 2025-05-07T20:31:46.3167592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3167866Z 2025-05-07T20:31:46.3168051Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.3168219Z 2025-05-07T20:31:46.3168317Z moe/activation_test.py:117: 2025-05-07T20:31:46.3168612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3168940Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.3169211Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.3169761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.3170320Z return fn(*args, **kwargs) 2025-05-07T20:31:46.3170967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.3171650Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.3172179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.3172860Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.3173509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.3174036Z kernel = self.compile( 2025-05-07T20:31:46.3174565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.3175206Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.3175598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.3175831Z 2025-05-07T20:31:46.3176035Z self = 2025-05-07T20:31:46.3177114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.3178492Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02a6be50>} 2025-05-07T20:31:46.3179830Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.3180851Z context = 2025-05-07T20:31:46.3181139Z 2025-05-07T20:31:46.3181308Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.3181824Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.3182285Z module_map=module_map) 2025-05-07T20:31:46.3182649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.3182999Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.3183249Z E ^ 2025-05-07T20:31:46.3183716Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.3184164Z 2025-05-07T20:31:46.3184582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.3185089Z 2025-05-07T20:31:46.7415331Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7415970Z self=, 2025-05-07T20:31:46.7416490Z T=16384, 2025-05-07T20:31:46.7416684Z D=5120, 2025-05-07T20:31:46.7416888Z scale_ub=None, 2025-05-07T20:31:46.7417106Z contiguous=False, 2025-05-07T20:31:46.7417330Z compiled=True, 2025-05-07T20:31:46.7417544Z ) 2025-05-07T20:31:46.7417868Z self = 2025-05-07T20:31:46.7418495Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.7418777Z 2025-05-07T20:31:46.7418855Z @given( 2025-05-07T20:31:46.7419091Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.7419409Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.7419715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.7420049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.7420385Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.7420670Z ) 2025-05-07T20:31:46.7421029Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.7421471Z def test_silu_mul_quant( 2025-05-07T20:31:46.7421714Z self, 2025-05-07T20:31:46.7421911Z T: int, 2025-05-07T20:31:46.7422113Z D: int, 2025-05-07T20:31:46.7422330Z scale_ub: Optional[float], 2025-05-07T20:31:46.7422609Z contiguous: bool, 2025-05-07T20:31:46.7422849Z compiled: bool, 2025-05-07T20:31:46.7423072Z ) -> None: 2025-05-07T20:31:46.7423291Z torch.manual_seed(2025) 2025-05-07T20:31:46.7423535Z 2025-05-07T20:31:46.7423814Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.7424152Z 2025-05-07T20:31:46.7424346Z x_sign = torch.sign(x) 2025-05-07T20:31:46.7424639Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.7424945Z x = x_sign * x_clamp 2025-05-07T20:31:46.7425186Z x0 = x[:, :D] 2025-05-07T20:31:46.7425404Z x1 = x[:, D:] 2025-05-07T20:31:46.7425612Z 2025-05-07T20:31:46.7425807Z if contiguous: 2025-05-07T20:31:46.7426044Z x0 = x0.contiguous() 2025-05-07T20:31:46.7426305Z x1 = x1.contiguous() 2025-05-07T20:31:46.7426554Z 2025-05-07T20:31:46.7426753Z if scale_ub is not None: 2025-05-07T20:31:46.7427036Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.7427374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.7427687Z ) 2025-05-07T20:31:46.7427886Z else: 2025-05-07T20:31:46.7428104Z scale_ub_tensor = None 2025-05-07T20:31:46.7428356Z 2025-05-07T20:31:46.7428596Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.7428914Z op = silu_mul_quant 2025-05-07T20:31:46.7429173Z if compiled: 2025-05-07T20:31:46.7429422Z op = torch.compile(op) 2025-05-07T20:31:46.7429725Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7430086Z 2025-05-07T20:31:46.7430281Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.7430455Z 2025-05-07T20:31:46.7430557Z moe/activation_test.py:117: 2025-05-07T20:31:46.7430853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7431181Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.7431468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.7432028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.7432618Z return fn(*args, **kwargs) 2025-05-07T20:31:46.7433298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.7433999Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.7434537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.7435346Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.7436013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.7436549Z kernel = self.compile( 2025-05-07T20:31:46.7437084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.7437815Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.7438211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.7438438Z 2025-05-07T20:31:46.7438649Z self = 2025-05-07T20:31:46.7439729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.7441125Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02ae09d0>} 2025-05-07T20:31:46.7442462Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.7443546Z context = 2025-05-07T20:31:46.7443836Z 2025-05-07T20:31:46.7444010Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.7444531Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.7444999Z module_map=module_map) 2025-05-07T20:31:46.7445366Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.7445712Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.7445979Z E ^ 2025-05-07T20:31:46.7446450Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.7446901Z 2025-05-07T20:31:46.7447321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.7447834Z 2025-05-07T20:31:46.7447938Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.7448350Z self=, 2025-05-07T20:31:46.7448759Z T=2048, 2025-05-07T20:31:46.7448947Z D=5120, 2025-05-07T20:31:46.7449137Z scale_ub=None, 2025-05-07T20:31:46.7449355Z contiguous=False, 2025-05-07T20:31:46.7449581Z compiled=True, 2025-05-07T20:31:46.7449789Z ) 2025-05-07T20:31:46.8662003Z self = 2025-05-07T20:31:46.8662647Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:46.8663038Z 2025-05-07T20:31:46.8663154Z @given( 2025-05-07T20:31:46.8663451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8663872Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8664185Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8664518Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8664836Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8665116Z ) 2025-05-07T20:31:46.8665457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8665881Z def test_silu_mul_quant( 2025-05-07T20:31:46.8666121Z self, 2025-05-07T20:31:46.8666311Z T: int, 2025-05-07T20:31:46.8666501Z D: int, 2025-05-07T20:31:46.8666716Z scale_ub: Optional[float], 2025-05-07T20:31:46.8666977Z contiguous: bool, 2025-05-07T20:31:46.8667206Z compiled: bool, 2025-05-07T20:31:46.8667423Z ) -> None: 2025-05-07T20:31:46.8667797Z torch.manual_seed(2025) 2025-05-07T20:31:46.8668036Z 2025-05-07T20:31:46.8668305Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8668641Z 2025-05-07T20:31:46.8668824Z x_sign = torch.sign(x) 2025-05-07T20:31:46.8669104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.8669528Z x = x_sign * x_clamp 2025-05-07T20:31:46.8669763Z x0 = x[:, :D] 2025-05-07T20:31:46.8670065Z x1 = x[:, D:] 2025-05-07T20:31:46.8670264Z 2025-05-07T20:31:46.8670445Z if contiguous: 2025-05-07T20:31:46.8670666Z x0 = x0.contiguous() 2025-05-07T20:31:46.8670921Z x1 = x1.contiguous() 2025-05-07T20:31:46.8671153Z 2025-05-07T20:31:46.8671334Z if scale_ub is not None: 2025-05-07T20:31:46.8671600Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.8671931Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.8672234Z ) 2025-05-07T20:31:46.8672428Z else: 2025-05-07T20:31:46.8672642Z scale_ub_tensor = None 2025-05-07T20:31:46.8672922Z 2025-05-07T20:31:46.8673164Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.8673469Z op = silu_mul_quant 2025-05-07T20:31:46.8673714Z if compiled: 2025-05-07T20:31:46.8673960Z op = torch.compile(op) 2025-05-07T20:31:46.8674253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8674527Z 2025-05-07T20:31:46.8674715Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.8674878Z 2025-05-07T20:31:46.8674977Z moe/activation_test.py:117: 2025-05-07T20:31:46.8675269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8675596Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.8675874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8676428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.8676986Z return fn(*args, **kwargs) 2025-05-07T20:31:46.8677637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.8678317Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.8678848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.8679517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.8680170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.8680696Z kernel = self.compile( 2025-05-07T20:31:46.8681222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.8681872Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.8682270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8682491Z 2025-05-07T20:31:46.8682697Z self = 2025-05-07T20:31:46.8683764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.8685140Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0270a550>} 2025-05-07T20:31:46.8686478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.8687604Z context = 2025-05-07T20:31:46.8687894Z 2025-05-07T20:31:46.8688061Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.8688571Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.8689107Z module_map=module_map) 2025-05-07T20:31:46.8689467Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.8689809Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.8690062Z E ^ 2025-05-07T20:31:46.8690519Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.8690982Z 2025-05-07T20:31:46.8691397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.8691923Z 2025-05-07T20:31:46.8692053Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:46.8692569Z self=, 2025-05-07T20:31:46.8693071Z T=2048, 2025-05-07T20:31:46.8693305Z D=5120, 2025-05-07T20:31:46.8693534Z scale_ub=1200.0, 2025-05-07T20:31:46.8700636Z contiguous=False, 2025-05-07T20:31:46.8700873Z compiled=True, 2025-05-07T20:31:46.8701086Z ) 2025-05-07T20:31:46.8701402Z self = 2025-05-07T20:31:46.8701910Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:46.8702247Z 2025-05-07T20:31:46.8702350Z @given( 2025-05-07T20:31:46.8702628Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:46.8703015Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:46.8703399Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:46.8704158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:46.8704487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:46.8704777Z ) 2025-05-07T20:31:46.8705127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:46.8705558Z def test_silu_mul_quant( 2025-05-07T20:31:46.8705799Z self, 2025-05-07T20:31:46.8705997Z T: int, 2025-05-07T20:31:46.8706185Z D: int, 2025-05-07T20:31:46.8706406Z scale_ub: Optional[float], 2025-05-07T20:31:46.8706673Z contiguous: bool, 2025-05-07T20:31:46.8706905Z compiled: bool, 2025-05-07T20:31:46.8707133Z ) -> None: 2025-05-07T20:31:46.8707346Z torch.manual_seed(2025) 2025-05-07T20:31:46.8707581Z 2025-05-07T20:31:46.8707849Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:46.8708190Z 2025-05-07T20:31:46.8708380Z x_sign = torch.sign(x) 2025-05-07T20:31:46.8708662Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:46.8708963Z x = x_sign * x_clamp 2025-05-07T20:31:46.8709209Z x0 = x[:, :D] 2025-05-07T20:31:46.8709416Z x1 = x[:, D:] 2025-05-07T20:31:46.8709620Z 2025-05-07T20:31:46.8709803Z if contiguous: 2025-05-07T20:31:46.8710093Z x0 = x0.contiguous() 2025-05-07T20:31:46.8710350Z x1 = x1.contiguous() 2025-05-07T20:31:46.8710586Z 2025-05-07T20:31:46.8710773Z if scale_ub is not None: 2025-05-07T20:31:46.8711039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:46.8711370Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:46.8711668Z ) 2025-05-07T20:31:46.8711855Z else: 2025-05-07T20:31:46.8712060Z scale_ub_tensor = None 2025-05-07T20:31:46.8712300Z 2025-05-07T20:31:46.8712522Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:46.8712828Z op = silu_mul_quant 2025-05-07T20:31:46.8713065Z if compiled: 2025-05-07T20:31:46.8713304Z op = torch.compile(op) 2025-05-07T20:31:46.8713762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8714038Z 2025-05-07T20:31:46.8714221Z > y_fp8, y_scale = fn() 2025-05-07T20:31:46.8714386Z 2025-05-07T20:31:46.8714480Z moe/activation_test.py:117: 2025-05-07T20:31:46.8714769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8715210Z moe/activation_test.py:115: in fn 2025-05-07T20:31:46.8715484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:46.8716036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:46.8716577Z return fn(*args, **kwargs) 2025-05-07T20:31:46.8717226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:46.8717908Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:46.8718437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:46.8719114Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:46.8719768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:46.8720290Z kernel = self.compile( 2025-05-07T20:31:46.8720830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:46.8721465Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:46.8721855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:46.8722088Z 2025-05-07T20:31:46.8722300Z self = 2025-05-07T20:31:46.8723433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:46.8724804Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02547310>} 2025-05-07T20:31:46.8726142Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:46.8727173Z context = 2025-05-07T20:31:46.8727456Z 2025-05-07T20:31:46.8727626Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:46.8728139Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:46.8728594Z module_map=module_map) 2025-05-07T20:31:46.8728957Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:46.8729303Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:46.8729549Z E ^ 2025-05-07T20:31:46.8730004Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:46.8730456Z 2025-05-07T20:31:46.8730874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:46.8731384Z 2025-05-07T20:31:47.0983612Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.0984047Z self=, 2025-05-07T20:31:47.0984486Z T=4096, 2025-05-07T20:31:47.0984673Z D=5120, 2025-05-07T20:31:47.0984868Z scale_ub=1200.0, 2025-05-07T20:31:47.0985083Z contiguous=True, 2025-05-07T20:31:47.0985310Z compiled=True, 2025-05-07T20:31:47.0985514Z ) 2025-05-07T20:31:47.0985831Z self = 2025-05-07T20:31:47.0986498Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:47.0986774Z 2025-05-07T20:31:47.0986857Z @given( 2025-05-07T20:31:47.0987082Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.0987387Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.0987695Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.0988150Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.0988478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.0988767Z ) 2025-05-07T20:31:47.0989118Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.0989557Z def test_silu_mul_quant( 2025-05-07T20:31:47.0989796Z self, 2025-05-07T20:31:47.0990056Z T: int, 2025-05-07T20:31:47.0990245Z D: int, 2025-05-07T20:31:47.0990462Z scale_ub: Optional[float], 2025-05-07T20:31:47.0990732Z contiguous: bool, 2025-05-07T20:31:47.0990968Z compiled: bool, 2025-05-07T20:31:47.0991193Z ) -> None: 2025-05-07T20:31:47.0991405Z torch.manual_seed(2025) 2025-05-07T20:31:47.0991646Z 2025-05-07T20:31:47.0991913Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.0992263Z 2025-05-07T20:31:47.0992462Z x_sign = torch.sign(x) 2025-05-07T20:31:47.0992791Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.0993104Z x = x_sign * x_clamp 2025-05-07T20:31:47.0993341Z x0 = x[:, :D] 2025-05-07T20:31:47.0993553Z x1 = x[:, D:] 2025-05-07T20:31:47.0993760Z 2025-05-07T20:31:47.0993942Z if contiguous: 2025-05-07T20:31:47.0994167Z x0 = x0.contiguous() 2025-05-07T20:31:47.0994427Z x1 = x1.contiguous() 2025-05-07T20:31:47.0994666Z 2025-05-07T20:31:47.0994851Z if scale_ub is not None: 2025-05-07T20:31:47.0995126Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.0995473Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.0995778Z ) 2025-05-07T20:31:47.0995965Z else: 2025-05-07T20:31:47.0996171Z scale_ub_tensor = None 2025-05-07T20:31:47.0996414Z 2025-05-07T20:31:47.0996643Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.0996959Z op = silu_mul_quant 2025-05-07T20:31:47.0997209Z if compiled: 2025-05-07T20:31:47.0997451Z op = torch.compile(op) 2025-05-07T20:31:47.0997747Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.0998020Z 2025-05-07T20:31:47.0998202Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.0998373Z 2025-05-07T20:31:47.0998468Z moe/activation_test.py:117: 2025-05-07T20:31:47.0998761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.0999088Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.0999367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.0999936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.1000502Z return fn(*args, **kwargs) 2025-05-07T20:31:47.1001171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.1001877Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.1002419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.1003161Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.1003983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.1004529Z kernel = self.compile( 2025-05-07T20:31:47.1005077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.1005854Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.1006260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.1006487Z 2025-05-07T20:31:47.1006708Z self = 2025-05-07T20:31:47.1007930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.1009347Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa024ca040>} 2025-05-07T20:31:47.1010729Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.1011785Z context = 2025-05-07T20:31:47.1012081Z 2025-05-07T20:31:47.1012249Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.1012792Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.1013264Z module_map=module_map) 2025-05-07T20:31:47.1013628Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.1013982Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.1014236Z E ^ 2025-05-07T20:31:47.1014704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.1015162Z 2025-05-07T20:31:47.1015591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.1016111Z 2025-05-07T20:31:47.1016225Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.1016643Z self=, 2025-05-07T20:31:47.1017055Z T=128, 2025-05-07T20:31:47.1017237Z D=5120, 2025-05-07T20:31:47.1017420Z scale_ub=1200.0, 2025-05-07T20:31:47.1017647Z contiguous=False, 2025-05-07T20:31:47.1017875Z compiled=True, 2025-05-07T20:31:47.1018068Z ) 2025-05-07T20:31:47.2343504Z self = 2025-05-07T20:31:47.2344055Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:47.2344363Z 2025-05-07T20:31:47.2344452Z @given( 2025-05-07T20:31:47.2344779Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.2345183Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.2345527Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.2345855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.2346189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.2346473Z ) 2025-05-07T20:31:47.2346821Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.2347253Z def test_silu_mul_quant( 2025-05-07T20:31:47.2347493Z self, 2025-05-07T20:31:47.2347688Z T: int, 2025-05-07T20:31:47.2347887Z D: int, 2025-05-07T20:31:47.2348101Z scale_ub: Optional[float], 2025-05-07T20:31:47.2348370Z contiguous: bool, 2025-05-07T20:31:47.2348615Z compiled: bool, 2025-05-07T20:31:47.2348839Z ) -> None: 2025-05-07T20:31:47.2349054Z torch.manual_seed(2025) 2025-05-07T20:31:47.2349293Z 2025-05-07T20:31:47.2349565Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.2349983Z 2025-05-07T20:31:47.2350175Z x_sign = torch.sign(x) 2025-05-07T20:31:47.2350460Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.2350927Z x = x_sign * x_clamp 2025-05-07T20:31:47.2351169Z x0 = x[:, :D] 2025-05-07T20:31:47.2351377Z x1 = x[:, D:] 2025-05-07T20:31:47.2351585Z 2025-05-07T20:31:47.2351769Z if contiguous: 2025-05-07T20:31:47.2351989Z x0 = x0.contiguous() 2025-05-07T20:31:47.2352247Z x1 = x1.contiguous() 2025-05-07T20:31:47.2352651Z 2025-05-07T20:31:47.2352856Z if scale_ub is not None: 2025-05-07T20:31:47.2353123Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.2353456Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.2353762Z ) 2025-05-07T20:31:47.2353944Z else: 2025-05-07T20:31:47.2354150Z scale_ub_tensor = None 2025-05-07T20:31:47.2354394Z 2025-05-07T20:31:47.2354615Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.2354921Z op = silu_mul_quant 2025-05-07T20:31:47.2355167Z if compiled: 2025-05-07T20:31:47.2355415Z op = torch.compile(op) 2025-05-07T20:31:47.2355708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.2355977Z 2025-05-07T20:31:47.2356164Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.2356328Z 2025-05-07T20:31:47.2356426Z moe/activation_test.py:117: 2025-05-07T20:31:47.2356720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.2357054Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.2357327Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.2357881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.2358431Z return fn(*args, **kwargs) 2025-05-07T20:31:47.2359086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.2359774Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.2360307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.2360983Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.2361636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.2362165Z kernel = self.compile( 2025-05-07T20:31:47.2362712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.2363394Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.2363783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.2364014Z 2025-05-07T20:31:47.2364218Z self = 2025-05-07T20:31:47.2365298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.2366659Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa024caca0>} 2025-05-07T20:31:47.2368002Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.2369019Z context = 2025-05-07T20:31:47.2369306Z 2025-05-07T20:31:47.2369481Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.2369999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.2370460Z module_map=module_map) 2025-05-07T20:31:47.2370902Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.2371260Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.2371510Z E ^ 2025-05-07T20:31:47.2371971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.2372417Z 2025-05-07T20:31:47.2372908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.2373422Z 2025-05-07T20:31:47.2373524Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.2373936Z self=, 2025-05-07T20:31:47.2374343Z T=16384, 2025-05-07T20:31:47.2374528Z D=7168, 2025-05-07T20:31:47.2374719Z scale_ub=1200.0, 2025-05-07T20:31:47.2374950Z contiguous=True, 2025-05-07T20:31:47.2375164Z compiled=True, 2025-05-07T20:31:47.2375366Z ) 2025-05-07T20:31:47.2375686Z self = 2025-05-07T20:31:47.2376175Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:47.2376457Z 2025-05-07T20:31:47.2376532Z @given( 2025-05-07T20:31:47.2376760Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.2377066Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.2377378Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.2377709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.2378043Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.2378319Z ) 2025-05-07T20:31:47.2378668Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.2379103Z def test_silu_mul_quant( 2025-05-07T20:31:47.2379338Z self, 2025-05-07T20:31:47.2379527Z T: int, 2025-05-07T20:31:47.2379727Z D: int, 2025-05-07T20:31:47.2379936Z scale_ub: Optional[float], 2025-05-07T20:31:47.2380207Z contiguous: bool, 2025-05-07T20:31:47.2380443Z compiled: bool, 2025-05-07T20:31:47.2380664Z ) -> None: 2025-05-07T20:31:47.2380875Z torch.manual_seed(2025) 2025-05-07T20:31:47.2381115Z 2025-05-07T20:31:47.2381382Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.2381725Z 2025-05-07T20:31:47.2381917Z x_sign = torch.sign(x) 2025-05-07T20:31:47.2382198Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.2382512Z x = x_sign * x_clamp 2025-05-07T20:31:47.2382784Z x0 = x[:, :D] 2025-05-07T20:31:47.2383010Z x1 = x[:, D:] 2025-05-07T20:31:47.2383209Z 2025-05-07T20:31:47.2383390Z if contiguous: 2025-05-07T20:31:47.2383618Z x0 = x0.contiguous() 2025-05-07T20:31:47.2383868Z x1 = x1.contiguous() 2025-05-07T20:31:47.2384125Z 2025-05-07T20:31:47.2384318Z if scale_ub is not None: 2025-05-07T20:31:47.2384590Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.2384920Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.2385229Z ) 2025-05-07T20:31:47.2385420Z else: 2025-05-07T20:31:47.2385631Z scale_ub_tensor = None 2025-05-07T20:31:47.2385881Z 2025-05-07T20:31:47.2386108Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.2386424Z op = silu_mul_quant 2025-05-07T20:31:47.2386677Z if compiled: 2025-05-07T20:31:47.2386924Z op = torch.compile(op) 2025-05-07T20:31:47.2387217Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.2387495Z 2025-05-07T20:31:47.2387689Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.2387852Z 2025-05-07T20:31:47.2387950Z moe/activation_test.py:117: 2025-05-07T20:31:47.2388247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.2388584Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.2388943Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.2389496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.2390106Z return fn(*args, **kwargs) 2025-05-07T20:31:47.2390756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.2391510Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.2392039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.2392712Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.2393412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.2393932Z kernel = self.compile( 2025-05-07T20:31:47.2394470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.2395116Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.2395503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.2395731Z 2025-05-07T20:31:47.2395939Z self = 2025-05-07T20:31:47.2397025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.2398391Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02697a60>} 2025-05-07T20:31:47.2399728Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.2400743Z context = 2025-05-07T20:31:47.2401033Z 2025-05-07T20:31:47.2401196Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.2401712Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.2402180Z module_map=module_map) 2025-05-07T20:31:47.2402537Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.2402887Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.2403142Z E ^ 2025-05-07T20:31:47.2403603Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.2404235Z 2025-05-07T20:31:47.2404651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.2405176Z 2025-05-07T20:31:47.7221192Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.7221616Z self=, 2025-05-07T20:31:47.7222047Z T=16384, 2025-05-07T20:31:47.7222294Z D=5120, 2025-05-07T20:31:47.7222799Z scale_ub=1200.0, 2025-05-07T20:31:47.7223271Z contiguous=True, 2025-05-07T20:31:47.7223697Z compiled=False, 2025-05-07T20:31:47.7224090Z ) 2025-05-07T20:31:47.7224704Z self = 2025-05-07T20:31:47.7225677Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:47.7226225Z 2025-05-07T20:31:47.7226383Z @given( 2025-05-07T20:31:47.7226820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.7227416Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.7228008Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.7228961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.7229599Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.7230270Z ) 2025-05-07T20:31:47.7230945Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.7231795Z def test_silu_mul_quant( 2025-05-07T20:31:47.7232493Z self, 2025-05-07T20:31:47.7232842Z T: int, 2025-05-07T20:31:47.7233053Z D: int, 2025-05-07T20:31:47.7233291Z scale_ub: Optional[float], 2025-05-07T20:31:47.7233560Z contiguous: bool, 2025-05-07T20:31:47.7233790Z compiled: bool, 2025-05-07T20:31:47.7234007Z ) -> None: 2025-05-07T20:31:47.7234215Z torch.manual_seed(2025) 2025-05-07T20:31:47.7234455Z 2025-05-07T20:31:47.7234715Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.7235047Z 2025-05-07T20:31:47.7235234Z x_sign = torch.sign(x) 2025-05-07T20:31:47.7235531Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.7235834Z x = x_sign * x_clamp 2025-05-07T20:31:47.7236073Z x0 = x[:, :D] 2025-05-07T20:31:47.7236283Z x1 = x[:, D:] 2025-05-07T20:31:47.7236475Z 2025-05-07T20:31:47.7236651Z if contiguous: 2025-05-07T20:31:47.7236881Z x0 = x0.contiguous() 2025-05-07T20:31:47.7243104Z x1 = x1.contiguous() 2025-05-07T20:31:47.7243367Z 2025-05-07T20:31:47.7243557Z if scale_ub is not None: 2025-05-07T20:31:47.7243838Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.7244186Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.7244492Z ) 2025-05-07T20:31:47.7244687Z else: 2025-05-07T20:31:47.7244900Z scale_ub_tensor = None 2025-05-07T20:31:47.7245152Z 2025-05-07T20:31:47.7245396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.7245724Z op = silu_mul_quant 2025-05-07T20:31:47.7245982Z if compiled: 2025-05-07T20:31:47.7246236Z op = torch.compile(op) 2025-05-07T20:31:47.7246532Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.7246804Z 2025-05-07T20:31:47.7246992Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.7247161Z 2025-05-07T20:31:47.7247260Z moe/activation_test.py:117: 2025-05-07T20:31:47.7247563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.7247891Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.7248168Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.7248862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.7249546Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.7250082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.7250763Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.7251422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.7251943Z kernel = self.compile( 2025-05-07T20:31:47.7252485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.7253184Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.7253575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.7253799Z 2025-05-07T20:31:47.7254003Z self = 2025-05-07T20:31:47.7255081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.7256567Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02423550>} 2025-05-07T20:31:47.7257912Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.7259007Z context = 2025-05-07T20:31:47.7259297Z 2025-05-07T20:31:47.7259460Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.7259980Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.7260442Z module_map=module_map) 2025-05-07T20:31:47.7260799Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.7261155Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.7261417Z E ^ 2025-05-07T20:31:47.7261882Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.7262332Z 2025-05-07T20:31:47.7262745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.7263314Z 2025-05-07T20:31:47.7263415Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.7263842Z self=, 2025-05-07T20:31:47.7264249Z T=1, 2025-05-07T20:31:47.7264423Z D=7168, 2025-05-07T20:31:47.7264614Z scale_ub=1200.0, 2025-05-07T20:31:47.7264836Z contiguous=False, 2025-05-07T20:31:47.7265048Z compiled=False, 2025-05-07T20:31:47.7265246Z ) 2025-05-07T20:31:47.7265562Z self = 2025-05-07T20:31:47.7266050Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:47.7266319Z 2025-05-07T20:31:47.7266395Z @given( 2025-05-07T20:31:47.7266612Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.7266918Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.7267211Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.7267534Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.7267862Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.7268145Z ) 2025-05-07T20:31:47.7268484Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.7268917Z def test_silu_mul_quant( 2025-05-07T20:31:47.7269152Z self, 2025-05-07T20:31:47.7269334Z T: int, 2025-05-07T20:31:47.7269525Z D: int, 2025-05-07T20:31:47.7269738Z scale_ub: Optional[float], 2025-05-07T20:31:47.7270069Z contiguous: bool, 2025-05-07T20:31:47.7270308Z compiled: bool, 2025-05-07T20:31:47.7270528Z ) -> None: 2025-05-07T20:31:47.7270740Z torch.manual_seed(2025) 2025-05-07T20:31:47.7270980Z 2025-05-07T20:31:47.7271250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.7271584Z 2025-05-07T20:31:47.7271776Z x_sign = torch.sign(x) 2025-05-07T20:31:47.7272061Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.7272368Z x = x_sign * x_clamp 2025-05-07T20:31:47.7272599Z x0 = x[:, :D] 2025-05-07T20:31:47.7272810Z x1 = x[:, D:] 2025-05-07T20:31:47.7273013Z 2025-05-07T20:31:47.7273189Z if contiguous: 2025-05-07T20:31:47.7273415Z x0 = x0.contiguous() 2025-05-07T20:31:47.7273665Z x1 = x1.contiguous() 2025-05-07T20:31:47.7273898Z 2025-05-07T20:31:47.7274080Z if scale_ub is not None: 2025-05-07T20:31:47.7274342Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.7274670Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.7274972Z ) 2025-05-07T20:31:47.7275246Z else: 2025-05-07T20:31:47.7275450Z scale_ub_tensor = None 2025-05-07T20:31:47.7275702Z 2025-05-07T20:31:47.7275931Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.7276237Z op = silu_mul_quant 2025-05-07T20:31:47.7276481Z if compiled: 2025-05-07T20:31:47.7276795Z op = torch.compile(op) 2025-05-07T20:31:47.7277081Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.7277352Z 2025-05-07T20:31:47.7277538Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.7277701Z 2025-05-07T20:31:47.7277801Z moe/activation_test.py:117: 2025-05-07T20:31:47.7278087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.7278411Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.7278686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.7279372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.7280057Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.7280589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.7281265Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.7281918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.7282443Z kernel = self.compile( 2025-05-07T20:31:47.7283025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.7283666Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.7284056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.7284282Z 2025-05-07T20:31:47.7284491Z self = 2025-05-07T20:31:47.7285569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.7286944Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02697e50>} 2025-05-07T20:31:47.7288280Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.7289300Z context = 2025-05-07T20:31:47.7289590Z 2025-05-07T20:31:47.7289754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.7290280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.7290736Z module_map=module_map) 2025-05-07T20:31:47.7291093Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.7291446Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.7291698Z E ^ 2025-05-07T20:31:47.7292157Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.7292611Z 2025-05-07T20:31:47.7293067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.7293579Z 2025-05-07T20:31:47.7293687Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.7294096Z self=, 2025-05-07T20:31:47.7294501Z T=4096, 2025-05-07T20:31:47.7294679Z D=7168, 2025-05-07T20:31:47.7294859Z scale_ub=1200.0, 2025-05-07T20:31:47.7295165Z contiguous=False, 2025-05-07T20:31:47.7295385Z compiled=True, 2025-05-07T20:31:47.7295583Z ) 2025-05-07T20:31:47.8467038Z self = 2025-05-07T20:31:47.8467644Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:47.8468164Z 2025-05-07T20:31:47.8468244Z @given( 2025-05-07T20:31:47.8468476Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.8468786Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.8469093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.8469422Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.8469747Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.8470118Z ) 2025-05-07T20:31:47.8470469Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.8470902Z def test_silu_mul_quant( 2025-05-07T20:31:47.8471150Z self, 2025-05-07T20:31:47.8471351Z T: int, 2025-05-07T20:31:47.8471549Z D: int, 2025-05-07T20:31:47.8471765Z scale_ub: Optional[float], 2025-05-07T20:31:47.8472035Z contiguous: bool, 2025-05-07T20:31:47.8472274Z compiled: bool, 2025-05-07T20:31:47.8472494Z ) -> None: 2025-05-07T20:31:47.8472722Z torch.manual_seed(2025) 2025-05-07T20:31:47.8472967Z 2025-05-07T20:31:47.8473236Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.8473579Z 2025-05-07T20:31:47.8473775Z x_sign = torch.sign(x) 2025-05-07T20:31:47.8474064Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.8474374Z x = x_sign * x_clamp 2025-05-07T20:31:47.8474614Z x0 = x[:, :D] 2025-05-07T20:31:47.8474827Z x1 = x[:, D:] 2025-05-07T20:31:47.8475032Z 2025-05-07T20:31:47.8475217Z if contiguous: 2025-05-07T20:31:47.8475444Z x0 = x0.contiguous() 2025-05-07T20:31:47.8475716Z x1 = x1.contiguous() 2025-05-07T20:31:47.8475960Z 2025-05-07T20:31:47.8476149Z if scale_ub is not None: 2025-05-07T20:31:47.8476426Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.8476759Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.8477073Z ) 2025-05-07T20:31:47.8477265Z else: 2025-05-07T20:31:47.8477482Z scale_ub_tensor = None 2025-05-07T20:31:47.8477737Z 2025-05-07T20:31:47.8477965Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.8478283Z op = silu_mul_quant 2025-05-07T20:31:47.8478532Z if compiled: 2025-05-07T20:31:47.8478776Z op = torch.compile(op) 2025-05-07T20:31:47.8479077Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.8479356Z 2025-05-07T20:31:47.8479547Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.8479720Z 2025-05-07T20:31:47.8479824Z moe/activation_test.py:117: 2025-05-07T20:31:47.8480125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.8480457Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.8480739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.8481293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.8481853Z return fn(*args, **kwargs) 2025-05-07T20:31:47.8482511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.8483249Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.8483780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.8484459Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.8485247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.8485779Z kernel = self.compile( 2025-05-07T20:31:47.8486318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.8486960Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.8487449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.8487680Z 2025-05-07T20:31:47.8487888Z self = 2025-05-07T20:31:47.8488969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.8490340Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0234eee0>} 2025-05-07T20:31:47.8491683Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.8492705Z context = 2025-05-07T20:31:47.8493009Z 2025-05-07T20:31:47.8493202Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.8493737Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.8494201Z module_map=module_map) 2025-05-07T20:31:47.8494561Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.8494915Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.8495161Z E ^ 2025-05-07T20:31:47.8495620Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.8496067Z 2025-05-07T20:31:47.8496483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.8496994Z 2025-05-07T20:31:47.8497099Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:47.8497499Z self=, 2025-05-07T20:31:47.8497914Z T=128, 2025-05-07T20:31:47.8498086Z D=7168, 2025-05-07T20:31:47.8498265Z scale_ub=1200.0, 2025-05-07T20:31:47.8498491Z contiguous=False, 2025-05-07T20:31:47.8498710Z compiled=True, 2025-05-07T20:31:47.8498899Z ) 2025-05-07T20:31:47.8499213Z self = 2025-05-07T20:31:47.8499696Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:47.8499963Z 2025-05-07T20:31:47.8500040Z @given( 2025-05-07T20:31:47.8500252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:47.8500555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:47.8500851Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:47.8501168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:47.8501493Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:47.8501774Z ) 2025-05-07T20:31:47.8502111Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:47.8502542Z def test_silu_mul_quant( 2025-05-07T20:31:47.8502774Z self, 2025-05-07T20:31:47.8502956Z T: int, 2025-05-07T20:31:47.8503144Z D: int, 2025-05-07T20:31:47.8503372Z scale_ub: Optional[float], 2025-05-07T20:31:47.8503632Z contiguous: bool, 2025-05-07T20:31:47.8504028Z compiled: bool, 2025-05-07T20:31:47.8504244Z ) -> None: 2025-05-07T20:31:47.8504447Z torch.manual_seed(2025) 2025-05-07T20:31:47.8504677Z 2025-05-07T20:31:47.8505066Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:47.8505403Z 2025-05-07T20:31:47.8505585Z x_sign = torch.sign(x) 2025-05-07T20:31:47.8505862Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:47.8506165Z x = x_sign * x_clamp 2025-05-07T20:31:47.8506398Z x0 = x[:, :D] 2025-05-07T20:31:47.8506717Z x1 = x[:, D:] 2025-05-07T20:31:47.8506921Z 2025-05-07T20:31:47.8507095Z if contiguous: 2025-05-07T20:31:47.8507310Z x0 = x0.contiguous() 2025-05-07T20:31:47.8507557Z x1 = x1.contiguous() 2025-05-07T20:31:47.8507794Z 2025-05-07T20:31:47.8507977Z if scale_ub is not None: 2025-05-07T20:31:47.8508238Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:47.8508559Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:47.8508864Z ) 2025-05-07T20:31:47.8509045Z else: 2025-05-07T20:31:47.8509243Z scale_ub_tensor = None 2025-05-07T20:31:47.8509489Z 2025-05-07T20:31:47.8509721Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:47.8510084Z op = silu_mul_quant 2025-05-07T20:31:47.8510327Z if compiled: 2025-05-07T20:31:47.8510565Z op = torch.compile(op) 2025-05-07T20:31:47.8510854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.8511128Z 2025-05-07T20:31:47.8511311Z > y_fp8, y_scale = fn() 2025-05-07T20:31:47.8511471Z 2025-05-07T20:31:47.8511573Z moe/activation_test.py:117: 2025-05-07T20:31:47.8511853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.8512176Z moe/activation_test.py:115: in fn 2025-05-07T20:31:47.8512450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:47.8513038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:47.8513581Z return fn(*args, **kwargs) 2025-05-07T20:31:47.8514229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:47.8514904Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:47.8515425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:47.8516097Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:47.8516746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:47.8517266Z kernel = self.compile( 2025-05-07T20:31:47.8517787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:47.8518426Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:47.8518814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:47.8519037Z 2025-05-07T20:31:47.8519240Z self = 2025-05-07T20:31:47.8520310Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:47.8521674Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa026c8af0>} 2025-05-07T20:31:47.8523038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:47.8524066Z context = 2025-05-07T20:31:47.8524347Z 2025-05-07T20:31:47.8524513Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:47.8525110Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:47.8525569Z module_map=module_map) 2025-05-07T20:31:47.8525918Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:47.8526261Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:47.8526586Z E ^ 2025-05-07T20:31:47.8527049Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:47.8527493Z 2025-05-07T20:31:47.8527902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:47.8528408Z 2025-05-07T20:31:48.0256010Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.0256438Z self=, 2025-05-07T20:31:48.0256845Z T=2048, 2025-05-07T20:31:48.0257044Z D=7168, 2025-05-07T20:31:48.0257251Z scale_ub=None, 2025-05-07T20:31:48.0257487Z contiguous=True, 2025-05-07T20:31:48.0257712Z compiled=True, 2025-05-07T20:31:48.0257910Z ) 2025-05-07T20:31:48.0258288Z self = 2025-05-07T20:31:48.0258902Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.0259177Z 2025-05-07T20:31:48.0259258Z @given( 2025-05-07T20:31:48.0259484Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.0259801Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.0260108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.0260431Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.0260763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.0261048Z ) 2025-05-07T20:31:48.0261391Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.0261841Z def test_silu_mul_quant( 2025-05-07T20:31:48.0262081Z self, 2025-05-07T20:31:48.0262274Z T: int, 2025-05-07T20:31:48.0262475Z D: int, 2025-05-07T20:31:48.0262694Z scale_ub: Optional[float], 2025-05-07T20:31:48.0262977Z contiguous: bool, 2025-05-07T20:31:48.0263248Z compiled: bool, 2025-05-07T20:31:48.0263476Z ) -> None: 2025-05-07T20:31:48.0263692Z torch.manual_seed(2025) 2025-05-07T20:31:48.0263928Z 2025-05-07T20:31:48.0264195Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.0264535Z 2025-05-07T20:31:48.0264723Z x_sign = torch.sign(x) 2025-05-07T20:31:48.0265009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.0265316Z x = x_sign * x_clamp 2025-05-07T20:31:48.0265548Z x0 = x[:, :D] 2025-05-07T20:31:48.0265765Z x1 = x[:, D:] 2025-05-07T20:31:48.0265973Z 2025-05-07T20:31:48.0266154Z if contiguous: 2025-05-07T20:31:48.0266393Z x0 = x0.contiguous() 2025-05-07T20:31:48.0266656Z x1 = x1.contiguous() 2025-05-07T20:31:48.0266892Z 2025-05-07T20:31:48.0267084Z if scale_ub is not None: 2025-05-07T20:31:48.0267359Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.0267685Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.0267997Z ) 2025-05-07T20:31:48.0268197Z else: 2025-05-07T20:31:48.0268411Z scale_ub_tensor = None 2025-05-07T20:31:48.0268655Z 2025-05-07T20:31:48.0268884Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.0269198Z op = silu_mul_quant 2025-05-07T20:31:48.0269443Z if compiled: 2025-05-07T20:31:48.0269694Z op = torch.compile(op) 2025-05-07T20:31:48.0270062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.0270333Z 2025-05-07T20:31:48.0270521Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.0270684Z 2025-05-07T20:31:48.0270950Z moe/activation_test.py:117: 2025-05-07T20:31:48.0271247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.0271572Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.0271849Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.0272401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:48.0273058Z return fn(*args, **kwargs) 2025-05-07T20:31:48.0273706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.0274384Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.0274905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.0275576Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.0276234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.0276758Z kernel = self.compile( 2025-05-07T20:31:48.0277285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.0277933Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.0278320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.0278544Z 2025-05-07T20:31:48.0278750Z self = 2025-05-07T20:31:48.0279820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.0281192Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa021238b0>} 2025-05-07T20:31:48.0282529Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.0283546Z context = 2025-05-07T20:31:48.0283837Z 2025-05-07T20:31:48.0283999Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.0284518Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.0284977Z module_map=module_map) 2025-05-07T20:31:48.0291276Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.0291678Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.0291942Z E ^ 2025-05-07T20:31:48.0292423Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.0292892Z 2025-05-07T20:31:48.0293350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.0293877Z 2025-05-07T20:31:48.0293981Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.0294399Z self=, 2025-05-07T20:31:48.0294805Z T=16384, 2025-05-07T20:31:48.0294994Z D=5120, 2025-05-07T20:31:48.0295186Z scale_ub=None, 2025-05-07T20:31:48.0295403Z contiguous=False, 2025-05-07T20:31:48.0295625Z compiled=False, 2025-05-07T20:31:48.0295828Z ) 2025-05-07T20:31:48.0296140Z self = 2025-05-07T20:31:48.0296633Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.0296916Z 2025-05-07T20:31:48.0296994Z @given( 2025-05-07T20:31:48.0297327Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.0297637Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.0297935Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.0298260Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.0298592Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.0298950Z ) 2025-05-07T20:31:48.0299297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.0299738Z def test_silu_mul_quant( 2025-05-07T20:31:48.0299973Z self, 2025-05-07T20:31:48.0300161Z T: int, 2025-05-07T20:31:48.0300358Z D: int, 2025-05-07T20:31:48.0300566Z scale_ub: Optional[float], 2025-05-07T20:31:48.0300834Z contiguous: bool, 2025-05-07T20:31:48.0301068Z compiled: bool, 2025-05-07T20:31:48.0301296Z ) -> None: 2025-05-07T20:31:48.0301503Z torch.manual_seed(2025) 2025-05-07T20:31:48.0301744Z 2025-05-07T20:31:48.0302018Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.0302349Z 2025-05-07T20:31:48.0302540Z x_sign = torch.sign(x) 2025-05-07T20:31:48.0302823Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.0305045Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.0306939Z 2025-05-07T20:31:48.0307060Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:48.0307271Z 2025-05-07T20:31:48.0307380Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.0307788Z self=, 2025-05-07T20:31:48.0308180Z T=4096, 2025-05-07T20:31:48.0308359Z D=7168, 2025-05-07T20:31:48.0308539Z scale_ub=1200.0, 2025-05-07T20:31:48.0308758Z contiguous=True, 2025-05-07T20:31:48.0308981Z compiled=True, 2025-05-07T20:31:48.0309173Z ) 2025-05-07T20:31:48.0309486Z self = 2025-05-07T20:31:48.0310042Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.0310359Z 2025-05-07T20:31:48.0310435Z @given( 2025-05-07T20:31:48.0310680Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.0311034Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.0311367Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.0311734Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.0312101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.0312417Z ) 2025-05-07T20:31:48.0312854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.0313373Z def test_silu_mul_quant( 2025-05-07T20:31:48.0313642Z self, 2025-05-07T20:31:48.0313844Z T: int, 2025-05-07T20:31:48.0314048Z D: int, 2025-05-07T20:31:48.0314274Z scale_ub: Optional[float], 2025-05-07T20:31:48.0314563Z contiguous: bool, 2025-05-07T20:31:48.0314819Z compiled: bool, 2025-05-07T20:31:48.0315056Z ) -> None: 2025-05-07T20:31:48.0315285Z torch.manual_seed(2025) 2025-05-07T20:31:48.0315542Z 2025-05-07T20:31:48.0315831Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.0316219Z 2025-05-07T20:31:48.0316416Z x_sign = torch.sign(x) 2025-05-07T20:31:48.0316732Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.0319456Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.0321445Z 2025-05-07T20:31:48.0321568Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:48.0321776Z 2025-05-07T20:31:48.0321881Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.0322283Z self=, 2025-05-07T20:31:48.0322680Z T=16384, 2025-05-07T20:31:48.0322894Z D=7168, 2025-05-07T20:31:48.0323101Z scale_ub=None, 2025-05-07T20:31:48.0323322Z contiguous=False, 2025-05-07T20:31:48.0323540Z compiled=False, 2025-05-07T20:31:48.0323733Z ) 2025-05-07T20:31:48.1362547Z self = 2025-05-07T20:31:48.1363085Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.1363370Z 2025-05-07T20:31:48.1363452Z @given( 2025-05-07T20:31:48.1363671Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.1363981Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.1364285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.1364607Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.1364928Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.1365217Z ) 2025-05-07T20:31:48.1365562Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.1365999Z def test_silu_mul_quant( 2025-05-07T20:31:48.1366250Z self, 2025-05-07T20:31:48.1366445Z T: int, 2025-05-07T20:31:48.1366636Z D: int, 2025-05-07T20:31:48.1366855Z scale_ub: Optional[float], 2025-05-07T20:31:48.1367125Z contiguous: bool, 2025-05-07T20:31:48.1367358Z compiled: bool, 2025-05-07T20:31:48.1367581Z ) -> None: 2025-05-07T20:31:48.1367810Z torch.manual_seed(2025) 2025-05-07T20:31:48.1368048Z 2025-05-07T20:31:48.1368317Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.1370393Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.1372286Z 2025-05-07T20:31:48.1372406Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.1372618Z 2025-05-07T20:31:48.1372730Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.1373141Z self=, 2025-05-07T20:31:48.1373542Z T=2048, 2025-05-07T20:31:48.1373727Z D=7168, 2025-05-07T20:31:48.1373915Z scale_ub=1200.0, 2025-05-07T20:31:48.1374138Z contiguous=True, 2025-05-07T20:31:48.1374362Z compiled=True, 2025-05-07T20:31:48.1374558Z ) 2025-05-07T20:31:48.1374874Z self = 2025-05-07T20:31:48.1375362Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:48.1375630Z 2025-05-07T20:31:48.1375717Z @given( 2025-05-07T20:31:48.1376081Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.1376396Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.1376704Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.1377024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.1377355Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.1377751Z ) 2025-05-07T20:31:48.1378096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.1378533Z def test_silu_mul_quant( 2025-05-07T20:31:48.1378775Z self, 2025-05-07T20:31:48.1378964Z T: int, 2025-05-07T20:31:48.1379157Z D: int, 2025-05-07T20:31:48.1379374Z scale_ub: Optional[float], 2025-05-07T20:31:48.1379643Z contiguous: bool, 2025-05-07T20:31:48.1379876Z compiled: bool, 2025-05-07T20:31:48.1380099Z ) -> None: 2025-05-07T20:31:48.1380313Z torch.manual_seed(2025) 2025-05-07T20:31:48.1380548Z 2025-05-07T20:31:48.1380826Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.1381160Z 2025-05-07T20:31:48.1381346Z x_sign = torch.sign(x) 2025-05-07T20:31:48.1381635Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.1383637Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.1385509Z 2025-05-07T20:31:48.1385630Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:48.1385843Z 2025-05-07T20:31:48.1385953Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.1386360Z self=, 2025-05-07T20:31:48.1386761Z T=2048, 2025-05-07T20:31:48.1386944Z D=7168, 2025-05-07T20:31:48.1387131Z scale_ub=None, 2025-05-07T20:31:48.1387343Z contiguous=True, 2025-05-07T20:31:48.1387574Z compiled=False, 2025-05-07T20:31:48.1387777Z ) 2025-05-07T20:31:48.1388090Z self = 2025-05-07T20:31:48.1388576Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.1388842Z 2025-05-07T20:31:48.1388924Z @given( 2025-05-07T20:31:48.1389144Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.1389456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.1389761Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.1390154Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.1390490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.1390775Z ) 2025-05-07T20:31:48.1391115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.1391552Z def test_silu_mul_quant( 2025-05-07T20:31:48.1391789Z self, 2025-05-07T20:31:48.1391976Z T: int, 2025-05-07T20:31:48.1392177Z D: int, 2025-05-07T20:31:48.1392391Z scale_ub: Optional[float], 2025-05-07T20:31:48.1392654Z contiguous: bool, 2025-05-07T20:31:48.1392928Z compiled: bool, 2025-05-07T20:31:48.1393162Z ) -> None: 2025-05-07T20:31:48.1393372Z torch.manual_seed(2025) 2025-05-07T20:31:48.1393609Z 2025-05-07T20:31:48.1393884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.1394221Z 2025-05-07T20:31:48.1394409Z > x_sign = torch.sign(x) 2025-05-07T20:31:48.1396446Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.1398399Z 2025-05-07T20:31:48.1398516Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:48.1398730Z 2025-05-07T20:31:48.1398834Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.1399242Z self=, 2025-05-07T20:31:48.1399637Z T=1, 2025-05-07T20:31:48.1399822Z D=7168, 2025-05-07T20:31:48.1400017Z scale_ub=1200.0, 2025-05-07T20:31:48.1400233Z contiguous=True, 2025-05-07T20:31:48.1400454Z compiled=False, 2025-05-07T20:31:48.1400665Z ) 2025-05-07T20:31:48.2967477Z self = 2025-05-07T20:31:48.2968761Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.2969286Z 2025-05-07T20:31:48.2969438Z @given( 2025-05-07T20:31:48.2969874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.2970480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.2971063Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.2971714Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.2972347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.2972795Z ) 2025-05-07T20:31:48.2973137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.2973571Z def test_silu_mul_quant( 2025-05-07T20:31:48.2973809Z self, 2025-05-07T20:31:48.2973994Z T: int, 2025-05-07T20:31:48.2974182Z D: int, 2025-05-07T20:31:48.2974401Z scale_ub: Optional[float], 2025-05-07T20:31:48.2974662Z contiguous: bool, 2025-05-07T20:31:48.2974896Z compiled: bool, 2025-05-07T20:31:48.2975112Z ) -> None: 2025-05-07T20:31:48.2975318Z torch.manual_seed(2025) 2025-05-07T20:31:48.2975555Z 2025-05-07T20:31:48.2975823Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.2976152Z 2025-05-07T20:31:48.2976345Z x_sign = torch.sign(x) 2025-05-07T20:31:48.2976627Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.2976936Z x = x_sign * x_clamp 2025-05-07T20:31:48.2977167Z x0 = x[:, :D] 2025-05-07T20:31:48.2977376Z x1 = x[:, D:] 2025-05-07T20:31:48.2977572Z 2025-05-07T20:31:48.2977749Z if contiguous: 2025-05-07T20:31:48.2977976Z x0 = x0.contiguous() 2025-05-07T20:31:48.2978230Z x1 = x1.contiguous() 2025-05-07T20:31:48.2978462Z 2025-05-07T20:31:48.2978654Z if scale_ub is not None: 2025-05-07T20:31:48.2978920Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.2979248Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.2979550Z ) 2025-05-07T20:31:48.2979737Z else: 2025-05-07T20:31:48.2979938Z scale_ub_tensor = None 2025-05-07T20:31:48.2980188Z 2025-05-07T20:31:48.2980415Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.2980717Z op = silu_mul_quant 2025-05-07T20:31:48.2980965Z if compiled: 2025-05-07T20:31:48.2981204Z op = torch.compile(op) 2025-05-07T20:31:48.2981490Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.2981761Z 2025-05-07T20:31:48.2981945Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.2982109Z 2025-05-07T20:31:48.2982210Z moe/activation_test.py:117: 2025-05-07T20:31:48.2982499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.2983029Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.2983326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.2984023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.2984708Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.2985354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.2986031Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.2986680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.2987205Z kernel = self.compile( 2025-05-07T20:31:48.2987740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.2988387Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.2988777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.2989004Z 2025-05-07T20:31:48.2989207Z self = 2025-05-07T20:31:48.2990366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.2991744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa021dc550>} 2025-05-07T20:31:48.2993096Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.2994140Z context = 2025-05-07T20:31:48.2994427Z 2025-05-07T20:31:48.2994590Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.2995101Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.2995571Z module_map=module_map) 2025-05-07T20:31:48.2995931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.2996273Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.2996522Z E ^ 2025-05-07T20:31:48.2996982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.2997427Z 2025-05-07T20:31:48.2997848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.2998356Z 2025-05-07T20:31:48.2998457Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.2998870Z self=, 2025-05-07T20:31:48.2999264Z T=128, 2025-05-07T20:31:48.2999444Z D=5120, 2025-05-07T20:31:48.2999629Z scale_ub=None, 2025-05-07T20:31:48.2999841Z contiguous=True, 2025-05-07T20:31:48.3000055Z compiled=False, 2025-05-07T20:31:48.3000258Z ) 2025-05-07T20:31:48.3000574Z self = 2025-05-07T20:31:48.3001050Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.3001319Z 2025-05-07T20:31:48.3001390Z @given( 2025-05-07T20:31:48.3001612Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.3001914Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.3002208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.3002531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.3002853Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.3003210Z ) 2025-05-07T20:31:48.3003554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.3004171Z def test_silu_mul_quant( 2025-05-07T20:31:48.3004407Z self, 2025-05-07T20:31:48.3004594Z T: int, 2025-05-07T20:31:48.3004782Z D: int, 2025-05-07T20:31:48.3005121Z scale_ub: Optional[float], 2025-05-07T20:31:48.3005381Z contiguous: bool, 2025-05-07T20:31:48.3005610Z compiled: bool, 2025-05-07T20:31:48.3005824Z ) -> None: 2025-05-07T20:31:48.3006029Z torch.manual_seed(2025) 2025-05-07T20:31:48.3006259Z 2025-05-07T20:31:48.3006525Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.3006851Z 2025-05-07T20:31:48.3007045Z x_sign = torch.sign(x) 2025-05-07T20:31:48.3007334Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.3007631Z x = x_sign * x_clamp 2025-05-07T20:31:48.3007862Z x0 = x[:, :D] 2025-05-07T20:31:48.3008079Z x1 = x[:, D:] 2025-05-07T20:31:48.3008277Z 2025-05-07T20:31:48.3008449Z if contiguous: 2025-05-07T20:31:48.3008670Z x0 = x0.contiguous() 2025-05-07T20:31:48.3008917Z x1 = x1.contiguous() 2025-05-07T20:31:48.3009151Z 2025-05-07T20:31:48.3009335Z if scale_ub is not None: 2025-05-07T20:31:48.3009599Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.3009929Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.3010234Z ) 2025-05-07T20:31:48.3010419Z else: 2025-05-07T20:31:48.3010623Z scale_ub_tensor = None 2025-05-07T20:31:48.3010865Z 2025-05-07T20:31:48.3011087Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.3011394Z op = silu_mul_quant 2025-05-07T20:31:48.3011639Z if compiled: 2025-05-07T20:31:48.3011882Z op = torch.compile(op) 2025-05-07T20:31:48.3012173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.3012436Z 2025-05-07T20:31:48.3012624Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.3012804Z 2025-05-07T20:31:48.3012913Z moe/activation_test.py:117: 2025-05-07T20:31:48.3013220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.3013547Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.3013816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.3014501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.3015181Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.3015704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.3016370Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.3017020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.3017536Z kernel = self.compile( 2025-05-07T20:31:48.3018064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.3018709Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.3019100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.3019327Z 2025-05-07T20:31:48.3019533Z self = 2025-05-07T20:31:48.3020605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.3022090Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02227040>} 2025-05-07T20:31:48.3023482Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.3024492Z context = 2025-05-07T20:31:48.3024849Z 2025-05-07T20:31:48.3025016Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.3025523Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.3025983Z module_map=module_map) 2025-05-07T20:31:48.3026340Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.3026681Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.3026932Z E ^ 2025-05-07T20:31:48.3027400Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.3027844Z 2025-05-07T20:31:48.3028259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.3028762Z 2025-05-07T20:31:48.3028860Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.3029266Z self=, 2025-05-07T20:31:48.3029664Z T=128, 2025-05-07T20:31:48.3029887Z D=7168, 2025-05-07T20:31:48.3030070Z scale_ub=None, 2025-05-07T20:31:48.3030273Z contiguous=True, 2025-05-07T20:31:48.3030485Z compiled=False, 2025-05-07T20:31:48.3030688Z ) 2025-05-07T20:31:48.3922533Z self = 2025-05-07T20:31:48.3923049Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.3923359Z 2025-05-07T20:31:48.3923447Z @given( 2025-05-07T20:31:48.3923674Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.3923979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.3924279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.3924600Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.3924919Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.3925203Z ) 2025-05-07T20:31:48.3925540Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.3925969Z def test_silu_mul_quant( 2025-05-07T20:31:48.3926208Z self, 2025-05-07T20:31:48.3926391Z T: int, 2025-05-07T20:31:48.3926582Z D: int, 2025-05-07T20:31:48.3926794Z scale_ub: Optional[float], 2025-05-07T20:31:48.3927060Z contiguous: bool, 2025-05-07T20:31:48.3927284Z compiled: bool, 2025-05-07T20:31:48.3927502Z ) -> None: 2025-05-07T20:31:48.3927708Z torch.manual_seed(2025) 2025-05-07T20:31:48.3927937Z 2025-05-07T20:31:48.3928207Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.3928541Z 2025-05-07T20:31:48.3928722Z x_sign = torch.sign(x) 2025-05-07T20:31:48.3929006Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.3929306Z x = x_sign * x_clamp 2025-05-07T20:31:48.3929533Z x0 = x[:, :D] 2025-05-07T20:31:48.3929749Z x1 = x[:, D:] 2025-05-07T20:31:48.3929969Z 2025-05-07T20:31:48.3930145Z if contiguous: 2025-05-07T20:31:48.3930363Z x0 = x0.contiguous() 2025-05-07T20:31:48.3930620Z x1 = x1.contiguous() 2025-05-07T20:31:48.3930853Z 2025-05-07T20:31:48.3931034Z if scale_ub is not None: 2025-05-07T20:31:48.3931294Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.3938082Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.3938440Z ) 2025-05-07T20:31:48.3938626Z else: 2025-05-07T20:31:48.3938834Z scale_ub_tensor = None 2025-05-07T20:31:48.3939258Z 2025-05-07T20:31:48.3939490Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.3939804Z op = silu_mul_quant 2025-05-07T20:31:48.3940061Z if compiled: 2025-05-07T20:31:48.3940305Z op = torch.compile(op) 2025-05-07T20:31:48.3940607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.3940992Z 2025-05-07T20:31:48.3941177Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.3941346Z 2025-05-07T20:31:48.3941446Z moe/activation_test.py:117: 2025-05-07T20:31:48.3941740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.3942072Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.3942346Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.3943109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.3943814Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.3944366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.3945057Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.3945731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.3946274Z kernel = self.compile( 2025-05-07T20:31:48.3946819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.3947482Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.3947884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.3948112Z 2025-05-07T20:31:48.3948327Z self = 2025-05-07T20:31:48.3949436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.3950926Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02227c10>} 2025-05-07T20:31:48.3952301Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.3953390Z context = 2025-05-07T20:31:48.3953683Z 2025-05-07T20:31:48.3953854Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.3954379Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.3954852Z module_map=module_map) 2025-05-07T20:31:48.3955215Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.3955560Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.3955818Z E ^ 2025-05-07T20:31:48.3956286Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.3956748Z 2025-05-07T20:31:48.3957179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.3957702Z 2025-05-07T20:31:48.3957801Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.3958226Z self=, 2025-05-07T20:31:48.3958634Z T=2048, 2025-05-07T20:31:48.3958811Z D=7168, 2025-05-07T20:31:48.3958993Z scale_ub=1200.0, 2025-05-07T20:31:48.3959205Z contiguous=True, 2025-05-07T20:31:48.3959419Z compiled=False, 2025-05-07T20:31:48.3959621Z ) 2025-05-07T20:31:48.3960028Z self = 2025-05-07T20:31:48.3960524Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.3960810Z 2025-05-07T20:31:48.3960884Z @given( 2025-05-07T20:31:48.3961111Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.3961522Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.3961831Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.3962164Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.3962494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.3962774Z ) 2025-05-07T20:31:48.3963120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.3963562Z def test_silu_mul_quant( 2025-05-07T20:31:48.3963796Z self, 2025-05-07T20:31:48.3963986Z T: int, 2025-05-07T20:31:48.3964174Z D: int, 2025-05-07T20:31:48.3964394Z scale_ub: Optional[float], 2025-05-07T20:31:48.3964663Z contiguous: bool, 2025-05-07T20:31:48.3964899Z compiled: bool, 2025-05-07T20:31:48.3965125Z ) -> None: 2025-05-07T20:31:48.3965330Z torch.manual_seed(2025) 2025-05-07T20:31:48.3965563Z 2025-05-07T20:31:48.3965831Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.3967953Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.3969873Z 2025-05-07T20:31:48.3969998Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.3970211Z 2025-05-07T20:31:48.3970312Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.3970724Z self=, 2025-05-07T20:31:48.3971126Z T=1, 2025-05-07T20:31:48.3971301Z D=5120, 2025-05-07T20:31:48.3971487Z scale_ub=1200.0, 2025-05-07T20:31:48.3971703Z contiguous=True, 2025-05-07T20:31:48.3971918Z compiled=False, 2025-05-07T20:31:48.3972112Z ) 2025-05-07T20:31:48.4455902Z self = 2025-05-07T20:31:48.4456409Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.4456704Z 2025-05-07T20:31:48.4456799Z @given( 2025-05-07T20:31:48.4457125Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.4457548Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.4457903Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.4458236Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.4458562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.4458841Z ) 2025-05-07T20:31:48.4459190Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.4459638Z def test_silu_mul_quant( 2025-05-07T20:31:48.4459875Z self, 2025-05-07T20:31:48.4460067Z T: int, 2025-05-07T20:31:48.4460271Z D: int, 2025-05-07T20:31:48.4460485Z scale_ub: Optional[float], 2025-05-07T20:31:48.4460765Z contiguous: bool, 2025-05-07T20:31:48.4461004Z compiled: bool, 2025-05-07T20:31:48.4461223Z ) -> None: 2025-05-07T20:31:48.4461438Z torch.manual_seed(2025) 2025-05-07T20:31:48.4461682Z 2025-05-07T20:31:48.4461951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.4462290Z 2025-05-07T20:31:48.4462642Z x_sign = torch.sign(x) 2025-05-07T20:31:48.4462941Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:48.4463284Z x = x_sign * x_clamp 2025-05-07T20:31:48.4463536Z x0 = x[:, :D] 2025-05-07T20:31:48.4463754Z x1 = x[:, D:] 2025-05-07T20:31:48.4463961Z 2025-05-07T20:31:48.4464156Z if contiguous: 2025-05-07T20:31:48.4464512Z x0 = x0.contiguous() 2025-05-07T20:31:48.4464770Z x1 = x1.contiguous() 2025-05-07T20:31:48.4465013Z 2025-05-07T20:31:48.4465204Z if scale_ub is not None: 2025-05-07T20:31:48.4465469Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:48.4465809Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:48.4466121Z ) 2025-05-07T20:31:48.4466309Z else: 2025-05-07T20:31:48.4466522Z scale_ub_tensor = None 2025-05-07T20:31:48.4466771Z 2025-05-07T20:31:48.4466999Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:48.4467316Z op = silu_mul_quant 2025-05-07T20:31:48.4467568Z if compiled: 2025-05-07T20:31:48.4467811Z op = torch.compile(op) 2025-05-07T20:31:48.4468103Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.4468377Z 2025-05-07T20:31:48.4468573Z > y_fp8, y_scale = fn() 2025-05-07T20:31:48.4468744Z 2025-05-07T20:31:48.4468842Z moe/activation_test.py:117: 2025-05-07T20:31:48.4469139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.4469478Z moe/activation_test.py:115: in fn 2025-05-07T20:31:48.4469756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:48.4470526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:48.4471212Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:48.4471744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:48.4472424Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:48.4473090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:48.4473619Z kernel = self.compile( 2025-05-07T20:31:48.4474163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:48.4474813Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:48.4475207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:48.4475433Z 2025-05-07T20:31:48.4475644Z self = 2025-05-07T20:31:48.4476720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:48.4478103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa021ad9d0>} 2025-05-07T20:31:48.4479448Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:48.4480471Z context = 2025-05-07T20:31:48.4480757Z 2025-05-07T20:31:48.4480928Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:48.4481446Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:48.4481910Z module_map=module_map) 2025-05-07T20:31:48.4482273Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:48.4482706Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:48.4482969Z E ^ 2025-05-07T20:31:48.4483436Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:48.4483885Z 2025-05-07T20:31:48.4484302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:48.4484910Z 2025-05-07T20:31:48.4485014Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.4485425Z self=, 2025-05-07T20:31:48.4485827Z T=2048, 2025-05-07T20:31:48.4486013Z D=5120, 2025-05-07T20:31:48.4486203Z scale_ub=None, 2025-05-07T20:31:48.4486416Z contiguous=True, 2025-05-07T20:31:48.4486661Z compiled=False, 2025-05-07T20:31:48.4486864Z ) 2025-05-07T20:31:48.4487178Z self = 2025-05-07T20:31:48.4487675Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.4487944Z 2025-05-07T20:31:48.4488034Z @given( 2025-05-07T20:31:48.4488257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.4488569Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.4488873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.4489208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.4489548Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.4489829Z ) 2025-05-07T20:31:48.4490173Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.4490619Z def test_silu_mul_quant( 2025-05-07T20:31:48.4490860Z self, 2025-05-07T20:31:48.4491055Z T: int, 2025-05-07T20:31:48.4491250Z D: int, 2025-05-07T20:31:48.4491465Z scale_ub: Optional[float], 2025-05-07T20:31:48.4491734Z contiguous: bool, 2025-05-07T20:31:48.4491974Z compiled: bool, 2025-05-07T20:31:48.4492198Z ) -> None: 2025-05-07T20:31:48.4492417Z torch.manual_seed(2025) 2025-05-07T20:31:48.4492650Z 2025-05-07T20:31:48.4492922Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.4493261Z 2025-05-07T20:31:48.4493453Z > x_sign = torch.sign(x) 2025-05-07T20:31:48.4495416Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.4497280Z 2025-05-07T20:31:48.4497406Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:48.4497621Z 2025-05-07T20:31:48.4497723Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.4498152Z self=, 2025-05-07T20:31:48.4498551Z T=16384, 2025-05-07T20:31:48.4498744Z D=5120, 2025-05-07T20:31:48.4498945Z scale_ub=None, 2025-05-07T20:31:48.4499155Z contiguous=True, 2025-05-07T20:31:48.4499378Z compiled=False, 2025-05-07T20:31:48.4499580Z ) 2025-05-07T20:31:48.4499895Z self = 2025-05-07T20:31:48.4500386Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.4500660Z 2025-05-07T20:31:48.4500746Z @given( 2025-05-07T20:31:48.4500974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.4501278Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.4501585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.4502002Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.4502324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.4502607Z ) 2025-05-07T20:31:48.4502954Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.4503387Z def test_silu_mul_quant( 2025-05-07T20:31:48.4503878Z self, 2025-05-07T20:31:48.4504074Z T: int, 2025-05-07T20:31:48.4504264Z D: int, 2025-05-07T20:31:48.4504479Z scale_ub: Optional[float], 2025-05-07T20:31:48.4504746Z contiguous: bool, 2025-05-07T20:31:48.4504977Z compiled: bool, 2025-05-07T20:31:48.4505203Z ) -> None: 2025-05-07T20:31:48.4505417Z torch.manual_seed(2025) 2025-05-07T20:31:48.4505653Z 2025-05-07T20:31:48.4505925Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.4507980Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.4509889Z 2025-05-07T20:31:48.4510009Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.4510219Z 2025-05-07T20:31:48.4510325Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.4510732Z self=, 2025-05-07T20:31:48.4511129Z T=4096, 2025-05-07T20:31:48.4511313Z D=5120, 2025-05-07T20:31:48.4511498Z scale_ub=None, 2025-05-07T20:31:48.4511715Z contiguous=True, 2025-05-07T20:31:48.4511936Z compiled=False, 2025-05-07T20:31:48.4512145Z ) 2025-05-07T20:31:48.5550521Z self = 2025-05-07T20:31:48.5551075Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.5551464Z 2025-05-07T20:31:48.5551574Z @given( 2025-05-07T20:31:48.5551888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5552198Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5552503Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5552830Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5553159Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5553443Z ) 2025-05-07T20:31:48.5553783Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5554230Z def test_silu_mul_quant( 2025-05-07T20:31:48.5554480Z self, 2025-05-07T20:31:48.5554670Z T: int, 2025-05-07T20:31:48.5554874Z D: int, 2025-05-07T20:31:48.5555093Z scale_ub: Optional[float], 2025-05-07T20:31:48.5555362Z contiguous: bool, 2025-05-07T20:31:48.5555606Z compiled: bool, 2025-05-07T20:31:48.5555830Z ) -> None: 2025-05-07T20:31:48.5556040Z torch.manual_seed(2025) 2025-05-07T20:31:48.5556289Z 2025-05-07T20:31:48.5556564Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5558610Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.5560617Z 2025-05-07T20:31:48.5560749Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.5560964Z 2025-05-07T20:31:48.5561065Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5561484Z self=, 2025-05-07T20:31:48.5562002Z T=2048, 2025-05-07T20:31:48.5562184Z D=5120, 2025-05-07T20:31:48.5562374Z scale_ub=None, 2025-05-07T20:31:48.5562588Z contiguous=False, 2025-05-07T20:31:48.5562813Z compiled=False, 2025-05-07T20:31:48.5563022Z ) 2025-05-07T20:31:48.5563336Z self = 2025-05-07T20:31:48.5563827Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:48.5564109Z 2025-05-07T20:31:48.5564189Z @given( 2025-05-07T20:31:48.5564419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5564741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5565046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5565374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5565699Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5565984Z ) 2025-05-07T20:31:48.5566330Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5566773Z def test_silu_mul_quant( 2025-05-07T20:31:48.5567012Z self, 2025-05-07T20:31:48.5567207Z T: int, 2025-05-07T20:31:48.5567403Z D: int, 2025-05-07T20:31:48.5567617Z scale_ub: Optional[float], 2025-05-07T20:31:48.5567889Z contiguous: bool, 2025-05-07T20:31:48.5568128Z compiled: bool, 2025-05-07T20:31:48.5568355Z ) -> None: 2025-05-07T20:31:48.5568567Z torch.manual_seed(2025) 2025-05-07T20:31:48.5568808Z 2025-05-07T20:31:48.5569078Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5571135Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.5572992Z 2025-05-07T20:31:48.5573111Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.5573321Z 2025-05-07T20:31:48.5573423Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5573839Z self=, 2025-05-07T20:31:48.5574237Z T=4096, 2025-05-07T20:31:48.5574421Z D=7168, 2025-05-07T20:31:48.5574613Z scale_ub=None, 2025-05-07T20:31:48.5574829Z contiguous=True, 2025-05-07T20:31:48.5575052Z compiled=True, 2025-05-07T20:31:48.5575259Z ) 2025-05-07T20:31:48.5575575Z self = 2025-05-07T20:31:48.5576062Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:48.5576333Z 2025-05-07T20:31:48.5576410Z @given( 2025-05-07T20:31:48.5576638Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5576954Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5577252Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5577665Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5578023Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5578296Z ) 2025-05-07T20:31:48.5578635Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5579071Z def test_silu_mul_quant( 2025-05-07T20:31:48.5579311Z self, 2025-05-07T20:31:48.5579586Z T: int, 2025-05-07T20:31:48.5579779Z D: int, 2025-05-07T20:31:48.5579995Z scale_ub: Optional[float], 2025-05-07T20:31:48.5580255Z contiguous: bool, 2025-05-07T20:31:48.5580488Z compiled: bool, 2025-05-07T20:31:48.5580705Z ) -> None: 2025-05-07T20:31:48.5580988Z torch.manual_seed(2025) 2025-05-07T20:31:48.5581227Z 2025-05-07T20:31:48.5581495Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5583526Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.5585367Z 2025-05-07T20:31:48.5585487Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.5585692Z 2025-05-07T20:31:48.5585793Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5586206Z self=, 2025-05-07T20:31:48.5586603Z T=2048, 2025-05-07T20:31:48.5586779Z D=5120, 2025-05-07T20:31:48.5586961Z scale_ub=1200.0, 2025-05-07T20:31:48.5587186Z contiguous=False, 2025-05-07T20:31:48.5587402Z compiled=False, 2025-05-07T20:31:48.5587602Z ) 2025-05-07T20:31:48.5587915Z self = 2025-05-07T20:31:48.5588400Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:48.5588684Z 2025-05-07T20:31:48.5588759Z @given( 2025-05-07T20:31:48.5588980Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5589291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5589587Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5590013Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5590336Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5590623Z ) 2025-05-07T20:31:48.5590962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5591395Z def test_silu_mul_quant( 2025-05-07T20:31:48.5591629Z self, 2025-05-07T20:31:48.5591821Z T: int, 2025-05-07T20:31:48.5592017Z D: int, 2025-05-07T20:31:48.5592223Z scale_ub: Optional[float], 2025-05-07T20:31:48.5592484Z contiguous: bool, 2025-05-07T20:31:48.5592718Z compiled: bool, 2025-05-07T20:31:48.5592930Z ) -> None: 2025-05-07T20:31:48.5593145Z torch.manual_seed(2025) 2025-05-07T20:31:48.5593381Z 2025-05-07T20:31:48.5593653Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5595670Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.5597517Z 2025-05-07T20:31:48.5597634Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.5597847Z 2025-05-07T20:31:48.5597948Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5598354Z self=, 2025-05-07T20:31:48.5598748Z T=4096, 2025-05-07T20:31:48.5598935Z D=7168, 2025-05-07T20:31:48.5599233Z scale_ub=1200.0, 2025-05-07T20:31:48.5599454Z contiguous=True, 2025-05-07T20:31:48.5599664Z compiled=False, 2025-05-07T20:31:48.5599859Z ) 2025-05-07T20:31:48.5600167Z self = 2025-05-07T20:31:48.5600646Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.5600996Z 2025-05-07T20:31:48.5601073Z @given( 2025-05-07T20:31:48.5601296Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.5601598Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.5601895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.5602217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.5602539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.5602819Z ) 2025-05-07T20:31:48.5603157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.5603595Z def test_silu_mul_quant( 2025-05-07T20:31:48.5604096Z self, 2025-05-07T20:31:48.5604284Z T: int, 2025-05-07T20:31:48.5604482Z D: int, 2025-05-07T20:31:48.5604693Z scale_ub: Optional[float], 2025-05-07T20:31:48.5604958Z contiguous: bool, 2025-05-07T20:31:48.5605202Z compiled: bool, 2025-05-07T20:31:48.5605437Z ) -> None: 2025-05-07T20:31:48.5612221Z torch.manual_seed(2025) 2025-05-07T20:31:48.5612493Z 2025-05-07T20:31:48.5612769Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.5614817Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.5616683Z 2025-05-07T20:31:48.5616801Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.5617013Z 2025-05-07T20:31:48.5617125Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.5617531Z self=, 2025-05-07T20:31:48.5617926Z T=16384, 2025-05-07T20:31:48.5618121Z D=7168, 2025-05-07T20:31:48.5618372Z scale_ub=None, 2025-05-07T20:31:48.5618695Z contiguous=False, 2025-05-07T20:31:48.5618978Z compiled=True, 2025-05-07T20:31:48.5619243Z ) 2025-05-07T20:31:48.6913358Z self = 2025-05-07T20:31:48.6913900Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:48.6914293Z 2025-05-07T20:31:48.6914401Z @given( 2025-05-07T20:31:48.6914666Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6914972Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6915278Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6915609Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6915939Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6916227Z ) 2025-05-07T20:31:48.6916575Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6917008Z def test_silu_mul_quant( 2025-05-07T20:31:48.6917252Z self, 2025-05-07T20:31:48.6917447Z T: int, 2025-05-07T20:31:48.6917639Z D: int, 2025-05-07T20:31:48.6917856Z scale_ub: Optional[float], 2025-05-07T20:31:48.6918127Z contiguous: bool, 2025-05-07T20:31:48.6918359Z compiled: bool, 2025-05-07T20:31:48.6918591Z ) -> None: 2025-05-07T20:31:48.6918809Z torch.manual_seed(2025) 2025-05-07T20:31:48.6919223Z 2025-05-07T20:31:48.6919495Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6921541Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6923564Z 2025-05-07T20:31:48.6923684Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6923894Z 2025-05-07T20:31:48.6924002Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6924418Z self=, 2025-05-07T20:31:48.6924827Z T=4096, 2025-05-07T20:31:48.6925013Z D=7168, 2025-05-07T20:31:48.6925200Z scale_ub=None, 2025-05-07T20:31:48.6925408Z contiguous=True, 2025-05-07T20:31:48.6925627Z compiled=False, 2025-05-07T20:31:48.6925832Z ) 2025-05-07T20:31:48.6926140Z self = 2025-05-07T20:31:48.6926635Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.6926903Z 2025-05-07T20:31:48.6926990Z @given( 2025-05-07T20:31:48.6927211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6927518Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6927825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6928149Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6928474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6928753Z ) 2025-05-07T20:31:48.6929100Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6929533Z def test_silu_mul_quant( 2025-05-07T20:31:48.6929773Z self, 2025-05-07T20:31:48.6929965Z T: int, 2025-05-07T20:31:48.6930161Z D: int, 2025-05-07T20:31:48.6930377Z scale_ub: Optional[float], 2025-05-07T20:31:48.6930657Z contiguous: bool, 2025-05-07T20:31:48.6930891Z compiled: bool, 2025-05-07T20:31:48.6931114Z ) -> None: 2025-05-07T20:31:48.6931325Z torch.manual_seed(2025) 2025-05-07T20:31:48.6931567Z 2025-05-07T20:31:48.6931831Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6933864Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6935716Z 2025-05-07T20:31:48.6935833Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6936046Z 2025-05-07T20:31:48.6936152Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6936560Z self=, 2025-05-07T20:31:48.6936957Z T=16384, 2025-05-07T20:31:48.6937143Z D=7168, 2025-05-07T20:31:48.6937333Z scale_ub=None, 2025-05-07T20:31:48.6937541Z contiguous=True, 2025-05-07T20:31:48.6937764Z compiled=False, 2025-05-07T20:31:48.6937964Z ) 2025-05-07T20:31:48.6938270Z self = 2025-05-07T20:31:48.6938754Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:48.6939110Z 2025-05-07T20:31:48.6939191Z @given( 2025-05-07T20:31:48.6939411Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6939721Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6940021Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6940414Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6940739Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6941023Z ) 2025-05-07T20:31:48.6941364Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6941793Z def test_silu_mul_quant( 2025-05-07T20:31:48.6942035Z self, 2025-05-07T20:31:48.6942228Z T: int, 2025-05-07T20:31:48.6942418Z D: int, 2025-05-07T20:31:48.6942635Z scale_ub: Optional[float], 2025-05-07T20:31:48.6942905Z contiguous: bool, 2025-05-07T20:31:48.6943167Z compiled: bool, 2025-05-07T20:31:48.6943406Z ) -> None: 2025-05-07T20:31:48.6943622Z torch.manual_seed(2025) 2025-05-07T20:31:48.6943857Z 2025-05-07T20:31:48.6944121Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6946171Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6948049Z 2025-05-07T20:31:48.6948165Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6948372Z 2025-05-07T20:31:48.6948480Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6948892Z self=, 2025-05-07T20:31:48.6949288Z T=16384, 2025-05-07T20:31:48.6949475Z D=7168, 2025-05-07T20:31:48.6949659Z scale_ub=1200.0, 2025-05-07T20:31:48.6949974Z contiguous=True, 2025-05-07T20:31:48.6950193Z compiled=False, 2025-05-07T20:31:48.6950394Z ) 2025-05-07T20:31:48.6950703Z self = 2025-05-07T20:31:48.6951189Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:48.6951461Z 2025-05-07T20:31:48.6951543Z @given( 2025-05-07T20:31:48.6951764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:48.6952074Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:48.6952375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:48.6952693Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:48.6953022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:48.6953335Z ) 2025-05-07T20:31:48.6953696Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:48.6954130Z def test_silu_mul_quant( 2025-05-07T20:31:48.6954368Z self, 2025-05-07T20:31:48.6954556Z T: int, 2025-05-07T20:31:48.6954749Z D: int, 2025-05-07T20:31:48.6954965Z scale_ub: Optional[float], 2025-05-07T20:31:48.6955227Z contiguous: bool, 2025-05-07T20:31:48.6955465Z compiled: bool, 2025-05-07T20:31:48.6955684Z ) -> None: 2025-05-07T20:31:48.6955901Z torch.manual_seed(2025) 2025-05-07T20:31:48.6956136Z 2025-05-07T20:31:48.6956398Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:48.6958542Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:48.6960492Z 2025-05-07T20:31:48.6960612Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:48.6960822Z 2025-05-07T20:31:48.6960935Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:48.6961345Z self=, 2025-05-07T20:31:48.6961740Z T=128, 2025-05-07T20:31:48.6961927Z D=5120, 2025-05-07T20:31:48.6962111Z scale_ub=1200.0, 2025-05-07T20:31:48.6962334Z contiguous=False, 2025-05-07T20:31:48.6962557Z compiled=False, 2025-05-07T20:31:48.6962759Z ) 2025-05-07T20:31:49.0713394Z self = 2025-05-07T20:31:49.0713953Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.0714234Z 2025-05-07T20:31:49.0714314Z @given( 2025-05-07T20:31:49.0714547Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.0714858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.0715164Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.0715495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.0715825Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.0716102Z ) 2025-05-07T20:31:49.0716451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.0716891Z def test_silu_mul_quant( 2025-05-07T20:31:49.0717130Z self, 2025-05-07T20:31:49.0717324Z T: int, 2025-05-07T20:31:49.0717522Z D: int, 2025-05-07T20:31:49.0717741Z scale_ub: Optional[float], 2025-05-07T20:31:49.0718009Z contiguous: bool, 2025-05-07T20:31:49.0718255Z compiled: bool, 2025-05-07T20:31:49.0718482Z ) -> None: 2025-05-07T20:31:49.0718695Z torch.manual_seed(2025) 2025-05-07T20:31:49.0718936Z 2025-05-07T20:31:49.0719208Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.0719546Z 2025-05-07T20:31:49.0719744Z x_sign = torch.sign(x) 2025-05-07T20:31:49.0720033Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.0720340Z x = x_sign * x_clamp 2025-05-07T20:31:49.0720580Z x0 = x[:, :D] 2025-05-07T20:31:49.0720798Z x1 = x[:, D:] 2025-05-07T20:31:49.0721000Z 2025-05-07T20:31:49.0721186Z if contiguous: 2025-05-07T20:31:49.0721418Z x0 = x0.contiguous() 2025-05-07T20:31:49.0721673Z x1 = x1.contiguous() 2025-05-07T20:31:49.0721912Z 2025-05-07T20:31:49.0722106Z if scale_ub is not None: 2025-05-07T20:31:49.0722373Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.0722715Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.0723027Z ) 2025-05-07T20:31:49.0723222Z else: 2025-05-07T20:31:49.0723427Z scale_ub_tensor = None 2025-05-07T20:31:49.0723681Z 2025-05-07T20:31:49.0723914Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.0724226Z op = silu_mul_quant 2025-05-07T20:31:49.0724476Z if compiled: 2025-05-07T20:31:49.0724725Z op = torch.compile(op) 2025-05-07T20:31:49.0725016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.0725296Z 2025-05-07T20:31:49.0725488Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.0725652Z 2025-05-07T20:31:49.0725753Z moe/activation_test.py:117: 2025-05-07T20:31:49.0726054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.0726385Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.0726669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.0727509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.0728218Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.0728755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.0729578Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.0730240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.0730771Z kernel = self.compile( 2025-05-07T20:31:49.0731309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.0731958Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.0732356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.0732588Z 2025-05-07T20:31:49.0732798Z self = 2025-05-07T20:31:49.0733933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.0735315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa5b1e7d670>} 2025-05-07T20:31:49.0736659Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.0737684Z context = 2025-05-07T20:31:49.0737969Z 2025-05-07T20:31:49.0738143Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.0738661Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.0739126Z module_map=module_map) 2025-05-07T20:31:49.0739488Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.0739849Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.0740104Z E ^ 2025-05-07T20:31:49.0740571Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.0741021Z 2025-05-07T20:31:49.0741438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.0741947Z 2025-05-07T20:31:49.0742050Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.0742462Z self=, 2025-05-07T20:31:49.0742861Z T=2048, 2025-05-07T20:31:49.0743052Z D=7168, 2025-05-07T20:31:49.0743251Z scale_ub=None, 2025-05-07T20:31:49.0743507Z contiguous=False, 2025-05-07T20:31:49.0743732Z compiled=False, 2025-05-07T20:31:49.0743930Z ) 2025-05-07T20:31:49.0744244Z self = 2025-05-07T20:31:49.0744745Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.0745016Z 2025-05-07T20:31:49.0745093Z @given( 2025-05-07T20:31:49.0745322Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.0745634Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.0745936Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.0746267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.0746597Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.0746895Z ) 2025-05-07T20:31:49.0747320Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.0747759Z def test_silu_mul_quant( 2025-05-07T20:31:49.0748000Z self, 2025-05-07T20:31:49.0748189Z T: int, 2025-05-07T20:31:49.0748391Z D: int, 2025-05-07T20:31:49.0748604Z scale_ub: Optional[float], 2025-05-07T20:31:49.0748879Z contiguous: bool, 2025-05-07T20:31:49.0749195Z compiled: bool, 2025-05-07T20:31:49.0749412Z ) -> None: 2025-05-07T20:31:49.0749626Z torch.manual_seed(2025) 2025-05-07T20:31:49.0749953Z 2025-05-07T20:31:49.0750220Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.0752287Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.0754193Z 2025-05-07T20:31:49.0754312Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.0754536Z 2025-05-07T20:31:49.0754638Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.0755055Z self=, 2025-05-07T20:31:49.0755455Z T=128, 2025-05-07T20:31:49.0755644Z D=7168, 2025-05-07T20:31:49.0755838Z scale_ub=1200.0, 2025-05-07T20:31:49.0756063Z contiguous=True, 2025-05-07T20:31:49.0756286Z compiled=True, 2025-05-07T20:31:49.0756488Z ) 2025-05-07T20:31:49.1213318Z self = 2025-05-07T20:31:49.1213839Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.1214115Z 2025-05-07T20:31:49.1214198Z @given( 2025-05-07T20:31:49.1214433Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.1214741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.1215046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.1215374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.1215705Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.1215993Z ) 2025-05-07T20:31:49.1216337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.1216775Z def test_silu_mul_quant( 2025-05-07T20:31:49.1217013Z self, 2025-05-07T20:31:49.1217208Z T: int, 2025-05-07T20:31:49.1217410Z D: int, 2025-05-07T20:31:49.1217622Z scale_ub: Optional[float], 2025-05-07T20:31:49.1217894Z contiguous: bool, 2025-05-07T20:31:49.1218133Z compiled: bool, 2025-05-07T20:31:49.1218350Z ) -> None: 2025-05-07T20:31:49.1218565Z torch.manual_seed(2025) 2025-05-07T20:31:49.1218812Z 2025-05-07T20:31:49.1219076Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.1219414Z 2025-05-07T20:31:49.1219605Z x_sign = torch.sign(x) 2025-05-07T20:31:49.1219889Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.1220207Z x = x_sign * x_clamp 2025-05-07T20:31:49.1220447Z x0 = x[:, :D] 2025-05-07T20:31:49.1220658Z x1 = x[:, D:] 2025-05-07T20:31:49.1220865Z 2025-05-07T20:31:49.1221051Z if contiguous: 2025-05-07T20:31:49.1221278Z x0 = x0.contiguous() 2025-05-07T20:31:49.1221529Z x1 = x1.contiguous() 2025-05-07T20:31:49.1221770Z 2025-05-07T20:31:49.1221963Z if scale_ub is not None: 2025-05-07T20:31:49.1222228Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.1222561Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.1222870Z ) 2025-05-07T20:31:49.1223200Z else: 2025-05-07T20:31:49.1223413Z scale_ub_tensor = None 2025-05-07T20:31:49.1223662Z 2025-05-07T20:31:49.1223890Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.1224203Z op = silu_mul_quant 2025-05-07T20:31:49.1224450Z if compiled: 2025-05-07T20:31:49.1224836Z op = torch.compile(op) 2025-05-07T20:31:49.1225126Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.1225396Z 2025-05-07T20:31:49.1225587Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.1225751Z 2025-05-07T20:31:49.1225851Z moe/activation_test.py:117: 2025-05-07T20:31:49.1226148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.1226477Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.1226756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.1227306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.1227866Z return fn(*args, **kwargs) 2025-05-07T20:31:49.1228522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.1229206Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.1229741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.1230493Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.1231148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.1231673Z kernel = self.compile( 2025-05-07T20:31:49.1232205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.1232855Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.1233299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.1233532Z 2025-05-07T20:31:49.1233741Z self = 2025-05-07T20:31:49.1234824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.1236205Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa5b1e665e0>} 2025-05-07T20:31:49.1237551Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.1238564Z context = 2025-05-07T20:31:49.1238859Z 2025-05-07T20:31:49.1239026Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.1239548Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.1240013Z module_map=module_map) 2025-05-07T20:31:49.1240376Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.1240727Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.1240985Z E ^ 2025-05-07T20:31:49.1241441Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.1241894Z 2025-05-07T20:31:49.1242307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.1242819Z 2025-05-07T20:31:49.1242920Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.1243420Z self=, 2025-05-07T20:31:49.1243818Z T=128, 2025-05-07T20:31:49.1244006Z D=7168, 2025-05-07T20:31:49.1244201Z scale_ub=1200.0, 2025-05-07T20:31:49.1244426Z contiguous=True, 2025-05-07T20:31:49.1244646Z compiled=False, 2025-05-07T20:31:49.1244850Z ) 2025-05-07T20:31:49.1245162Z self = 2025-05-07T20:31:49.1245731Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.1246004Z 2025-05-07T20:31:49.1246080Z @given( 2025-05-07T20:31:49.1246309Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.1246613Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.1246918Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.1247247Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.1247571Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.1247857Z ) 2025-05-07T20:31:49.1248214Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.1248649Z def test_silu_mul_quant( 2025-05-07T20:31:49.1248890Z self, 2025-05-07T20:31:49.1249081Z T: int, 2025-05-07T20:31:49.1249272Z D: int, 2025-05-07T20:31:49.1249491Z scale_ub: Optional[float], 2025-05-07T20:31:49.1249771Z contiguous: bool, 2025-05-07T20:31:49.1250007Z compiled: bool, 2025-05-07T20:31:49.1250225Z ) -> None: 2025-05-07T20:31:49.1250441Z torch.manual_seed(2025) 2025-05-07T20:31:49.1250683Z 2025-05-07T20:31:49.1250945Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.1251287Z 2025-05-07T20:31:49.1251480Z x_sign = torch.sign(x) 2025-05-07T20:31:49.1251769Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.1253824Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.1255684Z 2025-05-07T20:31:49.1255803Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:49.1256026Z 2025-05-07T20:31:49.1256127Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.1256538Z self=, 2025-05-07T20:31:49.1256942Z T=128, 2025-05-07T20:31:49.1257124Z D=5120, 2025-05-07T20:31:49.1257316Z scale_ub=1200.0, 2025-05-07T20:31:49.1257540Z contiguous=True, 2025-05-07T20:31:49.1257760Z compiled=True, 2025-05-07T20:31:49.1264609Z ) 2025-05-07T20:31:49.1264957Z self = 2025-05-07T20:31:49.1265448Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.1265719Z 2025-05-07T20:31:49.1265800Z @given( 2025-05-07T20:31:49.1266030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.1266341Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.1266644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.1266974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.1267299Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.1267585Z ) 2025-05-07T20:31:49.1267929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.1268364Z def test_silu_mul_quant( 2025-05-07T20:31:49.1268605Z self, 2025-05-07T20:31:49.1268801Z T: int, 2025-05-07T20:31:49.1268989Z D: int, 2025-05-07T20:31:49.1269317Z scale_ub: Optional[float], 2025-05-07T20:31:49.1269592Z contiguous: bool, 2025-05-07T20:31:49.1269872Z compiled: bool, 2025-05-07T20:31:49.1270112Z ) -> None: 2025-05-07T20:31:49.1270337Z torch.manual_seed(2025) 2025-05-07T20:31:49.1270597Z 2025-05-07T20:31:49.1270884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.1271355Z 2025-05-07T20:31:49.1271557Z > x_sign = torch.sign(x) 2025-05-07T20:31:49.1274042Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.1276437Z 2025-05-07T20:31:49.1276556Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:49.1276774Z 2025-05-07T20:31:49.1276877Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.1277286Z self=, 2025-05-07T20:31:49.1277693Z T=128, 2025-05-07T20:31:49.1277871Z D=7168, 2025-05-07T20:31:49.1278060Z scale_ub=None, 2025-05-07T20:31:49.1278274Z contiguous=True, 2025-05-07T20:31:49.1278490Z compiled=True, 2025-05-07T20:31:49.1278688Z ) 2025-05-07T20:31:49.4113787Z self = 2025-05-07T20:31:49.4114299Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.4114559Z 2025-05-07T20:31:49.4114636Z @given( 2025-05-07T20:31:49.4114862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4115181Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4115479Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4115802Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4116121Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4116396Z ) 2025-05-07T20:31:49.4116747Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4117181Z def test_silu_mul_quant( 2025-05-07T20:31:49.4117408Z self, 2025-05-07T20:31:49.4117595Z T: int, 2025-05-07T20:31:49.4117791Z D: int, 2025-05-07T20:31:49.4117998Z scale_ub: Optional[float], 2025-05-07T20:31:49.4118258Z contiguous: bool, 2025-05-07T20:31:49.4118489Z compiled: bool, 2025-05-07T20:31:49.4118710Z ) -> None: 2025-05-07T20:31:49.4118914Z torch.manual_seed(2025) 2025-05-07T20:31:49.4119147Z 2025-05-07T20:31:49.4119404Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4121458Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.4123310Z 2025-05-07T20:31:49.4123425Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.4123640Z 2025-05-07T20:31:49.4185252Z FAILED 2025-05-07T20:31:49.4185555Z 2025-05-07T20:31:49.4185795Z =================================== FAILURES =================================== 2025-05-07T20:31:49.4186423Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:31:49.4187257Z + Exception Group Traceback (most recent call last): 2025-05-07T20:31:49.4188103Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:31:49.4188842Z | yield 2025-05-07T20:31:49.4189414Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run 2025-05-07T20:31:49.4190333Z | self._callTestMethod(testMethod) 2025-05-07T20:31:49.4191094Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod 2025-05-07T20:31:49.4191819Z | method() 2025-05-07T20:31:49.4192670Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:31:49.4194011Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4194927Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:31:49.4195772Z | raise the_error_hypothesis_found 2025-05-07T20:31:49.4196441Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:31:49.4197100Z +-+---------------- 1 ---------------- 2025-05-07T20:31:49.4197502Z | Traceback (most recent call last): 2025-05-07T20:31:49.4198473Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:49.4199522Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4202334Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.4205055Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:49.4205501Z | self=, 2025-05-07T20:31:49.4205897Z | T=128, 2025-05-07T20:31:49.4206092Z | D=7168, 2025-05-07T20:31:49.4206298Z | scale_ub=1200.0, 2025-05-07T20:31:49.4206529Z | contiguous=True, 2025-05-07T20:31:49.4206766Z | compiled=False, 2025-05-07T20:31:49.4206989Z | ) 2025-05-07T20:31:49.4207159Z | 2025-05-07T20:31:49.4207678Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:31:49.4208287Z +---------------- 2 ---------------- 2025-05-07T20:31:49.4208570Z | Traceback (most recent call last): 2025-05-07T20:31:49.4209268Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:49.4210042Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4212099Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.4214266Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:49.4214703Z | self=, 2025-05-07T20:31:49.4215096Z | T=128, 2025-05-07T20:31:49.4215293Z | D=7168, 2025-05-07T20:31:49.4215499Z | scale_ub=None, 2025-05-07T20:31:49.4215723Z | contiguous=True, 2025-05-07T20:31:49.4216077Z | compiled=True, 2025-05-07T20:31:49.4216293Z | ) 2025-05-07T20:31:49.4216459Z | 2025-05-07T20:31:49.4216977Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:49.4217570Z +---------------- 3 ---------------- 2025-05-07T20:31:49.4217844Z | Traceback (most recent call last): 2025-05-07T20:31:49.4218545Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:49.4219318Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4221356Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.4223789Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:49.4224399Z | self=, 2025-05-07T20:31:49.4224969Z | T=128, 2025-05-07T20:31:49.4225247Z | D=5120, 2025-05-07T20:31:49.4225536Z | scale_ub=1200.0, 2025-05-07T20:31:49.4225861Z | contiguous=True, 2025-05-07T20:31:49.4226176Z | compiled=True, 2025-05-07T20:31:49.4226495Z | ) 2025-05-07T20:31:49.4226722Z | 2025-05-07T20:31:49.4227441Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:31:49.4228266Z +---------------- 4 ---------------- 2025-05-07T20:31:49.4228654Z | Traceback (most recent call last): 2025-05-07T20:31:49.4229621Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:31:49.4230704Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.4231596Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:31:49.4232527Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4233726Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:31:49.4234814Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.4235627Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:31:49.4236612Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4237663Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:31:49.4238738Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4239837Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:31:49.4241031Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4242079Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:31:49.4242772Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.4243536Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:31:49.4244080Z | fn() 2025-05-07T20:31:49.4244640Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:31:49.4245262Z | self.fn.run( 2025-05-07T20:31:49.4245776Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:31:49.4246346Z | kernel = self.compile( 2025-05-07T20:31:49.4246948Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:31:49.4247642Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4248332Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.4249112Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4249617Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4249964Z | def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.4250211Z | ^ 2025-05-07T20:31:49.4250661Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4251212Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:49.4251601Z | # The test always failed when commented parts were varied together. 2025-05-07T20:31:49.4252110Z | self=, 2025-05-07T20:31:49.4252546Z | T=1, # or any other generated value 2025-05-07T20:31:49.4252847Z | D=5120, # or any other generated value 2025-05-07T20:31:49.4253206Z | scale_ub=None, # or any other generated value 2025-05-07T20:31:49.4253577Z | contiguous=True, # or any other generated value 2025-05-07T20:31:49.4254035Z | compiled=True, # or any other generated value 2025-05-07T20:31:49.4254438Z | ) 2025-05-07T20:31:49.4254682Z | 2025-05-07T20:31:49.4255395Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:49.4256220Z +------------------------------------ 2025-05-07T20:31:49.4256720Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:31:49.4257228Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4257801Z self=, 2025-05-07T20:31:49.4258344Z T=1, 2025-05-07T20:31:49.4258597Z D=5120, 2025-05-07T20:31:49.4258864Z scale_ub=None, 2025-05-07T20:31:49.4259157Z contiguous=True, 2025-05-07T20:31:49.4304531Z compiled=True, 2025-05-07T20:31:49.4304885Z ) 2025-05-07T20:31:49.4305331Z self = 2025-05-07T20:31:49.4306012Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.4306368Z 2025-05-07T20:31:49.4306488Z @given( 2025-05-07T20:31:49.4306804Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4307235Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4307651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4308090Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4308843Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4309244Z ) 2025-05-07T20:31:49.4309711Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4310448Z def test_silu_mul_quant( 2025-05-07T20:31:49.4310790Z self, 2025-05-07T20:31:49.4311058Z T: int, 2025-05-07T20:31:49.4311493Z D: int, 2025-05-07T20:31:49.4311796Z scale_ub: Optional[float], 2025-05-07T20:31:49.4312165Z contiguous: bool, 2025-05-07T20:31:49.4312494Z compiled: bool, 2025-05-07T20:31:49.4312807Z ) -> None: 2025-05-07T20:31:49.4313136Z torch.manual_seed(2025) 2025-05-07T20:31:49.4313485Z 2025-05-07T20:31:49.4313850Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4314309Z 2025-05-07T20:31:49.4314564Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4314951Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4315372Z x = x_sign * x_clamp 2025-05-07T20:31:49.4315702Z x0 = x[:, :D] 2025-05-07T20:31:49.4315987Z x1 = x[:, D:] 2025-05-07T20:31:49.4316259Z 2025-05-07T20:31:49.4316511Z if contiguous: 2025-05-07T20:31:49.4316828Z x0 = x0.contiguous() 2025-05-07T20:31:49.4317197Z x1 = x1.contiguous() 2025-05-07T20:31:49.4317530Z 2025-05-07T20:31:49.4317765Z if scale_ub is not None: 2025-05-07T20:31:49.4318104Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4318508Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4318885Z ) 2025-05-07T20:31:49.4319150Z else: 2025-05-07T20:31:49.4319441Z scale_ub_tensor = None 2025-05-07T20:31:49.4319779Z 2025-05-07T20:31:49.4320066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4320456Z op = silu_mul_quant 2025-05-07T20:31:49.4320756Z if compiled: 2025-05-07T20:31:49.4321055Z op = torch.compile(op) 2025-05-07T20:31:49.4321430Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4321776Z 2025-05-07T20:31:49.4322041Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.4322409Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.4322760Z 2025-05-07T20:31:49.4323047Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4323544Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.4323895Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.4324290Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.4324743Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4325151Z 2025-05-07T20:31:49.4325392Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.4325636Z 2025-05-07T20:31:49.4325757Z moe/activation_test.py:126: 2025-05-07T20:31:49.4326120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4326527Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.4326927Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4327902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.4328850Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.4329514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4330419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4331366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.4332350Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4333401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.4334322Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4335219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.4336007Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.4336830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.4337467Z fn() 2025-05-07T20:31:49.4338086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.4338799Z self.fn.run( 2025-05-07T20:31:49.4339372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4340026Z kernel = self.compile( 2025-05-07T20:31:49.4340689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4341495Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4341997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4342279Z 2025-05-07T20:31:49.4342536Z self = 2025-05-07T20:31:49.4343947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4345721Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faba5c74820>} 2025-05-07T20:31:49.4347462Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4348752Z context = 2025-05-07T20:31:49.4349133Z 2025-05-07T20:31:49.4349334Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4350144Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4350782Z module_map=module_map) 2025-05-07T20:31:49.4351268Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4351736Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.4352098Z E ^ 2025-05-07T20:31:49.4352726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4353361Z 2025-05-07T20:31:49.4353947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4354635Z 2025-05-07T20:31:49.4354773Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4355333Z self=, 2025-05-07T20:31:49.4355887Z T=2048, 2025-05-07T20:31:49.4356138Z D=5120, 2025-05-07T20:31:49.4356411Z scale_ub=1200.0, 2025-05-07T20:31:49.4356717Z contiguous=True, 2025-05-07T20:31:49.4357018Z compiled=False, 2025-05-07T20:31:49.4357298Z ) 2025-05-07T20:31:49.4357731Z self = 2025-05-07T20:31:49.4358400Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.4358785Z 2025-05-07T20:31:49.4358892Z @given( 2025-05-07T20:31:49.4359218Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4359647Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4360064Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4360710Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4361170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4361552Z ) 2025-05-07T20:31:49.4362031Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4362639Z def test_silu_mul_quant( 2025-05-07T20:31:49.4363064Z self, 2025-05-07T20:31:49.4363331Z T: int, 2025-05-07T20:31:49.4363605Z D: int, 2025-05-07T20:31:49.4363899Z scale_ub: Optional[float], 2025-05-07T20:31:49.4364270Z contiguous: bool, 2025-05-07T20:31:49.4364595Z compiled: bool, 2025-05-07T20:31:49.4364902Z ) -> None: 2025-05-07T20:31:49.4365194Z torch.manual_seed(2025) 2025-05-07T20:31:49.4365531Z 2025-05-07T20:31:49.4365904Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4366362Z 2025-05-07T20:31:49.4366629Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4367024Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4367422Z x = x_sign * x_clamp 2025-05-07T20:31:49.4367748Z x0 = x[:, :D] 2025-05-07T20:31:49.4368044Z x1 = x[:, D:] 2025-05-07T20:31:49.4368324Z 2025-05-07T20:31:49.4368579Z if contiguous: 2025-05-07T20:31:49.4368903Z x0 = x0.contiguous() 2025-05-07T20:31:49.4369261Z x1 = x1.contiguous() 2025-05-07T20:31:49.4369593Z 2025-05-07T20:31:49.4369858Z if scale_ub is not None: 2025-05-07T20:31:49.4370232Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4370712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4371136Z ) 2025-05-07T20:31:49.4371402Z else: 2025-05-07T20:31:49.4371692Z scale_ub_tensor = None 2025-05-07T20:31:49.4372032Z 2025-05-07T20:31:49.4372347Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4393463Z op = silu_mul_quant 2025-05-07T20:31:49.4393900Z if compiled: 2025-05-07T20:31:49.4394260Z op = torch.compile(op) 2025-05-07T20:31:49.4394680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4395067Z 2025-05-07T20:31:49.4395327Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4395561Z 2025-05-07T20:31:49.4395706Z moe/activation_test.py:117: 2025-05-07T20:31:49.4396122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4396576Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4396970Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4397930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4398856Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4399544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4400425Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4401305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4402011Z kernel = self.compile( 2025-05-07T20:31:49.4402694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4403570Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4404396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4404694Z 2025-05-07T20:31:49.4404951Z self = 2025-05-07T20:31:49.4406345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4408524Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faba62e1ee0>} 2025-05-07T20:31:49.4410273Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4411808Z context = 2025-05-07T20:31:49.4412232Z 2025-05-07T20:31:49.4412467Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4413232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4413938Z module_map=module_map) 2025-05-07T20:31:49.4414448Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4414922Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4415300Z E ^ 2025-05-07T20:31:49.4415948Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4416595Z 2025-05-07T20:31:49.4417178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4417921Z 2025-05-07T20:31:49.4418064Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4418639Z self=, 2025-05-07T20:31:49.4419188Z T=2048, 2025-05-07T20:31:49.4419454Z D=5120, 2025-05-07T20:31:49.4419726Z scale_ub=1200.0, 2025-05-07T20:31:49.4420030Z contiguous=True, 2025-05-07T20:31:49.4420335Z compiled=True, 2025-05-07T20:31:49.4420620Z ) 2025-05-07T20:31:49.4421049Z self = 2025-05-07T20:31:49.4421729Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.4422103Z 2025-05-07T20:31:49.4422209Z @given( 2025-05-07T20:31:49.4422523Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4422941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4423358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4423816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4424243Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4424625Z ) 2025-05-07T20:31:49.4425092Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4425678Z def test_silu_mul_quant( 2025-05-07T20:31:49.4426001Z self, 2025-05-07T20:31:49.4426262Z T: int, 2025-05-07T20:31:49.4426530Z D: int, 2025-05-07T20:31:49.4426815Z scale_ub: Optional[float], 2025-05-07T20:31:49.4427184Z contiguous: bool, 2025-05-07T20:31:49.4427509Z compiled: bool, 2025-05-07T20:31:49.4427811Z ) -> None: 2025-05-07T20:31:49.4428108Z torch.manual_seed(2025) 2025-05-07T20:31:49.4428470Z 2025-05-07T20:31:49.4428855Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4429361Z 2025-05-07T20:31:49.4429633Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4430165Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4430598Z x = x_sign * x_clamp 2025-05-07T20:31:49.4430940Z x0 = x[:, :D] 2025-05-07T20:31:49.4431246Z x1 = x[:, D:] 2025-05-07T20:31:49.4431533Z 2025-05-07T20:31:49.4431794Z if contiguous: 2025-05-07T20:31:49.4432113Z x0 = x0.contiguous() 2025-05-07T20:31:49.4432462Z x1 = x1.contiguous() 2025-05-07T20:31:49.4432798Z 2025-05-07T20:31:49.4433094Z if scale_ub is not None: 2025-05-07T20:31:49.4433489Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4433949Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4434519Z ) 2025-05-07T20:31:49.4434787Z else: 2025-05-07T20:31:49.4435086Z scale_ub_tensor = None 2025-05-07T20:31:49.4435436Z 2025-05-07T20:31:49.4435739Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4436170Z op = silu_mul_quant 2025-05-07T20:31:49.4436606Z if compiled: 2025-05-07T20:31:49.4436946Z op = torch.compile(op) 2025-05-07T20:31:49.4437351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4437718Z 2025-05-07T20:31:49.4437980Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.4438376Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.4438781Z 2025-05-07T20:31:49.4439114Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4439566Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.4439975Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.4440425Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.4440923Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4441360Z 2025-05-07T20:31:49.4441642Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.4441926Z 2025-05-07T20:31:49.4442077Z moe/activation_test.py:126: 2025-05-07T20:31:49.4442508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4442986Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.4443505Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4444610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.4445676Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.4446438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4447400Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4448349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.4449355Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4450414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.4451457Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4452480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.4453357Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.4454212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.4454935Z fn() 2025-05-07T20:31:49.4455647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.4456389Z self.fn.run( 2025-05-07T20:31:49.4457004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4457722Z kernel = self.compile( 2025-05-07T20:31:49.4458476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4459383Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4459940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4460267Z 2025-05-07T20:31:49.4460551Z self = 2025-05-07T20:31:49.4462152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4464084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faba62dd5e0>} 2025-05-07T20:31:49.4465861Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4467136Z context = 2025-05-07T20:31:49.4467515Z 2025-05-07T20:31:49.4467729Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4468428Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4469035Z module_map=module_map) 2025-05-07T20:31:49.4469540Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4470095Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.4470426Z E ^ 2025-05-07T20:31:49.4471029Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4471640Z 2025-05-07T20:31:49.4472209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4472911Z 2025-05-07T20:31:49.4473054Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4473615Z self=, 2025-05-07T20:31:49.4474156Z T=16384, 2025-05-07T20:31:49.4474425Z D=7168, 2025-05-07T20:31:49.4474692Z scale_ub=1200.0, 2025-05-07T20:31:49.4474996Z contiguous=False, 2025-05-07T20:31:49.4475306Z compiled=False, 2025-05-07T20:31:49.4475590Z ) 2025-05-07T20:31:49.4476021Z self = 2025-05-07T20:31:49.4476698Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.4477083Z 2025-05-07T20:31:49.4477203Z @given( 2025-05-07T20:31:49.4477510Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4477916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4478358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4478789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4479210Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4479599Z ) 2025-05-07T20:31:49.4480034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4480573Z def test_silu_mul_quant( 2025-05-07T20:31:49.4480871Z self, 2025-05-07T20:31:49.4481114Z T: int, 2025-05-07T20:31:49.4481351Z D: int, 2025-05-07T20:31:49.4481624Z scale_ub: Optional[float], 2025-05-07T20:31:49.4481995Z contiguous: bool, 2025-05-07T20:31:49.4482329Z compiled: bool, 2025-05-07T20:31:49.4482649Z ) -> None: 2025-05-07T20:31:49.4482955Z torch.manual_seed(2025) 2025-05-07T20:31:49.4483293Z 2025-05-07T20:31:49.4483680Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4484177Z 2025-05-07T20:31:49.4484456Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4484852Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4485284Z x = x_sign * x_clamp 2025-05-07T20:31:49.4485625Z x0 = x[:, :D] 2025-05-07T20:31:49.4485918Z x1 = x[:, D:] 2025-05-07T20:31:49.4486203Z 2025-05-07T20:31:49.4486464Z if contiguous: 2025-05-07T20:31:49.4486766Z x0 = x0.contiguous() 2025-05-07T20:31:49.4487087Z x1 = x1.contiguous() 2025-05-07T20:31:49.4487391Z 2025-05-07T20:31:49.4487623Z if scale_ub is not None: 2025-05-07T20:31:49.4488069Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4488487Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4488861Z ) 2025-05-07T20:31:49.4489133Z else: 2025-05-07T20:31:49.4489425Z scale_ub_tensor = None 2025-05-07T20:31:49.4489760Z 2025-05-07T20:31:49.4490067Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4490613Z op = silu_mul_quant 2025-05-07T20:31:49.4490941Z if compiled: 2025-05-07T20:31:49.4491202Z op = torch.compile(op) 2025-05-07T20:31:49.4491504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4491783Z 2025-05-07T20:31:49.4491979Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4492150Z 2025-05-07T20:31:49.4492259Z moe/activation_test.py:117: 2025-05-07T20:31:49.4492558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4492885Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4493176Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4493873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4494578Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4495116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4495813Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4496480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4497009Z kernel = self.compile( 2025-05-07T20:31:49.4497554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4498212Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4498617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4498848Z 2025-05-07T20:31:49.4499062Z self = 2025-05-07T20:31:49.4500155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4501555Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab6864e160>} 2025-05-07T20:31:49.4502914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4504310Z context = 2025-05-07T20:31:49.4504610Z 2025-05-07T20:31:49.4504783Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4505317Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4505784Z module_map=module_map) 2025-05-07T20:31:49.4506150Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4506509Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4506768Z E ^ 2025-05-07T20:31:49.4507232Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4507692Z 2025-05-07T20:31:49.4508112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4508630Z 2025-05-07T20:31:49.4508734Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4509148Z self=, 2025-05-07T20:31:49.4509762Z T=1, 2025-05-07T20:31:49.4510040Z D=7168, 2025-05-07T20:31:49.4510228Z scale_ub=None, 2025-05-07T20:31:49.4510435Z contiguous=True, 2025-05-07T20:31:49.4510654Z compiled=True, 2025-05-07T20:31:49.4510850Z ) 2025-05-07T20:31:49.4511163Z self = 2025-05-07T20:31:49.4511773Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.4512033Z 2025-05-07T20:31:49.4512109Z @given( 2025-05-07T20:31:49.4512334Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4512638Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4512948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4513293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4513650Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4513937Z ) 2025-05-07T20:31:49.4514291Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4514729Z def test_silu_mul_quant( 2025-05-07T20:31:49.4514964Z self, 2025-05-07T20:31:49.4515156Z T: int, 2025-05-07T20:31:49.4515353Z D: int, 2025-05-07T20:31:49.4515564Z scale_ub: Optional[float], 2025-05-07T20:31:49.4515844Z contiguous: bool, 2025-05-07T20:31:49.4516076Z compiled: bool, 2025-05-07T20:31:49.4516292Z ) -> None: 2025-05-07T20:31:49.4516511Z torch.manual_seed(2025) 2025-05-07T20:31:49.4516751Z 2025-05-07T20:31:49.4517014Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4517355Z 2025-05-07T20:31:49.4517547Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4517829Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4518135Z x = x_sign * x_clamp 2025-05-07T20:31:49.4518376Z x0 = x[:, :D] 2025-05-07T20:31:49.4518589Z x1 = x[:, D:] 2025-05-07T20:31:49.4518794Z 2025-05-07T20:31:49.4518982Z if contiguous: 2025-05-07T20:31:49.4519204Z x0 = x0.contiguous() 2025-05-07T20:31:49.4519464Z x1 = x1.contiguous() 2025-05-07T20:31:49.4519708Z 2025-05-07T20:31:49.4519898Z if scale_ub is not None: 2025-05-07T20:31:49.4520165Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4520511Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4520814Z ) 2025-05-07T20:31:49.4521002Z else: 2025-05-07T20:31:49.4521213Z scale_ub_tensor = None 2025-05-07T20:31:49.4521464Z 2025-05-07T20:31:49.4521690Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4522005Z op = silu_mul_quant 2025-05-07T20:31:49.4522255Z if compiled: 2025-05-07T20:31:49.4522497Z op = torch.compile(op) 2025-05-07T20:31:49.4522792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4523070Z 2025-05-07T20:31:49.4523262Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.4523565Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.4523856Z 2025-05-07T20:31:49.4524092Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4524422Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.4524721Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.4525032Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.4525382Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4525685Z 2025-05-07T20:31:49.4525883Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.4526076Z 2025-05-07T20:31:49.4526175Z moe/activation_test.py:126: 2025-05-07T20:31:49.4526470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4526806Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.4527132Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4528001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.4528769Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.4529319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4530071Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4530760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.4531486Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4532238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.4533082Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4533994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.4534790Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.4535502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.4536016Z fn() 2025-05-07T20:31:49.4536515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.4537088Z self.fn.run( 2025-05-07T20:31:49.4537545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4538071Z kernel = self.compile( 2025-05-07T20:31:49.4538608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4539264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4539652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4539883Z 2025-05-07T20:31:49.4540092Z self = 2025-05-07T20:31:49.4541178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4542576Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faba5c20280>} 2025-05-07T20:31:49.4544148Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4545307Z context = 2025-05-07T20:31:49.4545606Z 2025-05-07T20:31:49.4545775Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4546304Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4546767Z module_map=module_map) 2025-05-07T20:31:49.4547133Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4547491Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.4547756Z E ^ 2025-05-07T20:31:49.4548219Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4548673Z 2025-05-07T20:31:49.4549088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4549597Z 2025-05-07T20:31:49.4549704Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4550270Z self=, 2025-05-07T20:31:49.4550680Z T=4096, 2025-05-07T20:31:49.4550866Z D=5120, 2025-05-07T20:31:49.4551061Z scale_ub=None, 2025-05-07T20:31:49.4551270Z contiguous=False, 2025-05-07T20:31:49.4551492Z compiled=False, 2025-05-07T20:31:49.4551772Z ) 2025-05-07T20:31:49.4552091Z self = 2025-05-07T20:31:49.4552583Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.4552854Z 2025-05-07T20:31:49.4552933Z @given( 2025-05-07T20:31:49.4553152Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4553459Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4553761Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4554082Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4554419Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4554703Z ) 2025-05-07T20:31:49.4555048Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4555480Z def test_silu_mul_quant( 2025-05-07T20:31:49.4555718Z self, 2025-05-07T20:31:49.4555908Z T: int, 2025-05-07T20:31:49.4556104Z D: int, 2025-05-07T20:31:49.4556319Z scale_ub: Optional[float], 2025-05-07T20:31:49.4556587Z contiguous: bool, 2025-05-07T20:31:49.4556816Z compiled: bool, 2025-05-07T20:31:49.4557035Z ) -> None: 2025-05-07T20:31:49.4557248Z torch.manual_seed(2025) 2025-05-07T20:31:49.4557479Z 2025-05-07T20:31:49.4557743Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4558080Z 2025-05-07T20:31:49.4558271Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4558558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4558864Z x = x_sign * x_clamp 2025-05-07T20:31:49.4559102Z x0 = x[:, :D] 2025-05-07T20:31:49.4559320Z x1 = x[:, D:] 2025-05-07T20:31:49.4559524Z 2025-05-07T20:31:49.4559709Z if contiguous: 2025-05-07T20:31:49.4559932Z x0 = x0.contiguous() 2025-05-07T20:31:49.4560190Z x1 = x1.contiguous() 2025-05-07T20:31:49.4560430Z 2025-05-07T20:31:49.4560621Z if scale_ub is not None: 2025-05-07T20:31:49.4560900Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4561233Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4561535Z ) 2025-05-07T20:31:49.4561731Z else: 2025-05-07T20:31:49.4561942Z scale_ub_tensor = None 2025-05-07T20:31:49.4562186Z 2025-05-07T20:31:49.4562416Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4562728Z op = silu_mul_quant 2025-05-07T20:31:49.4562983Z if compiled: 2025-05-07T20:31:49.4563266Z op = torch.compile(op) 2025-05-07T20:31:49.4563567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4563833Z 2025-05-07T20:31:49.4564031Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4564201Z 2025-05-07T20:31:49.4564297Z moe/activation_test.py:117: 2025-05-07T20:31:49.4564589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4564918Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4565195Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4565886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4566589Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4567126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4567807Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4568577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4569105Z kernel = self.compile( 2025-05-07T20:31:49.4569647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4570299Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4570774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4571006Z 2025-05-07T20:31:49.4571232Z self = 2025-05-07T20:31:49.4580729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4582134Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab69195f70>} 2025-05-07T20:31:49.4583482Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4584517Z context = 2025-05-07T20:31:49.4584816Z 2025-05-07T20:31:49.4584989Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4585521Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4585985Z module_map=module_map) 2025-05-07T20:31:49.4586355Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4586715Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4586986Z E ^ 2025-05-07T20:31:49.4587461Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4587919Z 2025-05-07T20:31:49.4588334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4588845Z 2025-05-07T20:31:49.4588957Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4589368Z self=, 2025-05-07T20:31:49.4589768Z T=4096, 2025-05-07T20:31:49.4590070Z D=7168, 2025-05-07T20:31:49.4590261Z scale_ub=None, 2025-05-07T20:31:49.4590470Z contiguous=False, 2025-05-07T20:31:49.4590698Z compiled=False, 2025-05-07T20:31:49.4590908Z ) 2025-05-07T20:31:49.4591223Z self = 2025-05-07T20:31:49.4591723Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.4591995Z 2025-05-07T20:31:49.4592076Z @given( 2025-05-07T20:31:49.4592307Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4592628Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4592942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4593305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4593638Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4593929Z ) 2025-05-07T20:31:49.4594279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4594714Z def test_silu_mul_quant( 2025-05-07T20:31:49.4594961Z self, 2025-05-07T20:31:49.4595159Z T: int, 2025-05-07T20:31:49.4595349Z D: int, 2025-05-07T20:31:49.4595569Z scale_ub: Optional[float], 2025-05-07T20:31:49.4595840Z contiguous: bool, 2025-05-07T20:31:49.4596071Z compiled: bool, 2025-05-07T20:31:49.4596296Z ) -> None: 2025-05-07T20:31:49.4596509Z torch.manual_seed(2025) 2025-05-07T20:31:49.4596744Z 2025-05-07T20:31:49.4597143Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4597488Z 2025-05-07T20:31:49.4597673Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4597965Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4598279Z x = x_sign * x_clamp 2025-05-07T20:31:49.4598592Z x0 = x[:, :D] 2025-05-07T20:31:49.4598811Z x1 = x[:, D:] 2025-05-07T20:31:49.4599020Z 2025-05-07T20:31:49.4599206Z if contiguous: 2025-05-07T20:31:49.4599431Z x0 = x0.contiguous() 2025-05-07T20:31:49.4599694Z x1 = x1.contiguous() 2025-05-07T20:31:49.4599935Z 2025-05-07T20:31:49.4600120Z if scale_ub is not None: 2025-05-07T20:31:49.4600398Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4600734Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4601038Z ) 2025-05-07T20:31:49.4601239Z else: 2025-05-07T20:31:49.4601465Z scale_ub_tensor = None 2025-05-07T20:31:49.4601711Z 2025-05-07T20:31:49.4601943Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4602256Z op = silu_mul_quant 2025-05-07T20:31:49.4602502Z if compiled: 2025-05-07T20:31:49.4602740Z op = torch.compile(op) 2025-05-07T20:31:49.4603037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4603342Z 2025-05-07T20:31:49.4603549Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4603972Z 2025-05-07T20:31:49.4604124Z moe/activation_test.py:117: 2025-05-07T20:31:49.4604425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4604746Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4605023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4605709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4606400Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4606931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4607607Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4608262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4608790Z kernel = self.compile( 2025-05-07T20:31:49.4609329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4609974Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4610364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4610592Z 2025-05-07T20:31:49.4610795Z self = 2025-05-07T20:31:49.4611880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4613256Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faba0eadee0>} 2025-05-07T20:31:49.4614610Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4615636Z context = 2025-05-07T20:31:49.4615929Z 2025-05-07T20:31:49.4616095Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4616619Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4617309Z module_map=module_map) 2025-05-07T20:31:49.4617678Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4618028Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4618288Z E ^ 2025-05-07T20:31:49.4618757Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4619341Z 2025-05-07T20:31:49.4619756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4620268Z 2025-05-07T20:31:49.4620370Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4620782Z self=, 2025-05-07T20:31:49.4621173Z T=128, 2025-05-07T20:31:49.4621360Z D=7168, 2025-05-07T20:31:49.4621551Z scale_ub=None, 2025-05-07T20:31:49.4621765Z contiguous=False, 2025-05-07T20:31:49.4621982Z compiled=True, 2025-05-07T20:31:49.4622187Z ) 2025-05-07T20:31:49.4622505Z self = 2025-05-07T20:31:49.4622988Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.4623285Z 2025-05-07T20:31:49.4623376Z @given( 2025-05-07T20:31:49.4623609Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4623922Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4624229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4624557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4624880Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4625168Z ) 2025-05-07T20:31:49.4625519Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4625957Z def test_silu_mul_quant( 2025-05-07T20:31:49.4626191Z self, 2025-05-07T20:31:49.4626382Z T: int, 2025-05-07T20:31:49.4626577Z D: int, 2025-05-07T20:31:49.4626790Z scale_ub: Optional[float], 2025-05-07T20:31:49.4627058Z contiguous: bool, 2025-05-07T20:31:49.4627293Z compiled: bool, 2025-05-07T20:31:49.4627506Z ) -> None: 2025-05-07T20:31:49.4627718Z torch.manual_seed(2025) 2025-05-07T20:31:49.4627958Z 2025-05-07T20:31:49.4628226Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4628565Z 2025-05-07T20:31:49.4628755Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4629037Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4629342Z x = x_sign * x_clamp 2025-05-07T20:31:49.4629583Z x0 = x[:, :D] 2025-05-07T20:31:49.4629792Z x1 = x[:, D:] 2025-05-07T20:31:49.4630109Z 2025-05-07T20:31:49.4630291Z if contiguous: 2025-05-07T20:31:49.4630520Z x0 = x0.contiguous() 2025-05-07T20:31:49.4630776Z x1 = x1.contiguous() 2025-05-07T20:31:49.4631017Z 2025-05-07T20:31:49.4631208Z if scale_ub is not None: 2025-05-07T20:31:49.4631475Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4631808Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4632113Z ) 2025-05-07T20:31:49.4632298Z else: 2025-05-07T20:31:49.4632508Z scale_ub_tensor = None 2025-05-07T20:31:49.4632766Z 2025-05-07T20:31:49.4632988Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4633302Z op = silu_mul_quant 2025-05-07T20:31:49.4633577Z if compiled: 2025-05-07T20:31:49.4633845Z op = torch.compile(op) 2025-05-07T20:31:49.4634143Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4634417Z 2025-05-07T20:31:49.4634605Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.4634888Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.4635180Z 2025-05-07T20:31:49.4635420Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4635838Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.4636130Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.4636444Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.4636795Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4637182Z 2025-05-07T20:31:49.4637385Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.4637576Z 2025-05-07T20:31:49.4637673Z moe/activation_test.py:126: 2025-05-07T20:31:49.4637967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4638295Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.4638621Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4639400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.4640158Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.4640705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4641379Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4642064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.4642786Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4643534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.4644273Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4644997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.4645636Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.4646237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.4646746Z fn() 2025-05-07T20:31:49.4647243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.4647829Z self.fn.run( 2025-05-07T20:31:49.4648285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4648810Z kernel = self.compile( 2025-05-07T20:31:49.4649342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4649990Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4650379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4650612Z 2025-05-07T20:31:49.4650822Z self = 2025-05-07T20:31:49.4651906Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4653503Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faba1a29700>} 2025-05-07T20:31:49.4655172Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4656217Z context = 2025-05-07T20:31:49.4656512Z 2025-05-07T20:31:49.4656681Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4657321Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4657785Z module_map=module_map) 2025-05-07T20:31:49.4658154Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4658511Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.4658854Z E ^ 2025-05-07T20:31:49.4659312Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4659766Z 2025-05-07T20:31:49.4660178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4660684Z 2025-05-07T20:31:49.4660794Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4661199Z self=, 2025-05-07T20:31:49.4661603Z T=128, 2025-05-07T20:31:49.4661791Z D=7168, 2025-05-07T20:31:49.4661985Z scale_ub=None, 2025-05-07T20:31:49.4662202Z contiguous=False, 2025-05-07T20:31:49.4662423Z compiled=False, 2025-05-07T20:31:49.4662632Z ) 2025-05-07T20:31:49.4662946Z self = 2025-05-07T20:31:49.4663560Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.4663900Z 2025-05-07T20:31:49.4664003Z @given( 2025-05-07T20:31:49.4664277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4664664Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4665020Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4665343Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4665672Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4665956Z ) 2025-05-07T20:31:49.4666301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4666731Z def test_silu_mul_quant( 2025-05-07T20:31:49.4666974Z self, 2025-05-07T20:31:49.4667167Z T: int, 2025-05-07T20:31:49.4667355Z D: int, 2025-05-07T20:31:49.4667572Z scale_ub: Optional[float], 2025-05-07T20:31:49.4667842Z contiguous: bool, 2025-05-07T20:31:49.4668070Z compiled: bool, 2025-05-07T20:31:49.4668294Z ) -> None: 2025-05-07T20:31:49.4668508Z torch.manual_seed(2025) 2025-05-07T20:31:49.4668738Z 2025-05-07T20:31:49.4669009Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4669344Z 2025-05-07T20:31:49.4669527Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4669812Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4670200Z x = x_sign * x_clamp 2025-05-07T20:31:49.4670433Z x0 = x[:, :D] 2025-05-07T20:31:49.4670648Z x1 = x[:, D:] 2025-05-07T20:31:49.4670852Z 2025-05-07T20:31:49.4671039Z if contiguous: 2025-05-07T20:31:49.4671260Z x0 = x0.contiguous() 2025-05-07T20:31:49.4671519Z x1 = x1.contiguous() 2025-05-07T20:31:49.4671762Z 2025-05-07T20:31:49.4671944Z if scale_ub is not None: 2025-05-07T20:31:49.4672213Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4672545Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4672848Z ) 2025-05-07T20:31:49.4673039Z else: 2025-05-07T20:31:49.4673254Z scale_ub_tensor = None 2025-05-07T20:31:49.4673499Z 2025-05-07T20:31:49.4673727Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4674041Z op = silu_mul_quant 2025-05-07T20:31:49.4674283Z if compiled: 2025-05-07T20:31:49.4674527Z op = torch.compile(op) 2025-05-07T20:31:49.4674822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4675088Z 2025-05-07T20:31:49.4675275Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4675443Z 2025-05-07T20:31:49.4675542Z moe/activation_test.py:117: 2025-05-07T20:31:49.4675921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4676248Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4676534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4677218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4677982Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4678520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4679198Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4679851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4680385Z kernel = self.compile( 2025-05-07T20:31:49.4680926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4681571Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4681966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4682196Z 2025-05-07T20:31:49.4682400Z self = 2025-05-07T20:31:49.4683537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4684905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab681d7d30>} 2025-05-07T20:31:49.4686252Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4687281Z context = 2025-05-07T20:31:49.4687567Z 2025-05-07T20:31:49.4687737Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4688263Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4688728Z module_map=module_map) 2025-05-07T20:31:49.4689089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4689443Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4689695Z E ^ 2025-05-07T20:31:49.4690167Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4690619Z 2025-05-07T20:31:49.4691040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4691552Z 2025-05-07T20:31:49.4691662Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4692070Z self=, 2025-05-07T20:31:49.4692465Z T=4096, 2025-05-07T20:31:49.4692650Z D=5120, 2025-05-07T20:31:49.4692831Z scale_ub=1200.0, 2025-05-07T20:31:49.4693056Z contiguous=True, 2025-05-07T20:31:49.4693277Z compiled=False, 2025-05-07T20:31:49.4693501Z ) 2025-05-07T20:31:49.4693840Z self = 2025-05-07T20:31:49.4694329Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.4694601Z 2025-05-07T20:31:49.4694678Z @given( 2025-05-07T20:31:49.4694899Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4695211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4695514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4695917Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4696245Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4696525Z ) 2025-05-07T20:31:49.4696864Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4697298Z def test_silu_mul_quant( 2025-05-07T20:31:49.4697615Z self, 2025-05-07T20:31:49.4697796Z T: int, 2025-05-07T20:31:49.4697995Z D: int, 2025-05-07T20:31:49.4698214Z scale_ub: Optional[float], 2025-05-07T20:31:49.4698475Z contiguous: bool, 2025-05-07T20:31:49.4698713Z compiled: bool, 2025-05-07T20:31:49.4698932Z ) -> None: 2025-05-07T20:31:49.4699144Z torch.manual_seed(2025) 2025-05-07T20:31:49.4699380Z 2025-05-07T20:31:49.4699650Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4699990Z 2025-05-07T20:31:49.4700174Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4700469Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4700772Z x = x_sign * x_clamp 2025-05-07T20:31:49.4701005Z x0 = x[:, :D] 2025-05-07T20:31:49.4701220Z x1 = x[:, D:] 2025-05-07T20:31:49.4701423Z 2025-05-07T20:31:49.4701604Z if contiguous: 2025-05-07T20:31:49.4701831Z x0 = x0.contiguous() 2025-05-07T20:31:49.4702094Z x1 = x1.contiguous() 2025-05-07T20:31:49.4702324Z 2025-05-07T20:31:49.4702517Z if scale_ub is not None: 2025-05-07T20:31:49.4702783Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4703108Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4703412Z ) 2025-05-07T20:31:49.4703601Z else: 2025-05-07T20:31:49.4704022Z scale_ub_tensor = None 2025-05-07T20:31:49.4704288Z 2025-05-07T20:31:49.4704520Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4704831Z op = silu_mul_quant 2025-05-07T20:31:49.4705070Z if compiled: 2025-05-07T20:31:49.4705321Z op = torch.compile(op) 2025-05-07T20:31:49.4705616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4705880Z 2025-05-07T20:31:49.4706063Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4706226Z 2025-05-07T20:31:49.4706331Z moe/activation_test.py:117: 2025-05-07T20:31:49.4706624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4706952Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4707230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4707917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4708593Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4709126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4709810Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4710510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4711036Z kernel = self.compile( 2025-05-07T20:31:49.4711567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4712222Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4712605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4712833Z 2025-05-07T20:31:49.4713042Z self = 2025-05-07T20:31:49.4714165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4715671Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab69195940>} 2025-05-07T20:31:49.4717003Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4718154Z context = 2025-05-07T20:31:49.4718441Z 2025-05-07T20:31:49.4718604Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4719126Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4719583Z module_map=module_map) 2025-05-07T20:31:49.4719946Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4720296Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4720552Z E ^ 2025-05-07T20:31:49.4721014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4721469Z 2025-05-07T20:31:49.4721882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4722393Z 2025-05-07T20:31:49.4722498Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4722902Z self=, 2025-05-07T20:31:49.4723324Z T=1, 2025-05-07T20:31:49.4723528Z D=5120, 2025-05-07T20:31:49.4723714Z scale_ub=None, 2025-05-07T20:31:49.4723919Z contiguous=True, 2025-05-07T20:31:49.4724136Z compiled=True, 2025-05-07T20:31:49.4724332Z ) 2025-05-07T20:31:49.4724642Z self = 2025-05-07T20:31:49.4725120Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.4725376Z 2025-05-07T20:31:49.4725461Z @given( 2025-05-07T20:31:49.4725679Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4725982Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4726282Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4726599Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4726928Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4727214Z ) 2025-05-07T20:31:49.4727556Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4727982Z def test_silu_mul_quant( 2025-05-07T20:31:49.4728224Z self, 2025-05-07T20:31:49.4728413Z T: int, 2025-05-07T20:31:49.4728601Z D: int, 2025-05-07T20:31:49.4728811Z scale_ub: Optional[float], 2025-05-07T20:31:49.4729076Z contiguous: bool, 2025-05-07T20:31:49.4729303Z compiled: bool, 2025-05-07T20:31:49.4729524Z ) -> None: 2025-05-07T20:31:49.4734618Z torch.manual_seed(2025) 2025-05-07T20:31:49.4734870Z 2025-05-07T20:31:49.4735134Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4735472Z 2025-05-07T20:31:49.4735661Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4735942Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4736254Z x = x_sign * x_clamp 2025-05-07T20:31:49.4736490Z x0 = x[:, :D] 2025-05-07T20:31:49.4736696Z x1 = x[:, D:] 2025-05-07T20:31:49.4736893Z 2025-05-07T20:31:49.4737071Z if contiguous: 2025-05-07T20:31:49.4737289Z x0 = x0.contiguous() 2025-05-07T20:31:49.4737546Z x1 = x1.contiguous() 2025-05-07T20:31:49.4737781Z 2025-05-07T20:31:49.4737963Z if scale_ub is not None: 2025-05-07T20:31:49.4738234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4738568Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4738867Z ) 2025-05-07T20:31:49.4739159Z else: 2025-05-07T20:31:49.4739367Z scale_ub_tensor = None 2025-05-07T20:31:49.4739439Z 2025-05-07T20:31:49.4739577Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4739663Z op = silu_mul_quant 2025-05-07T20:31:49.4739746Z if compiled: 2025-05-07T20:31:49.4739933Z op = torch.compile(op) 2025-05-07T20:31:49.4740042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4740112Z 2025-05-07T20:31:49.4740205Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.4740325Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.4740396Z 2025-05-07T20:31:49.4740532Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4740631Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.4740729Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.4740854Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.4740998Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4741072Z 2025-05-07T20:31:49.4741170Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.4741175Z 2025-05-07T20:31:49.4741273Z moe/activation_test.py:126: 2025-05-07T20:31:49.4741407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4741513Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.4741646Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4742219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.4742317Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.4742678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4742904Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4743301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.4743573Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4743965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.4744221Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4744586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.4744750Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.4745087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.4745162Z fn() 2025-05-07T20:31:49.4745559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.4745642Z self.fn.run( 2025-05-07T20:31:49.4745969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4746063Z kernel = self.compile( 2025-05-07T20:31:49.4746440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4746616Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4746747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4746752Z 2025-05-07T20:31:49.4746956Z self = 2025-05-07T20:31:49.4747824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4748332Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab68cea820>} 2025-05-07T20:31:49.4749079Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4749345Z context = 2025-05-07T20:31:49.4749350Z 2025-05-07T20:31:49.4749511Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4749775Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4749943Z module_map=module_map) 2025-05-07T20:31:49.4750104Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4750213Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.4750288Z E ^ 2025-05-07T20:31:49.4750637Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4750647Z 2025-05-07T20:31:49.4751056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4751066Z 2025-05-07T20:31:49.4751167Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4751388Z self=, 2025-05-07T20:31:49.4751463Z T=2048, 2025-05-07T20:31:49.4751537Z D=5120, 2025-05-07T20:31:49.4751618Z scale_ub=None, 2025-05-07T20:31:49.4751700Z contiguous=True, 2025-05-07T20:31:49.4751779Z compiled=True, 2025-05-07T20:31:49.4751853Z ) 2025-05-07T20:31:49.4752065Z self = 2025-05-07T20:31:49.4752240Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.4752245Z 2025-05-07T20:31:49.4752317Z @given( 2025-05-07T20:31:49.4752432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4752530Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4752647Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4752757Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4752869Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4752940Z ) 2025-05-07T20:31:49.4753186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4753277Z def test_silu_mul_quant( 2025-05-07T20:31:49.4753368Z self, 2025-05-07T20:31:49.4753446Z T: int, 2025-05-07T20:31:49.4753538Z D: int, 2025-05-07T20:31:49.4753639Z scale_ub: Optional[float], 2025-05-07T20:31:49.4753728Z contiguous: bool, 2025-05-07T20:31:49.4753815Z compiled: bool, 2025-05-07T20:31:49.4753888Z ) -> None: 2025-05-07T20:31:49.4753986Z torch.manual_seed(2025) 2025-05-07T20:31:49.4754056Z 2025-05-07T20:31:49.4754223Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4754295Z 2025-05-07T20:31:49.4754389Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4754512Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4754596Z x = x_sign * x_clamp 2025-05-07T20:31:49.4754671Z x0 = x[:, :D] 2025-05-07T20:31:49.4754746Z x1 = x[:, D:] 2025-05-07T20:31:49.4754816Z 2025-05-07T20:31:49.4754894Z if contiguous: 2025-05-07T20:31:49.4754985Z x0 = x0.contiguous() 2025-05-07T20:31:49.4755068Z x1 = x1.contiguous() 2025-05-07T20:31:49.4755136Z 2025-05-07T20:31:49.4755230Z if scale_ub is not None: 2025-05-07T20:31:49.4755334Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4755551Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4755626Z ) 2025-05-07T20:31:49.4755700Z else: 2025-05-07T20:31:49.4755793Z scale_ub_tensor = None 2025-05-07T20:31:49.4755863Z 2025-05-07T20:31:49.4755988Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4756151Z op = silu_mul_quant 2025-05-07T20:31:49.4756229Z if compiled: 2025-05-07T20:31:49.4756324Z op = torch.compile(op) 2025-05-07T20:31:49.4756429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4756501Z 2025-05-07T20:31:49.4756585Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.4756708Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.4756776Z 2025-05-07T20:31:49.4756910Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4757006Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.4757100Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.4757228Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.4757364Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4757436Z 2025-05-07T20:31:49.4757537Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.4757541Z 2025-05-07T20:31:49.4757639Z moe/activation_test.py:126: 2025-05-07T20:31:49.4757766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4757869Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.4757998Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4758556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.4758654Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.4759008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4759232Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4759591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.4759844Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4760241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.4760489Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4760857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.4761020Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.4761361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.4761442Z fn() 2025-05-07T20:31:49.4761834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.4761917Z self.fn.run( 2025-05-07T20:31:49.4762248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4762341Z kernel = self.compile( 2025-05-07T20:31:49.4762717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4762892Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4763040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4763045Z 2025-05-07T20:31:49.4763272Z self = 2025-05-07T20:31:49.4764149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4764659Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa04c8d940>} 2025-05-07T20:31:49.4765494Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4765687Z context = 2025-05-07T20:31:49.4765691Z 2025-05-07T20:31:49.4765851Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4766107Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4766218Z module_map=module_map) 2025-05-07T20:31:49.4766376Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4766475Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.4766544Z E ^ 2025-05-07T20:31:49.4766894Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4766904Z 2025-05-07T20:31:49.4767315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4767320Z 2025-05-07T20:31:49.4767419Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4767641Z self=, 2025-05-07T20:31:49.4767720Z T=128, 2025-05-07T20:31:49.4767791Z D=5120, 2025-05-07T20:31:49.4767872Z scale_ub=None, 2025-05-07T20:31:49.4767950Z contiguous=True, 2025-05-07T20:31:49.4768026Z compiled=True, 2025-05-07T20:31:49.4768100Z ) 2025-05-07T20:31:49.4768315Z self = 2025-05-07T20:31:49.4768479Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.4768483Z 2025-05-07T20:31:49.4768565Z @given( 2025-05-07T20:31:49.4768681Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4768779Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4768896Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4769009Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4769121Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4769192Z ) 2025-05-07T20:31:49.4769431Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4769527Z def test_silu_mul_quant( 2025-05-07T20:31:49.4769595Z self, 2025-05-07T20:31:49.4769666Z T: int, 2025-05-07T20:31:49.4769740Z D: int, 2025-05-07T20:31:49.4769838Z scale_ub: Optional[float], 2025-05-07T20:31:49.4772668Z contiguous: bool, 2025-05-07T20:31:49.4772758Z compiled: bool, 2025-05-07T20:31:49.4772834Z ) -> None: 2025-05-07T20:31:49.4772934Z torch.manual_seed(2025) 2025-05-07T20:31:49.4773005Z 2025-05-07T20:31:49.4773197Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4773288Z 2025-05-07T20:31:49.4773393Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4773512Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4773600Z x = x_sign * x_clamp 2025-05-07T20:31:49.4773677Z x0 = x[:, :D] 2025-05-07T20:31:49.4773753Z x1 = x[:, D:] 2025-05-07T20:31:49.4773827Z 2025-05-07T20:31:49.4773906Z if contiguous: 2025-05-07T20:31:49.4773995Z x0 = x0.contiguous() 2025-05-07T20:31:49.4774084Z x1 = x1.contiguous() 2025-05-07T20:31:49.4774151Z 2025-05-07T20:31:49.4774240Z if scale_ub is not None: 2025-05-07T20:31:49.4774444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4774579Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4774657Z ) 2025-05-07T20:31:49.4774726Z else: 2025-05-07T20:31:49.4774818Z scale_ub_tensor = None 2025-05-07T20:31:49.4774966Z 2025-05-07T20:31:49.4775095Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4775179Z op = silu_mul_quant 2025-05-07T20:31:49.4775262Z if compiled: 2025-05-07T20:31:49.4775359Z op = torch.compile(op) 2025-05-07T20:31:49.4775461Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4775537Z 2025-05-07T20:31:49.4775623Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.4775745Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.4775817Z 2025-05-07T20:31:49.4775950Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4776057Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.4776151Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.4776269Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.4776406Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4776474Z 2025-05-07T20:31:49.4776571Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.4776576Z 2025-05-07T20:31:49.4776679Z moe/activation_test.py:126: 2025-05-07T20:31:49.4776802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4776907Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.4777036Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4777600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.4777697Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.4778056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4778274Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4778635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.4778890Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4779287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.4779536Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4779901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.4780069Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.4780409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.4780489Z fn() 2025-05-07T20:31:49.4780887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.4780966Z self.fn.run( 2025-05-07T20:31:49.4781307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4781396Z kernel = self.compile( 2025-05-07T20:31:49.4781772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4781946Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4782070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4782074Z 2025-05-07T20:31:49.4782281Z self = 2025-05-07T20:31:49.4783144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4783690Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa04853c10>} 2025-05-07T20:31:49.4784530Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4784723Z context = 2025-05-07T20:31:49.4784727Z 2025-05-07T20:31:49.4784893Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4785160Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4785264Z module_map=module_map) 2025-05-07T20:31:49.4785428Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4785525Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.4785600Z E ^ 2025-05-07T20:31:49.4785959Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4785963Z 2025-05-07T20:31:49.4786375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4786380Z 2025-05-07T20:31:49.4786482Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4786700Z self=, 2025-05-07T20:31:49.4786778Z T=4096, 2025-05-07T20:31:49.4786850Z D=5120, 2025-05-07T20:31:49.4786926Z scale_ub=None, 2025-05-07T20:31:49.4787010Z contiguous=True, 2025-05-07T20:31:49.4787094Z compiled=True, 2025-05-07T20:31:49.4787166Z ) 2025-05-07T20:31:49.4787386Z self = 2025-05-07T20:31:49.4787554Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.4787558Z 2025-05-07T20:31:49.4787637Z @given( 2025-05-07T20:31:49.4787759Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4787854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4787970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4788083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4788192Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4788263Z ) 2025-05-07T20:31:49.4788505Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4788593Z def test_silu_mul_quant( 2025-05-07T20:31:49.4788670Z self, 2025-05-07T20:31:49.4788745Z T: int, 2025-05-07T20:31:49.4788816Z D: int, 2025-05-07T20:31:49.4788915Z scale_ub: Optional[float], 2025-05-07T20:31:49.4788999Z contiguous: bool, 2025-05-07T20:31:49.4789078Z compiled: bool, 2025-05-07T20:31:49.4789155Z ) -> None: 2025-05-07T20:31:49.4789245Z torch.manual_seed(2025) 2025-05-07T20:31:49.4789322Z 2025-05-07T20:31:49.4789487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4789558Z 2025-05-07T20:31:49.4789652Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4789771Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4789934Z x = x_sign * x_clamp 2025-05-07T20:31:49.4790014Z x0 = x[:, :D] 2025-05-07T20:31:49.4790092Z x1 = x[:, D:] 2025-05-07T20:31:49.4790160Z 2025-05-07T20:31:49.4790243Z if contiguous: 2025-05-07T20:31:49.4790330Z x0 = x0.contiguous() 2025-05-07T20:31:49.4790416Z x1 = x1.contiguous() 2025-05-07T20:31:49.4790576Z 2025-05-07T20:31:49.4790666Z if scale_ub is not None: 2025-05-07T20:31:49.4790764Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4790899Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4790971Z ) 2025-05-07T20:31:49.4791125Z else: 2025-05-07T20:31:49.4791215Z scale_ub_tensor = None 2025-05-07T20:31:49.4791285Z 2025-05-07T20:31:49.4791412Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4791499Z op = silu_mul_quant 2025-05-07T20:31:49.4791579Z if compiled: 2025-05-07T20:31:49.4791680Z op = torch.compile(op) 2025-05-07T20:31:49.4791782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4791852Z 2025-05-07T20:31:49.4791941Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.4792058Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.4792127Z 2025-05-07T20:31:49.4792272Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4792370Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.4792468Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.4792586Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.4792728Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4792800Z 2025-05-07T20:31:49.4792897Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.4792901Z 2025-05-07T20:31:49.4792997Z moe/activation_test.py:126: 2025-05-07T20:31:49.4793125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4793229Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.4793363Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4793923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.4794024Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.4794382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4794600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4794964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.4795222Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4795613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.4795865Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4796232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.4796399Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.4796737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.4796810Z fn() 2025-05-07T20:31:49.4797208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.4797289Z self.fn.run( 2025-05-07T20:31:49.4797619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4797715Z kernel = self.compile( 2025-05-07T20:31:49.4798091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4798262Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4798388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4798393Z 2025-05-07T20:31:49.4798708Z self = 2025-05-07T20:31:49.4799493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4800073Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa042d21f0>} 2025-05-07T20:31:49.4800824Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4801014Z context = 2025-05-07T20:31:49.4801019Z 2025-05-07T20:31:49.4801186Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4801452Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4801559Z module_map=module_map) 2025-05-07T20:31:49.4801723Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4801832Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.4801908Z E ^ 2025-05-07T20:31:49.4802263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4802267Z 2025-05-07T20:31:49.4802676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4802681Z 2025-05-07T20:31:49.4802779Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4803027Z self=, 2025-05-07T20:31:49.4803102Z T=16384, 2025-05-07T20:31:49.4803203Z D=5120, 2025-05-07T20:31:49.4803280Z scale_ub=None, 2025-05-07T20:31:49.4803359Z contiguous=True, 2025-05-07T20:31:49.4803440Z compiled=True, 2025-05-07T20:31:49.4803510Z ) 2025-05-07T20:31:49.4803911Z self = 2025-05-07T20:31:49.4804170Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.4804187Z 2025-05-07T20:31:49.4804281Z @given( 2025-05-07T20:31:49.4804398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4804498Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4804611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4804729Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4804839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4804911Z ) 2025-05-07T20:31:49.4805158Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4805254Z def test_silu_mul_quant( 2025-05-07T20:31:49.4805325Z self, 2025-05-07T20:31:49.4805400Z T: int, 2025-05-07T20:31:49.4805470Z D: int, 2025-05-07T20:31:49.4805568Z scale_ub: Optional[float], 2025-05-07T20:31:49.4805654Z contiguous: bool, 2025-05-07T20:31:49.4805736Z compiled: bool, 2025-05-07T20:31:49.4805812Z ) -> None: 2025-05-07T20:31:49.4805906Z torch.manual_seed(2025) 2025-05-07T20:31:49.4805975Z 2025-05-07T20:31:49.4806141Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4806214Z 2025-05-07T20:31:49.4806300Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4806421Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4806503Z x = x_sign * x_clamp 2025-05-07T20:31:49.4806580Z x0 = x[:, :D] 2025-05-07T20:31:49.4806658Z x1 = x[:, D:] 2025-05-07T20:31:49.4806726Z 2025-05-07T20:31:49.4806803Z if contiguous: 2025-05-07T20:31:49.4807039Z x0 = x0.contiguous() 2025-05-07T20:31:49.4807126Z x1 = x1.contiguous() 2025-05-07T20:31:49.4807196Z 2025-05-07T20:31:49.4807287Z if scale_ub is not None: 2025-05-07T20:31:49.4807387Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4807520Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4807710Z ) 2025-05-07T20:31:49.4807783Z else: 2025-05-07T20:31:49.4807876Z scale_ub_tensor = None 2025-05-07T20:31:49.4807945Z 2025-05-07T20:31:49.4808070Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4808158Z op = silu_mul_quant 2025-05-07T20:31:49.4808238Z if compiled: 2025-05-07T20:31:49.4808334Z op = torch.compile(op) 2025-05-07T20:31:49.4808439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4808511Z 2025-05-07T20:31:49.4808598Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.4808724Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.4808793Z 2025-05-07T20:31:49.4808924Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4809024Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.4809120Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.4809247Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.4809384Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4809456Z 2025-05-07T20:31:49.4809556Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.4809561Z 2025-05-07T20:31:49.4809653Z moe/activation_test.py:126: 2025-05-07T20:31:49.4809776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4809881Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.4810009Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4810585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.4810681Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.4811038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4811264Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4811624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.4811877Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4812274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.4812522Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4812896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.4813058Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.4813432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.4813525Z fn() 2025-05-07T20:31:49.4813916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.4813999Z self.fn.run( 2025-05-07T20:31:49.4814330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4814418Z kernel = self.compile( 2025-05-07T20:31:49.4814792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4814963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4815166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4815179Z 2025-05-07T20:31:49.4815383Z self = 2025-05-07T20:31:49.4816157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4816817Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa049af790>} 2025-05-07T20:31:49.4817560Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4817752Z context = 2025-05-07T20:31:49.4817761Z 2025-05-07T20:31:49.4817924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4818187Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4818292Z module_map=module_map) 2025-05-07T20:31:49.4818457Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4818562Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.4818637Z E ^ 2025-05-07T20:31:49.4818995Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4819002Z 2025-05-07T20:31:49.4819411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4819415Z 2025-05-07T20:31:49.4819516Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4819743Z self=, 2025-05-07T20:31:49.4819814Z T=1, 2025-05-07T20:31:49.4819886Z D=5120, 2025-05-07T20:31:49.4819970Z scale_ub=1200.0, 2025-05-07T20:31:49.4820051Z contiguous=True, 2025-05-07T20:31:49.4820132Z compiled=True, 2025-05-07T20:31:49.4820205Z ) 2025-05-07T20:31:49.4820416Z self = 2025-05-07T20:31:49.4820588Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.4820592Z 2025-05-07T20:31:49.4820664Z @given( 2025-05-07T20:31:49.4820781Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4820882Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4820991Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4821102Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4821219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4821289Z ) 2025-05-07T20:31:49.4821536Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4821630Z def test_silu_mul_quant( 2025-05-07T20:31:49.4821702Z self, 2025-05-07T20:31:49.4821779Z T: int, 2025-05-07T20:31:49.4821852Z D: int, 2025-05-07T20:31:49.4821944Z scale_ub: Optional[float], 2025-05-07T20:31:49.4822035Z contiguous: bool, 2025-05-07T20:31:49.4822114Z compiled: bool, 2025-05-07T20:31:49.4822187Z ) -> None: 2025-05-07T20:31:49.4822280Z torch.manual_seed(2025) 2025-05-07T20:31:49.4822348Z 2025-05-07T20:31:49.4822511Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4822588Z 2025-05-07T20:31:49.4822674Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4822792Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4822877Z x = x_sign * x_clamp 2025-05-07T20:31:49.4822951Z x0 = x[:, :D] 2025-05-07T20:31:49.4823027Z x1 = x[:, D:] 2025-05-07T20:31:49.4823178Z 2025-05-07T20:31:49.4823260Z if contiguous: 2025-05-07T20:31:49.4823349Z x0 = x0.contiguous() 2025-05-07T20:31:49.4823434Z x1 = x1.contiguous() 2025-05-07T20:31:49.4823503Z 2025-05-07T20:31:49.4823596Z if scale_ub is not None: 2025-05-07T20:31:49.4823792Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4823923Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4823998Z ) 2025-05-07T20:31:49.4824072Z else: 2025-05-07T20:31:49.4824163Z scale_ub_tensor = None 2025-05-07T20:31:49.4824234Z 2025-05-07T20:31:49.4824358Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4824447Z op = silu_mul_quant 2025-05-07T20:31:49.4824529Z if compiled: 2025-05-07T20:31:49.4824624Z op = torch.compile(op) 2025-05-07T20:31:49.4824727Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4824801Z 2025-05-07T20:31:49.4824886Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4824891Z 2025-05-07T20:31:49.4824986Z moe/activation_test.py:117: 2025-05-07T20:31:49.4825110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4825205Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4825306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4825670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.4825761Z return fn(*args, **kwargs) 2025-05-07T20:31:49.4826252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4826346Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4826701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4826923Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4827256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4827348Z kernel = self.compile( 2025-05-07T20:31:49.4827722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4827901Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4828023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4828027Z 2025-05-07T20:31:49.4828228Z self = 2025-05-07T20:31:49.4829003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4829509Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa049af310>} 2025-05-07T20:31:49.4830323Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4830520Z context = 2025-05-07T20:31:49.4830524Z 2025-05-07T20:31:49.4830690Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4830948Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4831054Z module_map=module_map) 2025-05-07T20:31:49.4831218Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4831314Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4831472Z E ^ 2025-05-07T20:31:49.4831830Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4831835Z 2025-05-07T20:31:49.4832246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4832329Z 2025-05-07T20:31:49.4832430Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4832651Z self=, 2025-05-07T20:31:49.4832724Z T=1, 2025-05-07T20:31:49.4832800Z D=5120, 2025-05-07T20:31:49.4832878Z scale_ub=None, 2025-05-07T20:31:49.4832962Z contiguous=False, 2025-05-07T20:31:49.4833046Z compiled=True, 2025-05-07T20:31:49.4833134Z ) 2025-05-07T20:31:49.4833379Z self = 2025-05-07T20:31:49.4833544Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.4833553Z 2025-05-07T20:31:49.4833626Z @given( 2025-05-07T20:31:49.4833742Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4833836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4833946Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4834067Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4834178Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4834250Z ) 2025-05-07T20:31:49.4834493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4834581Z def test_silu_mul_quant( 2025-05-07T20:31:49.4834655Z self, 2025-05-07T20:31:49.4834730Z T: int, 2025-05-07T20:31:49.4834801Z D: int, 2025-05-07T20:31:49.4834900Z scale_ub: Optional[float], 2025-05-07T20:31:49.4834986Z contiguous: bool, 2025-05-07T20:31:49.4835067Z compiled: bool, 2025-05-07T20:31:49.4835140Z ) -> None: 2025-05-07T20:31:49.4835234Z torch.manual_seed(2025) 2025-05-07T20:31:49.4835301Z 2025-05-07T20:31:49.4835469Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4835540Z 2025-05-07T20:31:49.4835628Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4835755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4835844Z x = x_sign * x_clamp 2025-05-07T20:31:49.4835918Z x0 = x[:, :D] 2025-05-07T20:31:49.4835996Z x1 = x[:, D:] 2025-05-07T20:31:49.4836063Z 2025-05-07T20:31:49.4836146Z if contiguous: 2025-05-07T20:31:49.4836232Z x0 = x0.contiguous() 2025-05-07T20:31:49.4836318Z x1 = x1.contiguous() 2025-05-07T20:31:49.4836389Z 2025-05-07T20:31:49.4836477Z if scale_ub is not None: 2025-05-07T20:31:49.4836577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4836710Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4836787Z ) 2025-05-07T20:31:49.4836860Z else: 2025-05-07T20:31:49.4836955Z scale_ub_tensor = None 2025-05-07T20:31:49.4837021Z 2025-05-07T20:31:49.4837144Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4837232Z op = silu_mul_quant 2025-05-07T20:31:49.4837319Z if compiled: 2025-05-07T20:31:49.4837415Z op = torch.compile(op) 2025-05-07T20:31:49.4837522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4837591Z 2025-05-07T20:31:49.4837681Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.4837798Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.4837866Z 2025-05-07T20:31:49.4837998Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4838096Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.4838191Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.4838396Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.4838533Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4838604Z 2025-05-07T20:31:49.4838702Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.4838707Z 2025-05-07T20:31:49.4838800Z moe/activation_test.py:126: 2025-05-07T20:31:49.4839002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4839105Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.4839233Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4839790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.4839886Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.4840239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4840470Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4840828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.4841081Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4841481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.4841732Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4842102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.4842267Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.4842606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.4842679Z fn() 2025-05-07T20:31:49.4843078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.4843164Z self.fn.run( 2025-05-07T20:31:49.4843533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4843637Z kernel = self.compile( 2025-05-07T20:31:49.4844013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4844187Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4844312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4844317Z 2025-05-07T20:31:49.4844520Z self = 2025-05-07T20:31:49.4845303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4845812Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03b8ff70>} 2025-05-07T20:31:49.4846552Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4846749Z context = 2025-05-07T20:31:49.4846754Z 2025-05-07T20:31:49.4846913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4847176Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4847278Z module_map=module_map) 2025-05-07T20:31:49.4847514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4847620Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.4847694Z E ^ 2025-05-07T20:31:49.4848049Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4848053Z 2025-05-07T20:31:49.4848538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4848543Z 2025-05-07T20:31:49.4848643Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4848868Z self=, 2025-05-07T20:31:49.4848942Z T=1, 2025-05-07T20:31:49.4849013Z D=5120, 2025-05-07T20:31:49.4849094Z scale_ub=None, 2025-05-07T20:31:49.4849176Z contiguous=True, 2025-05-07T20:31:49.4849255Z compiled=False, 2025-05-07T20:31:49.4849327Z ) 2025-05-07T20:31:49.4849548Z self = 2025-05-07T20:31:49.4849708Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.4849717Z 2025-05-07T20:31:49.4849792Z @given( 2025-05-07T20:31:49.4849907Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4850003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4850118Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4850230Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4850343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4850415Z ) 2025-05-07T20:31:49.4850655Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4850754Z def test_silu_mul_quant( 2025-05-07T20:31:49.4850827Z self, 2025-05-07T20:31:49.4850896Z T: int, 2025-05-07T20:31:49.4850972Z D: int, 2025-05-07T20:31:49.4851067Z scale_ub: Optional[float], 2025-05-07T20:31:49.4851154Z contiguous: bool, 2025-05-07T20:31:49.4851238Z compiled: bool, 2025-05-07T20:31:49.4851313Z ) -> None: 2025-05-07T20:31:49.4851407Z torch.manual_seed(2025) 2025-05-07T20:31:49.4851477Z 2025-05-07T20:31:49.4851643Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4851718Z 2025-05-07T20:31:49.4851804Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4851922Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4852007Z x = x_sign * x_clamp 2025-05-07T20:31:49.4852081Z x0 = x[:, :D] 2025-05-07T20:31:49.4852155Z x1 = x[:, D:] 2025-05-07T20:31:49.4852229Z 2025-05-07T20:31:49.4852306Z if contiguous: 2025-05-07T20:31:49.4852393Z x0 = x0.contiguous() 2025-05-07T20:31:49.4852481Z x1 = x1.contiguous() 2025-05-07T20:31:49.4852549Z 2025-05-07T20:31:49.4852639Z if scale_ub is not None: 2025-05-07T20:31:49.4852737Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4852872Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4852950Z ) 2025-05-07T20:31:49.4853024Z else: 2025-05-07T20:31:49.4853115Z scale_ub_tensor = None 2025-05-07T20:31:49.4853187Z 2025-05-07T20:31:49.4853313Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4853402Z op = silu_mul_quant 2025-05-07T20:31:49.4853485Z if compiled: 2025-05-07T20:31:49.4853580Z op = torch.compile(op) 2025-05-07T20:31:49.4853682Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4853754Z 2025-05-07T20:31:49.4853841Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4853846Z 2025-05-07T20:31:49.4853941Z moe/activation_test.py:117: 2025-05-07T20:31:49.4854065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4854159Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4854343Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4854851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4854948Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4855311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4855631Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4855969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4856060Z kernel = self.compile( 2025-05-07T20:31:49.4861140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4861338Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4861474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4861480Z 2025-05-07T20:31:49.4861686Z self = 2025-05-07T20:31:49.4862463Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4863079Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa04853e50>} 2025-05-07T20:31:49.4864010Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4864252Z context = 2025-05-07T20:31:49.4864258Z 2025-05-07T20:31:49.4864464Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4864790Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4864921Z module_map=module_map) 2025-05-07T20:31:49.4865118Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4865245Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4865338Z E ^ 2025-05-07T20:31:49.4865707Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4865716Z 2025-05-07T20:31:49.4866125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4866130Z 2025-05-07T20:31:49.4866228Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4866450Z self=, 2025-05-07T20:31:49.4866526Z T=128, 2025-05-07T20:31:49.4866597Z D=5120, 2025-05-07T20:31:49.4866679Z scale_ub=None, 2025-05-07T20:31:49.4866761Z contiguous=False, 2025-05-07T20:31:49.4866838Z compiled=True, 2025-05-07T20:31:49.4866911Z ) 2025-05-07T20:31:49.4867123Z self = 2025-05-07T20:31:49.4867297Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.4867302Z 2025-05-07T20:31:49.4867375Z @given( 2025-05-07T20:31:49.4867489Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4867591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4867701Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4867812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4867928Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4868002Z ) 2025-05-07T20:31:49.4868345Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4868438Z def test_silu_mul_quant( 2025-05-07T20:31:49.4868512Z self, 2025-05-07T20:31:49.4868585Z T: int, 2025-05-07T20:31:49.4868656Z D: int, 2025-05-07T20:31:49.4868751Z scale_ub: Optional[float], 2025-05-07T20:31:49.4868841Z contiguous: bool, 2025-05-07T20:31:49.4868995Z compiled: bool, 2025-05-07T20:31:49.4869066Z ) -> None: 2025-05-07T20:31:49.4869161Z torch.manual_seed(2025) 2025-05-07T20:31:49.4869231Z 2025-05-07T20:31:49.4869398Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4869473Z 2025-05-07T20:31:49.4869561Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4869682Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4869770Z x = x_sign * x_clamp 2025-05-07T20:31:49.4869907Z x0 = x[:, :D] 2025-05-07T20:31:49.4869991Z x1 = x[:, D:] 2025-05-07T20:31:49.4870060Z 2025-05-07T20:31:49.4870144Z if contiguous: 2025-05-07T20:31:49.4870233Z x0 = x0.contiguous() 2025-05-07T20:31:49.4870319Z x1 = x1.contiguous() 2025-05-07T20:31:49.4870389Z 2025-05-07T20:31:49.4870477Z if scale_ub is not None: 2025-05-07T20:31:49.4870581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4870717Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4870801Z ) 2025-05-07T20:31:49.4870876Z else: 2025-05-07T20:31:49.4870965Z scale_ub_tensor = None 2025-05-07T20:31:49.4871035Z 2025-05-07T20:31:49.4871159Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4871243Z op = silu_mul_quant 2025-05-07T20:31:49.4871330Z if compiled: 2025-05-07T20:31:49.4871426Z op = torch.compile(op) 2025-05-07T20:31:49.4871533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4871601Z 2025-05-07T20:31:49.4871691Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4871695Z 2025-05-07T20:31:49.4871791Z moe/activation_test.py:117: 2025-05-07T20:31:49.4871915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4872011Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4872107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4872478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.4872574Z return fn(*args, **kwargs) 2025-05-07T20:31:49.4873188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4873304Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4873749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4874028Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4874443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4874557Z kernel = self.compile( 2025-05-07T20:31:49.4874998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4875175Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4875296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4875301Z 2025-05-07T20:31:49.4875504Z self = 2025-05-07T20:31:49.4876278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4876861Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03bb00d0>} 2025-05-07T20:31:49.4877607Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4877869Z context = 2025-05-07T20:31:49.4877873Z 2025-05-07T20:31:49.4878039Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4878297Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4878401Z module_map=module_map) 2025-05-07T20:31:49.4878563Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4878657Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4878733Z E ^ 2025-05-07T20:31:49.4879090Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4879095Z 2025-05-07T20:31:49.4879504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4879513Z 2025-05-07T20:31:49.4879621Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4879840Z self=, 2025-05-07T20:31:49.4879913Z T=128, 2025-05-07T20:31:49.4879988Z D=7168, 2025-05-07T20:31:49.4880066Z scale_ub=1200.0, 2025-05-07T20:31:49.4880149Z contiguous=False, 2025-05-07T20:31:49.4880230Z compiled=False, 2025-05-07T20:31:49.4880299Z ) 2025-05-07T20:31:49.4880511Z self = 2025-05-07T20:31:49.4880682Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.4880687Z 2025-05-07T20:31:49.4880766Z @given( 2025-05-07T20:31:49.4880884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4880977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4881090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4881207Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4881322Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4881392Z ) 2025-05-07T20:31:49.4881638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4881726Z def test_silu_mul_quant( 2025-05-07T20:31:49.4881798Z self, 2025-05-07T20:31:49.4881871Z T: int, 2025-05-07T20:31:49.4881940Z D: int, 2025-05-07T20:31:49.4882036Z scale_ub: Optional[float], 2025-05-07T20:31:49.4882122Z contiguous: bool, 2025-05-07T20:31:49.4882201Z compiled: bool, 2025-05-07T20:31:49.4882276Z ) -> None: 2025-05-07T20:31:49.4882370Z torch.manual_seed(2025) 2025-05-07T20:31:49.4882442Z 2025-05-07T20:31:49.4882611Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4882682Z 2025-05-07T20:31:49.4882774Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4882894Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4883002Z x = x_sign * x_clamp 2025-05-07T20:31:49.4883085Z x0 = x[:, :D] 2025-05-07T20:31:49.4883176Z x1 = x[:, D:] 2025-05-07T20:31:49.4883254Z 2025-05-07T20:31:49.4883333Z if contiguous: 2025-05-07T20:31:49.4883421Z x0 = x0.contiguous() 2025-05-07T20:31:49.4883505Z x1 = x1.contiguous() 2025-05-07T20:31:49.4883578Z 2025-05-07T20:31:49.4883666Z if scale_ub is not None: 2025-05-07T20:31:49.4883764Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4883900Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4883973Z ) 2025-05-07T20:31:49.4884129Z else: 2025-05-07T20:31:49.4884223Z scale_ub_tensor = None 2025-05-07T20:31:49.4884291Z 2025-05-07T20:31:49.4884420Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4884505Z op = silu_mul_quant 2025-05-07T20:31:49.4884583Z if compiled: 2025-05-07T20:31:49.4884756Z op = torch.compile(op) 2025-05-07T20:31:49.4884859Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4884928Z 2025-05-07T20:31:49.4885019Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4885023Z 2025-05-07T20:31:49.4885116Z moe/activation_test.py:117: 2025-05-07T20:31:49.4885238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4885339Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4885433Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4885935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4886031Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4886386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4886607Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4886944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4887034Z kernel = self.compile( 2025-05-07T20:31:49.4887408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4887579Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4887707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4887712Z 2025-05-07T20:31:49.4887914Z self = 2025-05-07T20:31:49.4888687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4889189Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0373aee0>} 2025-05-07T20:31:49.4889929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4890121Z context = 2025-05-07T20:31:49.4890125Z 2025-05-07T20:31:49.4890286Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4890554Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4890655Z module_map=module_map) 2025-05-07T20:31:49.4890813Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4890912Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4890985Z E ^ 2025-05-07T20:31:49.4891340Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4891345Z 2025-05-07T20:31:49.4891755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4891760Z 2025-05-07T20:31:49.4891856Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4892077Z self=, 2025-05-07T20:31:49.4892150Z T=128, 2025-05-07T20:31:49.4892223Z D=5120, 2025-05-07T20:31:49.4892304Z scale_ub=None, 2025-05-07T20:31:49.4892494Z contiguous=False, 2025-05-07T20:31:49.4892575Z compiled=False, 2025-05-07T20:31:49.4892645Z ) 2025-05-07T20:31:49.4892858Z self = 2025-05-07T20:31:49.4893050Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.4893155Z 2025-05-07T20:31:49.4893243Z @given( 2025-05-07T20:31:49.4893389Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4893507Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4893644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4893784Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4893927Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4894012Z ) 2025-05-07T20:31:49.4894314Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4894430Z def test_silu_mul_quant( 2025-05-07T20:31:49.4894519Z self, 2025-05-07T20:31:49.4894620Z T: int, 2025-05-07T20:31:49.4894712Z D: int, 2025-05-07T20:31:49.4894829Z scale_ub: Optional[float], 2025-05-07T20:31:49.4894936Z contiguous: bool, 2025-05-07T20:31:49.4895035Z compiled: bool, 2025-05-07T20:31:49.4895125Z ) -> None: 2025-05-07T20:31:49.4895237Z torch.manual_seed(2025) 2025-05-07T20:31:49.4895320Z 2025-05-07T20:31:49.4895487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4895558Z 2025-05-07T20:31:49.4895644Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4895762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4895849Z x = x_sign * x_clamp 2025-05-07T20:31:49.4895925Z x0 = x[:, :D] 2025-05-07T20:31:49.4895998Z x1 = x[:, D:] 2025-05-07T20:31:49.4896067Z 2025-05-07T20:31:49.4896144Z if contiguous: 2025-05-07T20:31:49.4896231Z x0 = x0.contiguous() 2025-05-07T20:31:49.4896321Z x1 = x1.contiguous() 2025-05-07T20:31:49.4896388Z 2025-05-07T20:31:49.4896479Z if scale_ub is not None: 2025-05-07T20:31:49.4896577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4896707Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4896780Z ) 2025-05-07T20:31:49.4896859Z else: 2025-05-07T20:31:49.4896950Z scale_ub_tensor = None 2025-05-07T20:31:49.4897020Z 2025-05-07T20:31:49.4897145Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4897229Z op = silu_mul_quant 2025-05-07T20:31:49.4897311Z if compiled: 2025-05-07T20:31:49.4897405Z op = torch.compile(op) 2025-05-07T20:31:49.4897508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4897578Z 2025-05-07T20:31:49.4897663Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4897667Z 2025-05-07T20:31:49.4897765Z moe/activation_test.py:117: 2025-05-07T20:31:49.4897893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4897988Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4898086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4898580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4898679Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4899038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4899258Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4899596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4899683Z kernel = self.compile( 2025-05-07T20:31:49.4900057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4900316Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4900438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4900443Z 2025-05-07T20:31:49.4900647Z self = 2025-05-07T20:31:49.4901493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4901996Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03d1c820>} 2025-05-07T20:31:49.4902751Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4902971Z context = 2025-05-07T20:31:49.4902977Z 2025-05-07T20:31:49.4903181Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4903502Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4903638Z module_map=module_map) 2025-05-07T20:31:49.4904122Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4904287Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4904382Z E ^ 2025-05-07T20:31:49.4904793Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4904798Z 2025-05-07T20:31:49.4905205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4905210Z 2025-05-07T20:31:49.4905318Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4905537Z self=, 2025-05-07T20:31:49.4905614Z T=128, 2025-05-07T20:31:49.4905686Z D=5120, 2025-05-07T20:31:49.4905762Z scale_ub=1200.0, 2025-05-07T20:31:49.4905852Z contiguous=True, 2025-05-07T20:31:49.4905929Z compiled=False, 2025-05-07T20:31:49.4905998Z ) 2025-05-07T20:31:49.4906217Z self = 2025-05-07T20:31:49.4906382Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.4906386Z 2025-05-07T20:31:49.4906462Z @given( 2025-05-07T20:31:49.4906580Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4906676Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4906791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4906905Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4907017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4907093Z ) 2025-05-07T20:31:49.4907334Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4907424Z def test_silu_mul_quant( 2025-05-07T20:31:49.4907501Z self, 2025-05-07T20:31:49.4907574Z T: int, 2025-05-07T20:31:49.4907644Z D: int, 2025-05-07T20:31:49.4907739Z scale_ub: Optional[float], 2025-05-07T20:31:49.4907824Z contiguous: bool, 2025-05-07T20:31:49.4907904Z compiled: bool, 2025-05-07T20:31:49.4907981Z ) -> None: 2025-05-07T20:31:49.4908072Z torch.manual_seed(2025) 2025-05-07T20:31:49.4908146Z 2025-05-07T20:31:49.4908309Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4908383Z 2025-05-07T20:31:49.4908471Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4908591Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4908821Z x = x_sign * x_clamp 2025-05-07T20:31:49.4908901Z x0 = x[:, :D] 2025-05-07T20:31:49.4908977Z x1 = x[:, D:] 2025-05-07T20:31:49.4909046Z 2025-05-07T20:31:49.4909127Z if contiguous: 2025-05-07T20:31:49.4909213Z x0 = x0.contiguous() 2025-05-07T20:31:49.4909298Z x1 = x1.contiguous() 2025-05-07T20:31:49.4909484Z 2025-05-07T20:31:49.4909570Z if scale_ub is not None: 2025-05-07T20:31:49.4909672Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4909807Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4909936Z ) 2025-05-07T20:31:49.4910015Z else: 2025-05-07T20:31:49.4910111Z scale_ub_tensor = None 2025-05-07T20:31:49.4910180Z 2025-05-07T20:31:49.4910309Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4910396Z op = silu_mul_quant 2025-05-07T20:31:49.4910476Z if compiled: 2025-05-07T20:31:49.4910578Z op = torch.compile(op) 2025-05-07T20:31:49.4910679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4910747Z 2025-05-07T20:31:49.4910837Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4910842Z 2025-05-07T20:31:49.4910935Z moe/activation_test.py:117: 2025-05-07T20:31:49.4911068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4911165Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4911259Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4911754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4911848Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4912201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4912422Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4912761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4912853Z kernel = self.compile( 2025-05-07T20:31:49.4913277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4913451Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4913576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4913580Z 2025-05-07T20:31:49.4913782Z self = 2025-05-07T20:31:49.4914557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4915063Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03b8fdc0>} 2025-05-07T20:31:49.4915813Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4916009Z context = 2025-05-07T20:31:49.4916014Z 2025-05-07T20:31:49.4916175Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4916439Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4916541Z module_map=module_map) 2025-05-07T20:31:49.4916700Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4916799Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4916873Z E ^ 2025-05-07T20:31:49.4917305Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4917314Z 2025-05-07T20:31:49.4917723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4917800Z 2025-05-07T20:31:49.4917899Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4918118Z self=, 2025-05-07T20:31:49.4918192Z T=1, 2025-05-07T20:31:49.4918264Z D=7168, 2025-05-07T20:31:49.4918346Z scale_ub=1200.0, 2025-05-07T20:31:49.4918423Z contiguous=True, 2025-05-07T20:31:49.4918503Z compiled=True, 2025-05-07T20:31:49.4918574Z ) 2025-05-07T20:31:49.4918790Z self = 2025-05-07T20:31:49.4918955Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.4918960Z 2025-05-07T20:31:49.4919039Z @given( 2025-05-07T20:31:49.4919153Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4919251Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4919363Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4919476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4919595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4919664Z ) 2025-05-07T20:31:49.4919904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4919998Z def test_silu_mul_quant( 2025-05-07T20:31:49.4920067Z self, 2025-05-07T20:31:49.4920140Z T: int, 2025-05-07T20:31:49.4920213Z D: int, 2025-05-07T20:31:49.4920306Z scale_ub: Optional[float], 2025-05-07T20:31:49.4920396Z contiguous: bool, 2025-05-07T20:31:49.4920481Z compiled: bool, 2025-05-07T20:31:49.4920552Z ) -> None: 2025-05-07T20:31:49.4920648Z torch.manual_seed(2025) 2025-05-07T20:31:49.4920721Z 2025-05-07T20:31:49.4920883Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4920959Z 2025-05-07T20:31:49.4921046Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4921167Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4921259Z x = x_sign * x_clamp 2025-05-07T20:31:49.4921334Z x0 = x[:, :D] 2025-05-07T20:31:49.4921412Z x1 = x[:, D:] 2025-05-07T20:31:49.4921482Z 2025-05-07T20:31:49.4921559Z if contiguous: 2025-05-07T20:31:49.4921652Z x0 = x0.contiguous() 2025-05-07T20:31:49.4921735Z x1 = x1.contiguous() 2025-05-07T20:31:49.4921802Z 2025-05-07T20:31:49.4921894Z if scale_ub is not None: 2025-05-07T20:31:49.4921994Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4922125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4922198Z ) 2025-05-07T20:31:49.4922276Z else: 2025-05-07T20:31:49.4922365Z scale_ub_tensor = None 2025-05-07T20:31:49.4922437Z 2025-05-07T20:31:49.4922561Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4922648Z op = silu_mul_quant 2025-05-07T20:31:49.4922732Z if compiled: 2025-05-07T20:31:49.4922835Z op = torch.compile(op) 2025-05-07T20:31:49.4922937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4923007Z 2025-05-07T20:31:49.4923103Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4923108Z 2025-05-07T20:31:49.4923219Z moe/activation_test.py:117: 2025-05-07T20:31:49.4923368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4923463Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4923559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4923922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.4924127Z return fn(*args, **kwargs) 2025-05-07T20:31:49.4924618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4924713Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4925067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4925359Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4925690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4925779Z kernel = self.compile( 2025-05-07T20:31:49.4926157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4926330Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4926458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4926462Z 2025-05-07T20:31:49.4926666Z self = 2025-05-07T20:31:49.4927445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4927950Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0419cd30>} 2025-05-07T20:31:49.4928709Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4928896Z context = 2025-05-07T20:31:49.4928905Z 2025-05-07T20:31:49.4929068Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4929327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4929428Z module_map=module_map) 2025-05-07T20:31:49.4929592Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4929685Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4929760Z E ^ 2025-05-07T20:31:49.4930115Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4930120Z 2025-05-07T20:31:49.4930530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4930534Z 2025-05-07T20:31:49.4930637Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4930859Z self=, 2025-05-07T20:31:49.4930929Z T=1, 2025-05-07T20:31:49.4931004Z D=7168, 2025-05-07T20:31:49.4931083Z scale_ub=1200.0, 2025-05-07T20:31:49.4931163Z contiguous=False, 2025-05-07T20:31:49.4931243Z compiled=True, 2025-05-07T20:31:49.4931311Z ) 2025-05-07T20:31:49.4931529Z self = 2025-05-07T20:31:49.4931700Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.4931704Z 2025-05-07T20:31:49.4931775Z @given( 2025-05-07T20:31:49.4931894Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4931987Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4932099Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4932214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4932322Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4932393Z ) 2025-05-07T20:31:49.4932721Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4932815Z def test_silu_mul_quant( 2025-05-07T20:31:49.4932887Z self, 2025-05-07T20:31:49.4932963Z T: int, 2025-05-07T20:31:49.4933034Z D: int, 2025-05-07T20:31:49.4933130Z scale_ub: Optional[float], 2025-05-07T20:31:49.4933287Z contiguous: bool, 2025-05-07T20:31:49.4933380Z compiled: bool, 2025-05-07T20:31:49.4933471Z ) -> None: 2025-05-07T20:31:49.4933570Z torch.manual_seed(2025) 2025-05-07T20:31:49.4933652Z 2025-05-07T20:31:49.4933819Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4933886Z 2025-05-07T20:31:49.4933972Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4934095Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4934179Z x = x_sign * x_clamp 2025-05-07T20:31:49.4934256Z x0 = x[:, :D] 2025-05-07T20:31:49.4934335Z x1 = x[:, D:] 2025-05-07T20:31:49.4934407Z 2025-05-07T20:31:49.4934485Z if contiguous: 2025-05-07T20:31:49.4934574Z x0 = x0.contiguous() 2025-05-07T20:31:49.4934660Z x1 = x1.contiguous() 2025-05-07T20:31:49.4934735Z 2025-05-07T20:31:49.4934820Z if scale_ub is not None: 2025-05-07T20:31:49.4934930Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4935064Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4935137Z ) 2025-05-07T20:31:49.4935207Z else: 2025-05-07T20:31:49.4935301Z scale_ub_tensor = None 2025-05-07T20:31:49.4935371Z 2025-05-07T20:31:49.4935495Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4935586Z op = silu_mul_quant 2025-05-07T20:31:49.4935664Z if compiled: 2025-05-07T20:31:49.4935759Z op = torch.compile(op) 2025-05-07T20:31:49.4935863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4935931Z 2025-05-07T20:31:49.4936027Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4936032Z 2025-05-07T20:31:49.4936123Z moe/activation_test.py:117: 2025-05-07T20:31:49.4936247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4936352Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4936450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4936813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.4936904Z return fn(*args, **kwargs) 2025-05-07T20:31:49.4937393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4937488Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4937838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4938069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4938403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4938494Z kernel = self.compile( 2025-05-07T20:31:49.4938869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4939043Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4939170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4939175Z 2025-05-07T20:31:49.4939379Z self = 2025-05-07T20:31:49.4940155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4940734Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03b9cb80>} 2025-05-07T20:31:49.4941480Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4941744Z context = 2025-05-07T20:31:49.4941749Z 2025-05-07T20:31:49.4941911Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4942174Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4942276Z module_map=module_map) 2025-05-07T20:31:49.4942433Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4942534Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4942612Z E ^ 2025-05-07T20:31:49.4942964Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4942972Z 2025-05-07T20:31:49.4943379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4943387Z 2025-05-07T20:31:49.4943483Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4943705Z self=, 2025-05-07T20:31:49.4943780Z T=1, 2025-05-07T20:31:49.4943854Z D=7168, 2025-05-07T20:31:49.4943937Z scale_ub=None, 2025-05-07T20:31:49.4944016Z contiguous=False, 2025-05-07T20:31:49.4944094Z compiled=True, 2025-05-07T20:31:49.4944168Z ) 2025-05-07T20:31:49.4944382Z self = 2025-05-07T20:31:49.4944546Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.4944559Z 2025-05-07T20:31:49.4944633Z @given( 2025-05-07T20:31:49.4944746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4944844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4944953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4945070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4945183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4945253Z ) 2025-05-07T20:31:49.4945495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4945588Z def test_silu_mul_quant( 2025-05-07T20:31:49.4945660Z self, 2025-05-07T20:31:49.4945734Z T: int, 2025-05-07T20:31:49.4945806Z D: int, 2025-05-07T20:31:49.4945902Z scale_ub: Optional[float], 2025-05-07T20:31:49.4945990Z contiguous: bool, 2025-05-07T20:31:49.4946069Z compiled: bool, 2025-05-07T20:31:49.4946141Z ) -> None: 2025-05-07T20:31:49.4946238Z torch.manual_seed(2025) 2025-05-07T20:31:49.4946308Z 2025-05-07T20:31:49.4946473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4946547Z 2025-05-07T20:31:49.4946637Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4946757Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4946851Z x = x_sign * x_clamp 2025-05-07T20:31:49.4946927Z x0 = x[:, :D] 2025-05-07T20:31:49.4947004Z x1 = x[:, D:] 2025-05-07T20:31:49.4947074Z 2025-05-07T20:31:49.4947151Z if contiguous: 2025-05-07T20:31:49.4947242Z x0 = x0.contiguous() 2025-05-07T20:31:49.4947325Z x1 = x1.contiguous() 2025-05-07T20:31:49.4947393Z 2025-05-07T20:31:49.4947484Z if scale_ub is not None: 2025-05-07T20:31:49.4947583Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4947711Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4947866Z ) 2025-05-07T20:31:49.4947938Z else: 2025-05-07T20:31:49.4948028Z scale_ub_tensor = None 2025-05-07T20:31:49.4948099Z 2025-05-07T20:31:49.4948224Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4948308Z op = silu_mul_quant 2025-05-07T20:31:49.4948563Z if compiled: 2025-05-07T20:31:49.4948657Z op = torch.compile(op) 2025-05-07T20:31:49.4948765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4948834Z 2025-05-07T20:31:49.4948921Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.4949041Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.4949109Z 2025-05-07T20:31:49.4949240Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4949341Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.4949436Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.4949553Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.4949696Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4949769Z 2025-05-07T20:31:49.4949922Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.4949927Z 2025-05-07T20:31:49.4950022Z moe/activation_test.py:126: 2025-05-07T20:31:49.4950145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4950258Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.4950389Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.4950945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.4951044Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.4951400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4951626Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4951985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.4952239Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4952681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:49.4952993Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.4953459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.4953664Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.4954081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.4954176Z fn() 2025-05-07T20:31:49.4954673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.4954771Z self.fn.run( 2025-05-07T20:31:49.4955189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4955298Z kernel = self.compile( 2025-05-07T20:31:49.4955685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4955855Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4955979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4955984Z 2025-05-07T20:31:49.4956190Z self = 2025-05-07T20:31:49.4957046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4957554Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03105430>} 2025-05-07T20:31:49.4958297Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4958558Z context = 2025-05-07T20:31:49.4958563Z 2025-05-07T20:31:49.4958724Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4958983Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4959090Z module_map=module_map) 2025-05-07T20:31:49.4959253Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4959350Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.4959429Z E ^ 2025-05-07T20:31:49.4959779Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4959784Z 2025-05-07T20:31:49.4960201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4960205Z 2025-05-07T20:31:49.4960303Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4960521Z self=, 2025-05-07T20:31:49.4960597Z T=1, 2025-05-07T20:31:49.4960667Z D=5120, 2025-05-07T20:31:49.4960746Z scale_ub=1200.0, 2025-05-07T20:31:49.4960835Z contiguous=False, 2025-05-07T20:31:49.4960912Z compiled=True, 2025-05-07T20:31:49.4960982Z ) 2025-05-07T20:31:49.4961197Z self = 2025-05-07T20:31:49.4961364Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.4961368Z 2025-05-07T20:31:49.4961445Z @given( 2025-05-07T20:31:49.4961561Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4961653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4961774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4961886Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4961995Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4962068Z ) 2025-05-07T20:31:49.4962311Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4962398Z def test_silu_mul_quant( 2025-05-07T20:31:49.4962473Z self, 2025-05-07T20:31:49.4962554Z T: int, 2025-05-07T20:31:49.4962645Z D: int, 2025-05-07T20:31:49.4962761Z scale_ub: Optional[float], 2025-05-07T20:31:49.4962865Z contiguous: bool, 2025-05-07T20:31:49.4962976Z compiled: bool, 2025-05-07T20:31:49.4963068Z ) -> None: 2025-05-07T20:31:49.4963182Z torch.manual_seed(2025) 2025-05-07T20:31:49.4963271Z 2025-05-07T20:31:49.4963477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4963563Z 2025-05-07T20:31:49.4963680Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4963829Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4963934Z x = x_sign * x_clamp 2025-05-07T20:31:49.4964033Z x0 = x[:, :D] 2025-05-07T20:31:49.4964128Z x1 = x[:, D:] 2025-05-07T20:31:49.4964214Z 2025-05-07T20:31:49.4964315Z if contiguous: 2025-05-07T20:31:49.4964421Z x0 = x0.contiguous() 2025-05-07T20:31:49.4964529Z x1 = x1.contiguous() 2025-05-07T20:31:49.4964616Z 2025-05-07T20:31:49.4964719Z if scale_ub is not None: 2025-05-07T20:31:49.4964823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4965033Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4965106Z ) 2025-05-07T20:31:49.4965182Z else: 2025-05-07T20:31:49.4965273Z scale_ub_tensor = None 2025-05-07T20:31:49.4965341Z 2025-05-07T20:31:49.4965468Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4965634Z op = silu_mul_quant 2025-05-07T20:31:49.4965714Z if compiled: 2025-05-07T20:31:49.4965811Z op = torch.compile(op) 2025-05-07T20:31:49.4965913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4965985Z 2025-05-07T20:31:49.4966071Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4966075Z 2025-05-07T20:31:49.4966168Z moe/activation_test.py:117: 2025-05-07T20:31:49.4966293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4966391Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4966486Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4966861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.4966948Z return fn(*args, **kwargs) 2025-05-07T20:31:49.4967441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4967537Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4967889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4968112Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4968447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4968536Z kernel = self.compile( 2025-05-07T20:31:49.4968914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4969090Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4969216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4969220Z 2025-05-07T20:31:49.4969424Z self = 2025-05-07T20:31:49.4970204Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4970710Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa03105e50>} 2025-05-07T20:31:49.4971460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4971653Z context = 2025-05-07T20:31:49.4971658Z 2025-05-07T20:31:49.4971819Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4972083Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4972190Z module_map=module_map) 2025-05-07T20:31:49.4972349Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4972445Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4972518Z E ^ 2025-05-07T20:31:49.4972870Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4972875Z 2025-05-07T20:31:49.4973285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4973290Z 2025-05-07T20:31:49.4973467Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4973693Z self=, 2025-05-07T20:31:49.4973768Z T=1, 2025-05-07T20:31:49.4973841Z D=5120, 2025-05-07T20:31:49.4973923Z scale_ub=1200.0, 2025-05-07T20:31:49.4974132Z contiguous=False, 2025-05-07T20:31:49.4974210Z compiled=False, 2025-05-07T20:31:49.4974284Z ) 2025-05-07T20:31:49.4974498Z self = 2025-05-07T20:31:49.4974660Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.4974665Z 2025-05-07T20:31:49.4974742Z @given( 2025-05-07T20:31:49.4974856Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4974952Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4975062Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4975172Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4975289Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4975360Z ) 2025-05-07T20:31:49.4975603Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4975696Z def test_silu_mul_quant( 2025-05-07T20:31:49.4975769Z self, 2025-05-07T20:31:49.4975844Z T: int, 2025-05-07T20:31:49.4975921Z D: int, 2025-05-07T20:31:49.4976018Z scale_ub: Optional[float], 2025-05-07T20:31:49.4976103Z contiguous: bool, 2025-05-07T20:31:49.4976181Z compiled: bool, 2025-05-07T20:31:49.4976253Z ) -> None: 2025-05-07T20:31:49.4976346Z torch.manual_seed(2025) 2025-05-07T20:31:49.4976415Z 2025-05-07T20:31:49.4976577Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4976649Z 2025-05-07T20:31:49.4976734Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4976852Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4976942Z x = x_sign * x_clamp 2025-05-07T20:31:49.4977020Z x0 = x[:, :D] 2025-05-07T20:31:49.4977093Z x1 = x[:, D:] 2025-05-07T20:31:49.4977165Z 2025-05-07T20:31:49.4977244Z if contiguous: 2025-05-07T20:31:49.4977332Z x0 = x0.contiguous() 2025-05-07T20:31:49.4977418Z x1 = x1.contiguous() 2025-05-07T20:31:49.4977490Z 2025-05-07T20:31:49.4977580Z if scale_ub is not None: 2025-05-07T20:31:49.4977678Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4977808Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4977884Z ) 2025-05-07T20:31:49.4977958Z else: 2025-05-07T20:31:49.4978052Z scale_ub_tensor = None 2025-05-07T20:31:49.4978123Z 2025-05-07T20:31:49.4978246Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4978331Z op = silu_mul_quant 2025-05-07T20:31:49.4978414Z if compiled: 2025-05-07T20:31:49.4978516Z op = torch.compile(op) 2025-05-07T20:31:49.4978616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4978691Z 2025-05-07T20:31:49.4978780Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4978784Z 2025-05-07T20:31:49.4978877Z moe/activation_test.py:117: 2025-05-07T20:31:49.4979005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4979110Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4979209Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4979709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4979814Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4980172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4980401Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4985365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4985474Z kernel = self.compile( 2025-05-07T20:31:49.4985870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4986143Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4986271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4986276Z 2025-05-07T20:31:49.4986479Z self = 2025-05-07T20:31:49.4987255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.4987770Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02f22820>} 2025-05-07T20:31:49.4988511Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.4988710Z context = 2025-05-07T20:31:49.4988715Z 2025-05-07T20:31:49.4988875Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.4989135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.4989241Z module_map=module_map) 2025-05-07T20:31:49.4989401Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.4989501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.4989574Z E ^ 2025-05-07T20:31:49.4989991Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.4989997Z 2025-05-07T20:31:49.4990410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.4990420Z 2025-05-07T20:31:49.4990518Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.4990742Z self=, 2025-05-07T20:31:49.4990816Z T=16384, 2025-05-07T20:31:49.4990889Z D=5120, 2025-05-07T20:31:49.4990969Z scale_ub=1200.0, 2025-05-07T20:31:49.4991049Z contiguous=False, 2025-05-07T20:31:49.4991124Z compiled=True, 2025-05-07T20:31:49.4991195Z ) 2025-05-07T20:31:49.4991410Z self = 2025-05-07T20:31:49.4991585Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.4991589Z 2025-05-07T20:31:49.4991672Z @given( 2025-05-07T20:31:49.4991788Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.4991886Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.4991999Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.4992111Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.4992225Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.4992295Z ) 2025-05-07T20:31:49.4992538Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.4992628Z def test_silu_mul_quant( 2025-05-07T20:31:49.4992702Z self, 2025-05-07T20:31:49.4992771Z T: int, 2025-05-07T20:31:49.4992846Z D: int, 2025-05-07T20:31:49.4992941Z scale_ub: Optional[float], 2025-05-07T20:31:49.4993022Z contiguous: bool, 2025-05-07T20:31:49.4993107Z compiled: bool, 2025-05-07T20:31:49.4993183Z ) -> None: 2025-05-07T20:31:49.4993363Z torch.manual_seed(2025) 2025-05-07T20:31:49.4993456Z 2025-05-07T20:31:49.4993645Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.4993716Z 2025-05-07T20:31:49.4993803Z x_sign = torch.sign(x) 2025-05-07T20:31:49.4993924Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.4994088Z x = x_sign * x_clamp 2025-05-07T20:31:49.4994162Z x0 = x[:, :D] 2025-05-07T20:31:49.4994236Z x1 = x[:, D:] 2025-05-07T20:31:49.4994307Z 2025-05-07T20:31:49.4994387Z if contiguous: 2025-05-07T20:31:49.4994475Z x0 = x0.contiguous() 2025-05-07T20:31:49.4994561Z x1 = x1.contiguous() 2025-05-07T20:31:49.4994632Z 2025-05-07T20:31:49.4994719Z if scale_ub is not None: 2025-05-07T20:31:49.4994823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.4994955Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.4995032Z ) 2025-05-07T20:31:49.4995108Z else: 2025-05-07T20:31:49.4995197Z scale_ub_tensor = None 2025-05-07T20:31:49.4995271Z 2025-05-07T20:31:49.4995395Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.4995482Z op = silu_mul_quant 2025-05-07T20:31:49.4995566Z if compiled: 2025-05-07T20:31:49.4995666Z op = torch.compile(op) 2025-05-07T20:31:49.4995769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4995840Z 2025-05-07T20:31:49.4995926Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.4995931Z 2025-05-07T20:31:49.4996024Z moe/activation_test.py:117: 2025-05-07T20:31:49.4996152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4996250Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.4996349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.4996716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.4996809Z return fn(*args, **kwargs) 2025-05-07T20:31:49.4997307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.4997400Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.4997754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.4997984Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.4998316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.4998409Z kernel = self.compile( 2025-05-07T20:31:49.4998783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.4998953Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.4999081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.4999085Z 2025-05-07T20:31:49.4999289Z self = 2025-05-07T20:31:49.5000068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5000579Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0295d790>} 2025-05-07T20:31:49.5001328Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5001517Z context = 2025-05-07T20:31:49.5001604Z 2025-05-07T20:31:49.5001769Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5002033Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5002137Z module_map=module_map) 2025-05-07T20:31:49.5002373Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5002469Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5002545Z E ^ 2025-05-07T20:31:49.5002902Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5002907Z 2025-05-07T20:31:49.5003315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5003320Z 2025-05-07T20:31:49.5003417Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5003644Z self=, 2025-05-07T20:31:49.5003925Z T=2048, 2025-05-07T20:31:49.5004042Z D=7168, 2025-05-07T20:31:49.5004154Z scale_ub=1200.0, 2025-05-07T20:31:49.5004270Z contiguous=False, 2025-05-07T20:31:49.5004390Z compiled=True, 2025-05-07T20:31:49.5004486Z ) 2025-05-07T20:31:49.5004771Z self = 2025-05-07T20:31:49.5004956Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.5004961Z 2025-05-07T20:31:49.5005035Z @given( 2025-05-07T20:31:49.5005150Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5005246Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5005355Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5005471Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5005580Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5005650Z ) 2025-05-07T20:31:49.5005900Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5005987Z def test_silu_mul_quant( 2025-05-07T20:31:49.5006060Z self, 2025-05-07T20:31:49.5006137Z T: int, 2025-05-07T20:31:49.5006212Z D: int, 2025-05-07T20:31:49.5006309Z scale_ub: Optional[float], 2025-05-07T20:31:49.5006397Z contiguous: bool, 2025-05-07T20:31:49.5006477Z compiled: bool, 2025-05-07T20:31:49.5006555Z ) -> None: 2025-05-07T20:31:49.5006645Z torch.manual_seed(2025) 2025-05-07T20:31:49.5006714Z 2025-05-07T20:31:49.5006884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5006954Z 2025-05-07T20:31:49.5007039Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5007163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5007247Z x = x_sign * x_clamp 2025-05-07T20:31:49.5007322Z x0 = x[:, :D] 2025-05-07T20:31:49.5007404Z x1 = x[:, D:] 2025-05-07T20:31:49.5007476Z 2025-05-07T20:31:49.5007554Z if contiguous: 2025-05-07T20:31:49.5007643Z x0 = x0.contiguous() 2025-05-07T20:31:49.5007731Z x1 = x1.contiguous() 2025-05-07T20:31:49.5007800Z 2025-05-07T20:31:49.5007889Z if scale_ub is not None: 2025-05-07T20:31:49.5007993Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5008126Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5008198Z ) 2025-05-07T20:31:49.5008270Z else: 2025-05-07T20:31:49.5008364Z scale_ub_tensor = None 2025-05-07T20:31:49.5008432Z 2025-05-07T20:31:49.5008556Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5008645Z op = silu_mul_quant 2025-05-07T20:31:49.5008726Z if compiled: 2025-05-07T20:31:49.5008819Z op = torch.compile(op) 2025-05-07T20:31:49.5008926Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5009151Z 2025-05-07T20:31:49.5009243Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5009248Z 2025-05-07T20:31:49.5009342Z moe/activation_test.py:117: 2025-05-07T20:31:49.5009466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5009565Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5009771Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5010142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5010231Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5010720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5010815Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5011166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5011397Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5011733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5011822Z kernel = self.compile( 2025-05-07T20:31:49.5012197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5012377Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5012499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5012504Z 2025-05-07T20:31:49.5012709Z self = 2025-05-07T20:31:49.5013533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5014044Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02a404c0>} 2025-05-07T20:31:49.5014781Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5014974Z context = 2025-05-07T20:31:49.5014979Z 2025-05-07T20:31:49.5015144Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5015402Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5015505Z module_map=module_map) 2025-05-07T20:31:49.5015663Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5015757Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5015838Z E ^ 2025-05-07T20:31:49.5016188Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5016193Z 2025-05-07T20:31:49.5016602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5016615Z 2025-05-07T20:31:49.5016714Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5016931Z self=, 2025-05-07T20:31:49.5017007Z T=1, 2025-05-07T20:31:49.5017079Z D=5120, 2025-05-07T20:31:49.5017154Z scale_ub=None, 2025-05-07T20:31:49.5017240Z contiguous=False, 2025-05-07T20:31:49.5017319Z compiled=False, 2025-05-07T20:31:49.5017389Z ) 2025-05-07T20:31:49.5017603Z self = 2025-05-07T20:31:49.5017875Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.5017880Z 2025-05-07T20:31:49.5017954Z @given( 2025-05-07T20:31:49.5018077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5018170Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5018285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5018473Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5018583Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5018658Z ) 2025-05-07T20:31:49.5018897Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5018988Z def test_silu_mul_quant( 2025-05-07T20:31:49.5019061Z self, 2025-05-07T20:31:49.5019133Z T: int, 2025-05-07T20:31:49.5019204Z D: int, 2025-05-07T20:31:49.5019299Z scale_ub: Optional[float], 2025-05-07T20:31:49.5019384Z contiguous: bool, 2025-05-07T20:31:49.5019467Z compiled: bool, 2025-05-07T20:31:49.5019540Z ) -> None: 2025-05-07T20:31:49.5019632Z torch.manual_seed(2025) 2025-05-07T20:31:49.5019703Z 2025-05-07T20:31:49.5019869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5019940Z 2025-05-07T20:31:49.5020030Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5020153Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5020239Z x = x_sign * x_clamp 2025-05-07T20:31:49.5020318Z x0 = x[:, :D] 2025-05-07T20:31:49.5020393Z x1 = x[:, D:] 2025-05-07T20:31:49.5020461Z 2025-05-07T20:31:49.5020541Z if contiguous: 2025-05-07T20:31:49.5020626Z x0 = x0.contiguous() 2025-05-07T20:31:49.5020710Z x1 = x1.contiguous() 2025-05-07T20:31:49.5020778Z 2025-05-07T20:31:49.5020866Z if scale_ub is not None: 2025-05-07T20:31:49.5020970Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5021102Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5021182Z ) 2025-05-07T20:31:49.5021260Z else: 2025-05-07T20:31:49.5021347Z scale_ub_tensor = None 2025-05-07T20:31:49.5021415Z 2025-05-07T20:31:49.5021542Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5021627Z op = silu_mul_quant 2025-05-07T20:31:49.5021712Z if compiled: 2025-05-07T20:31:49.5021813Z op = torch.compile(op) 2025-05-07T20:31:49.5021915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5021984Z 2025-05-07T20:31:49.5022074Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5022078Z 2025-05-07T20:31:49.5022170Z moe/activation_test.py:117: 2025-05-07T20:31:49.5022300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5022396Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5022493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5022999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5023092Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5023476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5023725Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5024062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5024154Z kernel = self.compile( 2025-05-07T20:31:49.5024529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5024701Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5024832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5024836Z 2025-05-07T20:31:49.5025124Z self = 2025-05-07T20:31:49.5025900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5026476Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02a40820>} 2025-05-07T20:31:49.5027216Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5027413Z context = 2025-05-07T20:31:49.5027417Z 2025-05-07T20:31:49.5027580Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5027848Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5027948Z module_map=module_map) 2025-05-07T20:31:49.5028105Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5028200Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5028280Z E ^ 2025-05-07T20:31:49.5028634Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5028639Z 2025-05-07T20:31:49.5029044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5029049Z 2025-05-07T20:31:49.5029148Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5029369Z self=, 2025-05-07T20:31:49.5029442Z T=4096, 2025-05-07T20:31:49.5029513Z D=7168, 2025-05-07T20:31:49.5029602Z scale_ub=1200.0, 2025-05-07T20:31:49.5029682Z contiguous=False, 2025-05-07T20:31:49.5029765Z compiled=False, 2025-05-07T20:31:49.5029892Z ) 2025-05-07T20:31:49.5030107Z self = 2025-05-07T20:31:49.5030281Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.5030291Z 2025-05-07T20:31:49.5030367Z @given( 2025-05-07T20:31:49.5030483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5030585Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5030697Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5030809Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5030925Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5030995Z ) 2025-05-07T20:31:49.5031236Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5031324Z def test_silu_mul_quant( 2025-05-07T20:31:49.5031400Z self, 2025-05-07T20:31:49.5031476Z T: int, 2025-05-07T20:31:49.5031547Z D: int, 2025-05-07T20:31:49.5031640Z scale_ub: Optional[float], 2025-05-07T20:31:49.5031729Z contiguous: bool, 2025-05-07T20:31:49.5031809Z compiled: bool, 2025-05-07T20:31:49.5031888Z ) -> None: 2025-05-07T20:31:49.5031981Z torch.manual_seed(2025) 2025-05-07T20:31:49.5032052Z 2025-05-07T20:31:49.5032216Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5032289Z 2025-05-07T20:31:49.5032374Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5032498Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5032581Z x = x_sign * x_clamp 2025-05-07T20:31:49.5032654Z x0 = x[:, :D] 2025-05-07T20:31:49.5032733Z x1 = x[:, D:] 2025-05-07T20:31:49.5032801Z 2025-05-07T20:31:49.5032877Z if contiguous: 2025-05-07T20:31:49.5033053Z x0 = x0.contiguous() 2025-05-07T20:31:49.5033141Z x1 = x1.contiguous() 2025-05-07T20:31:49.5033208Z 2025-05-07T20:31:49.5033300Z if scale_ub is not None: 2025-05-07T20:31:49.5033402Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5033531Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5033680Z ) 2025-05-07T20:31:49.5033751Z else: 2025-05-07T20:31:49.5033840Z scale_ub_tensor = None 2025-05-07T20:31:49.5033911Z 2025-05-07T20:31:49.5034036Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5034124Z op = silu_mul_quant 2025-05-07T20:31:49.5034208Z if compiled: 2025-05-07T20:31:49.5034307Z op = torch.compile(op) 2025-05-07T20:31:49.5034412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5034480Z 2025-05-07T20:31:49.5034564Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5034568Z 2025-05-07T20:31:49.5034668Z moe/activation_test.py:117: 2025-05-07T20:31:49.5034792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5034888Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5034982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5035481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5035583Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5035937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5036154Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5036493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5036582Z kernel = self.compile( 2025-05-07T20:31:49.5036967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5037137Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5037258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5037262Z 2025-05-07T20:31:49.5037467Z self = 2025-05-07T20:31:49.5038244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5038753Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02f8faf0>} 2025-05-07T20:31:49.5039499Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5039688Z context = 2025-05-07T20:31:49.5039693Z 2025-05-07T20:31:49.5039858Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5040122Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5040225Z module_map=module_map) 2025-05-07T20:31:49.5040382Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5040473Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5040546Z E ^ 2025-05-07T20:31:49.5040899Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5040903Z 2025-05-07T20:31:49.5041391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5041401Z 2025-05-07T20:31:49.5041499Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5041718Z self=, 2025-05-07T20:31:49.5041799Z T=16384, 2025-05-07T20:31:49.5041870Z D=7168, 2025-05-07T20:31:49.5042022Z scale_ub=None, 2025-05-07T20:31:49.5042107Z contiguous=True, 2025-05-07T20:31:49.5042185Z compiled=True, 2025-05-07T20:31:49.5042254Z ) 2025-05-07T20:31:49.5042474Z self = 2025-05-07T20:31:49.5042645Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.5042650Z 2025-05-07T20:31:49.5042727Z @given( 2025-05-07T20:31:49.5042842Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5042940Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5043055Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5043195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5043324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5043403Z ) 2025-05-07T20:31:49.5043644Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5043733Z def test_silu_mul_quant( 2025-05-07T20:31:49.5043813Z self, 2025-05-07T20:31:49.5043886Z T: int, 2025-05-07T20:31:49.5043959Z D: int, 2025-05-07T20:31:49.5044056Z scale_ub: Optional[float], 2025-05-07T20:31:49.5044139Z contiguous: bool, 2025-05-07T20:31:49.5044222Z compiled: bool, 2025-05-07T20:31:49.5044298Z ) -> None: 2025-05-07T20:31:49.5044387Z torch.manual_seed(2025) 2025-05-07T20:31:49.5044461Z 2025-05-07T20:31:49.5044625Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5044693Z 2025-05-07T20:31:49.5044782Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5044905Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5044989Z x = x_sign * x_clamp 2025-05-07T20:31:49.5045073Z x0 = x[:, :D] 2025-05-07T20:31:49.5045147Z x1 = x[:, D:] 2025-05-07T20:31:49.5045216Z 2025-05-07T20:31:49.5045299Z if contiguous: 2025-05-07T20:31:49.5045386Z x0 = x0.contiguous() 2025-05-07T20:31:49.5045478Z x1 = x1.contiguous() 2025-05-07T20:31:49.5045547Z 2025-05-07T20:31:49.5045634Z if scale_ub is not None: 2025-05-07T20:31:49.5045736Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5045867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5045940Z ) 2025-05-07T20:31:49.5046016Z else: 2025-05-07T20:31:49.5046107Z scale_ub_tensor = None 2025-05-07T20:31:49.5046176Z 2025-05-07T20:31:49.5046304Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5046389Z op = silu_mul_quant 2025-05-07T20:31:49.5046472Z if compiled: 2025-05-07T20:31:49.5046575Z op = torch.compile(op) 2025-05-07T20:31:49.5046677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5046748Z 2025-05-07T20:31:49.5046831Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5046836Z 2025-05-07T20:31:49.5046931Z moe/activation_test.py:117: 2025-05-07T20:31:49.5047064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5047160Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5047253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5047618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5047707Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5048201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5048293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5048753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5048977Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5049309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5049470Z kernel = self.compile( 2025-05-07T20:31:49.5049849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5050019Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5050143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5050147Z 2025-05-07T20:31:49.5050351Z self = 2025-05-07T20:31:49.5051126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5051627Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02e67790>} 2025-05-07T20:31:49.5052370Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5052561Z context = 2025-05-07T20:31:49.5052565Z 2025-05-07T20:31:49.5052725Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5052982Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5053093Z module_map=module_map) 2025-05-07T20:31:49.5053253Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5053366Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5053447Z E ^ 2025-05-07T20:31:49.5053822Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5053831Z 2025-05-07T20:31:49.5054245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5054249Z 2025-05-07T20:31:49.5054348Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5054570Z self=, 2025-05-07T20:31:49.5054640Z T=4096, 2025-05-07T20:31:49.5054710Z D=5120, 2025-05-07T20:31:49.5054791Z scale_ub=None, 2025-05-07T20:31:49.5054872Z contiguous=False, 2025-05-07T20:31:49.5054950Z compiled=True, 2025-05-07T20:31:49.5055023Z ) 2025-05-07T20:31:49.5055239Z self = 2025-05-07T20:31:49.5055407Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.5055412Z 2025-05-07T20:31:49.5055487Z @given( 2025-05-07T20:31:49.5055602Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5055706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5055817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5055932Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5056044Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5056114Z ) 2025-05-07T20:31:49.5056356Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5056446Z def test_silu_mul_quant( 2025-05-07T20:31:49.5056516Z self, 2025-05-07T20:31:49.5056586Z T: int, 2025-05-07T20:31:49.5056657Z D: int, 2025-05-07T20:31:49.5056839Z scale_ub: Optional[float], 2025-05-07T20:31:49.5056926Z contiguous: bool, 2025-05-07T20:31:49.5057007Z compiled: bool, 2025-05-07T20:31:49.5057079Z ) -> None: 2025-05-07T20:31:49.5057173Z torch.manual_seed(2025) 2025-05-07T20:31:49.5057238Z 2025-05-07T20:31:49.5057402Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5057549Z 2025-05-07T20:31:49.5057636Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5057758Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5057844Z x = x_sign * x_clamp 2025-05-07T20:31:49.5057919Z x0 = x[:, :D] 2025-05-07T20:31:49.5057993Z x1 = x[:, D:] 2025-05-07T20:31:49.5058067Z 2025-05-07T20:31:49.5058145Z if contiguous: 2025-05-07T20:31:49.5058232Z x0 = x0.contiguous() 2025-05-07T20:31:49.5058320Z x1 = x1.contiguous() 2025-05-07T20:31:49.5058386Z 2025-05-07T20:31:49.5058482Z if scale_ub is not None: 2025-05-07T20:31:49.5058584Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5058714Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5058791Z ) 2025-05-07T20:31:49.5058863Z else: 2025-05-07T20:31:49.5058951Z scale_ub_tensor = None 2025-05-07T20:31:49.5059028Z 2025-05-07T20:31:49.5059152Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5059236Z op = silu_mul_quant 2025-05-07T20:31:49.5059321Z if compiled: 2025-05-07T20:31:49.5059415Z op = torch.compile(op) 2025-05-07T20:31:49.5059516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5059588Z 2025-05-07T20:31:49.5059673Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5059677Z 2025-05-07T20:31:49.5059773Z moe/activation_test.py:117: 2025-05-07T20:31:49.5059896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5059997Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5060097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5060457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5060548Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5061044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5061138Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5061491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5061708Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5062044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5062135Z kernel = self.compile( 2025-05-07T20:31:49.5062517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5062687Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5062808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5062817Z 2025-05-07T20:31:49.5063020Z self = 2025-05-07T20:31:49.5063791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5064296Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02b5d550>} 2025-05-07T20:31:49.5065116Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5065306Z context = 2025-05-07T20:31:49.5065311Z 2025-05-07T20:31:49.5065476Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5065807Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5065913Z module_map=module_map) 2025-05-07T20:31:49.5066073Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5066166Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5066245Z E ^ 2025-05-07T20:31:49.5066603Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5066608Z 2025-05-07T20:31:49.5067024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5067034Z 2025-05-07T20:31:49.5067133Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5067351Z self=, 2025-05-07T20:31:49.5067429Z T=4096, 2025-05-07T20:31:49.5067506Z D=5120, 2025-05-07T20:31:49.5067584Z scale_ub=1200.0, 2025-05-07T20:31:49.5067668Z contiguous=False, 2025-05-07T20:31:49.5067745Z compiled=False, 2025-05-07T20:31:49.5067813Z ) 2025-05-07T20:31:49.5068031Z self = 2025-05-07T20:31:49.5068201Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.5068205Z 2025-05-07T20:31:49.5068280Z @given( 2025-05-07T20:31:49.5068395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5068488Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5068607Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5068720Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5068828Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5068900Z ) 2025-05-07T20:31:49.5069141Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5069234Z def test_silu_mul_quant( 2025-05-07T20:31:49.5069306Z self, 2025-05-07T20:31:49.5069377Z T: int, 2025-05-07T20:31:49.5069448Z D: int, 2025-05-07T20:31:49.5069544Z scale_ub: Optional[float], 2025-05-07T20:31:49.5069629Z contiguous: bool, 2025-05-07T20:31:49.5069717Z compiled: bool, 2025-05-07T20:31:49.5069790Z ) -> None: 2025-05-07T20:31:49.5069942Z torch.manual_seed(2025) 2025-05-07T20:31:49.5070016Z 2025-05-07T20:31:49.5070182Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5070252Z 2025-05-07T20:31:49.5070347Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5070465Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5070552Z x = x_sign * x_clamp 2025-05-07T20:31:49.5070632Z x0 = x[:, :D] 2025-05-07T20:31:49.5070705Z x1 = x[:, D:] 2025-05-07T20:31:49.5070775Z 2025-05-07T20:31:49.5070854Z if contiguous: 2025-05-07T20:31:49.5070945Z x0 = x0.contiguous() 2025-05-07T20:31:49.5071032Z x1 = x1.contiguous() 2025-05-07T20:31:49.5071101Z 2025-05-07T20:31:49.5071187Z if scale_ub is not None: 2025-05-07T20:31:49.5071290Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5071419Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5071492Z ) 2025-05-07T20:31:49.5071564Z else: 2025-05-07T20:31:49.5071653Z scale_ub_tensor = None 2025-05-07T20:31:49.5071722Z 2025-05-07T20:31:49.5071849Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5072020Z op = silu_mul_quant 2025-05-07T20:31:49.5072102Z if compiled: 2025-05-07T20:31:49.5072199Z op = torch.compile(op) 2025-05-07T20:31:49.5072301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5072372Z 2025-05-07T20:31:49.5072456Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5072555Z 2025-05-07T20:31:49.5072648Z moe/activation_test.py:117: 2025-05-07T20:31:49.5072773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5072871Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5072966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5073520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5073612Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5073966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5074189Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5074523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5074614Z kernel = self.compile( 2025-05-07T20:31:49.5074996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5075165Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5075288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5075292Z 2025-05-07T20:31:49.5075494Z self = 2025-05-07T20:31:49.5076269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5076768Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02ebd0d0>} 2025-05-07T20:31:49.5077506Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5077699Z context = 2025-05-07T20:31:49.5077704Z 2025-05-07T20:31:49.5077864Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5078123Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5078223Z module_map=module_map) 2025-05-07T20:31:49.5078384Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5078486Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5078558Z E ^ 2025-05-07T20:31:49.5078913Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5078918Z 2025-05-07T20:31:49.5079325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5079334Z 2025-05-07T20:31:49.5079430Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5079650Z self=, 2025-05-07T20:31:49.5079723Z T=4096, 2025-05-07T20:31:49.5079799Z D=5120, 2025-05-07T20:31:49.5079877Z scale_ub=1200.0, 2025-05-07T20:31:49.5079955Z contiguous=False, 2025-05-07T20:31:49.5080033Z compiled=True, 2025-05-07T20:31:49.5080102Z ) 2025-05-07T20:31:49.5080314Z self = 2025-05-07T20:31:49.5080638Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.5080643Z 2025-05-07T20:31:49.5080718Z @given( 2025-05-07T20:31:49.5080832Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5080932Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5081043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5081234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5081343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5081413Z ) 2025-05-07T20:31:49.5081655Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5081746Z def test_silu_mul_quant( 2025-05-07T20:31:49.5081816Z self, 2025-05-07T20:31:49.5081891Z T: int, 2025-05-07T20:31:49.5081963Z D: int, 2025-05-07T20:31:49.5082054Z scale_ub: Optional[float], 2025-05-07T20:31:49.5082141Z contiguous: bool, 2025-05-07T20:31:49.5082226Z compiled: bool, 2025-05-07T20:31:49.5082298Z ) -> None: 2025-05-07T20:31:49.5082392Z torch.manual_seed(2025) 2025-05-07T20:31:49.5082460Z 2025-05-07T20:31:49.5082626Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5082700Z 2025-05-07T20:31:49.5082788Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5082914Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5082998Z x = x_sign * x_clamp 2025-05-07T20:31:49.5083073Z x0 = x[:, :D] 2025-05-07T20:31:49.5083152Z x1 = x[:, D:] 2025-05-07T20:31:49.5083221Z 2025-05-07T20:31:49.5083317Z if contiguous: 2025-05-07T20:31:49.5083413Z x0 = x0.contiguous() 2025-05-07T20:31:49.5083518Z x1 = x1.contiguous() 2025-05-07T20:31:49.5083587Z 2025-05-07T20:31:49.5083676Z if scale_ub is not None: 2025-05-07T20:31:49.5083777Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5083911Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5083985Z ) 2025-05-07T20:31:49.5084054Z else: 2025-05-07T20:31:49.5084145Z scale_ub_tensor = None 2025-05-07T20:31:49.5084213Z 2025-05-07T20:31:49.5084338Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5084431Z op = silu_mul_quant 2025-05-07T20:31:49.5084510Z if compiled: 2025-05-07T20:31:49.5084604Z op = torch.compile(op) 2025-05-07T20:31:49.5084708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5084774Z 2025-05-07T20:31:49.5084862Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5084866Z 2025-05-07T20:31:49.5084960Z moe/activation_test.py:117: 2025-05-07T20:31:49.5085085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5085185Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5085280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5085645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5085735Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5086223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5086320Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5086674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5086892Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5087228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5087317Z kernel = self.compile( 2025-05-07T20:31:49.5087693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5087949Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5088073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5088078Z 2025-05-07T20:31:49.5088278Z self = 2025-05-07T20:31:49.5089128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5089627Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02ebddc0>} 2025-05-07T20:31:49.5090369Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5090564Z context = 2025-05-07T20:31:49.5090568Z 2025-05-07T20:31:49.5090730Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5090988Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5091096Z module_map=module_map) 2025-05-07T20:31:49.5091266Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5091358Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5091433Z E ^ 2025-05-07T20:31:49.5091787Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5091791Z 2025-05-07T20:31:49.5092200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5092205Z 2025-05-07T20:31:49.5092310Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5092528Z self=, 2025-05-07T20:31:49.5092602Z T=2048, 2025-05-07T20:31:49.5092678Z D=7168, 2025-05-07T20:31:49.5092756Z scale_ub=1200.0, 2025-05-07T20:31:49.5092841Z contiguous=False, 2025-05-07T20:31:49.5092928Z compiled=False, 2025-05-07T20:31:49.5092998Z ) 2025-05-07T20:31:49.5093216Z self = 2025-05-07T20:31:49.5093385Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.5093389Z 2025-05-07T20:31:49.5093462Z @given( 2025-05-07T20:31:49.5093577Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5093672Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5093781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5093894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5094008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5094083Z ) 2025-05-07T20:31:49.5094325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5094412Z def test_silu_mul_quant( 2025-05-07T20:31:49.5094485Z self, 2025-05-07T20:31:49.5094562Z T: int, 2025-05-07T20:31:49.5094633Z D: int, 2025-05-07T20:31:49.5094730Z scale_ub: Optional[float], 2025-05-07T20:31:49.5094815Z contiguous: bool, 2025-05-07T20:31:49.5094894Z compiled: bool, 2025-05-07T20:31:49.5094971Z ) -> None: 2025-05-07T20:31:49.5095061Z torch.manual_seed(2025) 2025-05-07T20:31:49.5095132Z 2025-05-07T20:31:49.5095301Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5095370Z 2025-05-07T20:31:49.5095458Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5095578Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5095746Z x = x_sign * x_clamp 2025-05-07T20:31:49.5095826Z x0 = x[:, :D] 2025-05-07T20:31:49.5095899Z x1 = x[:, D:] 2025-05-07T20:31:49.5095968Z 2025-05-07T20:31:49.5096049Z if contiguous: 2025-05-07T20:31:49.5096135Z x0 = x0.contiguous() 2025-05-07T20:31:49.5096219Z x1 = x1.contiguous() 2025-05-07T20:31:49.5096365Z 2025-05-07T20:31:49.5096450Z if scale_ub is not None: 2025-05-07T20:31:49.5096551Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5096684Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5096755Z ) 2025-05-07T20:31:49.5096828Z else: 2025-05-07T20:31:49.5096919Z scale_ub_tensor = None 2025-05-07T20:31:49.5096988Z 2025-05-07T20:31:49.5097116Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5097201Z op = silu_mul_quant 2025-05-07T20:31:49.5097282Z if compiled: 2025-05-07T20:31:49.5097387Z op = torch.compile(op) 2025-05-07T20:31:49.5097489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5097557Z 2025-05-07T20:31:49.5097644Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5097649Z 2025-05-07T20:31:49.5097742Z moe/activation_test.py:117: 2025-05-07T20:31:49.5097865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5097966Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5098060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5098559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5098653Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5099009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5099230Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5099567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5099655Z kernel = self.compile( 2025-05-07T20:31:49.5100034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5100210Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5100333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5100338Z 2025-05-07T20:31:49.5100544Z self = 2025-05-07T20:31:49.5101319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5101822Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa029de670>} 2025-05-07T20:31:49.5102568Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5102761Z context = 2025-05-07T20:31:49.5102766Z 2025-05-07T20:31:49.5102928Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5103194Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5103316Z module_map=module_map) 2025-05-07T20:31:49.5103502Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5103596Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5108367Z E ^ 2025-05-07T20:31:49.5108909Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5108916Z 2025-05-07T20:31:49.5109336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5109341Z 2025-05-07T20:31:49.5109578Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5109798Z self=, 2025-05-07T20:31:49.5109931Z T=1, 2025-05-07T20:31:49.5110008Z D=7168, 2025-05-07T20:31:49.5110086Z scale_ub=None, 2025-05-07T20:31:49.5110165Z contiguous=True, 2025-05-07T20:31:49.5110248Z compiled=False, 2025-05-07T20:31:49.5110317Z ) 2025-05-07T20:31:49.5110530Z self = 2025-05-07T20:31:49.5110696Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.5110701Z 2025-05-07T20:31:49.5110776Z @given( 2025-05-07T20:31:49.5110905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5111001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5111112Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5111225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5111343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5111415Z ) 2025-05-07T20:31:49.5111661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5111748Z def test_silu_mul_quant( 2025-05-07T20:31:49.5111823Z self, 2025-05-07T20:31:49.5111893Z T: int, 2025-05-07T20:31:49.5111962Z D: int, 2025-05-07T20:31:49.5112057Z scale_ub: Optional[float], 2025-05-07T20:31:49.5112141Z contiguous: bool, 2025-05-07T20:31:49.5112222Z compiled: bool, 2025-05-07T20:31:49.5112301Z ) -> None: 2025-05-07T20:31:49.5112391Z torch.manual_seed(2025) 2025-05-07T20:31:49.5112465Z 2025-05-07T20:31:49.5112632Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5112703Z 2025-05-07T20:31:49.5112791Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5112914Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5112999Z x = x_sign * x_clamp 2025-05-07T20:31:49.5113080Z x0 = x[:, :D] 2025-05-07T20:31:49.5113162Z x1 = x[:, D:] 2025-05-07T20:31:49.5113229Z 2025-05-07T20:31:49.5113322Z if contiguous: 2025-05-07T20:31:49.5113423Z x0 = x0.contiguous() 2025-05-07T20:31:49.5113519Z x1 = x1.contiguous() 2025-05-07T20:31:49.5113604Z 2025-05-07T20:31:49.5113692Z if scale_ub is not None: 2025-05-07T20:31:49.5113796Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5113934Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5114007Z ) 2025-05-07T20:31:49.5114079Z else: 2025-05-07T20:31:49.5114180Z scale_ub_tensor = None 2025-05-07T20:31:49.5114248Z 2025-05-07T20:31:49.5114374Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5114464Z op = silu_mul_quant 2025-05-07T20:31:49.5114544Z if compiled: 2025-05-07T20:31:49.5114642Z op = torch.compile(op) 2025-05-07T20:31:49.5114749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5114820Z 2025-05-07T20:31:49.5114908Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5114912Z 2025-05-07T20:31:49.5115009Z moe/activation_test.py:117: 2025-05-07T20:31:49.5115132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5115232Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5115327Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5115829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5116009Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5116367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5116592Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5117003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5117090Z kernel = self.compile( 2025-05-07T20:31:49.5117473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5117644Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5117769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5117774Z 2025-05-07T20:31:49.5117978Z self = 2025-05-07T20:31:49.5118762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5119268Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0285c160>} 2025-05-07T20:31:49.5120018Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5120211Z context = 2025-05-07T20:31:49.5120215Z 2025-05-07T20:31:49.5120379Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5120648Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5120755Z module_map=module_map) 2025-05-07T20:31:49.5120916Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5121014Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5121086Z E ^ 2025-05-07T20:31:49.5121438Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5121447Z 2025-05-07T20:31:49.5121857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5121862Z 2025-05-07T20:31:49.5121960Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5122180Z self=, 2025-05-07T20:31:49.5122254Z T=16384, 2025-05-07T20:31:49.5122326Z D=7168, 2025-05-07T20:31:49.5122408Z scale_ub=1200.0, 2025-05-07T20:31:49.5122491Z contiguous=False, 2025-05-07T20:31:49.5122573Z compiled=True, 2025-05-07T20:31:49.5122645Z ) 2025-05-07T20:31:49.5122857Z self = 2025-05-07T20:31:49.5123028Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.5123033Z 2025-05-07T20:31:49.5123110Z @given( 2025-05-07T20:31:49.5123231Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5123329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5123437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5123548Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5123663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5123734Z ) 2025-05-07T20:31:49.5123973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5124063Z def test_silu_mul_quant( 2025-05-07T20:31:49.5124135Z self, 2025-05-07T20:31:49.5124206Z T: int, 2025-05-07T20:31:49.5124365Z D: int, 2025-05-07T20:31:49.5124462Z scale_ub: Optional[float], 2025-05-07T20:31:49.5124545Z contiguous: bool, 2025-05-07T20:31:49.5124626Z compiled: bool, 2025-05-07T20:31:49.5124699Z ) -> None: 2025-05-07T20:31:49.5124790Z torch.manual_seed(2025) 2025-05-07T20:31:49.5124934Z 2025-05-07T20:31:49.5125100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5125173Z 2025-05-07T20:31:49.5125259Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5125379Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5125466Z x = x_sign * x_clamp 2025-05-07T20:31:49.5125542Z x0 = x[:, :D] 2025-05-07T20:31:49.5125615Z x1 = x[:, D:] 2025-05-07T20:31:49.5125687Z 2025-05-07T20:31:49.5125766Z if contiguous: 2025-05-07T20:31:49.5125854Z x0 = x0.contiguous() 2025-05-07T20:31:49.5125938Z x1 = x1.contiguous() 2025-05-07T20:31:49.5126011Z 2025-05-07T20:31:49.5126099Z if scale_ub is not None: 2025-05-07T20:31:49.5126198Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5126329Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5126406Z ) 2025-05-07T20:31:49.5126480Z else: 2025-05-07T20:31:49.5126574Z scale_ub_tensor = None 2025-05-07T20:31:49.5126643Z 2025-05-07T20:31:49.5126768Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5126857Z op = silu_mul_quant 2025-05-07T20:31:49.5126938Z if compiled: 2025-05-07T20:31:49.5127033Z op = torch.compile(op) 2025-05-07T20:31:49.5127135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5127204Z 2025-05-07T20:31:49.5127287Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5127291Z 2025-05-07T20:31:49.5127384Z moe/activation_test.py:117: 2025-05-07T20:31:49.5127516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5127611Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5127711Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5128072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5128169Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5128659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5128752Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5129106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5129326Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5129659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5129754Z kernel = self.compile( 2025-05-07T20:31:49.5130127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5130302Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5130422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5130432Z 2025-05-07T20:31:49.5130635Z self = 2025-05-07T20:31:49.5131414Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5131918Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0285c4c0>} 2025-05-07T20:31:49.5132744Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5132940Z context = 2025-05-07T20:31:49.5133018Z 2025-05-07T20:31:49.5133213Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5133498Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5133602Z module_map=module_map) 2025-05-07T20:31:49.5133766Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5133859Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5133933Z E ^ 2025-05-07T20:31:49.5134292Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5134297Z 2025-05-07T20:31:49.5134709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5134714Z 2025-05-07T20:31:49.5134815Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5135033Z self=, 2025-05-07T20:31:49.5135110Z T=1, 2025-05-07T20:31:49.5135181Z D=7168, 2025-05-07T20:31:49.5135257Z scale_ub=None, 2025-05-07T20:31:49.5135338Z contiguous=False, 2025-05-07T20:31:49.5135421Z compiled=False, 2025-05-07T20:31:49.5135490Z ) 2025-05-07T20:31:49.5135707Z self = 2025-05-07T20:31:49.5135874Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.5135879Z 2025-05-07T20:31:49.5135952Z @given( 2025-05-07T20:31:49.5136071Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5136165Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5136279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5136395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5136504Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5136575Z ) 2025-05-07T20:31:49.5136818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5136910Z def test_silu_mul_quant( 2025-05-07T20:31:49.5136982Z self, 2025-05-07T20:31:49.5137058Z T: int, 2025-05-07T20:31:49.5137128Z D: int, 2025-05-07T20:31:49.5137224Z scale_ub: Optional[float], 2025-05-07T20:31:49.5137306Z contiguous: bool, 2025-05-07T20:31:49.5137385Z compiled: bool, 2025-05-07T20:31:49.5137460Z ) -> None: 2025-05-07T20:31:49.5137548Z torch.manual_seed(2025) 2025-05-07T20:31:49.5137616Z 2025-05-07T20:31:49.5137783Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5137853Z 2025-05-07T20:31:49.5137946Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5138069Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5138154Z x = x_sign * x_clamp 2025-05-07T20:31:49.5138226Z x0 = x[:, :D] 2025-05-07T20:31:49.5138305Z x1 = x[:, D:] 2025-05-07T20:31:49.5138377Z 2025-05-07T20:31:49.5138458Z if contiguous: 2025-05-07T20:31:49.5138545Z x0 = x0.contiguous() 2025-05-07T20:31:49.5138630Z x1 = x1.contiguous() 2025-05-07T20:31:49.5138701Z 2025-05-07T20:31:49.5138789Z if scale_ub is not None: 2025-05-07T20:31:49.5138891Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5139021Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5139094Z ) 2025-05-07T20:31:49.5139166Z else: 2025-05-07T20:31:49.5139260Z scale_ub_tensor = None 2025-05-07T20:31:49.5139327Z 2025-05-07T20:31:49.5139557Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5139647Z op = silu_mul_quant 2025-05-07T20:31:49.5139727Z if compiled: 2025-05-07T20:31:49.5139820Z op = torch.compile(op) 2025-05-07T20:31:49.5139925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5139993Z 2025-05-07T20:31:49.5140157Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5140161Z 2025-05-07T20:31:49.5140251Z moe/activation_test.py:117: 2025-05-07T20:31:49.5140375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5140474Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5140569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5141073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5141170Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5141531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5141753Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5142086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5142172Z kernel = self.compile( 2025-05-07T20:31:49.5142555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5142726Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5142852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5142857Z 2025-05-07T20:31:49.5143060Z self = 2025-05-07T20:31:49.5143893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5144399Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02d1c820>} 2025-05-07T20:31:49.5145138Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5145335Z context = 2025-05-07T20:31:49.5145340Z 2025-05-07T20:31:49.5145500Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5145757Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5145866Z module_map=module_map) 2025-05-07T20:31:49.5146028Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5146128Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5146201Z E ^ 2025-05-07T20:31:49.5146554Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5146558Z 2025-05-07T20:31:49.5146974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5146979Z 2025-05-07T20:31:49.5147076Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5147297Z self=, 2025-05-07T20:31:49.5147370Z T=2048, 2025-05-07T20:31:49.5147442Z D=7168, 2025-05-07T20:31:49.5147526Z scale_ub=None, 2025-05-07T20:31:49.5147605Z contiguous=False, 2025-05-07T20:31:49.5147680Z compiled=True, 2025-05-07T20:31:49.5147755Z ) 2025-05-07T20:31:49.5148056Z self = 2025-05-07T20:31:49.5148227Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.5148231Z 2025-05-07T20:31:49.5148305Z @given( 2025-05-07T20:31:49.5148419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5148513Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5148699Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5148811Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5148924Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5148994Z ) 2025-05-07T20:31:49.5149236Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5149327Z def test_silu_mul_quant( 2025-05-07T20:31:49.5149400Z self, 2025-05-07T20:31:49.5149469Z T: int, 2025-05-07T20:31:49.5149545Z D: int, 2025-05-07T20:31:49.5149639Z scale_ub: Optional[float], 2025-05-07T20:31:49.5149727Z contiguous: bool, 2025-05-07T20:31:49.5149811Z compiled: bool, 2025-05-07T20:31:49.5149945Z ) -> None: 2025-05-07T20:31:49.5150036Z torch.manual_seed(2025) 2025-05-07T20:31:49.5150109Z 2025-05-07T20:31:49.5150273Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5150356Z 2025-05-07T20:31:49.5150443Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5150561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5150650Z x = x_sign * x_clamp 2025-05-07T20:31:49.5150723Z x0 = x[:, :D] 2025-05-07T20:31:49.5150799Z x1 = x[:, D:] 2025-05-07T20:31:49.5150869Z 2025-05-07T20:31:49.5150947Z if contiguous: 2025-05-07T20:31:49.5151030Z x0 = x0.contiguous() 2025-05-07T20:31:49.5151118Z x1 = x1.contiguous() 2025-05-07T20:31:49.5151187Z 2025-05-07T20:31:49.5151273Z if scale_ub is not None: 2025-05-07T20:31:49.5151376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5151512Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5151587Z ) 2025-05-07T20:31:49.5151661Z else: 2025-05-07T20:31:49.5151751Z scale_ub_tensor = None 2025-05-07T20:31:49.5151822Z 2025-05-07T20:31:49.5151946Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5152034Z op = silu_mul_quant 2025-05-07T20:31:49.5152117Z if compiled: 2025-05-07T20:31:49.5152211Z op = torch.compile(op) 2025-05-07T20:31:49.5152310Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5152382Z 2025-05-07T20:31:49.5152467Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5152471Z 2025-05-07T20:31:49.5152566Z moe/activation_test.py:117: 2025-05-07T20:31:49.5152693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5152789Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5152894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5153264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5153352Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5153848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5153945Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5154296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5154518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5154851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5154942Z kernel = self.compile( 2025-05-07T20:31:49.5155398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5155570Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5155695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5155700Z 2025-05-07T20:31:49.5155902Z self = 2025-05-07T20:31:49.5156754Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5157255Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02acf790>} 2025-05-07T20:31:49.5158010Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5158197Z context = 2025-05-07T20:31:49.5158202Z 2025-05-07T20:31:49.5158361Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5158621Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5158726Z module_map=module_map) 2025-05-07T20:31:49.5158883Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5158978Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5159054Z E ^ 2025-05-07T20:31:49.5159408Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5159412Z 2025-05-07T20:31:49.5159821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5159830Z 2025-05-07T20:31:49.5159929Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5160149Z self=, 2025-05-07T20:31:49.5160222Z T=4096, 2025-05-07T20:31:49.5160296Z D=7168, 2025-05-07T20:31:49.5160379Z scale_ub=None, 2025-05-07T20:31:49.5160463Z contiguous=False, 2025-05-07T20:31:49.5160544Z compiled=True, 2025-05-07T20:31:49.5160615Z ) 2025-05-07T20:31:49.5160835Z self = 2025-05-07T20:31:49.5161004Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.5161009Z 2025-05-07T20:31:49.5161077Z @given( 2025-05-07T20:31:49.5161190Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5161285Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5161395Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5161511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5161623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5161694Z ) 2025-05-07T20:31:49.5161939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5162028Z def test_silu_mul_quant( 2025-05-07T20:31:49.5162102Z self, 2025-05-07T20:31:49.5162182Z T: int, 2025-05-07T20:31:49.5162252Z D: int, 2025-05-07T20:31:49.5162347Z scale_ub: Optional[float], 2025-05-07T20:31:49.5162434Z contiguous: bool, 2025-05-07T20:31:49.5162513Z compiled: bool, 2025-05-07T20:31:49.5162587Z ) -> None: 2025-05-07T20:31:49.5162682Z torch.manual_seed(2025) 2025-05-07T20:31:49.5162751Z 2025-05-07T20:31:49.5162918Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5162988Z 2025-05-07T20:31:49.5163098Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5163233Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5163410Z x = x_sign * x_clamp 2025-05-07T20:31:49.5163485Z x0 = x[:, :D] 2025-05-07T20:31:49.5163564Z x1 = x[:, D:] 2025-05-07T20:31:49.5163630Z 2025-05-07T20:31:49.5163707Z if contiguous: 2025-05-07T20:31:49.5163798Z x0 = x0.contiguous() 2025-05-07T20:31:49.5163955Z x1 = x1.contiguous() 2025-05-07T20:31:49.5164024Z 2025-05-07T20:31:49.5164114Z if scale_ub is not None: 2025-05-07T20:31:49.5164218Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5164350Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5164427Z ) 2025-05-07T20:31:49.5164500Z else: 2025-05-07T20:31:49.5164592Z scale_ub_tensor = None 2025-05-07T20:31:49.5164661Z 2025-05-07T20:31:49.5164785Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5164873Z op = silu_mul_quant 2025-05-07T20:31:49.5164953Z if compiled: 2025-05-07T20:31:49.5165051Z op = torch.compile(op) 2025-05-07T20:31:49.5165156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5165224Z 2025-05-07T20:31:49.5165311Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5165315Z 2025-05-07T20:31:49.5165411Z moe/activation_test.py:117: 2025-05-07T20:31:49.5165538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5165636Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5165730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5166095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5166185Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5166677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5166769Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5167128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5167347Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5167685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5167779Z kernel = self.compile( 2025-05-07T20:31:49.5168154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5168327Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5168447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5168452Z 2025-05-07T20:31:49.5168654Z self = 2025-05-07T20:31:49.5169434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5169935Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa028114c0>} 2025-05-07T20:31:49.5170684Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5170871Z context = 2025-05-07T20:31:49.5170876Z 2025-05-07T20:31:49.5171040Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5171299Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5171400Z module_map=module_map) 2025-05-07T20:31:49.5171670Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5171767Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5171839Z E ^ 2025-05-07T20:31:49.5172193Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5172272Z 2025-05-07T20:31:49.5172681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5172686Z 2025-05-07T20:31:49.5172788Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5173004Z self=, 2025-05-07T20:31:49.5173078Z T=16384, 2025-05-07T20:31:49.5173157Z D=5120, 2025-05-07T20:31:49.5173235Z scale_ub=1200.0, 2025-05-07T20:31:49.5173334Z contiguous=False, 2025-05-07T20:31:49.5173423Z compiled=False, 2025-05-07T20:31:49.5173504Z ) 2025-05-07T20:31:49.5173738Z self = 2025-05-07T20:31:49.5173914Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.5173919Z 2025-05-07T20:31:49.5173991Z @given( 2025-05-07T20:31:49.5174106Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5174203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5174311Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5174426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5174536Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5174606Z ) 2025-05-07T20:31:49.5174849Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5174936Z def test_silu_mul_quant( 2025-05-07T20:31:49.5175012Z self, 2025-05-07T20:31:49.5175082Z T: int, 2025-05-07T20:31:49.5175152Z D: int, 2025-05-07T20:31:49.5175257Z scale_ub: Optional[float], 2025-05-07T20:31:49.5175342Z contiguous: bool, 2025-05-07T20:31:49.5175422Z compiled: bool, 2025-05-07T20:31:49.5175497Z ) -> None: 2025-05-07T20:31:49.5175586Z torch.manual_seed(2025) 2025-05-07T20:31:49.5175654Z 2025-05-07T20:31:49.5175821Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5175897Z 2025-05-07T20:31:49.5175985Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5176104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5176188Z x = x_sign * x_clamp 2025-05-07T20:31:49.5176262Z x0 = x[:, :D] 2025-05-07T20:31:49.5176338Z x1 = x[:, D:] 2025-05-07T20:31:49.5176406Z 2025-05-07T20:31:49.5176487Z if contiguous: 2025-05-07T20:31:49.5176572Z x0 = x0.contiguous() 2025-05-07T20:31:49.5176656Z x1 = x1.contiguous() 2025-05-07T20:31:49.5176727Z 2025-05-07T20:31:49.5176813Z if scale_ub is not None: 2025-05-07T20:31:49.5176918Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5177051Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5177124Z ) 2025-05-07T20:31:49.5177194Z else: 2025-05-07T20:31:49.5177286Z scale_ub_tensor = None 2025-05-07T20:31:49.5177361Z 2025-05-07T20:31:49.5177489Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5177575Z op = silu_mul_quant 2025-05-07T20:31:49.5177653Z if compiled: 2025-05-07T20:31:49.5177753Z op = torch.compile(op) 2025-05-07T20:31:49.5177853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5177922Z 2025-05-07T20:31:49.5178009Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5178013Z 2025-05-07T20:31:49.5178105Z moe/activation_test.py:117: 2025-05-07T20:31:49.5178227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5178326Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5178500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5179003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5179096Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5180142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5180364Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5180698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5180787Z kernel = self.compile( 2025-05-07T20:31:49.5181166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5181335Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5181466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5181471Z 2025-05-07T20:31:49.5181671Z self = 2025-05-07T20:31:49.5182445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5182954Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02811820>} 2025-05-07T20:31:49.5183693Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5183883Z context = 2025-05-07T20:31:49.5183892Z 2025-05-07T20:31:49.5184053Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5184315Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5184420Z module_map=module_map) 2025-05-07T20:31:49.5184585Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5184679Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5184752Z E ^ 2025-05-07T20:31:49.5185108Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5185113Z 2025-05-07T20:31:49.5185526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5185530Z 2025-05-07T20:31:49.5185632Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5185854Z self=, 2025-05-07T20:31:49.5185927Z T=16384, 2025-05-07T20:31:49.5185999Z D=5120, 2025-05-07T20:31:49.5186078Z scale_ub=1200.0, 2025-05-07T20:31:49.5186156Z contiguous=True, 2025-05-07T20:31:49.5186238Z compiled=True, 2025-05-07T20:31:49.5186307Z ) 2025-05-07T20:31:49.5186525Z self = 2025-05-07T20:31:49.5186699Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.5186704Z 2025-05-07T20:31:49.5186778Z @given( 2025-05-07T20:31:49.5186896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5186990Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5187101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5187214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5187324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5187396Z ) 2025-05-07T20:31:49.5187734Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5187827Z def test_silu_mul_quant( 2025-05-07T20:31:49.5187903Z self, 2025-05-07T20:31:49.5187972Z T: int, 2025-05-07T20:31:49.5188043Z D: int, 2025-05-07T20:31:49.5188141Z scale_ub: Optional[float], 2025-05-07T20:31:49.5188299Z contiguous: bool, 2025-05-07T20:31:49.5188381Z compiled: bool, 2025-05-07T20:31:49.5188458Z ) -> None: 2025-05-07T20:31:49.5188547Z torch.manual_seed(2025) 2025-05-07T20:31:49.5188616Z 2025-05-07T20:31:49.5188781Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5188850Z 2025-05-07T20:31:49.5188937Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5189058Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5189140Z x = x_sign * x_clamp 2025-05-07T20:31:49.5189220Z x0 = x[:, :D] 2025-05-07T20:31:49.5189300Z x1 = x[:, D:] 2025-05-07T20:31:49.5189367Z 2025-05-07T20:31:49.5189448Z if contiguous: 2025-05-07T20:31:49.5189535Z x0 = x0.contiguous() 2025-05-07T20:31:49.5189621Z x1 = x1.contiguous() 2025-05-07T20:31:49.5189693Z 2025-05-07T20:31:49.5189779Z if scale_ub is not None: 2025-05-07T20:31:49.5189958Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5190093Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5190166Z ) 2025-05-07T20:31:49.5190241Z else: 2025-05-07T20:31:49.5190336Z scale_ub_tensor = None 2025-05-07T20:31:49.5190402Z 2025-05-07T20:31:49.5190529Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5190619Z op = silu_mul_quant 2025-05-07T20:31:49.5190699Z if compiled: 2025-05-07T20:31:49.5190799Z op = torch.compile(op) 2025-05-07T20:31:49.5190900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5190970Z 2025-05-07T20:31:49.5191058Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5191063Z 2025-05-07T20:31:49.5191155Z moe/activation_test.py:117: 2025-05-07T20:31:49.5191279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5191378Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5191478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5191841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5191930Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5192423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5192519Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5192873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5193122Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5193484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5193574Z kernel = self.compile( 2025-05-07T20:31:49.5193954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5194127Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5194248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5194253Z 2025-05-07T20:31:49.5194458Z self = 2025-05-07T20:31:49.5195234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5195824Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02a6be50>} 2025-05-07T20:31:49.5196575Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5196863Z context = 2025-05-07T20:31:49.5196868Z 2025-05-07T20:31:49.5197032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5197294Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5197401Z module_map=module_map) 2025-05-07T20:31:49.5197562Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5197667Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5197742Z E ^ 2025-05-07T20:31:49.5198093Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5198098Z 2025-05-07T20:31:49.5198510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5198519Z 2025-05-07T20:31:49.5198618Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5198836Z self=, 2025-05-07T20:31:49.5198913Z T=16384, 2025-05-07T20:31:49.5198983Z D=5120, 2025-05-07T20:31:49.5199059Z scale_ub=None, 2025-05-07T20:31:49.5199145Z contiguous=False, 2025-05-07T20:31:49.5199223Z compiled=True, 2025-05-07T20:31:49.5199294Z ) 2025-05-07T20:31:49.5199510Z self = 2025-05-07T20:31:49.5199687Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.5199692Z 2025-05-07T20:31:49.5199769Z @given( 2025-05-07T20:31:49.5199883Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5199977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5200092Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5200208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5200316Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5200390Z ) 2025-05-07T20:31:49.5200628Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5200716Z def test_silu_mul_quant( 2025-05-07T20:31:49.5200793Z self, 2025-05-07T20:31:49.5200863Z T: int, 2025-05-07T20:31:49.5200938Z D: int, 2025-05-07T20:31:49.5201031Z scale_ub: Optional[float], 2025-05-07T20:31:49.5201113Z contiguous: bool, 2025-05-07T20:31:49.5201194Z compiled: bool, 2025-05-07T20:31:49.5201270Z ) -> None: 2025-05-07T20:31:49.5201360Z torch.manual_seed(2025) 2025-05-07T20:31:49.5201433Z 2025-05-07T20:31:49.5201597Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5201666Z 2025-05-07T20:31:49.5201753Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5201876Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5201960Z x = x_sign * x_clamp 2025-05-07T20:31:49.5202039Z x0 = x[:, :D] 2025-05-07T20:31:49.5202114Z x1 = x[:, D:] 2025-05-07T20:31:49.5202183Z 2025-05-07T20:31:49.5202263Z if contiguous: 2025-05-07T20:31:49.5202350Z x0 = x0.contiguous() 2025-05-07T20:31:49.5202439Z x1 = x1.contiguous() 2025-05-07T20:31:49.5202507Z 2025-05-07T20:31:49.5202595Z if scale_ub is not None: 2025-05-07T20:31:49.5202698Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5202826Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5202984Z ) 2025-05-07T20:31:49.5203067Z else: 2025-05-07T20:31:49.5203161Z scale_ub_tensor = None 2025-05-07T20:31:49.5203247Z 2025-05-07T20:31:49.5203390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5203486Z op = silu_mul_quant 2025-05-07T20:31:49.5203640Z if compiled: 2025-05-07T20:31:49.5203913Z op = torch.compile(op) 2025-05-07T20:31:49.5204018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5204091Z 2025-05-07T20:31:49.5204176Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5204180Z 2025-05-07T20:31:49.5204273Z moe/activation_test.py:117: 2025-05-07T20:31:49.5204402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5204498Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5204593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5204967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5205055Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5205549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5205646Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5205996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5206221Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5206551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5206641Z kernel = self.compile( 2025-05-07T20:31:49.5207018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5207191Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5207317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5207321Z 2025-05-07T20:31:49.5207524Z self = 2025-05-07T20:31:49.5208298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5208809Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02ae09d0>} 2025-05-07T20:31:49.5209550Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5209747Z context = 2025-05-07T20:31:49.5209751Z 2025-05-07T20:31:49.5209915Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5210177Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5210286Z module_map=module_map) 2025-05-07T20:31:49.5210444Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5210543Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5210616Z E ^ 2025-05-07T20:31:49.5210967Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5210972Z 2025-05-07T20:31:49.5211384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5211388Z 2025-05-07T20:31:49.5211487Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5211919Z self=, 2025-05-07T20:31:49.5211996Z T=2048, 2025-05-07T20:31:49.5212069Z D=5120, 2025-05-07T20:31:49.5212149Z scale_ub=None, 2025-05-07T20:31:49.5212231Z contiguous=False, 2025-05-07T20:31:49.5212306Z compiled=True, 2025-05-07T20:31:49.5212500Z ) 2025-05-07T20:31:49.5212714Z self = 2025-05-07T20:31:49.5212880Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.5212884Z 2025-05-07T20:31:49.5212961Z @given( 2025-05-07T20:31:49.5213075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5213176Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5213285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5213397Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5213510Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5213586Z ) 2025-05-07T20:31:49.5213826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5213919Z def test_silu_mul_quant( 2025-05-07T20:31:49.5213991Z self, 2025-05-07T20:31:49.5214066Z T: int, 2025-05-07T20:31:49.5214144Z D: int, 2025-05-07T20:31:49.5214240Z scale_ub: Optional[float], 2025-05-07T20:31:49.5214327Z contiguous: bool, 2025-05-07T20:31:49.5214411Z compiled: bool, 2025-05-07T20:31:49.5214487Z ) -> None: 2025-05-07T20:31:49.5214578Z torch.manual_seed(2025) 2025-05-07T20:31:49.5214647Z 2025-05-07T20:31:49.5214810Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5214882Z 2025-05-07T20:31:49.5214970Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5215087Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5215177Z x = x_sign * x_clamp 2025-05-07T20:31:49.5215256Z x0 = x[:, :D] 2025-05-07T20:31:49.5215330Z x1 = x[:, D:] 2025-05-07T20:31:49.5215398Z 2025-05-07T20:31:49.5215476Z if contiguous: 2025-05-07T20:31:49.5215561Z x0 = x0.contiguous() 2025-05-07T20:31:49.5215651Z x1 = x1.contiguous() 2025-05-07T20:31:49.5215719Z 2025-05-07T20:31:49.5215816Z if scale_ub is not None: 2025-05-07T20:31:49.5215915Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5216046Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5216123Z ) 2025-05-07T20:31:49.5216193Z else: 2025-05-07T20:31:49.5216284Z scale_ub_tensor = None 2025-05-07T20:31:49.5216356Z 2025-05-07T20:31:49.5216484Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5216567Z op = silu_mul_quant 2025-05-07T20:31:49.5216652Z if compiled: 2025-05-07T20:31:49.5216748Z op = torch.compile(op) 2025-05-07T20:31:49.5216853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5216924Z 2025-05-07T20:31:49.5217009Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5217013Z 2025-05-07T20:31:49.5217105Z moe/activation_test.py:117: 2025-05-07T20:31:49.5217227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5217326Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5217426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5217790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5217876Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5218367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5218460Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5218898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5219120Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5219454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5219552Z kernel = self.compile( 2025-05-07T20:31:49.5220002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5220172Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5220297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5220302Z 2025-05-07T20:31:49.5220503Z self = 2025-05-07T20:31:49.5221288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5221791Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0270a550>} 2025-05-07T20:31:49.5222537Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5222733Z context = 2025-05-07T20:31:49.5222738Z 2025-05-07T20:31:49.5222901Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5223207Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5223319Z module_map=module_map) 2025-05-07T20:31:49.5223481Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5223581Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5223655Z E ^ 2025-05-07T20:31:49.5224009Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5224014Z 2025-05-07T20:31:49.5224421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5224431Z 2025-05-07T20:31:49.5224538Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5224755Z self=, 2025-05-07T20:31:49.5224827Z T=2048, 2025-05-07T20:31:49.5224900Z D=5120, 2025-05-07T20:31:49.5224976Z scale_ub=1200.0, 2025-05-07T20:31:49.5225060Z contiguous=False, 2025-05-07T20:31:49.5225139Z compiled=True, 2025-05-07T20:31:49.5225207Z ) 2025-05-07T20:31:49.5225424Z self = 2025-05-07T20:31:49.5225599Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.5225604Z 2025-05-07T20:31:49.5230187Z @given( 2025-05-07T20:31:49.5230322Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5230421Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5230545Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5230655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5230764Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5230839Z ) 2025-05-07T20:31:49.5231086Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5231178Z def test_silu_mul_quant( 2025-05-07T20:31:49.5231255Z self, 2025-05-07T20:31:49.5231326Z T: int, 2025-05-07T20:31:49.5231397Z D: int, 2025-05-07T20:31:49.5231493Z scale_ub: Optional[float], 2025-05-07T20:31:49.5231577Z contiguous: bool, 2025-05-07T20:31:49.5231765Z compiled: bool, 2025-05-07T20:31:49.5231843Z ) -> None: 2025-05-07T20:31:49.5231934Z torch.manual_seed(2025) 2025-05-07T20:31:49.5232009Z 2025-05-07T20:31:49.5232181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5232252Z 2025-05-07T20:31:49.5232445Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5232567Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5232649Z x = x_sign * x_clamp 2025-05-07T20:31:49.5232725Z x0 = x[:, :D] 2025-05-07T20:31:49.5232798Z x1 = x[:, D:] 2025-05-07T20:31:49.5232867Z 2025-05-07T20:31:49.5232948Z if contiguous: 2025-05-07T20:31:49.5233035Z x0 = x0.contiguous() 2025-05-07T20:31:49.5233128Z x1 = x1.contiguous() 2025-05-07T20:31:49.5233195Z 2025-05-07T20:31:49.5233301Z if scale_ub is not None: 2025-05-07T20:31:49.5233412Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5233572Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5233645Z ) 2025-05-07T20:31:49.5233718Z else: 2025-05-07T20:31:49.5233810Z scale_ub_tensor = None 2025-05-07T20:31:49.5233878Z 2025-05-07T20:31:49.5234006Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5234096Z op = silu_mul_quant 2025-05-07T20:31:49.5234176Z if compiled: 2025-05-07T20:31:49.5234276Z op = torch.compile(op) 2025-05-07T20:31:49.5234377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5234449Z 2025-05-07T20:31:49.5234537Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5234542Z 2025-05-07T20:31:49.5234638Z moe/activation_test.py:117: 2025-05-07T20:31:49.5234772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5234869Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5234965Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5235346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5235435Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5235927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5236032Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5236385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5236607Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5236942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5237033Z kernel = self.compile( 2025-05-07T20:31:49.5237411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5237594Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5237720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5237725Z 2025-05-07T20:31:49.5237928Z self = 2025-05-07T20:31:49.5238711Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5239220Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02547310>} 2025-05-07T20:31:49.5240042Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5240237Z context = 2025-05-07T20:31:49.5240242Z 2025-05-07T20:31:49.5240407Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5240665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5240847Z module_map=module_map) 2025-05-07T20:31:49.5241009Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5241105Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5241178Z E ^ 2025-05-07T20:31:49.5241529Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5241534Z 2025-05-07T20:31:49.5241942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5241947Z 2025-05-07T20:31:49.5242048Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5242270Z self=, 2025-05-07T20:31:49.5242344Z T=4096, 2025-05-07T20:31:49.5242415Z D=5120, 2025-05-07T20:31:49.5242500Z scale_ub=1200.0, 2025-05-07T20:31:49.5242579Z contiguous=True, 2025-05-07T20:31:49.5242662Z compiled=True, 2025-05-07T20:31:49.5242736Z ) 2025-05-07T20:31:49.5242950Z self = 2025-05-07T20:31:49.5243115Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.5243119Z 2025-05-07T20:31:49.5243193Z @given( 2025-05-07T20:31:49.5243307Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5243402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5243512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5243623Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5243736Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5243807Z ) 2025-05-07T20:31:49.5244047Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5244141Z def test_silu_mul_quant( 2025-05-07T20:31:49.5244211Z self, 2025-05-07T20:31:49.5244288Z T: int, 2025-05-07T20:31:49.5244365Z D: int, 2025-05-07T20:31:49.5244458Z scale_ub: Optional[float], 2025-05-07T20:31:49.5244546Z contiguous: bool, 2025-05-07T20:31:49.5244628Z compiled: bool, 2025-05-07T20:31:49.5244700Z ) -> None: 2025-05-07T20:31:49.5244793Z torch.manual_seed(2025) 2025-05-07T20:31:49.5244861Z 2025-05-07T20:31:49.5245028Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5245101Z 2025-05-07T20:31:49.5245187Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5245307Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5245395Z x = x_sign * x_clamp 2025-05-07T20:31:49.5245470Z x0 = x[:, :D] 2025-05-07T20:31:49.5245543Z x1 = x[:, D:] 2025-05-07T20:31:49.5245615Z 2025-05-07T20:31:49.5245694Z if contiguous: 2025-05-07T20:31:49.5245777Z x0 = x0.contiguous() 2025-05-07T20:31:49.5245863Z x1 = x1.contiguous() 2025-05-07T20:31:49.5245938Z 2025-05-07T20:31:49.5246023Z if scale_ub is not None: 2025-05-07T20:31:49.5246125Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5246255Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5246333Z ) 2025-05-07T20:31:49.5246406Z else: 2025-05-07T20:31:49.5246494Z scale_ub_tensor = None 2025-05-07T20:31:49.5246566Z 2025-05-07T20:31:49.5246689Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5246778Z op = silu_mul_quant 2025-05-07T20:31:49.5246860Z if compiled: 2025-05-07T20:31:49.5247041Z op = torch.compile(op) 2025-05-07T20:31:49.5247146Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5247216Z 2025-05-07T20:31:49.5247303Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5247307Z 2025-05-07T20:31:49.5247402Z moe/activation_test.py:117: 2025-05-07T20:31:49.5247527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5247701Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5247798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5248160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5248249Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5248740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5248834Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5249195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5249414Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5249747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5249847Z kernel = self.compile( 2025-05-07T20:31:49.5250222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5250395Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5250518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5250522Z 2025-05-07T20:31:49.5250726Z self = 2025-05-07T20:31:49.5251509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5252015Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa024ca040>} 2025-05-07T20:31:49.5252762Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5252955Z context = 2025-05-07T20:31:49.5252960Z 2025-05-07T20:31:49.5253139Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5253434Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5253538Z module_map=module_map) 2025-05-07T20:31:49.5253704Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5253799Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5253872Z E ^ 2025-05-07T20:31:49.5254226Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5254234Z 2025-05-07T20:31:49.5254642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5254646Z 2025-05-07T20:31:49.5254747Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5254965Z self=, 2025-05-07T20:31:49.5255039Z T=128, 2025-05-07T20:31:49.5255113Z D=5120, 2025-05-07T20:31:49.5255191Z scale_ub=1200.0, 2025-05-07T20:31:49.5255270Z contiguous=False, 2025-05-07T20:31:49.5255351Z compiled=True, 2025-05-07T20:31:49.5255420Z ) 2025-05-07T20:31:49.5255712Z self = 2025-05-07T20:31:49.5255887Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.5255892Z 2025-05-07T20:31:49.5255965Z @given( 2025-05-07T20:31:49.5256083Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5256177Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5256364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5256478Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5256587Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5256653Z ) 2025-05-07T20:31:49.5256903Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5256991Z def test_silu_mul_quant( 2025-05-07T20:31:49.5257066Z self, 2025-05-07T20:31:49.5257139Z T: int, 2025-05-07T20:31:49.5257211Z D: int, 2025-05-07T20:31:49.5257306Z scale_ub: Optional[float], 2025-05-07T20:31:49.5257395Z contiguous: bool, 2025-05-07T20:31:49.5257474Z compiled: bool, 2025-05-07T20:31:49.5257552Z ) -> None: 2025-05-07T20:31:49.5257643Z torch.manual_seed(2025) 2025-05-07T20:31:49.5257711Z 2025-05-07T20:31:49.5257879Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5257951Z 2025-05-07T20:31:49.5258038Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5258164Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5258248Z x = x_sign * x_clamp 2025-05-07T20:31:49.5258324Z x0 = x[:, :D] 2025-05-07T20:31:49.5258402Z x1 = x[:, D:] 2025-05-07T20:31:49.5258471Z 2025-05-07T20:31:49.5258552Z if contiguous: 2025-05-07T20:31:49.5258640Z x0 = x0.contiguous() 2025-05-07T20:31:49.5258723Z x1 = x1.contiguous() 2025-05-07T20:31:49.5258794Z 2025-05-07T20:31:49.5258881Z if scale_ub is not None: 2025-05-07T20:31:49.5258986Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5259121Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5259194Z ) 2025-05-07T20:31:49.5259267Z else: 2025-05-07T20:31:49.5259359Z scale_ub_tensor = None 2025-05-07T20:31:49.5259425Z 2025-05-07T20:31:49.5259555Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5259644Z op = silu_mul_quant 2025-05-07T20:31:49.5259724Z if compiled: 2025-05-07T20:31:49.5259823Z op = torch.compile(op) 2025-05-07T20:31:49.5259925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5259995Z 2025-05-07T20:31:49.5260082Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5260087Z 2025-05-07T20:31:49.5260178Z moe/activation_test.py:117: 2025-05-07T20:31:49.5260301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5260401Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5260499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5260865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5260955Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5261444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5261543Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5261894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5262113Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5262451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5262538Z kernel = self.compile( 2025-05-07T20:31:49.5263020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5263196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5263318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5263323Z 2025-05-07T20:31:49.5263561Z self = 2025-05-07T20:31:49.5264432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5264939Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa024caca0>} 2025-05-07T20:31:49.5265693Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5265881Z context = 2025-05-07T20:31:49.5265886Z 2025-05-07T20:31:49.5266048Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5266315Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5266423Z module_map=module_map) 2025-05-07T20:31:49.5266582Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5266675Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5266753Z E ^ 2025-05-07T20:31:49.5267103Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5267108Z 2025-05-07T20:31:49.5267523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5267532Z 2025-05-07T20:31:49.5267634Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5267852Z self=, 2025-05-07T20:31:49.5267928Z T=16384, 2025-05-07T20:31:49.5268001Z D=7168, 2025-05-07T20:31:49.5268080Z scale_ub=1200.0, 2025-05-07T20:31:49.5268167Z contiguous=True, 2025-05-07T20:31:49.5268248Z compiled=True, 2025-05-07T20:31:49.5268318Z ) 2025-05-07T20:31:49.5268535Z self = 2025-05-07T20:31:49.5268703Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.5268708Z 2025-05-07T20:31:49.5268783Z @given( 2025-05-07T20:31:49.5268899Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5268993Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5269106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5269222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5269331Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5269409Z ) 2025-05-07T20:31:49.5269651Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5269739Z def test_silu_mul_quant( 2025-05-07T20:31:49.5269868Z self, 2025-05-07T20:31:49.5269940Z T: int, 2025-05-07T20:31:49.5270009Z D: int, 2025-05-07T20:31:49.5270108Z scale_ub: Optional[float], 2025-05-07T20:31:49.5270191Z contiguous: bool, 2025-05-07T20:31:49.5270273Z compiled: bool, 2025-05-07T20:31:49.5270347Z ) -> None: 2025-05-07T20:31:49.5270436Z torch.manual_seed(2025) 2025-05-07T20:31:49.5270508Z 2025-05-07T20:31:49.5270670Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5270741Z 2025-05-07T20:31:49.5270833Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5271034Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5271120Z x = x_sign * x_clamp 2025-05-07T20:31:49.5271197Z x0 = x[:, :D] 2025-05-07T20:31:49.5271274Z x1 = x[:, D:] 2025-05-07T20:31:49.5271344Z 2025-05-07T20:31:49.5271425Z if contiguous: 2025-05-07T20:31:49.5271510Z x0 = x0.contiguous() 2025-05-07T20:31:49.5271678Z x1 = x1.contiguous() 2025-05-07T20:31:49.5271750Z 2025-05-07T20:31:49.5271836Z if scale_ub is not None: 2025-05-07T20:31:49.5271947Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5272076Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5272147Z ) 2025-05-07T20:31:49.5272222Z else: 2025-05-07T20:31:49.5272311Z scale_ub_tensor = None 2025-05-07T20:31:49.5272380Z 2025-05-07T20:31:49.5272507Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5272592Z op = silu_mul_quant 2025-05-07T20:31:49.5272677Z if compiled: 2025-05-07T20:31:49.5272775Z op = torch.compile(op) 2025-05-07T20:31:49.5272878Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5272949Z 2025-05-07T20:31:49.5273036Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5273041Z 2025-05-07T20:31:49.5273132Z moe/activation_test.py:117: 2025-05-07T20:31:49.5273271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5273367Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5273459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5273825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5273913Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5274402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5274497Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5274853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5275084Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5275416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5275508Z kernel = self.compile( 2025-05-07T20:31:49.5275889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5276062Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5276187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5276191Z 2025-05-07T20:31:49.5276395Z self = 2025-05-07T20:31:49.5277176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5277680Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02697a60>} 2025-05-07T20:31:49.5278423Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5278613Z context = 2025-05-07T20:31:49.5278618Z 2025-05-07T20:31:49.5278777Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5279038Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5279224Z module_map=module_map) 2025-05-07T20:31:49.5279386Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5279481Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5279554Z E ^ 2025-05-07T20:31:49.5279904Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5279982Z 2025-05-07T20:31:49.5280396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5280401Z 2025-05-07T20:31:49.5280498Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5280719Z self=, 2025-05-07T20:31:49.5280793Z T=16384, 2025-05-07T20:31:49.5280866Z D=5120, 2025-05-07T20:31:49.5280944Z scale_ub=1200.0, 2025-05-07T20:31:49.5281025Z contiguous=True, 2025-05-07T20:31:49.5281107Z compiled=False, 2025-05-07T20:31:49.5281181Z ) 2025-05-07T20:31:49.5281405Z self = 2025-05-07T20:31:49.5281576Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.5281581Z 2025-05-07T20:31:49.5281657Z @given( 2025-05-07T20:31:49.5281772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5281875Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5281986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5282097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5282211Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5282281Z ) 2025-05-07T20:31:49.5282525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5282615Z def test_silu_mul_quant( 2025-05-07T20:31:49.5282687Z self, 2025-05-07T20:31:49.5282757Z T: int, 2025-05-07T20:31:49.5282830Z D: int, 2025-05-07T20:31:49.5282931Z scale_ub: Optional[float], 2025-05-07T20:31:49.5283014Z contiguous: bool, 2025-05-07T20:31:49.5283115Z compiled: bool, 2025-05-07T20:31:49.5283193Z ) -> None: 2025-05-07T20:31:49.5283306Z torch.manual_seed(2025) 2025-05-07T20:31:49.5283379Z 2025-05-07T20:31:49.5283543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5283621Z 2025-05-07T20:31:49.5283708Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5283827Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5283911Z x = x_sign * x_clamp 2025-05-07T20:31:49.5283989Z x0 = x[:, :D] 2025-05-07T20:31:49.5284064Z x1 = x[:, D:] 2025-05-07T20:31:49.5284136Z 2025-05-07T20:31:49.5284213Z if contiguous: 2025-05-07T20:31:49.5284297Z x0 = x0.contiguous() 2025-05-07T20:31:49.5284389Z x1 = x1.contiguous() 2025-05-07T20:31:49.5284457Z 2025-05-07T20:31:49.5284548Z if scale_ub is not None: 2025-05-07T20:31:49.5284651Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5284782Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5284860Z ) 2025-05-07T20:31:49.5284932Z else: 2025-05-07T20:31:49.5285022Z scale_ub_tensor = None 2025-05-07T20:31:49.5285101Z 2025-05-07T20:31:49.5285225Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5285311Z op = silu_mul_quant 2025-05-07T20:31:49.5285394Z if compiled: 2025-05-07T20:31:49.5285489Z op = torch.compile(op) 2025-05-07T20:31:49.5285590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5285659Z 2025-05-07T20:31:49.5285744Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5285748Z 2025-05-07T20:31:49.5285842Z moe/activation_test.py:117: 2025-05-07T20:31:49.5285966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5286145Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5286245Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5286743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5286834Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5287273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5287490Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5287827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5287915Z kernel = self.compile( 2025-05-07T20:31:49.5288292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5288465Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5288590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5288595Z 2025-05-07T20:31:49.5288799Z self = 2025-05-07T20:31:49.5289574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5290085Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02423550>} 2025-05-07T20:31:49.5290830Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5291022Z context = 2025-05-07T20:31:49.5291027Z 2025-05-07T20:31:49.5291191Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5291449Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5291550Z module_map=module_map) 2025-05-07T20:31:49.5291715Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5291808Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5291880Z E ^ 2025-05-07T20:31:49.5292236Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5292241Z 2025-05-07T20:31:49.5292650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5292654Z 2025-05-07T20:31:49.5292756Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5292978Z self=, 2025-05-07T20:31:49.5293051Z T=1, 2025-05-07T20:31:49.5293129Z D=7168, 2025-05-07T20:31:49.5293207Z scale_ub=1200.0, 2025-05-07T20:31:49.5293307Z contiguous=False, 2025-05-07T20:31:49.5293397Z compiled=False, 2025-05-07T20:31:49.5293485Z ) 2025-05-07T20:31:49.5293708Z self = 2025-05-07T20:31:49.5293870Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.5293875Z 2025-05-07T20:31:49.5293947Z @given( 2025-05-07T20:31:49.5294063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5294157Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5294269Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5294385Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5294496Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5294710Z ) 2025-05-07T20:31:49.5294953Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5295040Z def test_silu_mul_quant( 2025-05-07T20:31:49.5295115Z self, 2025-05-07T20:31:49.5295187Z T: int, 2025-05-07T20:31:49.5295259Z D: int, 2025-05-07T20:31:49.5295433Z scale_ub: Optional[float], 2025-05-07T20:31:49.5295516Z contiguous: bool, 2025-05-07T20:31:49.5295595Z compiled: bool, 2025-05-07T20:31:49.5295671Z ) -> None: 2025-05-07T20:31:49.5295762Z torch.manual_seed(2025) 2025-05-07T20:31:49.5295830Z 2025-05-07T20:31:49.5295999Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5296068Z 2025-05-07T20:31:49.5296155Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5296276Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5296359Z x = x_sign * x_clamp 2025-05-07T20:31:49.5296441Z x0 = x[:, :D] 2025-05-07T20:31:49.5296524Z x1 = x[:, D:] 2025-05-07T20:31:49.5296592Z 2025-05-07T20:31:49.5296673Z if contiguous: 2025-05-07T20:31:49.5296759Z x0 = x0.contiguous() 2025-05-07T20:31:49.5296843Z x1 = x1.contiguous() 2025-05-07T20:31:49.5296915Z 2025-05-07T20:31:49.5297007Z if scale_ub is not None: 2025-05-07T20:31:49.5297115Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5297247Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5297320Z ) 2025-05-07T20:31:49.5297392Z else: 2025-05-07T20:31:49.5297484Z scale_ub_tensor = None 2025-05-07T20:31:49.5297556Z 2025-05-07T20:31:49.5297684Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5297768Z op = silu_mul_quant 2025-05-07T20:31:49.5297847Z if compiled: 2025-05-07T20:31:49.5297942Z op = torch.compile(op) 2025-05-07T20:31:49.5298047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5298116Z 2025-05-07T20:31:49.5298204Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5298208Z 2025-05-07T20:31:49.5298301Z moe/activation_test.py:117: 2025-05-07T20:31:49.5298426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5298529Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5298624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5299124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5299218Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5299573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5299795Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5300133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5300223Z kernel = self.compile( 2025-05-07T20:31:49.5300601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5300773Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5300903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5300908Z 2025-05-07T20:31:49.5301111Z self = 2025-05-07T20:31:49.5301883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5302472Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02697e50>} 2025-05-07T20:31:49.5303237Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5303459Z context = 2025-05-07T20:31:49.5303537Z 2025-05-07T20:31:49.5303954Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5304220Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5304324Z module_map=module_map) 2025-05-07T20:31:49.5304481Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5304576Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5304649Z E ^ 2025-05-07T20:31:49.5305006Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5305010Z 2025-05-07T20:31:49.5305420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5305424Z 2025-05-07T20:31:49.5305523Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5305748Z self=, 2025-05-07T20:31:49.5305823Z T=4096, 2025-05-07T20:31:49.5305894Z D=7168, 2025-05-07T20:31:49.5305978Z scale_ub=1200.0, 2025-05-07T20:31:49.5306059Z contiguous=False, 2025-05-07T20:31:49.5306138Z compiled=True, 2025-05-07T20:31:49.5306207Z ) 2025-05-07T20:31:49.5306421Z self = 2025-05-07T20:31:49.5306596Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.5306600Z 2025-05-07T20:31:49.5306674Z @given( 2025-05-07T20:31:49.5306796Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5306891Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5307001Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5307116Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5307225Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5307299Z ) 2025-05-07T20:31:49.5307543Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5307631Z def test_silu_mul_quant( 2025-05-07T20:31:49.5307706Z self, 2025-05-07T20:31:49.5307779Z T: int, 2025-05-07T20:31:49.5307852Z D: int, 2025-05-07T20:31:49.5307948Z scale_ub: Optional[float], 2025-05-07T20:31:49.5308032Z contiguous: bool, 2025-05-07T20:31:49.5308111Z compiled: bool, 2025-05-07T20:31:49.5308188Z ) -> None: 2025-05-07T20:31:49.5308277Z torch.manual_seed(2025) 2025-05-07T20:31:49.5308348Z 2025-05-07T20:31:49.5308524Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5308594Z 2025-05-07T20:31:49.5308680Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5308802Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5308890Z x = x_sign * x_clamp 2025-05-07T20:31:49.5308970Z x0 = x[:, :D] 2025-05-07T20:31:49.5309050Z x1 = x[:, D:] 2025-05-07T20:31:49.5309120Z 2025-05-07T20:31:49.5309201Z if contiguous: 2025-05-07T20:31:49.5309290Z x0 = x0.contiguous() 2025-05-07T20:31:49.5309375Z x1 = x1.contiguous() 2025-05-07T20:31:49.5309446Z 2025-05-07T20:31:49.5309534Z if scale_ub is not None: 2025-05-07T20:31:49.5309635Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5309767Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5309887Z ) 2025-05-07T20:31:49.5309961Z else: 2025-05-07T20:31:49.5310054Z scale_ub_tensor = None 2025-05-07T20:31:49.5310254Z 2025-05-07T20:31:49.5310382Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5310471Z op = silu_mul_quant 2025-05-07T20:31:49.5310551Z if compiled: 2025-05-07T20:31:49.5310651Z op = torch.compile(op) 2025-05-07T20:31:49.5310752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5310934Z 2025-05-07T20:31:49.5311023Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5311027Z 2025-05-07T20:31:49.5311124Z moe/activation_test.py:117: 2025-05-07T20:31:49.5311250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5311352Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5311446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5311807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5311899Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5312399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5312495Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5312847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5313096Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5313458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5313548Z kernel = self.compile( 2025-05-07T20:31:49.5313925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5314096Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5314216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5314221Z 2025-05-07T20:31:49.5314431Z self = 2025-05-07T20:31:49.5315205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5315715Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa0234eee0>} 2025-05-07T20:31:49.5316461Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5316650Z context = 2025-05-07T20:31:49.5316655Z 2025-05-07T20:31:49.5316824Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5317083Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5317193Z module_map=module_map) 2025-05-07T20:31:49.5317354Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5317453Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5317531Z E ^ 2025-05-07T20:31:49.5317882Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5317887Z 2025-05-07T20:31:49.5318299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5318304Z 2025-05-07T20:31:49.5318403Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5318621Z self=, 2025-05-07T20:31:49.5318698Z T=128, 2025-05-07T20:31:49.5318853Z D=7168, 2025-05-07T20:31:49.5318931Z scale_ub=1200.0, 2025-05-07T20:31:49.5319015Z contiguous=False, 2025-05-07T20:31:49.5319092Z compiled=True, 2025-05-07T20:31:49.5319161Z ) 2025-05-07T20:31:49.5319379Z self = 2025-05-07T20:31:49.5319545Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:49.5319649Z 2025-05-07T20:31:49.5319727Z @given( 2025-05-07T20:31:49.5319841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5319934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5320047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5320158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5320267Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5320338Z ) 2025-05-07T20:31:49.5320578Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5320674Z def test_silu_mul_quant( 2025-05-07T20:31:49.5320748Z self, 2025-05-07T20:31:49.5320819Z T: int, 2025-05-07T20:31:49.5320894Z D: int, 2025-05-07T20:31:49.5320989Z scale_ub: Optional[float], 2025-05-07T20:31:49.5321072Z contiguous: bool, 2025-05-07T20:31:49.5321156Z compiled: bool, 2025-05-07T20:31:49.5321237Z ) -> None: 2025-05-07T20:31:49.5321324Z torch.manual_seed(2025) 2025-05-07T20:31:49.5321396Z 2025-05-07T20:31:49.5321560Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5321630Z 2025-05-07T20:31:49.5321719Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5321838Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5321922Z x = x_sign * x_clamp 2025-05-07T20:31:49.5322000Z x0 = x[:, :D] 2025-05-07T20:31:49.5322076Z x1 = x[:, D:] 2025-05-07T20:31:49.5322146Z 2025-05-07T20:31:49.5322224Z if contiguous: 2025-05-07T20:31:49.5322316Z x0 = x0.contiguous() 2025-05-07T20:31:49.5322403Z x1 = x1.contiguous() 2025-05-07T20:31:49.5322470Z 2025-05-07T20:31:49.5322557Z if scale_ub is not None: 2025-05-07T20:31:49.5322658Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5322790Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5322866Z ) 2025-05-07T20:31:49.5322939Z else: 2025-05-07T20:31:49.5323031Z scale_ub_tensor = None 2025-05-07T20:31:49.5323100Z 2025-05-07T20:31:49.5323233Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5323338Z op = silu_mul_quant 2025-05-07T20:31:49.5323421Z if compiled: 2025-05-07T20:31:49.5323539Z op = torch.compile(op) 2025-05-07T20:31:49.5323640Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5323714Z 2025-05-07T20:31:49.5323799Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5323804Z 2025-05-07T20:31:49.5323900Z moe/activation_test.py:117: 2025-05-07T20:31:49.5324027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5324121Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5324215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5324588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5324674Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5325167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5325259Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5325613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5325835Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5326251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5326347Z kernel = self.compile( 2025-05-07T20:31:49.5326727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5326975Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5327099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5327104Z 2025-05-07T20:31:49.5327308Z self = 2025-05-07T20:31:49.5328080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5328589Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa026c8af0>} 2025-05-07T20:31:49.5329331Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5329526Z context = 2025-05-07T20:31:49.5329531Z 2025-05-07T20:31:49.5329696Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5329955Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5330056Z module_map=module_map) 2025-05-07T20:31:49.5330213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5330308Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5330381Z E ^ 2025-05-07T20:31:49.5330736Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5330741Z 2025-05-07T20:31:49.5331159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5331164Z 2025-05-07T20:31:49.5331263Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5331490Z self=, 2025-05-07T20:31:49.5331563Z T=2048, 2025-05-07T20:31:49.5331637Z D=7168, 2025-05-07T20:31:49.5331716Z scale_ub=None, 2025-05-07T20:31:49.5331796Z contiguous=True, 2025-05-07T20:31:49.5331872Z compiled=True, 2025-05-07T20:31:49.5331946Z ) 2025-05-07T20:31:49.5332159Z self = 2025-05-07T20:31:49.5332324Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.5332329Z 2025-05-07T20:31:49.5332402Z @given( 2025-05-07T20:31:49.5332521Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5332617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5332727Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5332838Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5332951Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5333027Z ) 2025-05-07T20:31:49.5333268Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5333359Z def test_silu_mul_quant( 2025-05-07T20:31:49.5333431Z self, 2025-05-07T20:31:49.5333501Z T: int, 2025-05-07T20:31:49.5333575Z D: int, 2025-05-07T20:31:49.5333668Z scale_ub: Optional[float], 2025-05-07T20:31:49.5333757Z contiguous: bool, 2025-05-07T20:31:49.5333836Z compiled: bool, 2025-05-07T20:31:49.5333907Z ) -> None: 2025-05-07T20:31:49.5334000Z torch.manual_seed(2025) 2025-05-07T20:31:49.5334068Z 2025-05-07T20:31:49.5334315Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5334393Z 2025-05-07T20:31:49.5334480Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5334601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5334689Z x = x_sign * x_clamp 2025-05-07T20:31:49.5334916Z x0 = x[:, :D] 2025-05-07T20:31:49.5334991Z x1 = x[:, D:] 2025-05-07T20:31:49.5335061Z 2025-05-07T20:31:49.5335140Z if contiguous: 2025-05-07T20:31:49.5335225Z x0 = x0.contiguous() 2025-05-07T20:31:49.5335311Z x1 = x1.contiguous() 2025-05-07T20:31:49.5335378Z 2025-05-07T20:31:49.5335467Z if scale_ub is not None: 2025-05-07T20:31:49.5335569Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5335697Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5335770Z ) 2025-05-07T20:31:49.5335840Z else: 2025-05-07T20:31:49.5335936Z scale_ub_tensor = None 2025-05-07T20:31:49.5336008Z 2025-05-07T20:31:49.5336132Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5336216Z op = silu_mul_quant 2025-05-07T20:31:49.5336301Z if compiled: 2025-05-07T20:31:49.5336395Z op = torch.compile(op) 2025-05-07T20:31:49.5336507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5336579Z 2025-05-07T20:31:49.5336666Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5336671Z 2025-05-07T20:31:49.5336770Z moe/activation_test.py:117: 2025-05-07T20:31:49.5336893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5336992Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5337090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5337453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5337544Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5338040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5338134Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5338491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5338713Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5339045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5339137Z kernel = self.compile( 2025-05-07T20:31:49.5339514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5339683Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5339812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5339817Z 2025-05-07T20:31:49.5340019Z self = 2025-05-07T20:31:49.5340797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5341305Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa021238b0>} 2025-05-07T20:31:49.5342055Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5342244Z context = 2025-05-07T20:31:49.5342248Z 2025-05-07T20:31:49.5342491Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5342756Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5342859Z module_map=module_map) 2025-05-07T20:31:49.5343020Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5343213Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5343292Z E ^ 2025-05-07T20:31:49.5343665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5343670Z 2025-05-07T20:31:49.5344079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5344083Z 2025-05-07T20:31:49.5344181Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5344402Z self=, 2025-05-07T20:31:49.5344481Z T=16384, 2025-05-07T20:31:49.5344560Z D=5120, 2025-05-07T20:31:49.5344637Z scale_ub=None, 2025-05-07T20:31:49.5344719Z contiguous=False, 2025-05-07T20:31:49.5344800Z compiled=False, 2025-05-07T20:31:49.5344869Z ) 2025-05-07T20:31:49.5345082Z self = 2025-05-07T20:31:49.5345262Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.5345267Z 2025-05-07T20:31:49.5345340Z @given( 2025-05-07T20:31:49.5345461Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5345555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5345667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5345779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5345887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5345959Z ) 2025-05-07T20:31:49.5346203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5346292Z def test_silu_mul_quant( 2025-05-07T20:31:49.5346367Z self, 2025-05-07T20:31:49.5346439Z T: int, 2025-05-07T20:31:49.5350962Z D: int, 2025-05-07T20:31:49.5351079Z scale_ub: Optional[float], 2025-05-07T20:31:49.5351166Z contiguous: bool, 2025-05-07T20:31:49.5351258Z compiled: bool, 2025-05-07T20:31:49.5351331Z ) -> None: 2025-05-07T20:31:49.5351423Z torch.manual_seed(2025) 2025-05-07T20:31:49.5351494Z 2025-05-07T20:31:49.5351666Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5351736Z 2025-05-07T20:31:49.5351827Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5351949Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5353814Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5353825Z 2025-05-07T20:31:49.5353939Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:49.5353944Z 2025-05-07T20:31:49.5354040Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5354267Z self=, 2025-05-07T20:31:49.5354338Z T=4096, 2025-05-07T20:31:49.5354414Z D=7168, 2025-05-07T20:31:49.5354491Z scale_ub=1200.0, 2025-05-07T20:31:49.5354567Z contiguous=True, 2025-05-07T20:31:49.5354648Z compiled=True, 2025-05-07T20:31:49.5354717Z ) 2025-05-07T20:31:49.5355034Z self = 2025-05-07T20:31:49.5355206Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.5355211Z 2025-05-07T20:31:49.5355282Z @given( 2025-05-07T20:31:49.5355395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5355590Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5355700Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5355815Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5355925Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5355996Z ) 2025-05-07T20:31:49.5356243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5356332Z def test_silu_mul_quant( 2025-05-07T20:31:49.5356406Z self, 2025-05-07T20:31:49.5356480Z T: int, 2025-05-07T20:31:49.5356550Z D: int, 2025-05-07T20:31:49.5356643Z scale_ub: Optional[float], 2025-05-07T20:31:49.5356733Z contiguous: bool, 2025-05-07T20:31:49.5356814Z compiled: bool, 2025-05-07T20:31:49.5356888Z ) -> None: 2025-05-07T20:31:49.5356983Z torch.manual_seed(2025) 2025-05-07T20:31:49.5357050Z 2025-05-07T20:31:49.5357219Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5357294Z 2025-05-07T20:31:49.5357385Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5357510Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5359311Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5359317Z 2025-05-07T20:31:49.5359433Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:49.5359438Z 2025-05-07T20:31:49.5359536Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5359761Z self=, 2025-05-07T20:31:49.5359835Z T=16384, 2025-05-07T20:31:49.5359910Z D=7168, 2025-05-07T20:31:49.5359985Z scale_ub=None, 2025-05-07T20:31:49.5360071Z contiguous=False, 2025-05-07T20:31:49.5360150Z compiled=False, 2025-05-07T20:31:49.5360225Z ) 2025-05-07T20:31:49.5360437Z self = 2025-05-07T20:31:49.5360610Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.5360615Z 2025-05-07T20:31:49.5360690Z @given( 2025-05-07T20:31:49.5360813Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5360907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5361019Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5361131Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5361241Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5361320Z ) 2025-05-07T20:31:49.5361561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5361652Z def test_silu_mul_quant( 2025-05-07T20:31:49.5361726Z self, 2025-05-07T20:31:49.5361795Z T: int, 2025-05-07T20:31:49.5361870Z D: int, 2025-05-07T20:31:49.5361962Z scale_ub: Optional[float], 2025-05-07T20:31:49.5362044Z contiguous: bool, 2025-05-07T20:31:49.5362127Z compiled: bool, 2025-05-07T20:31:49.5362200Z ) -> None: 2025-05-07T20:31:49.5362290Z torch.manual_seed(2025) 2025-05-07T20:31:49.5362362Z 2025-05-07T20:31:49.5362606Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5364399Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5364477Z 2025-05-07T20:31:49.5364591Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5364596Z 2025-05-07T20:31:49.5364699Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5364920Z self=, 2025-05-07T20:31:49.5364993Z T=2048, 2025-05-07T20:31:49.5365072Z D=7168, 2025-05-07T20:31:49.5365150Z scale_ub=1200.0, 2025-05-07T20:31:49.5365227Z contiguous=True, 2025-05-07T20:31:49.5365305Z compiled=True, 2025-05-07T20:31:49.5365374Z ) 2025-05-07T20:31:49.5365584Z self = 2025-05-07T20:31:49.5365755Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.5365761Z 2025-05-07T20:31:49.5365835Z @given( 2025-05-07T20:31:49.5365952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5366048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5366156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5366268Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5366377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5366445Z ) 2025-05-07T20:31:49.5366691Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5366780Z def test_silu_mul_quant( 2025-05-07T20:31:49.5366852Z self, 2025-05-07T20:31:49.5366929Z T: int, 2025-05-07T20:31:49.5366998Z D: int, 2025-05-07T20:31:49.5367090Z scale_ub: Optional[float], 2025-05-07T20:31:49.5367174Z contiguous: bool, 2025-05-07T20:31:49.5367259Z compiled: bool, 2025-05-07T20:31:49.5367336Z ) -> None: 2025-05-07T20:31:49.5367424Z torch.manual_seed(2025) 2025-05-07T20:31:49.5367494Z 2025-05-07T20:31:49.5367659Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5367726Z 2025-05-07T20:31:49.5367813Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5367932Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5369718Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5369729Z 2025-05-07T20:31:49.5369848Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:49.5369852Z 2025-05-07T20:31:49.5369949Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5370167Z self=, 2025-05-07T20:31:49.5370242Z T=2048, 2025-05-07T20:31:49.5370313Z D=7168, 2025-05-07T20:31:49.5370391Z scale_ub=None, 2025-05-07T20:31:49.5370468Z contiguous=True, 2025-05-07T20:31:49.5370547Z compiled=False, 2025-05-07T20:31:49.5370617Z ) 2025-05-07T20:31:49.5370909Z self = 2025-05-07T20:31:49.5371078Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.5371086Z 2025-05-07T20:31:49.5371158Z @given( 2025-05-07T20:31:49.5371271Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5371366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5371550Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5371660Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5371772Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5371846Z ) 2025-05-07T20:31:49.5372087Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5372178Z def test_silu_mul_quant( 2025-05-07T20:31:49.5372251Z self, 2025-05-07T20:31:49.5372328Z T: int, 2025-05-07T20:31:49.5372401Z D: int, 2025-05-07T20:31:49.5372495Z scale_ub: Optional[float], 2025-05-07T20:31:49.5372585Z contiguous: bool, 2025-05-07T20:31:49.5372667Z compiled: bool, 2025-05-07T20:31:49.5372740Z ) -> None: 2025-05-07T20:31:49.5372831Z torch.manual_seed(2025) 2025-05-07T20:31:49.5372901Z 2025-05-07T20:31:49.5373072Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5373164Z 2025-05-07T20:31:49.5373256Z > x_sign = torch.sign(x) 2025-05-07T20:31:49.5375034Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5375048Z 2025-05-07T20:31:49.5375160Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:49.5375165Z 2025-05-07T20:31:49.5375264Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5375487Z self=, 2025-05-07T20:31:49.5375565Z T=1, 2025-05-07T20:31:49.5375641Z D=7168, 2025-05-07T20:31:49.5375716Z scale_ub=1200.0, 2025-05-07T20:31:49.5375794Z contiguous=True, 2025-05-07T20:31:49.5375872Z compiled=False, 2025-05-07T20:31:49.5375941Z ) 2025-05-07T20:31:49.5376150Z self = 2025-05-07T20:31:49.5376313Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.5376317Z 2025-05-07T20:31:49.5376391Z @given( 2025-05-07T20:31:49.5376503Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5376599Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5376711Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5376824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5376933Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5377002Z ) 2025-05-07T20:31:49.5377246Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5377343Z def test_silu_mul_quant( 2025-05-07T20:31:49.5377416Z self, 2025-05-07T20:31:49.5377491Z T: int, 2025-05-07T20:31:49.5377560Z D: int, 2025-05-07T20:31:49.5377653Z scale_ub: Optional[float], 2025-05-07T20:31:49.5377739Z contiguous: bool, 2025-05-07T20:31:49.5377819Z compiled: bool, 2025-05-07T20:31:49.5377890Z ) -> None: 2025-05-07T20:31:49.5377982Z torch.manual_seed(2025) 2025-05-07T20:31:49.5378048Z 2025-05-07T20:31:49.5378212Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5378281Z 2025-05-07T20:31:49.5378449Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5378572Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5378656Z x = x_sign * x_clamp 2025-05-07T20:31:49.5378730Z x0 = x[:, :D] 2025-05-07T20:31:49.5378809Z x1 = x[:, D:] 2025-05-07T20:31:49.5378877Z 2025-05-07T20:31:49.5379031Z if contiguous: 2025-05-07T20:31:49.5379120Z x0 = x0.contiguous() 2025-05-07T20:31:49.5379204Z x1 = x1.contiguous() 2025-05-07T20:31:49.5379272Z 2025-05-07T20:31:49.5379362Z if scale_ub is not None: 2025-05-07T20:31:49.5379465Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5379600Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5379673Z ) 2025-05-07T20:31:49.5379743Z else: 2025-05-07T20:31:49.5379834Z scale_ub_tensor = None 2025-05-07T20:31:49.5379899Z 2025-05-07T20:31:49.5380023Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5380121Z op = silu_mul_quant 2025-05-07T20:31:49.5380202Z if compiled: 2025-05-07T20:31:49.5380296Z op = torch.compile(op) 2025-05-07T20:31:49.5380400Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5380473Z 2025-05-07T20:31:49.5380559Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5380569Z 2025-05-07T20:31:49.5380665Z moe/activation_test.py:117: 2025-05-07T20:31:49.5380789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5380889Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5380983Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5381480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5381577Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5381937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5382156Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5382493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5382582Z kernel = self.compile( 2025-05-07T20:31:49.5382964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5383135Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5383258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5383262Z 2025-05-07T20:31:49.5383471Z self = 2025-05-07T20:31:49.5384301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5384808Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa021dc550>} 2025-05-07T20:31:49.5385549Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5385743Z context = 2025-05-07T20:31:49.5385750Z 2025-05-07T20:31:49.5385912Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5386171Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5386277Z module_map=module_map) 2025-05-07T20:31:49.5386436Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5386636Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5386714Z E ^ 2025-05-07T20:31:49.5387065Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5387070Z 2025-05-07T20:31:49.5387480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5387564Z 2025-05-07T20:31:49.5387664Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5387886Z self=, 2025-05-07T20:31:49.5387961Z T=128, 2025-05-07T20:31:49.5388034Z D=5120, 2025-05-07T20:31:49.5388110Z scale_ub=None, 2025-05-07T20:31:49.5388192Z contiguous=True, 2025-05-07T20:31:49.5388274Z compiled=False, 2025-05-07T20:31:49.5388343Z ) 2025-05-07T20:31:49.5388560Z self = 2025-05-07T20:31:49.5388729Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.5388733Z 2025-05-07T20:31:49.5388806Z @given( 2025-05-07T20:31:49.5388922Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5389015Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5389133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5389244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5389351Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5389421Z ) 2025-05-07T20:31:49.5389659Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5389753Z def test_silu_mul_quant( 2025-05-07T20:31:49.5389883Z self, 2025-05-07T20:31:49.5389954Z T: int, 2025-05-07T20:31:49.5390029Z D: int, 2025-05-07T20:31:49.5390125Z scale_ub: Optional[float], 2025-05-07T20:31:49.5390211Z contiguous: bool, 2025-05-07T20:31:49.5390297Z compiled: bool, 2025-05-07T20:31:49.5390371Z ) -> None: 2025-05-07T20:31:49.5390459Z torch.manual_seed(2025) 2025-05-07T20:31:49.5390534Z 2025-05-07T20:31:49.5390698Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5390768Z 2025-05-07T20:31:49.5390863Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5390981Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5391063Z x = x_sign * x_clamp 2025-05-07T20:31:49.5391141Z x0 = x[:, :D] 2025-05-07T20:31:49.5391214Z x1 = x[:, D:] 2025-05-07T20:31:49.5391288Z 2025-05-07T20:31:49.5391367Z if contiguous: 2025-05-07T20:31:49.5391451Z x0 = x0.contiguous() 2025-05-07T20:31:49.5391538Z x1 = x1.contiguous() 2025-05-07T20:31:49.5391604Z 2025-05-07T20:31:49.5391691Z if scale_ub is not None: 2025-05-07T20:31:49.5391797Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5391933Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5392005Z ) 2025-05-07T20:31:49.5392079Z else: 2025-05-07T20:31:49.5392168Z scale_ub_tensor = None 2025-05-07T20:31:49.5392238Z 2025-05-07T20:31:49.5392365Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5392455Z op = silu_mul_quant 2025-05-07T20:31:49.5392540Z if compiled: 2025-05-07T20:31:49.5392637Z op = torch.compile(op) 2025-05-07T20:31:49.5392738Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5392812Z 2025-05-07T20:31:49.5392898Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5392902Z 2025-05-07T20:31:49.5392993Z moe/activation_test.py:117: 2025-05-07T20:31:49.5393119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5393215Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5393309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5393892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5393990Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5394347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5394643Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5394980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5395073Z kernel = self.compile( 2025-05-07T20:31:49.5395451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5395626Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5395748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5395757Z 2025-05-07T20:31:49.5395958Z self = 2025-05-07T20:31:49.5396735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5397246Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02227040>} 2025-05-07T20:31:49.5397992Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5398182Z context = 2025-05-07T20:31:49.5398187Z 2025-05-07T20:31:49.5398352Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5398614Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5398719Z module_map=module_map) 2025-05-07T20:31:49.5398880Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5398980Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5399055Z E ^ 2025-05-07T20:31:49.5399408Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5399413Z 2025-05-07T20:31:49.5399821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5399826Z 2025-05-07T20:31:49.5399928Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5400145Z self=, 2025-05-07T20:31:49.5400219Z T=128, 2025-05-07T20:31:49.5400299Z D=7168, 2025-05-07T20:31:49.5400375Z scale_ub=None, 2025-05-07T20:31:49.5400453Z contiguous=True, 2025-05-07T20:31:49.5400533Z compiled=False, 2025-05-07T20:31:49.5400600Z ) 2025-05-07T20:31:49.5400812Z self = 2025-05-07T20:31:49.5400982Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.5400987Z 2025-05-07T20:31:49.5401059Z @given( 2025-05-07T20:31:49.5401172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5401271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5401383Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5401496Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5401605Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5401675Z ) 2025-05-07T20:31:49.5401915Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5402085Z def test_silu_mul_quant( 2025-05-07T20:31:49.5402161Z self, 2025-05-07T20:31:49.5402236Z T: int, 2025-05-07T20:31:49.5402308Z D: int, 2025-05-07T20:31:49.5402401Z scale_ub: Optional[float], 2025-05-07T20:31:49.5402486Z contiguous: bool, 2025-05-07T20:31:49.5402643Z compiled: bool, 2025-05-07T20:31:49.5402721Z ) -> None: 2025-05-07T20:31:49.5402812Z torch.manual_seed(2025) 2025-05-07T20:31:49.5402880Z 2025-05-07T20:31:49.5403065Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5403146Z 2025-05-07T20:31:49.5403247Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5403377Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5403460Z x = x_sign * x_clamp 2025-05-07T20:31:49.5403533Z x0 = x[:, :D] 2025-05-07T20:31:49.5403610Z x1 = x[:, D:] 2025-05-07T20:31:49.5403679Z 2025-05-07T20:31:49.5403989Z if contiguous: 2025-05-07T20:31:49.5404081Z x0 = x0.contiguous() 2025-05-07T20:31:49.5404165Z x1 = x1.contiguous() 2025-05-07T20:31:49.5404235Z 2025-05-07T20:31:49.5404322Z if scale_ub is not None: 2025-05-07T20:31:49.5404425Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5404566Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5404638Z ) 2025-05-07T20:31:49.5404706Z else: 2025-05-07T20:31:49.5404797Z scale_ub_tensor = None 2025-05-07T20:31:49.5404866Z 2025-05-07T20:31:49.5404991Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5405081Z op = silu_mul_quant 2025-05-07T20:31:49.5405160Z if compiled: 2025-05-07T20:31:49.5405255Z op = torch.compile(op) 2025-05-07T20:31:49.5405359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5405430Z 2025-05-07T20:31:49.5405514Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5405525Z 2025-05-07T20:31:49.5405618Z moe/activation_test.py:117: 2025-05-07T20:31:49.5405741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5405843Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5405937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5406435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5406533Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5406886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5407108Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5407442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5407529Z kernel = self.compile( 2025-05-07T20:31:49.5407913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5408082Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5408204Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5408216Z 2025-05-07T20:31:49.5408418Z self = 2025-05-07T20:31:49.5409191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5409697Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa02227c10>} 2025-05-07T20:31:49.5410572Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5410767Z context = 2025-05-07T20:31:49.5410772Z 2025-05-07T20:31:49.5410934Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5411304Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5411411Z module_map=module_map) 2025-05-07T20:31:49.5411574Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5411670Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5411747Z E ^ 2025-05-07T20:31:49.5412097Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5412102Z 2025-05-07T20:31:49.5412520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5412524Z 2025-05-07T20:31:49.5412625Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5412841Z self=, 2025-05-07T20:31:49.5412919Z T=2048, 2025-05-07T20:31:49.5412994Z D=7168, 2025-05-07T20:31:49.5413074Z scale_ub=1200.0, 2025-05-07T20:31:49.5413156Z contiguous=True, 2025-05-07T20:31:49.5413233Z compiled=False, 2025-05-07T20:31:49.5413307Z ) 2025-05-07T20:31:49.5413555Z self = 2025-05-07T20:31:49.5413736Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.5413741Z 2025-05-07T20:31:49.5413818Z @given( 2025-05-07T20:31:49.5413935Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5414028Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5414149Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5414261Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5414370Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5414440Z ) 2025-05-07T20:31:49.5414679Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5414772Z def test_silu_mul_quant( 2025-05-07T20:31:49.5414844Z self, 2025-05-07T20:31:49.5414913Z T: int, 2025-05-07T20:31:49.5414985Z D: int, 2025-05-07T20:31:49.5415079Z scale_ub: Optional[float], 2025-05-07T20:31:49.5415163Z contiguous: bool, 2025-05-07T20:31:49.5415244Z compiled: bool, 2025-05-07T20:31:49.5415318Z ) -> None: 2025-05-07T20:31:49.5415407Z torch.manual_seed(2025) 2025-05-07T20:31:49.5415476Z 2025-05-07T20:31:49.5415640Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5417439Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5417449Z 2025-05-07T20:31:49.5417563Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5417567Z 2025-05-07T20:31:49.5417668Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5417890Z self=, 2025-05-07T20:31:49.5417964Z T=1, 2025-05-07T20:31:49.5418039Z D=5120, 2025-05-07T20:31:49.5418116Z scale_ub=1200.0, 2025-05-07T20:31:49.5418192Z contiguous=True, 2025-05-07T20:31:49.5418381Z compiled=False, 2025-05-07T20:31:49.5418452Z ) 2025-05-07T20:31:49.5418669Z self = 2025-05-07T20:31:49.5418830Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.5418834Z 2025-05-07T20:31:49.5418983Z @given( 2025-05-07T20:31:49.5419099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5419194Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5419303Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5419417Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5419527Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5419595Z ) 2025-05-07T20:31:49.5419837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5419927Z def test_silu_mul_quant( 2025-05-07T20:31:49.5420000Z self, 2025-05-07T20:31:49.5420082Z T: int, 2025-05-07T20:31:49.5420155Z D: int, 2025-05-07T20:31:49.5420251Z scale_ub: Optional[float], 2025-05-07T20:31:49.5420339Z contiguous: bool, 2025-05-07T20:31:49.5420418Z compiled: bool, 2025-05-07T20:31:49.5420492Z ) -> None: 2025-05-07T20:31:49.5420579Z torch.manual_seed(2025) 2025-05-07T20:31:49.5420654Z 2025-05-07T20:31:49.5420818Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5420889Z 2025-05-07T20:31:49.5420976Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5421097Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5421180Z x = x_sign * x_clamp 2025-05-07T20:31:49.5421256Z x0 = x[:, :D] 2025-05-07T20:31:49.5421336Z x1 = x[:, D:] 2025-05-07T20:31:49.5421404Z 2025-05-07T20:31:49.5421481Z if contiguous: 2025-05-07T20:31:49.5421572Z x0 = x0.contiguous() 2025-05-07T20:31:49.5421656Z x1 = x1.contiguous() 2025-05-07T20:31:49.5421733Z 2025-05-07T20:31:49.5421819Z if scale_ub is not None: 2025-05-07T20:31:49.5421920Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5422054Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5422126Z ) 2025-05-07T20:31:49.5422202Z else: 2025-05-07T20:31:49.5422295Z scale_ub_tensor = None 2025-05-07T20:31:49.5422362Z 2025-05-07T20:31:49.5422487Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5422575Z op = silu_mul_quant 2025-05-07T20:31:49.5422654Z if compiled: 2025-05-07T20:31:49.5422751Z op = torch.compile(op) 2025-05-07T20:31:49.5422855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5422925Z 2025-05-07T20:31:49.5423010Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5423019Z 2025-05-07T20:31:49.5423109Z moe/activation_test.py:117: 2025-05-07T20:31:49.5423236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5423336Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5423430Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5423924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5424024Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5424378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5424598Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5424934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5425024Z kernel = self.compile( 2025-05-07T20:31:49.5425402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5425655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5425778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5425783Z 2025-05-07T20:31:49.5425989Z self = 2025-05-07T20:31:49.5426842Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5427344Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faa021ad9d0>} 2025-05-07T20:31:49.5428095Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5428289Z context = 2025-05-07T20:31:49.5428294Z 2025-05-07T20:31:49.5428456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5428723Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5428836Z module_map=module_map) 2025-05-07T20:31:49.5429000Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5429093Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5429165Z E ^ 2025-05-07T20:31:49.5429521Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5429526Z 2025-05-07T20:31:49.5430007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5430013Z 2025-05-07T20:31:49.5430116Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5430337Z self=, 2025-05-07T20:31:49.5430410Z T=2048, 2025-05-07T20:31:49.5430487Z D=5120, 2025-05-07T20:31:49.5430563Z scale_ub=None, 2025-05-07T20:31:49.5430641Z contiguous=True, 2025-05-07T20:31:49.5430726Z compiled=False, 2025-05-07T20:31:49.5430794Z ) 2025-05-07T20:31:49.5431009Z self = 2025-05-07T20:31:49.5431182Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.5431186Z 2025-05-07T20:31:49.5431262Z @given( 2025-05-07T20:31:49.5431375Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5431472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5431584Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5431698Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5431813Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5431882Z ) 2025-05-07T20:31:49.5432127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5432216Z def test_silu_mul_quant( 2025-05-07T20:31:49.5432289Z self, 2025-05-07T20:31:49.5432368Z T: int, 2025-05-07T20:31:49.5432440Z D: int, 2025-05-07T20:31:49.5432533Z scale_ub: Optional[float], 2025-05-07T20:31:49.5432622Z contiguous: bool, 2025-05-07T20:31:49.5432703Z compiled: bool, 2025-05-07T20:31:49.5432774Z ) -> None: 2025-05-07T20:31:49.5432867Z torch.manual_seed(2025) 2025-05-07T20:31:49.5432938Z 2025-05-07T20:31:49.5433131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5433217Z 2025-05-07T20:31:49.5433307Z > x_sign = torch.sign(x) 2025-05-07T20:31:49.5435188Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5435264Z 2025-05-07T20:31:49.5435379Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:49.5435384Z 2025-05-07T20:31:49.5435484Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5435705Z self=, 2025-05-07T20:31:49.5435779Z T=16384, 2025-05-07T20:31:49.5435857Z D=5120, 2025-05-07T20:31:49.5435934Z scale_ub=None, 2025-05-07T20:31:49.5436013Z contiguous=True, 2025-05-07T20:31:49.5436095Z compiled=False, 2025-05-07T20:31:49.5436171Z ) 2025-05-07T20:31:49.5436385Z self = 2025-05-07T20:31:49.5436558Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.5436562Z 2025-05-07T20:31:49.5436636Z @given( 2025-05-07T20:31:49.5436757Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5436851Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5436962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5437075Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5437184Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5437254Z ) 2025-05-07T20:31:49.5437495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5437584Z def test_silu_mul_quant( 2025-05-07T20:31:49.5437660Z self, 2025-05-07T20:31:49.5437733Z T: int, 2025-05-07T20:31:49.5437806Z D: int, 2025-05-07T20:31:49.5437902Z scale_ub: Optional[float], 2025-05-07T20:31:49.5437986Z contiguous: bool, 2025-05-07T20:31:49.5438066Z compiled: bool, 2025-05-07T20:31:49.5438141Z ) -> None: 2025-05-07T20:31:49.5438230Z torch.manual_seed(2025) 2025-05-07T20:31:49.5438298Z 2025-05-07T20:31:49.5438470Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5440279Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5440285Z 2025-05-07T20:31:49.5440400Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5440404Z 2025-05-07T20:31:49.5440501Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5440725Z self=, 2025-05-07T20:31:49.5440803Z T=4096, 2025-05-07T20:31:49.5440873Z D=5120, 2025-05-07T20:31:49.5440951Z scale_ub=None, 2025-05-07T20:31:49.5441030Z contiguous=True, 2025-05-07T20:31:49.5441109Z compiled=False, 2025-05-07T20:31:49.5441182Z ) 2025-05-07T20:31:49.5441397Z self = 2025-05-07T20:31:49.5441561Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.5441565Z 2025-05-07T20:31:49.5441639Z @given( 2025-05-07T20:31:49.5441750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5441846Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5442036Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5442148Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5442257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5442327Z ) 2025-05-07T20:31:49.5442566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5442758Z def test_silu_mul_quant( 2025-05-07T20:31:49.5442830Z self, 2025-05-07T20:31:49.5442903Z T: int, 2025-05-07T20:31:49.5442978Z D: int, 2025-05-07T20:31:49.5443072Z scale_ub: Optional[float], 2025-05-07T20:31:49.5443154Z contiguous: bool, 2025-05-07T20:31:49.5443254Z compiled: bool, 2025-05-07T20:31:49.5443337Z ) -> None: 2025-05-07T20:31:49.5443447Z torch.manual_seed(2025) 2025-05-07T20:31:49.5443520Z 2025-05-07T20:31:49.5443682Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5445471Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5445482Z 2025-05-07T20:31:49.5445592Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5445596Z 2025-05-07T20:31:49.5445695Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5445916Z self=, 2025-05-07T20:31:49.5445990Z T=2048, 2025-05-07T20:31:49.5446061Z D=5120, 2025-05-07T20:31:49.5446136Z scale_ub=None, 2025-05-07T20:31:49.5446222Z contiguous=False, 2025-05-07T20:31:49.5446305Z compiled=False, 2025-05-07T20:31:49.5446375Z ) 2025-05-07T20:31:49.5446594Z self = 2025-05-07T20:31:49.5446760Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.5446769Z 2025-05-07T20:31:49.5446842Z @given( 2025-05-07T20:31:49.5446955Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5447047Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5447156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5447271Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5447379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5447446Z ) 2025-05-07T20:31:49.5447687Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5447776Z def test_silu_mul_quant( 2025-05-07T20:31:49.5447852Z self, 2025-05-07T20:31:49.5447932Z T: int, 2025-05-07T20:31:49.5448004Z D: int, 2025-05-07T20:31:49.5448101Z scale_ub: Optional[float], 2025-05-07T20:31:49.5448185Z contiguous: bool, 2025-05-07T20:31:49.5448263Z compiled: bool, 2025-05-07T20:31:49.5448336Z ) -> None: 2025-05-07T20:31:49.5448429Z torch.manual_seed(2025) 2025-05-07T20:31:49.5448497Z 2025-05-07T20:31:49.5448660Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5450501Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5450508Z 2025-05-07T20:31:49.5450626Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5450631Z 2025-05-07T20:31:49.5450727Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5450949Z self=, 2025-05-07T20:31:49.5451096Z T=4096, 2025-05-07T20:31:49.5451168Z D=7168, 2025-05-07T20:31:49.5451246Z scale_ub=None, 2025-05-07T20:31:49.5451325Z contiguous=True, 2025-05-07T20:31:49.5451399Z compiled=True, 2025-05-07T20:31:49.5451468Z ) 2025-05-07T20:31:49.5451682Z self = 2025-05-07T20:31:49.5451846Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.5451850Z 2025-05-07T20:31:49.5451924Z @given( 2025-05-07T20:31:49.5452037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5452139Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5452248Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5452357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5452467Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5452539Z ) 2025-05-07T20:31:49.5452777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5452873Z def test_silu_mul_quant( 2025-05-07T20:31:49.5452944Z self, 2025-05-07T20:31:49.5453015Z T: int, 2025-05-07T20:31:49.5453090Z D: int, 2025-05-07T20:31:49.5453184Z scale_ub: Optional[float], 2025-05-07T20:31:49.5453268Z contiguous: bool, 2025-05-07T20:31:49.5453350Z compiled: bool, 2025-05-07T20:31:49.5453421Z ) -> None: 2025-05-07T20:31:49.5453512Z torch.manual_seed(2025) 2025-05-07T20:31:49.5453583Z 2025-05-07T20:31:49.5453749Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5455519Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5455529Z 2025-05-07T20:31:49.5455640Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5455644Z 2025-05-07T20:31:49.5455746Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5455965Z self=, 2025-05-07T20:31:49.5456039Z T=2048, 2025-05-07T20:31:49.5456112Z D=5120, 2025-05-07T20:31:49.5456194Z scale_ub=1200.0, 2025-05-07T20:31:49.5456276Z contiguous=False, 2025-05-07T20:31:49.5456356Z compiled=False, 2025-05-07T20:31:49.5456425Z ) 2025-05-07T20:31:49.5456642Z self = 2025-05-07T20:31:49.5456811Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.5456819Z 2025-05-07T20:31:49.5456892Z @given( 2025-05-07T20:31:49.5457009Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5457103Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5457210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5457323Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5457431Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5457501Z ) 2025-05-07T20:31:49.5457745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5457914Z def test_silu_mul_quant( 2025-05-07T20:31:49.5457994Z self, 2025-05-07T20:31:49.5458067Z T: int, 2025-05-07T20:31:49.5458136Z D: int, 2025-05-07T20:31:49.5458233Z scale_ub: Optional[float], 2025-05-07T20:31:49.5458315Z contiguous: bool, 2025-05-07T20:31:49.5458396Z compiled: bool, 2025-05-07T20:31:49.5458542Z ) -> None: 2025-05-07T20:31:49.5458630Z torch.manual_seed(2025) 2025-05-07T20:31:49.5458697Z 2025-05-07T20:31:49.5458861Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5460623Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5460628Z 2025-05-07T20:31:49.5460746Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5460751Z 2025-05-07T20:31:49.5460855Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5461078Z self=, 2025-05-07T20:31:49.5461149Z T=4096, 2025-05-07T20:31:49.5461222Z D=7168, 2025-05-07T20:31:49.5461300Z scale_ub=1200.0, 2025-05-07T20:31:49.5461376Z contiguous=True, 2025-05-07T20:31:49.5461453Z compiled=False, 2025-05-07T20:31:49.5461529Z ) 2025-05-07T20:31:49.5461738Z self = 2025-05-07T20:31:49.5461908Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.5461912Z 2025-05-07T20:31:49.5461987Z @given( 2025-05-07T20:31:49.5462105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5462203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5462312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5462421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5462541Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5462611Z ) 2025-05-07T20:31:49.5462850Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5462942Z def test_silu_mul_quant( 2025-05-07T20:31:49.5463013Z self, 2025-05-07T20:31:49.5463088Z T: int, 2025-05-07T20:31:49.5463176Z D: int, 2025-05-07T20:31:49.5463278Z scale_ub: Optional[float], 2025-05-07T20:31:49.5463379Z contiguous: bool, 2025-05-07T20:31:49.5463465Z compiled: bool, 2025-05-07T20:31:49.5463540Z ) -> None: 2025-05-07T20:31:49.5463631Z torch.manual_seed(2025) 2025-05-07T20:31:49.5463700Z 2025-05-07T20:31:49.5463861Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5465628Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5465639Z 2025-05-07T20:31:49.5465751Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5465755Z 2025-05-07T20:31:49.5465855Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5466074Z self=, 2025-05-07T20:31:49.5466230Z T=16384, 2025-05-07T20:31:49.5466304Z D=7168, 2025-05-07T20:31:49.5466379Z scale_ub=None, 2025-05-07T20:31:49.5466457Z contiguous=False, 2025-05-07T20:31:49.5466537Z compiled=True, 2025-05-07T20:31:49.5466605Z ) 2025-05-07T20:31:49.5466822Z self = 2025-05-07T20:31:49.5467139Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:49.5467144Z 2025-05-07T20:31:49.5467217Z @given( 2025-05-07T20:31:49.5467336Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5467428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5467534Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5467647Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5467754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5467825Z ) 2025-05-07T20:31:49.5468071Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5468159Z def test_silu_mul_quant( 2025-05-07T20:31:49.5468236Z self, 2025-05-07T20:31:49.5468307Z T: int, 2025-05-07T20:31:49.5468379Z D: int, 2025-05-07T20:31:49.5468471Z scale_ub: Optional[float], 2025-05-07T20:31:49.5468560Z contiguous: bool, 2025-05-07T20:31:49.5468643Z compiled: bool, 2025-05-07T20:31:49.5468716Z ) -> None: 2025-05-07T20:31:49.5468802Z torch.manual_seed(2025) 2025-05-07T20:31:49.5468874Z 2025-05-07T20:31:49.5469033Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5475460Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5475470Z 2025-05-07T20:31:49.5475611Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5475621Z 2025-05-07T20:31:49.5475726Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5475944Z self=, 2025-05-07T20:31:49.5476018Z T=4096, 2025-05-07T20:31:49.5476092Z D=7168, 2025-05-07T20:31:49.5476169Z scale_ub=None, 2025-05-07T20:31:49.5476247Z contiguous=True, 2025-05-07T20:31:49.5476329Z compiled=False, 2025-05-07T20:31:49.5476397Z ) 2025-05-07T20:31:49.5476607Z self = 2025-05-07T20:31:49.5476778Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.5476787Z 2025-05-07T20:31:49.5476858Z @given( 2025-05-07T20:31:49.5476976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5477069Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5477180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5477298Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5477408Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5477479Z ) 2025-05-07T20:31:49.5477723Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5477812Z def test_silu_mul_quant( 2025-05-07T20:31:49.5477883Z self, 2025-05-07T20:31:49.5477958Z T: int, 2025-05-07T20:31:49.5478030Z D: int, 2025-05-07T20:31:49.5478125Z scale_ub: Optional[float], 2025-05-07T20:31:49.5478210Z contiguous: bool, 2025-05-07T20:31:49.5478290Z compiled: bool, 2025-05-07T20:31:49.5478367Z ) -> None: 2025-05-07T20:31:49.5478567Z torch.manual_seed(2025) 2025-05-07T20:31:49.5478639Z 2025-05-07T20:31:49.5478808Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5480583Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5480663Z 2025-05-07T20:31:49.5480784Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5480788Z 2025-05-07T20:31:49.5480886Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5481114Z self=, 2025-05-07T20:31:49.5481192Z T=16384, 2025-05-07T20:31:49.5481265Z D=7168, 2025-05-07T20:31:49.5481344Z scale_ub=None, 2025-05-07T20:31:49.5481423Z contiguous=True, 2025-05-07T20:31:49.5481501Z compiled=False, 2025-05-07T20:31:49.5481578Z ) 2025-05-07T20:31:49.5481787Z self = 2025-05-07T20:31:49.5481958Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:49.5481963Z 2025-05-07T20:31:49.5482040Z @given( 2025-05-07T20:31:49.5482155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5482250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5482362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5482470Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5482582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5482657Z ) 2025-05-07T20:31:49.5482901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5482995Z def test_silu_mul_quant( 2025-05-07T20:31:49.5483070Z self, 2025-05-07T20:31:49.5483142Z T: int, 2025-05-07T20:31:49.5483218Z D: int, 2025-05-07T20:31:49.5483318Z scale_ub: Optional[float], 2025-05-07T20:31:49.5483399Z contiguous: bool, 2025-05-07T20:31:49.5483482Z compiled: bool, 2025-05-07T20:31:49.5483555Z ) -> None: 2025-05-07T20:31:49.5483650Z torch.manual_seed(2025) 2025-05-07T20:31:49.5483723Z 2025-05-07T20:31:49.5483885Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5485661Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5485671Z 2025-05-07T20:31:49.5485783Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5485787Z 2025-05-07T20:31:49.5485890Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5486109Z self=, 2025-05-07T20:31:49.5486183Z T=16384, 2025-05-07T20:31:49.5486259Z D=7168, 2025-05-07T20:31:49.5486336Z scale_ub=1200.0, 2025-05-07T20:31:49.5486414Z contiguous=True, 2025-05-07T20:31:49.5486496Z compiled=False, 2025-05-07T20:31:49.5486565Z ) 2025-05-07T20:31:49.5486780Z self = 2025-05-07T20:31:49.5487035Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.5487040Z 2025-05-07T20:31:49.5487115Z @given( 2025-05-07T20:31:49.5487229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5487325Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5487509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5487622Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5487732Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5487801Z ) 2025-05-07T20:31:49.5488044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5488133Z def test_silu_mul_quant( 2025-05-07T20:31:49.5488204Z self, 2025-05-07T20:31:49.5488280Z T: int, 2025-05-07T20:31:49.5488348Z D: int, 2025-05-07T20:31:49.5488442Z scale_ub: Optional[float], 2025-05-07T20:31:49.5488523Z contiguous: bool, 2025-05-07T20:31:49.5488606Z compiled: bool, 2025-05-07T20:31:49.5488683Z ) -> None: 2025-05-07T20:31:49.5488771Z torch.manual_seed(2025) 2025-05-07T20:31:49.5488841Z 2025-05-07T20:31:49.5489003Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5490770Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5490781Z 2025-05-07T20:31:49.5490897Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5490902Z 2025-05-07T20:31:49.5491003Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5491223Z self=, 2025-05-07T20:31:49.5491300Z T=128, 2025-05-07T20:31:49.5491372Z D=5120, 2025-05-07T20:31:49.5491453Z scale_ub=1200.0, 2025-05-07T20:31:49.5491540Z contiguous=False, 2025-05-07T20:31:49.5491617Z compiled=False, 2025-05-07T20:31:49.5491688Z ) 2025-05-07T20:31:49.5491901Z self = 2025-05-07T20:31:49.5492067Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:49.5492072Z 2025-05-07T20:31:49.5492143Z @given( 2025-05-07T20:31:49.5492255Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5492347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5492457Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5492565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5492679Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5492749Z ) 2025-05-07T20:31:49.5492990Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5493080Z def test_silu_mul_quant( 2025-05-07T20:31:49.5493168Z self, 2025-05-07T20:31:49.5493254Z T: int, 2025-05-07T20:31:49.5493337Z D: int, 2025-05-07T20:31:49.5493443Z scale_ub: Optional[float], 2025-05-07T20:31:49.5493525Z contiguous: bool, 2025-05-07T20:31:49.5493608Z compiled: bool, 2025-05-07T20:31:49.5493680Z ) -> None: 2025-05-07T20:31:49.5493774Z torch.manual_seed(2025) 2025-05-07T20:31:49.5493842Z 2025-05-07T20:31:49.5494003Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5494076Z 2025-05-07T20:31:49.5494163Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5494284Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5494453Z x = x_sign * x_clamp 2025-05-07T20:31:49.5494531Z x0 = x[:, :D] 2025-05-07T20:31:49.5494607Z x1 = x[:, D:] 2025-05-07T20:31:49.5494677Z 2025-05-07T20:31:49.5494756Z if contiguous: 2025-05-07T20:31:49.5494848Z x0 = x0.contiguous() 2025-05-07T20:31:49.5494938Z x1 = x1.contiguous() 2025-05-07T20:31:49.5495081Z 2025-05-07T20:31:49.5495168Z if scale_ub is not None: 2025-05-07T20:31:49.5495269Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5495400Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5495477Z ) 2025-05-07T20:31:49.5495550Z else: 2025-05-07T20:31:49.5495640Z scale_ub_tensor = None 2025-05-07T20:31:49.5495713Z 2025-05-07T20:31:49.5495837Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5495921Z op = silu_mul_quant 2025-05-07T20:31:49.5496004Z if compiled: 2025-05-07T20:31:49.5496105Z op = torch.compile(op) 2025-05-07T20:31:49.5496209Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5496279Z 2025-05-07T20:31:49.5496366Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5496370Z 2025-05-07T20:31:49.5496463Z moe/activation_test.py:117: 2025-05-07T20:31:49.5496596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5496692Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5496787Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5497285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5497378Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5497736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5497954Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5498296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5498387Z kernel = self.compile( 2025-05-07T20:31:49.5498762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5498941Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5499065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5499069Z 2025-05-07T20:31:49.5499275Z self = 2025-05-07T20:31:49.5500049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5500560Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa5b1e7d670>} 2025-05-07T20:31:49.5501302Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5501494Z context = 2025-05-07T20:31:49.5501499Z 2025-05-07T20:31:49.5501664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5501923Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5502026Z module_map=module_map) 2025-05-07T20:31:49.5502186Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5502280Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5502357Z E ^ 2025-05-07T20:31:49.5502791Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5502796Z 2025-05-07T20:31:49.5503206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5503328Z 2025-05-07T20:31:49.5503433Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5503681Z self=, 2025-05-07T20:31:49.5503957Z T=2048, 2025-05-07T20:31:49.5504030Z D=7168, 2025-05-07T20:31:49.5504109Z scale_ub=None, 2025-05-07T20:31:49.5504195Z contiguous=False, 2025-05-07T20:31:49.5504273Z compiled=False, 2025-05-07T20:31:49.5504341Z ) 2025-05-07T20:31:49.5504555Z self = 2025-05-07T20:31:49.5504721Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:49.5504726Z 2025-05-07T20:31:49.5504804Z @given( 2025-05-07T20:31:49.5504923Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5505018Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5505133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5505248Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5505363Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5505435Z ) 2025-05-07T20:31:49.5505674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5505765Z def test_silu_mul_quant( 2025-05-07T20:31:49.5505840Z self, 2025-05-07T20:31:49.5505911Z T: int, 2025-05-07T20:31:49.5505981Z D: int, 2025-05-07T20:31:49.5506079Z scale_ub: Optional[float], 2025-05-07T20:31:49.5506162Z contiguous: bool, 2025-05-07T20:31:49.5506242Z compiled: bool, 2025-05-07T20:31:49.5506317Z ) -> None: 2025-05-07T20:31:49.5506409Z torch.manual_seed(2025) 2025-05-07T20:31:49.5506477Z 2025-05-07T20:31:49.5506642Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5508409Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5508422Z 2025-05-07T20:31:49.5508535Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5508540Z 2025-05-07T20:31:49.5508636Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5508858Z self=, 2025-05-07T20:31:49.5508931Z T=128, 2025-05-07T20:31:49.5509002Z D=7168, 2025-05-07T20:31:49.5509083Z scale_ub=1200.0, 2025-05-07T20:31:49.5509162Z contiguous=True, 2025-05-07T20:31:49.5509238Z compiled=True, 2025-05-07T20:31:49.5509309Z ) 2025-05-07T20:31:49.5509520Z self = 2025-05-07T20:31:49.5509690Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.5509694Z 2025-05-07T20:31:49.5509766Z @given( 2025-05-07T20:31:49.5509962Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5510059Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5510167Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5510276Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5510387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5510459Z ) 2025-05-07T20:31:49.5510832Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5510930Z def test_silu_mul_quant( 2025-05-07T20:31:49.5511005Z self, 2025-05-07T20:31:49.5511080Z T: int, 2025-05-07T20:31:49.5511151Z D: int, 2025-05-07T20:31:49.5511245Z scale_ub: Optional[float], 2025-05-07T20:31:49.5511438Z contiguous: bool, 2025-05-07T20:31:49.5511519Z compiled: bool, 2025-05-07T20:31:49.5511591Z ) -> None: 2025-05-07T20:31:49.5511687Z torch.manual_seed(2025) 2025-05-07T20:31:49.5511756Z 2025-05-07T20:31:49.5511920Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5511991Z 2025-05-07T20:31:49.5512079Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5512200Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5512286Z x = x_sign * x_clamp 2025-05-07T20:31:49.5512363Z x0 = x[:, :D] 2025-05-07T20:31:49.5512445Z x1 = x[:, D:] 2025-05-07T20:31:49.5512519Z 2025-05-07T20:31:49.5512595Z if contiguous: 2025-05-07T20:31:49.5512689Z x0 = x0.contiguous() 2025-05-07T20:31:49.5512774Z x1 = x1.contiguous() 2025-05-07T20:31:49.5512842Z 2025-05-07T20:31:49.5512935Z if scale_ub is not None: 2025-05-07T20:31:49.5513046Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.5513178Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.5513256Z ) 2025-05-07T20:31:49.5513330Z else: 2025-05-07T20:31:49.5513418Z scale_ub_tensor = None 2025-05-07T20:31:49.5513491Z 2025-05-07T20:31:49.5513617Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.5513701Z op = silu_mul_quant 2025-05-07T20:31:49.5513785Z if compiled: 2025-05-07T20:31:49.5513880Z op = torch.compile(op) 2025-05-07T20:31:49.5513986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5514060Z 2025-05-07T20:31:49.5514146Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.5514150Z 2025-05-07T20:31:49.5514249Z moe/activation_test.py:117: 2025-05-07T20:31:49.5514371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5514466Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.5514566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.5514927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:49.5515014Z return fn(*args, **kwargs) 2025-05-07T20:31:49.5515507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.5515601Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.5515956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.5516179Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.5516512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.5516608Z kernel = self.compile( 2025-05-07T20:31:49.5516980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.5517160Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.5517282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.5517286Z 2025-05-07T20:31:49.5517490Z self = 2025-05-07T20:31:49.5518267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.5518854Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa5b1e665e0>} 2025-05-07T20:31:49.5519596Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.5519858Z context = 2025-05-07T20:31:49.5519863Z 2025-05-07T20:31:49.5520025Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.5520287Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.5520389Z module_map=module_map) 2025-05-07T20:31:49.5520553Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.5520648Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.5520731Z E ^ 2025-05-07T20:31:49.5521084Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.5521088Z 2025-05-07T20:31:49.5521498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.5521508Z 2025-05-07T20:31:49.5521606Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5521824Z self=, 2025-05-07T20:31:49.5521897Z T=128, 2025-05-07T20:31:49.5521970Z D=7168, 2025-05-07T20:31:49.5522049Z scale_ub=1200.0, 2025-05-07T20:31:49.5522128Z contiguous=True, 2025-05-07T20:31:49.5522213Z compiled=False, 2025-05-07T20:31:49.5522282Z ) 2025-05-07T20:31:49.5522492Z self = 2025-05-07T20:31:49.5522664Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.5522668Z 2025-05-07T20:31:49.5522740Z @given( 2025-05-07T20:31:49.5522863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5522958Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5523094Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5523230Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5523351Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5523420Z ) 2025-05-07T20:31:49.5523664Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5523753Z def test_silu_mul_quant( 2025-05-07T20:31:49.5523825Z self, 2025-05-07T20:31:49.5523900Z T: int, 2025-05-07T20:31:49.5523972Z D: int, 2025-05-07T20:31:49.5524068Z scale_ub: Optional[float], 2025-05-07T20:31:49.5524152Z contiguous: bool, 2025-05-07T20:31:49.5524230Z compiled: bool, 2025-05-07T20:31:49.5524309Z ) -> None: 2025-05-07T20:31:49.5524398Z torch.manual_seed(2025) 2025-05-07T20:31:49.5524467Z 2025-05-07T20:31:49.5524634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5524705Z 2025-05-07T20:31:49.5524792Z x_sign = torch.sign(x) 2025-05-07T20:31:49.5524920Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.5526684Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5526690Z 2025-05-07T20:31:49.5526888Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:49.5526893Z 2025-05-07T20:31:49.5526992Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5527211Z self=, 2025-05-07T20:31:49.5527286Z T=128, 2025-05-07T20:31:49.5527430Z D=5120, 2025-05-07T20:31:49.5527513Z scale_ub=1200.0, 2025-05-07T20:31:49.5527591Z contiguous=True, 2025-05-07T20:31:49.5527667Z compiled=True, 2025-05-07T20:31:49.5527739Z ) 2025-05-07T20:31:49.5527956Z self = 2025-05-07T20:31:49.5528118Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:49.5528122Z 2025-05-07T20:31:49.5528198Z @given( 2025-05-07T20:31:49.5528311Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5528406Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5528527Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5528638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5528748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5528818Z ) 2025-05-07T20:31:49.5529057Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5529157Z def test_silu_mul_quant( 2025-05-07T20:31:49.5529231Z self, 2025-05-07T20:31:49.5529304Z T: int, 2025-05-07T20:31:49.5529375Z D: int, 2025-05-07T20:31:49.5529469Z scale_ub: Optional[float], 2025-05-07T20:31:49.5529550Z contiguous: bool, 2025-05-07T20:31:49.5529633Z compiled: bool, 2025-05-07T20:31:49.5529708Z ) -> None: 2025-05-07T20:31:49.5529799Z torch.manual_seed(2025) 2025-05-07T20:31:49.5529871Z 2025-05-07T20:31:49.5530032Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5530103Z 2025-05-07T20:31:49.5530194Z > x_sign = torch.sign(x) 2025-05-07T20:31:49.5531953Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5531968Z 2025-05-07T20:31:49.5532080Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:49.5532084Z 2025-05-07T20:31:49.5532182Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.5532402Z self=, 2025-05-07T20:31:49.5532474Z T=128, 2025-05-07T20:31:49.5532546Z D=7168, 2025-05-07T20:31:49.5532631Z scale_ub=None, 2025-05-07T20:31:49.5532709Z contiguous=True, 2025-05-07T20:31:49.5532786Z compiled=True, 2025-05-07T20:31:49.5532857Z ) 2025-05-07T20:31:49.5533072Z self = 2025-05-07T20:31:49.5533251Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.5533262Z 2025-05-07T20:31:49.5533342Z @given( 2025-05-07T20:31:49.5533478Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.5533577Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.5533684Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.5533797Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.5533909Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.5533979Z ) 2025-05-07T20:31:49.5534217Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.5534310Z def test_silu_mul_quant( 2025-05-07T20:31:49.5534485Z self, 2025-05-07T20:31:49.5534565Z T: int, 2025-05-07T20:31:49.5534634Z D: int, 2025-05-07T20:31:49.5534726Z scale_ub: Optional[float], 2025-05-07T20:31:49.5534816Z contiguous: bool, 2025-05-07T20:31:49.5534894Z compiled: bool, 2025-05-07T20:31:49.5535038Z ) -> None: 2025-05-07T20:31:49.5535130Z torch.manual_seed(2025) 2025-05-07T20:31:49.5535197Z 2025-05-07T20:31:49.5535357Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.5537124Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:49.5537130Z 2025-05-07T20:31:49.5537240Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:49.5537373Z =============================== warnings summary =============================== 2025-05-07T20:31:49.5537684Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:49.5537980Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:49.5538266Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:49.5539139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:31:49.5539368Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:31:49.5539372Z 2025-05-07T20:31:49.5539547Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:31:49.5540815Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:31:49.5541003Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:31:49.5541007Z 2025-05-07T20:31:49.5541215Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:31:49.5541375Z ================== 1 failed, 1 passed, 13 warnings in 32.89s =================== 2025-05-07T20:31:51.2304189Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:31:51.2925677Z 2025-05-07T20:31:51.2926136Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:31:51.2926500Z 2025-05-07T20:31:51.2926509Z 2025-05-07T20:31:51.2948759Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:31:53.4611766Z ============================= test session starts ============================== 2025-05-07T20:31:53.4612786Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:53.4613700Z cachedir: .pytest_cache 2025-05-07T20:31:53.4615051Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:53.4616377Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:53.4617031Z plugins: hypothesis-6.131.14 2025-05-07T20:31:55.0572373Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:55.2703391Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:31:55.2704245Z run-last-failure: rerun previous 1 failure 2025-05-07T20:31:55.2704463Z 2025-05-07T20:31:57.4658497Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:57.4659583Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:57.4660980Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:57.4662422Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:57.4663821Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:57.4665257Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.4666572Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:57.4667968Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.4669399Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:57.4670720Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:57.4671942Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:57.4673157Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:57.4674199Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:57.4675242Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:57.4676497Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:57.4677782Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:57.4679265Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:57.4680320Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:57.4681664Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:57.4683016Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:57.4684082Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.4685004Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.4685748Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:57.4686762Z W0507 20:31:57.464447 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.4829497Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:57.4830625Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:57.4831962Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:57.4833380Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:57.4834777Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:57.4836153Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.4837462Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:57.4838827Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.4840244Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:57.4841492Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:57.4842706Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:57.4844083Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:57.4845172Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:57.4846305Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:57.4847519Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:57.4848788Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:57.4849903Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:57.4850939Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:57.4852112Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:57.4853469Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:57.4854523Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.4855466Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.4856236Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:57.4857247Z W0507 20:31:57.482414 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.1287004Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.1287726Z self=, 2025-05-07T20:31:58.1288137Z T=1, 2025-05-07T20:31:58.1288331Z D=5120, 2025-05-07T20:31:58.1288520Z scale_ub=None, 2025-05-07T20:31:58.1288734Z contiguous=True, 2025-05-07T20:31:58.1288961Z compiled=True, 2025-05-07T20:31:58.1289165Z ) 2025-05-07T20:31:58.1289484Z self = 2025-05-07T20:31:58.1290000Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:58.1290263Z 2025-05-07T20:31:58.1290344Z @given( 2025-05-07T20:31:58.1290569Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.1290883Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.1291199Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.1291527Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.1291858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.1292144Z ) 2025-05-07T20:31:58.1292488Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.1292933Z def test_silu_mul_quant( 2025-05-07T20:31:58.1293177Z self, 2025-05-07T20:31:58.1293368Z T: int, 2025-05-07T20:31:58.1293563Z D: int, 2025-05-07T20:31:58.1293776Z scale_ub: Optional[float], 2025-05-07T20:31:58.1294045Z contiguous: bool, 2025-05-07T20:31:58.1294584Z compiled: bool, 2025-05-07T20:31:58.1294812Z ) -> None: 2025-05-07T20:31:58.1295031Z torch.manual_seed(2025) 2025-05-07T20:31:58.1295266Z 2025-05-07T20:31:58.1295560Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.1295932Z 2025-05-07T20:31:58.1296267Z x_sign = torch.sign(x) 2025-05-07T20:31:58.1296561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.1296875Z x = x_sign * x_clamp 2025-05-07T20:31:58.1297111Z x0 = x[:, :D] 2025-05-07T20:31:58.1297326Z x1 = x[:, D:] 2025-05-07T20:31:58.1297530Z 2025-05-07T20:31:58.1297708Z if contiguous: 2025-05-07T20:31:58.1297938Z x0 = x0.contiguous() 2025-05-07T20:31:58.1298197Z x1 = x1.contiguous() 2025-05-07T20:31:58.1298430Z 2025-05-07T20:31:58.1298618Z if scale_ub is not None: 2025-05-07T20:31:58.1298890Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.1299233Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.1299540Z ) 2025-05-07T20:31:58.1299732Z else: 2025-05-07T20:31:58.1299945Z scale_ub_tensor = None 2025-05-07T20:31:58.1300191Z 2025-05-07T20:31:58.1300423Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.1300744Z op = silu_mul_quant 2025-05-07T20:31:58.1300989Z if compiled: 2025-05-07T20:31:58.1301236Z op = torch.compile(op) 2025-05-07T20:31:58.1301537Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.1301810Z 2025-05-07T20:31:58.1302004Z y_fp8, y_scale = fn() 2025-05-07T20:31:58.1302291Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:58.1302575Z 2025-05-07T20:31:58.1302820Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.1303157Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:58.1303447Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:58.1304080Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:58.1304443Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:58.1304752Z 2025-05-07T20:31:58.1304950Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:58.1305152Z 2025-05-07T20:31:58.1305254Z moe/activation_test.py:126: 2025-05-07T20:31:58.1305553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.1305881Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:58.1306209Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:58.1307013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:58.1307783Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:58.1308331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.1309017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.1309698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:58.1310466Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:58.1311247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:58.1311997Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:58.1312728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:58.1313371Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:58.1313979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:58.1323502Z fn() 2025-05-07T20:31:58.1324066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:58.1324668Z self.fn.run( 2025-05-07T20:31:58.1325149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.1325844Z kernel = self.compile( 2025-05-07T20:31:58.1326392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.1327053Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.1327457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.1327684Z 2025-05-07T20:31:58.1327891Z self = 2025-05-07T20:31:58.1328999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.1330400Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2ece7040>} 2025-05-07T20:31:58.1331758Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.1332788Z context = 2025-05-07T20:31:58.1333076Z 2025-05-07T20:31:58.1333248Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.1333776Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.1334249Z module_map=module_map) 2025-05-07T20:31:58.1334613Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.1334973Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:58.1335243Z E ^ 2025-05-07T20:31:58.1335711Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.1336172Z 2025-05-07T20:31:58.1336588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.1337106Z 2025-05-07T20:31:58.1337209Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.1337626Z self=, 2025-05-07T20:31:58.1338022Z T=2048, 2025-05-07T20:31:58.1338210Z D=5120, 2025-05-07T20:31:58.1338406Z scale_ub=1200.0, 2025-05-07T20:31:58.1338620Z contiguous=True, 2025-05-07T20:31:58.1338847Z compiled=False, 2025-05-07T20:31:58.1339061Z ) 2025-05-07T20:31:59.1827905Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.1829207Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:59.1830622Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.1832067Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.1833702Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.1835094Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.1836529Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.1837901Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.1839317Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.1840569Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:59.1841785Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.1843003Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:59.1844046Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:59.1845065Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:59.1846290Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.1847576Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.1848702Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:59.1849747Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:59.1850921Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.1852285Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.1853354Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.1854280Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.1855041Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:59.1856108Z W0507 20:31:59.178220 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.4149741Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.4150919Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:59.4152253Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.4153810Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.4155192Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.4156582Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.4157896Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.4159277Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.4160694Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.4161949Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:59.4163164Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.4164383Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:59.4165426Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:31:59.4166447Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:59.4167666Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.4168950Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.4170072Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:31:59.4171122Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:59.4172299Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.4173736Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.4174805Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.4175796Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.4176542Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:59.4177568Z W0507 20:31:59.410914 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.2823592Z self = 2025-05-07T20:32:00.2824384Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:00.2824778Z 2025-05-07T20:32:00.2824901Z @given( 2025-05-07T20:32:00.2825212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.2825652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.2826088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.2826539Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.2826966Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.2827253Z ) 2025-05-07T20:32:00.2827606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.2828052Z def test_silu_mul_quant( 2025-05-07T20:32:00.2828295Z self, 2025-05-07T20:32:00.2828488Z T: int, 2025-05-07T20:32:00.2828679Z D: int, 2025-05-07T20:32:00.2828899Z scale_ub: Optional[float], 2025-05-07T20:32:00.2829170Z contiguous: bool, 2025-05-07T20:32:00.2829406Z compiled: bool, 2025-05-07T20:32:00.2829635Z ) -> None: 2025-05-07T20:32:00.2829937Z torch.manual_seed(2025) 2025-05-07T20:32:00.2830181Z 2025-05-07T20:32:00.2830459Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.2830807Z 2025-05-07T20:32:00.2830995Z x_sign = torch.sign(x) 2025-05-07T20:32:00.2831298Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.2831610Z x = x_sign * x_clamp 2025-05-07T20:32:00.2831854Z x0 = x[:, :D] 2025-05-07T20:32:00.2832067Z x1 = x[:, D:] 2025-05-07T20:32:00.2832275Z 2025-05-07T20:32:00.2832463Z if contiguous: 2025-05-07T20:32:00.2832688Z x0 = x0.contiguous() 2025-05-07T20:32:00.2832952Z x1 = x1.contiguous() 2025-05-07T20:32:00.2833200Z 2025-05-07T20:32:00.2833393Z if scale_ub is not None: 2025-05-07T20:32:00.2833673Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.2834022Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.2834329Z ) 2025-05-07T20:32:00.2834524Z else: 2025-05-07T20:32:00.2834734Z scale_ub_tensor = None 2025-05-07T20:32:00.2834979Z 2025-05-07T20:32:00.2835213Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.2835535Z op = silu_mul_quant 2025-05-07T20:32:00.2835782Z if compiled: 2025-05-07T20:32:00.2836030Z op = torch.compile(op) 2025-05-07T20:32:00.2836328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.2836603Z 2025-05-07T20:32:00.2836787Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.2836957Z 2025-05-07T20:32:00.2837057Z moe/activation_test.py:117: 2025-05-07T20:32:00.2837352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2837682Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.2837966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.2838848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.2839548Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.2840089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.2840896Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.2841560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.2842088Z kernel = self.compile( 2025-05-07T20:32:00.2842635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.2843291Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.2843684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2843918Z 2025-05-07T20:32:00.2844136Z self = 2025-05-07T20:32:00.2845238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.2846650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2f1a53a0>} 2025-05-07T20:32:00.2848010Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.2849038Z context = 2025-05-07T20:32:00.2849336Z 2025-05-07T20:32:00.2849508Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.2850050Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.2850545Z module_map=module_map) 2025-05-07T20:32:00.2850909Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.2851267Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.2851528Z E ^ 2025-05-07T20:32:00.2851994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.2852463Z 2025-05-07T20:32:00.2852884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.2853405Z 2025-05-07T20:32:00.2853508Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.2853927Z self=, 2025-05-07T20:32:00.2854329Z T=2048, 2025-05-07T20:32:00.2854518Z D=5120, 2025-05-07T20:32:00.2854716Z scale_ub=1200.0, 2025-05-07T20:32:00.2854932Z contiguous=True, 2025-05-07T20:32:00.2855154Z compiled=True, 2025-05-07T20:32:00.2855362Z ) 2025-05-07T20:32:00.2855678Z self = 2025-05-07T20:32:00.2856227Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:00.2856518Z 2025-05-07T20:32:00.2856593Z @given( 2025-05-07T20:32:00.2856823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.2857129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.2857437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.2857773Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.2858101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.2858390Z ) 2025-05-07T20:32:00.2858737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.2859258Z def test_silu_mul_quant( 2025-05-07T20:32:00.2859500Z self, 2025-05-07T20:32:00.2859696Z T: int, 2025-05-07T20:32:00.2859886Z D: int, 2025-05-07T20:32:00.2860103Z scale_ub: Optional[float], 2025-05-07T20:32:00.2860377Z contiguous: bool, 2025-05-07T20:32:00.2860614Z compiled: bool, 2025-05-07T20:32:00.2860909Z ) -> None: 2025-05-07T20:32:00.2861122Z torch.manual_seed(2025) 2025-05-07T20:32:00.2861366Z 2025-05-07T20:32:00.2861655Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.2861992Z 2025-05-07T20:32:00.2862181Z x_sign = torch.sign(x) 2025-05-07T20:32:00.2862471Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.2862777Z x = x_sign * x_clamp 2025-05-07T20:32:00.2863019Z x0 = x[:, :D] 2025-05-07T20:32:00.2863237Z x1 = x[:, D:] 2025-05-07T20:32:00.2863437Z 2025-05-07T20:32:00.2863620Z if contiguous: 2025-05-07T20:32:00.2863852Z x0 = x0.contiguous() 2025-05-07T20:32:00.2864106Z x1 = x1.contiguous() 2025-05-07T20:32:00.2864348Z 2025-05-07T20:32:00.2864542Z if scale_ub is not None: 2025-05-07T20:32:00.2864814Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.2865156Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.2865476Z ) 2025-05-07T20:32:00.2865662Z else: 2025-05-07T20:32:00.2865875Z scale_ub_tensor = None 2025-05-07T20:32:00.2866167Z 2025-05-07T20:32:00.2866415Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.2866732Z op = silu_mul_quant 2025-05-07T20:32:00.2866984Z if compiled: 2025-05-07T20:32:00.2867237Z op = torch.compile(op) 2025-05-07T20:32:00.2867534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.2867814Z 2025-05-07T20:32:00.2868014Z y_fp8, y_scale = fn() 2025-05-07T20:32:00.2868303Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:00.2868596Z 2025-05-07T20:32:00.2868834Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.2869167Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:00.2869461Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:00.2869860Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:00.2870217Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.2870531Z 2025-05-07T20:32:00.2870728Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:00.2870923Z 2025-05-07T20:32:00.2871024Z moe/activation_test.py:126: 2025-05-07T20:32:00.2871315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2871649Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:00.2871980Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.2872783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:00.2873551Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:00.2874100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.2874791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.2875479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:00.2876209Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.2876965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:00.2877722Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.2878535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:00.2879177Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:00.2879780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:00.2880363Z fn() 2025-05-07T20:32:00.2880874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:00.2881457Z self.fn.run( 2025-05-07T20:32:00.2881926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.2882454Z kernel = self.compile( 2025-05-07T20:32:00.2883000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.2883656Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.2884053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2884289Z 2025-05-07T20:32:00.2884497Z self = 2025-05-07T20:32:00.2885605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.2887062Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d89f670>} 2025-05-07T20:32:00.2888435Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.2889482Z context = 2025-05-07T20:32:00.2889775Z 2025-05-07T20:32:00.2889949Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.2890472Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.2890945Z module_map=module_map) 2025-05-07T20:32:00.2891319Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.2891679Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:00.2891942Z E ^ 2025-05-07T20:32:00.2892413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.2892869Z 2025-05-07T20:32:00.2893295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.2893812Z 2025-05-07T20:32:00.2893914Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.2894339Z self=, 2025-05-07T20:32:00.2894749Z T=16384, 2025-05-07T20:32:00.2894948Z D=7168, 2025-05-07T20:32:00.2895143Z scale_ub=1200.0, 2025-05-07T20:32:00.2895369Z contiguous=False, 2025-05-07T20:32:00.2895604Z compiled=False, 2025-05-07T20:32:00.2895804Z ) 2025-05-07T20:32:00.9222665Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:00.9223858Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:00.9225803Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:00.9227532Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:00.9228915Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:00.9230514Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.9231822Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:00.9233202Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.9234610Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:00.9235871Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:00.9237078Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:00.9238283Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:00.9239323Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:00.9240345Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:00.9241550Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:00.9242828Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:00.9243937Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:00.9244985Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:00.9246216Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:00.9247572Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:00.9248639Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.9249553Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.9250296Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:00.9251401Z W0507 20:32:00.918096 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.0993104Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:01.0994490Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:01.0995824Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:01.0997294Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:01.0998667Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:01.1000051Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.1001349Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:01.1002721Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.1004313Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:01.1005560Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:01.1006784Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:01.1007992Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:01.1009029Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:01.1010054Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:01.1011277Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:01.1012571Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:01.1013680Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:01.1014860Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:01.1016044Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:01.1017396Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:01.1018558Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.1019474Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.1020217Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:01.1021242Z W0507 20:32:01.095322 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.2547402Z self = 2025-05-07T20:32:02.2548926Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:02.2549719Z 2025-05-07T20:32:02.2550036Z @given( 2025-05-07T20:32:02.2550638Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.2551262Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.2551876Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.2552529Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.2553185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.2553752Z ) 2025-05-07T20:32:02.2554441Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.2555329Z def test_silu_mul_quant( 2025-05-07T20:32:02.2555812Z self, 2025-05-07T20:32:02.2556192Z T: int, 2025-05-07T20:32:02.2556429Z D: int, 2025-05-07T20:32:02.2556688Z scale_ub: Optional[float], 2025-05-07T20:32:02.2556964Z contiguous: bool, 2025-05-07T20:32:02.2557208Z compiled: bool, 2025-05-07T20:32:02.2557447Z ) -> None: 2025-05-07T20:32:02.2557662Z torch.manual_seed(2025) 2025-05-07T20:32:02.2557910Z 2025-05-07T20:32:02.2558187Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.2558530Z 2025-05-07T20:32:02.2558733Z x_sign = torch.sign(x) 2025-05-07T20:32:02.2559034Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.2559350Z x = x_sign * x_clamp 2025-05-07T20:32:02.2559589Z x0 = x[:, :D] 2025-05-07T20:32:02.2559812Z x1 = x[:, D:] 2025-05-07T20:32:02.2560025Z 2025-05-07T20:32:02.2560212Z if contiguous: 2025-05-07T20:32:02.2560455Z x0 = x0.contiguous() 2025-05-07T20:32:02.2560725Z x1 = x1.contiguous() 2025-05-07T20:32:02.2560966Z 2025-05-07T20:32:02.2561163Z if scale_ub is not None: 2025-05-07T20:32:02.2561448Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.2561790Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.2562112Z ) 2025-05-07T20:32:02.2562315Z else: 2025-05-07T20:32:02.2562525Z scale_ub_tensor = None 2025-05-07T20:32:02.2562786Z 2025-05-07T20:32:02.2563024Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.2563337Z op = silu_mul_quant 2025-05-07T20:32:02.2563593Z if compiled: 2025-05-07T20:32:02.2563860Z op = torch.compile(op) 2025-05-07T20:32:02.2564169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.2564445Z 2025-05-07T20:32:02.2564641Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.2564804Z 2025-05-07T20:32:02.2565084Z moe/activation_test.py:117: 2025-05-07T20:32:02.2565412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.2565744Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.2572915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.2573635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.2574501Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.2575049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.2575739Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.2576453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.2576991Z kernel = self.compile( 2025-05-07T20:32:02.2577551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.2578202Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.2578604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.2578843Z 2025-05-07T20:32:02.2579061Z self = 2025-05-07T20:32:02.2580155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.2581544Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d88c280>} 2025-05-07T20:32:02.2582898Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.2583928Z context = 2025-05-07T20:32:02.2584215Z 2025-05-07T20:32:02.2584394Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.2584925Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.2585395Z module_map=module_map) 2025-05-07T20:32:02.2585775Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.2586135Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.2586417Z E ^ 2025-05-07T20:32:02.2586922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.2587371Z 2025-05-07T20:32:02.2587799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.2588314Z 2025-05-07T20:32:02.2588429Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.2588844Z self=, 2025-05-07T20:32:02.2589248Z T=1, 2025-05-07T20:32:02.2589436Z D=7168, 2025-05-07T20:32:02.2589632Z scale_ub=None, 2025-05-07T20:32:02.2589934Z contiguous=True, 2025-05-07T20:32:02.2590170Z compiled=True, 2025-05-07T20:32:02.2590375Z ) 2025-05-07T20:32:02.2590701Z self = 2025-05-07T20:32:02.2591194Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:02.2591451Z 2025-05-07T20:32:02.2591540Z @given( 2025-05-07T20:32:02.2591774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.2592088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.2592400Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.2592810Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.2593149Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.2593446Z ) 2025-05-07T20:32:02.2593793Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.2594240Z def test_silu_mul_quant( 2025-05-07T20:32:02.2594631Z self, 2025-05-07T20:32:02.2594825Z T: int, 2025-05-07T20:32:02.2595024Z D: int, 2025-05-07T20:32:02.2595249Z scale_ub: Optional[float], 2025-05-07T20:32:02.2595518Z contiguous: bool, 2025-05-07T20:32:02.2595763Z compiled: bool, 2025-05-07T20:32:02.2595994Z ) -> None: 2025-05-07T20:32:02.2596215Z torch.manual_seed(2025) 2025-05-07T20:32:02.2596458Z 2025-05-07T20:32:02.2596742Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.2597098Z 2025-05-07T20:32:02.2597292Z x_sign = torch.sign(x) 2025-05-07T20:32:02.2597599Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.2597921Z x = x_sign * x_clamp 2025-05-07T20:32:02.2598159Z x0 = x[:, :D] 2025-05-07T20:32:02.2598384Z x1 = x[:, D:] 2025-05-07T20:32:02.2598601Z 2025-05-07T20:32:02.2598785Z if contiguous: 2025-05-07T20:32:02.2599017Z x0 = x0.contiguous() 2025-05-07T20:32:02.2599292Z x1 = x1.contiguous() 2025-05-07T20:32:02.2599533Z 2025-05-07T20:32:02.2599719Z if scale_ub is not None: 2025-05-07T20:32:02.2599997Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.2600334Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.2600639Z ) 2025-05-07T20:32:02.2600838Z else: 2025-05-07T20:32:02.2601054Z scale_ub_tensor = None 2025-05-07T20:32:02.2601300Z 2025-05-07T20:32:02.2601541Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.2601858Z op = silu_mul_quant 2025-05-07T20:32:02.2602108Z if compiled: 2025-05-07T20:32:02.2602355Z op = torch.compile(op) 2025-05-07T20:32:02.2602650Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.2602921Z 2025-05-07T20:32:02.2603113Z y_fp8, y_scale = fn() 2025-05-07T20:32:02.2603400Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:02.2603700Z 2025-05-07T20:32:02.2604125Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.2604463Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:02.2604757Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:02.2605067Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:02.2605425Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:02.2605737Z 2025-05-07T20:32:02.2605937Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:02.2606158Z 2025-05-07T20:32:02.2606270Z moe/activation_test.py:126: 2025-05-07T20:32:02.2606590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.2606924Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:02.2607243Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:02.2608036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:02.2608807Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:02.2609347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.2610022Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.2610703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:02.2611425Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:02.2612294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:02.2613046Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:02.2613775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:02.2614522Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:02.2615120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:02.2615634Z fn() 2025-05-07T20:32:02.2616136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:02.2616766Z self.fn.run( 2025-05-07T20:32:02.2617250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.2617781Z kernel = self.compile( 2025-05-07T20:32:02.2618321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.2618963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.2619367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.2619606Z 2025-05-07T20:32:02.2619825Z self = 2025-05-07T20:32:02.2620909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.2622296Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d89f5e0>} 2025-05-07T20:32:02.2623651Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.2624683Z context = 2025-05-07T20:32:02.2624977Z 2025-05-07T20:32:02.2625157Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.2625680Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.2626152Z module_map=module_map) 2025-05-07T20:32:02.2626525Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.2626886Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:02.2627154Z E ^ 2025-05-07T20:32:02.2627622Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.2628071Z 2025-05-07T20:32:02.2628502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.2629011Z 2025-05-07T20:32:02.2629125Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.2629541Z self=, 2025-05-07T20:32:02.2630039Z T=4096, 2025-05-07T20:32:02.2630228Z D=5120, 2025-05-07T20:32:02.2630412Z scale_ub=None, 2025-05-07T20:32:02.2630628Z contiguous=False, 2025-05-07T20:32:02.2630855Z compiled=False, 2025-05-07T20:32:02.2631054Z ) 2025-05-07T20:32:02.9419205Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:02.9420554Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:02.9422069Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:02.9423486Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:02.9424990Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:02.9426359Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.9427660Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:02.9429026Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.9430502Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:02.9431736Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:02.9432948Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:02.9434151Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:02.9435188Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:02.9436205Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:02.9437417Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:02.9438707Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:02.9439819Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:02.9440856Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:02.9442031Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:02.9443376Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:02.9444528Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.9445440Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.9446179Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:02.9447189Z W0507 20:32:02.937809 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.6217350Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:03.6218436Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:03.6219792Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:03.6221223Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:03.6222608Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:03.6224000Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.6225321Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:03.6226697Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.6228101Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:03.6229351Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:03.6230646Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:03.6231860Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:03.6232894Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:03.6233914Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:03.6235127Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:03.6236448Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:03.6237757Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:03.6238809Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:03.6240097Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:03.6241457Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:03.6242522Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.6243450Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.6244194Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:03.6245215Z W0507 20:32:03.617552 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.9441686Z self = 2025-05-07T20:32:04.9442248Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:04.9442638Z 2025-05-07T20:32:04.9442755Z @given( 2025-05-07T20:32:04.9443065Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.9443501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.9443929Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.9444396Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.9444873Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.9445244Z ) 2025-05-07T20:32:04.9445637Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.9446082Z def test_silu_mul_quant( 2025-05-07T20:32:04.9446328Z self, 2025-05-07T20:32:04.9446527Z T: int, 2025-05-07T20:32:04.9446729Z D: int, 2025-05-07T20:32:04.9446946Z scale_ub: Optional[float], 2025-05-07T20:32:04.9447219Z contiguous: bool, 2025-05-07T20:32:04.9447463Z compiled: bool, 2025-05-07T20:32:04.9447685Z ) -> None: 2025-05-07T20:32:04.9447902Z torch.manual_seed(2025) 2025-05-07T20:32:04.9448147Z 2025-05-07T20:32:04.9448417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.9448764Z 2025-05-07T20:32:04.9448964Z x_sign = torch.sign(x) 2025-05-07T20:32:04.9449262Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.9449570Z x = x_sign * x_clamp 2025-05-07T20:32:04.9449814Z x0 = x[:, :D] 2025-05-07T20:32:04.9450031Z x1 = x[:, D:] 2025-05-07T20:32:04.9450234Z 2025-05-07T20:32:04.9450425Z if contiguous: 2025-05-07T20:32:04.9450659Z x0 = x0.contiguous() 2025-05-07T20:32:04.9450919Z x1 = x1.contiguous() 2025-05-07T20:32:04.9451161Z 2025-05-07T20:32:04.9451356Z if scale_ub is not None: 2025-05-07T20:32:04.9451629Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.9451970Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.9452286Z ) 2025-05-07T20:32:04.9452478Z else: 2025-05-07T20:32:04.9452695Z scale_ub_tensor = None 2025-05-07T20:32:04.9452950Z 2025-05-07T20:32:04.9453180Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.9453499Z op = silu_mul_quant 2025-05-07T20:32:04.9453931Z if compiled: 2025-05-07T20:32:04.9454185Z op = torch.compile(op) 2025-05-07T20:32:04.9454482Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.9454759Z 2025-05-07T20:32:04.9454951Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.9455118Z 2025-05-07T20:32:04.9455219Z moe/activation_test.py:117: 2025-05-07T20:32:04.9455635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.9455969Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.9456248Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.9456942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.9457632Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.9458172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.9458860Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.9459519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.9460048Z kernel = self.compile( 2025-05-07T20:32:04.9460584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.9461245Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.9461652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.9461878Z 2025-05-07T20:32:04.9462084Z self = 2025-05-07T20:32:04.9463170Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.9464557Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d1f3550>} 2025-05-07T20:32:04.9465902Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.9466931Z context = 2025-05-07T20:32:04.9467217Z 2025-05-07T20:32:04.9467381Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.9467903Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.9468372Z module_map=module_map) 2025-05-07T20:32:04.9468742Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.9469092Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.9469361Z E ^ 2025-05-07T20:32:04.9469925Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.9470379Z 2025-05-07T20:32:04.9470794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.9471317Z 2025-05-07T20:32:04.9471420Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.9471839Z self=, 2025-05-07T20:32:04.9472246Z T=4096, 2025-05-07T20:32:04.9472430Z D=7168, 2025-05-07T20:32:04.9472627Z scale_ub=None, 2025-05-07T20:32:04.9472844Z contiguous=False, 2025-05-07T20:32:04.9473066Z compiled=False, 2025-05-07T20:32:04.9473275Z ) 2025-05-07T20:32:04.9473596Z self = 2025-05-07T20:32:04.9474086Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:04.9474451Z 2025-05-07T20:32:04.9474531Z @given( 2025-05-07T20:32:04.9474762Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.9475068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.9475379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.9475812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.9476144Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.9476425Z ) 2025-05-07T20:32:04.9476774Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.9477219Z def test_silu_mul_quant( 2025-05-07T20:32:04.9477457Z self, 2025-05-07T20:32:04.9477653Z T: int, 2025-05-07T20:32:04.9477851Z D: int, 2025-05-07T20:32:04.9478066Z scale_ub: Optional[float], 2025-05-07T20:32:04.9478337Z contiguous: bool, 2025-05-07T20:32:04.9478578Z compiled: bool, 2025-05-07T20:32:04.9478798Z ) -> None: 2025-05-07T20:32:04.9479022Z torch.manual_seed(2025) 2025-05-07T20:32:04.9479265Z 2025-05-07T20:32:04.9479538Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.9479880Z 2025-05-07T20:32:04.9480075Z x_sign = torch.sign(x) 2025-05-07T20:32:04.9480361Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.9480680Z x = x_sign * x_clamp 2025-05-07T20:32:04.9480921Z x0 = x[:, :D] 2025-05-07T20:32:04.9481143Z x1 = x[:, D:] 2025-05-07T20:32:04.9481347Z 2025-05-07T20:32:04.9481537Z if contiguous: 2025-05-07T20:32:04.9481772Z x0 = x0.contiguous() 2025-05-07T20:32:04.9482027Z x1 = x1.contiguous() 2025-05-07T20:32:04.9482271Z 2025-05-07T20:32:04.9482468Z if scale_ub is not None: 2025-05-07T20:32:04.9482739Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.9483074Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.9483391Z ) 2025-05-07T20:32:04.9483585Z else: 2025-05-07T20:32:04.9483799Z scale_ub_tensor = None 2025-05-07T20:32:04.9484052Z 2025-05-07T20:32:04.9484282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.9484601Z op = silu_mul_quant 2025-05-07T20:32:04.9484860Z if compiled: 2025-05-07T20:32:04.9485102Z op = torch.compile(op) 2025-05-07T20:32:04.9485405Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.9485684Z 2025-05-07T20:32:04.9485875Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.9486039Z 2025-05-07T20:32:04.9486143Z moe/activation_test.py:117: 2025-05-07T20:32:04.9486438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.9486807Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.9487100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.9487794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.9488488Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.9489025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.9489701Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.9490366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.9490895Z kernel = self.compile( 2025-05-07T20:32:04.9491435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.9492088Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.9492488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.9492716Z 2025-05-07T20:32:04.9493015Z self = 2025-05-07T20:32:04.9494097Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.9495548Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d096700>} 2025-05-07T20:32:04.9496897Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.9497976Z context = 2025-05-07T20:32:04.9498264Z 2025-05-07T20:32:04.9498440Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.9498971Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.9499439Z module_map=module_map) 2025-05-07T20:32:04.9499806Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.9500155Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.9500422Z E ^ 2025-05-07T20:32:04.9500895Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.9501349Z 2025-05-07T20:32:04.9501773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.9502282Z 2025-05-07T20:32:04.9502386Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.9502804Z self=, 2025-05-07T20:32:04.9503215Z T=128, 2025-05-07T20:32:04.9503400Z D=7168, 2025-05-07T20:32:04.9503598Z scale_ub=None, 2025-05-07T20:32:04.9504003Z contiguous=False, 2025-05-07T20:32:04.9504225Z compiled=True, 2025-05-07T20:32:04.9504426Z ) 2025-05-07T20:32:05.0278776Z self = 2025-05-07T20:32:05.0280199Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.0280951Z 2025-05-07T20:32:05.0281174Z @given( 2025-05-07T20:32:05.0281784Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.0282413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.0283015Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.0283652Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.0284299Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.0284849Z ) 2025-05-07T20:32:05.0285523Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.0286391Z def test_silu_mul_quant( 2025-05-07T20:32:05.0286726Z self, 2025-05-07T20:32:05.0286912Z T: int, 2025-05-07T20:32:05.0287098Z D: int, 2025-05-07T20:32:05.0287313Z scale_ub: Optional[float], 2025-05-07T20:32:05.0287582Z contiguous: bool, 2025-05-07T20:32:05.0287811Z compiled: bool, 2025-05-07T20:32:05.0288031Z ) -> None: 2025-05-07T20:32:05.0288243Z torch.manual_seed(2025) 2025-05-07T20:32:05.0288473Z 2025-05-07T20:32:05.0288739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.0289074Z 2025-05-07T20:32:05.0289254Z x_sign = torch.sign(x) 2025-05-07T20:32:05.0289538Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.0295980Z x = x_sign * x_clamp 2025-05-07T20:32:05.0296265Z x0 = x[:, :D] 2025-05-07T20:32:05.0296489Z x1 = x[:, D:] 2025-05-07T20:32:05.0296714Z 2025-05-07T20:32:05.0296936Z if contiguous: 2025-05-07T20:32:05.0297171Z x0 = x0.contiguous() 2025-05-07T20:32:05.0297610Z x1 = x1.contiguous() 2025-05-07T20:32:05.0297859Z 2025-05-07T20:32:05.0298056Z if scale_ub is not None: 2025-05-07T20:32:05.0298334Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.0298676Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.0299100Z ) 2025-05-07T20:32:05.0299297Z else: 2025-05-07T20:32:05.0299508Z scale_ub_tensor = None 2025-05-07T20:32:05.0299765Z 2025-05-07T20:32:05.0300011Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.0300326Z op = silu_mul_quant 2025-05-07T20:32:05.0300583Z if compiled: 2025-05-07T20:32:05.0300839Z op = torch.compile(op) 2025-05-07T20:32:05.0301132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.0301412Z 2025-05-07T20:32:05.0301604Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.0301889Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.0302192Z 2025-05-07T20:32:05.0302437Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.0302772Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.0303063Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.0303381Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.0303929Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.0304238Z 2025-05-07T20:32:05.0304444Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.0304641Z 2025-05-07T20:32:05.0304751Z moe/activation_test.py:126: 2025-05-07T20:32:05.0305045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.0305383Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.0305711Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.0306505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.0307314Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.0307857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.0308540Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.0309232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.0310029Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.0310783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.0311523Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.0312249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.0312891Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.0313491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.0314003Z fn() 2025-05-07T20:32:05.0314501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.0315096Z self.fn.run( 2025-05-07T20:32:05.0315556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.0316078Z kernel = self.compile( 2025-05-07T20:32:05.0316619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.0317270Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.0317668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.0318018Z 2025-05-07T20:32:05.0318235Z self = 2025-05-07T20:32:05.0319337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.0320824Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d3c13a0>} 2025-05-07T20:32:05.0322165Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.0323193Z context = 2025-05-07T20:32:05.0323479Z 2025-05-07T20:32:05.0323653Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.0324173Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.0324639Z module_map=module_map) 2025-05-07T20:32:05.0325005Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.0325363Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.0325634Z E ^ 2025-05-07T20:32:05.0326102Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.0326558Z 2025-05-07T20:32:05.0326981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.0327490Z 2025-05-07T20:32:05.0327593Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.0328008Z self=, 2025-05-07T20:32:05.0328413Z T=128, 2025-05-07T20:32:05.0328596Z D=7168, 2025-05-07T20:32:05.0328788Z scale_ub=None, 2025-05-07T20:32:05.0329007Z contiguous=False, 2025-05-07T20:32:05.0329226Z compiled=False, 2025-05-07T20:32:05.0329429Z ) 2025-05-07T20:32:05.4376922Z self = 2025-05-07T20:32:05.4377731Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.4378111Z 2025-05-07T20:32:05.4378226Z @given( 2025-05-07T20:32:05.4378533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4378926Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4379239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4379574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4379900Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4380191Z ) 2025-05-07T20:32:05.4380546Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4380983Z def test_silu_mul_quant( 2025-05-07T20:32:05.4381230Z self, 2025-05-07T20:32:05.4381429Z T: int, 2025-05-07T20:32:05.4381622Z D: int, 2025-05-07T20:32:05.4381844Z scale_ub: Optional[float], 2025-05-07T20:32:05.4382116Z contiguous: bool, 2025-05-07T20:32:05.4382356Z compiled: bool, 2025-05-07T20:32:05.4382587Z ) -> None: 2025-05-07T20:32:05.4382807Z torch.manual_seed(2025) 2025-05-07T20:32:05.4383044Z 2025-05-07T20:32:05.4383320Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4383669Z 2025-05-07T20:32:05.4383859Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4384155Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4384465Z x = x_sign * x_clamp 2025-05-07T20:32:05.4384708Z x0 = x[:, :D] 2025-05-07T20:32:05.4384926Z x1 = x[:, D:] 2025-05-07T20:32:05.4385139Z 2025-05-07T20:32:05.4385502Z if contiguous: 2025-05-07T20:32:05.4385736Z x0 = x0.contiguous() 2025-05-07T20:32:05.4386002Z x1 = x1.contiguous() 2025-05-07T20:32:05.4386247Z 2025-05-07T20:32:05.4386433Z if scale_ub is not None: 2025-05-07T20:32:05.4386706Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4387187Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4387518Z ) 2025-05-07T20:32:05.4387716Z else: 2025-05-07T20:32:05.4387925Z scale_ub_tensor = None 2025-05-07T20:32:05.4388173Z 2025-05-07T20:32:05.4388410Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4388730Z op = silu_mul_quant 2025-05-07T20:32:05.4388985Z if compiled: 2025-05-07T20:32:05.4389236Z op = torch.compile(op) 2025-05-07T20:32:05.4389537Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4389873Z 2025-05-07T20:32:05.4390068Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.4390235Z 2025-05-07T20:32:05.4390337Z moe/activation_test.py:117: 2025-05-07T20:32:05.4390634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4390963Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.4391244Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4391944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.4392628Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.4393165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4393848Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4394502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4395028Z kernel = self.compile( 2025-05-07T20:32:05.4395574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4396226Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4396617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4396858Z 2025-05-07T20:32:05.4397070Z self = 2025-05-07T20:32:05.4398156Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4399555Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d169b80>} 2025-05-07T20:32:05.4400907Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4401922Z context = 2025-05-07T20:32:05.4402215Z 2025-05-07T20:32:05.4402385Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4402904Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4403373Z module_map=module_map) 2025-05-07T20:32:05.4403905Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4404269Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.4404524Z E ^ 2025-05-07T20:32:05.4404991Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4405447Z 2025-05-07T20:32:05.4405991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4406516Z 2025-05-07T20:32:05.4406619Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4407073Z self=, 2025-05-07T20:32:05.4407605Z T=4096, 2025-05-07T20:32:05.4407795Z D=5120, 2025-05-07T20:32:05.4407990Z scale_ub=1200.0, 2025-05-07T20:32:05.4408210Z contiguous=True, 2025-05-07T20:32:05.4408436Z compiled=False, 2025-05-07T20:32:05.4408640Z ) 2025-05-07T20:32:05.4408959Z self = 2025-05-07T20:32:05.4409478Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.4409755Z 2025-05-07T20:32:05.4409832Z @given( 2025-05-07T20:32:05.4410061Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4410372Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4410682Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4411012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4411341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4411618Z ) 2025-05-07T20:32:05.4411967Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4412414Z def test_silu_mul_quant( 2025-05-07T20:32:05.4412652Z self, 2025-05-07T20:32:05.4412846Z T: int, 2025-05-07T20:32:05.4413045Z D: int, 2025-05-07T20:32:05.4413261Z scale_ub: Optional[float], 2025-05-07T20:32:05.4413531Z contiguous: bool, 2025-05-07T20:32:05.4413770Z compiled: bool, 2025-05-07T20:32:05.4413991Z ) -> None: 2025-05-07T20:32:05.4414208Z torch.manual_seed(2025) 2025-05-07T20:32:05.4414450Z 2025-05-07T20:32:05.4414725Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4415058Z 2025-05-07T20:32:05.4415263Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4415549Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4415852Z x = x_sign * x_clamp 2025-05-07T20:32:05.4416104Z x0 = x[:, :D] 2025-05-07T20:32:05.4416327Z x1 = x[:, D:] 2025-05-07T20:32:05.4416529Z 2025-05-07T20:32:05.4416722Z if contiguous: 2025-05-07T20:32:05.4416956Z x0 = x0.contiguous() 2025-05-07T20:32:05.4417218Z x1 = x1.contiguous() 2025-05-07T20:32:05.4417499Z 2025-05-07T20:32:05.4417702Z if scale_ub is not None: 2025-05-07T20:32:05.4417971Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4418305Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4418615Z ) 2025-05-07T20:32:05.4418803Z else: 2025-05-07T20:32:05.4419022Z scale_ub_tensor = None 2025-05-07T20:32:05.4419279Z 2025-05-07T20:32:05.4419513Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4419825Z op = silu_mul_quant 2025-05-07T20:32:05.4420081Z if compiled: 2025-05-07T20:32:05.4420327Z op = torch.compile(op) 2025-05-07T20:32:05.4420618Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4420894Z 2025-05-07T20:32:05.4421087Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.4421259Z 2025-05-07T20:32:05.4421364Z moe/activation_test.py:117: 2025-05-07T20:32:05.4421660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4421989Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.4422264Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4422952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.4423642Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.4424262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4424937Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4425593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4426117Z kernel = self.compile( 2025-05-07T20:32:05.4426738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4427383Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4427774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4428001Z 2025-05-07T20:32:05.4428210Z self = 2025-05-07T20:32:05.4429300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4430738Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2ccef5e0>} 2025-05-07T20:32:05.4432077Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4433106Z context = 2025-05-07T20:32:05.4433392Z 2025-05-07T20:32:05.4433562Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4434075Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4434544Z module_map=module_map) 2025-05-07T20:32:05.4434911Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4435262Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.4435518Z E ^ 2025-05-07T20:32:05.4435985Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4436434Z 2025-05-07T20:32:05.4436852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4437365Z 2025-05-07T20:32:05.4437466Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4437880Z self=, 2025-05-07T20:32:05.4438286Z T=1, 2025-05-07T20:32:05.4438465Z D=5120, 2025-05-07T20:32:05.4438656Z scale_ub=None, 2025-05-07T20:32:05.4438870Z contiguous=True, 2025-05-07T20:32:05.4439094Z compiled=True, 2025-05-07T20:32:05.4439291Z ) 2025-05-07T20:32:05.9665384Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.9666578Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:05.9668807Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.9671776Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.9674831Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.9677284Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.9678593Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.9680114Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.9681530Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.9682787Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:05.9684008Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.9685234Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:05.9686276Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:05.9687289Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:05.9688511Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.9689794Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.9690914Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:05.9691945Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:05.9693117Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.9694466Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.9695521Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.9696431Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.9697166Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:05.9698177Z W0507 20:32:05.962504 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.1544684Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.1545744Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:06.1547097Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.1548647Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.1550089Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.1551472Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.1552772Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.1554148Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.1555561Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.1556806Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:06.1558075Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.1559286Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:06.1560326Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:06.1561350Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:06.1562571Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.1563859Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.1564980Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:06.1566028Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:06.1567205Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.1568649Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.1569712Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.1570617Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.1571429Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:06.1572440Z W0507 20:32:06.150454 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.6542790Z self = 2025-05-07T20:32:06.6543350Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:06.6543730Z 2025-05-07T20:32:06.6543851Z @given( 2025-05-07T20:32:06.6544170Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:06.6544597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:06.6545015Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:06.6545446Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:06.6545871Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:06.6546158Z ) 2025-05-07T20:32:06.6546511Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:06.6546953Z def test_silu_mul_quant( 2025-05-07T20:32:06.6547194Z self, 2025-05-07T20:32:06.6547394Z T: int, 2025-05-07T20:32:06.6547597Z D: int, 2025-05-07T20:32:06.6547816Z scale_ub: Optional[float], 2025-05-07T20:32:06.6548093Z contiguous: bool, 2025-05-07T20:32:06.6548334Z compiled: bool, 2025-05-07T20:32:06.6548562Z ) -> None: 2025-05-07T20:32:06.6548788Z torch.manual_seed(2025) 2025-05-07T20:32:06.6549042Z 2025-05-07T20:32:06.6549314Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:06.6549658Z 2025-05-07T20:32:06.6549930Z x_sign = torch.sign(x) 2025-05-07T20:32:06.6550220Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:06.6550537Z x = x_sign * x_clamp 2025-05-07T20:32:06.6550781Z x0 = x[:, :D] 2025-05-07T20:32:06.6551003Z x1 = x[:, D:] 2025-05-07T20:32:06.6551210Z 2025-05-07T20:32:06.6551401Z if contiguous: 2025-05-07T20:32:06.6551638Z x0 = x0.contiguous() 2025-05-07T20:32:06.6551898Z x1 = x1.contiguous() 2025-05-07T20:32:06.6552144Z 2025-05-07T20:32:06.6552342Z if scale_ub is not None: 2025-05-07T20:32:06.6552620Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:06.6552963Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:06.6553277Z ) 2025-05-07T20:32:06.6553474Z else: 2025-05-07T20:32:06.6553690Z scale_ub_tensor = None 2025-05-07T20:32:06.6553947Z 2025-05-07T20:32:06.6554181Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:06.6554502Z op = silu_mul_quant 2025-05-07T20:32:06.6554759Z if compiled: 2025-05-07T20:32:06.6555012Z op = torch.compile(op) 2025-05-07T20:32:06.6555316Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:06.6555602Z 2025-05-07T20:32:06.6555799Z y_fp8, y_scale = fn() 2025-05-07T20:32:06.6556085Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:06.6556382Z 2025-05-07T20:32:06.6556626Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:06.6556964Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:06.6557262Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:06.6557583Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:06.6558119Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:06.6558441Z 2025-05-07T20:32:06.6558647Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:06.6558845Z 2025-05-07T20:32:06.6558947Z moe/activation_test.py:126: 2025-05-07T20:32:06.6559254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.6559713Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:06.6560045Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:06.6560833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:06.6561599Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:06.6562148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:06.6562835Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:06.6563526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:06.6564253Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:06.6565006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:06.6565758Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:06.6566493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:06.6567136Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:06.6567739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:06.6568254Z fn() 2025-05-07T20:32:06.6568767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:06.6569353Z self.fn.run( 2025-05-07T20:32:06.6575489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:06.6576073Z kernel = self.compile( 2025-05-07T20:32:06.6576624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:06.6577284Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.6577699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:06.6577928Z 2025-05-07T20:32:06.6578142Z self = 2025-05-07T20:32:06.6579221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:06.6580599Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2ca9c670>} 2025-05-07T20:32:06.6581950Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:06.6582976Z context = 2025-05-07T20:32:06.6583265Z 2025-05-07T20:32:06.6583434Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:06.6583946Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.6584418Z module_map=module_map) 2025-05-07T20:32:06.6584784Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.6585250Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:06.6585521Z E ^ 2025-05-07T20:32:06.6585996Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.6586445Z 2025-05-07T20:32:06.6586873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:06.6587510Z 2025-05-07T20:32:06.6587613Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:06.6588024Z self=, 2025-05-07T20:32:06.6588433Z T=2048, 2025-05-07T20:32:06.6588617Z D=5120, 2025-05-07T20:32:06.6588810Z scale_ub=None, 2025-05-07T20:32:06.6589024Z contiguous=True, 2025-05-07T20:32:06.6589242Z compiled=True, 2025-05-07T20:32:06.6589445Z ) 2025-05-07T20:32:07.1445670Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:07.1446939Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:07.1449251Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:07.1452102Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:07.1454851Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:07.1457409Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.1458713Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:07.1460090Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.1461513Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:07.1462774Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:07.1463993Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:07.1465215Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:07.1466245Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:07.1467281Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:07.1468756Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:07.1470117Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:07.1471238Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:07.1472400Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:07.1473581Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:07.1474939Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:07.1476006Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.1476927Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.1477711Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:07.1478745Z W0507 20:32:07.140506 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.3329886Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:07.3332022Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:07.3334684Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:07.3337363Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:07.3338798Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:07.3340184Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.3341492Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:07.3342873Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.3344285Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:07.3345536Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:07.3346939Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:07.3348203Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:07.3349365Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:07.3350442Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:07.3351664Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:07.3352948Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:07.3354067Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:07.3355112Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:07.3356285Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:07.3357648Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:07.3358710Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.3359621Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.3360360Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:07.3361378Z W0507 20:32:07.328864 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.8368134Z self = 2025-05-07T20:32:07.8368873Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:07.8369237Z 2025-05-07T20:32:07.8369334Z @given( 2025-05-07T20:32:07.8369643Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.8370015Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.8370321Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.8370644Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.8370978Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.8371256Z ) 2025-05-07T20:32:07.8371604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.8372049Z def test_silu_mul_quant( 2025-05-07T20:32:07.8372285Z self, 2025-05-07T20:32:07.8372477Z T: int, 2025-05-07T20:32:07.8372670Z D: int, 2025-05-07T20:32:07.8372880Z scale_ub: Optional[float], 2025-05-07T20:32:07.8373147Z contiguous: bool, 2025-05-07T20:32:07.8373380Z compiled: bool, 2025-05-07T20:32:07.8373596Z ) -> None: 2025-05-07T20:32:07.8373810Z torch.manual_seed(2025) 2025-05-07T20:32:07.8374214Z 2025-05-07T20:32:07.8374485Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.8374826Z 2025-05-07T20:32:07.8375015Z x_sign = torch.sign(x) 2025-05-07T20:32:07.8375304Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.8375608Z x = x_sign * x_clamp 2025-05-07T20:32:07.8375993Z x0 = x[:, :D] 2025-05-07T20:32:07.8376205Z x1 = x[:, D:] 2025-05-07T20:32:07.8376409Z 2025-05-07T20:32:07.8376592Z if contiguous: 2025-05-07T20:32:07.8376822Z x0 = x0.contiguous() 2025-05-07T20:32:07.8377073Z x1 = x1.contiguous() 2025-05-07T20:32:07.8377309Z 2025-05-07T20:32:07.8377498Z if scale_ub is not None: 2025-05-07T20:32:07.8377808Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.8378151Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.8378456Z ) 2025-05-07T20:32:07.8378644Z else: 2025-05-07T20:32:07.8378860Z scale_ub_tensor = None 2025-05-07T20:32:07.8379107Z 2025-05-07T20:32:07.8379331Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.8379644Z op = silu_mul_quant 2025-05-07T20:32:07.8379888Z if compiled: 2025-05-07T20:32:07.8380129Z op = torch.compile(op) 2025-05-07T20:32:07.8380434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.8380709Z 2025-05-07T20:32:07.8380895Z y_fp8, y_scale = fn() 2025-05-07T20:32:07.8381176Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:07.8381463Z 2025-05-07T20:32:07.8381695Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.8382024Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:07.8382315Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:07.8382627Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:07.8382985Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.8383294Z 2025-05-07T20:32:07.8383491Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:07.8383685Z 2025-05-07T20:32:07.8383787Z moe/activation_test.py:126: 2025-05-07T20:32:07.8384077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.8384414Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:07.8384746Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.8385534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:07.8386284Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:07.8386830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.8387514Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.8388200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:07.8388924Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.8389677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:07.8390488Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.8391220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:07.8391862Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:07.8392471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:07.8392986Z fn() 2025-05-07T20:32:07.8393580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:07.8394164Z self.fn.run( 2025-05-07T20:32:07.8394630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.8395152Z kernel = self.compile( 2025-05-07T20:32:07.8395692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.8396422Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.8396813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.8397046Z 2025-05-07T20:32:07.8397254Z self = 2025-05-07T20:32:07.8398395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.8399793Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2c8dc9d0>} 2025-05-07T20:32:07.8401147Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.8402173Z context = 2025-05-07T20:32:07.8402469Z 2025-05-07T20:32:07.8402637Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.8403153Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.8403614Z module_map=module_map) 2025-05-07T20:32:07.8404264Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.8404625Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:07.8404891Z E ^ 2025-05-07T20:32:07.8405342Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.8405801Z 2025-05-07T20:32:07.8406217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.8406744Z 2025-05-07T20:32:07.8406847Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.8407254Z self=, 2025-05-07T20:32:07.8407674Z T=128, 2025-05-07T20:32:07.8407879Z D=5120, 2025-05-07T20:32:07.8408064Z scale_ub=None, 2025-05-07T20:32:07.8408270Z contiguous=True, 2025-05-07T20:32:07.8408489Z compiled=True, 2025-05-07T20:32:07.8408686Z ) 2025-05-07T20:32:08.3724012Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.3726105Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:08.3728149Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.3729567Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.3730936Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.3732474Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.3733779Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.3735245Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.3736652Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.3737941Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:08.3739143Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.3740341Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:08.3741380Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:08.3742389Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:08.3743595Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.3744871Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.3745982Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:08.3747023Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:08.3748239Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.3749578Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.3750687Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.3751592Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.3752327Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:08.3753334Z W0507 20:32:08.368281 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.5656301Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.5657803Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:08.5660499Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.5663533Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.5666272Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.5668439Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.5669740Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.5671186Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.5672598Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.5673846Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:08.5675074Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.5676289Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:08.5677327Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:08.5678342Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:08.5679568Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.5680850Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.5681966Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:08.5683011Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:08.5684181Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.5685540Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.5686705Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.5687625Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.5688440Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:08.5689455Z W0507 20:32:08.561647 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.3810484Z self = 2025-05-07T20:32:09.3811048Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:09.3811430Z 2025-05-07T20:32:09.3811543Z @given( 2025-05-07T20:32:09.3811852Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.3812257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.3812593Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.3812921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.3813245Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.3813539Z ) 2025-05-07T20:32:09.3813886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.3814323Z def test_silu_mul_quant( 2025-05-07T20:32:09.3814570Z self, 2025-05-07T20:32:09.3814765Z T: int, 2025-05-07T20:32:09.3814965Z D: int, 2025-05-07T20:32:09.3815181Z scale_ub: Optional[float], 2025-05-07T20:32:09.3815450Z contiguous: bool, 2025-05-07T20:32:09.3815686Z compiled: bool, 2025-05-07T20:32:09.3815910Z ) -> None: 2025-05-07T20:32:09.3816124Z torch.manual_seed(2025) 2025-05-07T20:32:09.3816364Z 2025-05-07T20:32:09.3816634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.3816972Z 2025-05-07T20:32:09.3817164Z x_sign = torch.sign(x) 2025-05-07T20:32:09.3817447Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.3817753Z x = x_sign * x_clamp 2025-05-07T20:32:09.3818024Z x0 = x[:, :D] 2025-05-07T20:32:09.3818255Z x1 = x[:, D:] 2025-05-07T20:32:09.3818459Z 2025-05-07T20:32:09.3818643Z if contiguous: 2025-05-07T20:32:09.3818864Z x0 = x0.contiguous() 2025-05-07T20:32:09.3819125Z x1 = x1.contiguous() 2025-05-07T20:32:09.3819360Z 2025-05-07T20:32:09.3819543Z if scale_ub is not None: 2025-05-07T20:32:09.3819817Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.3820147Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.3820453Z ) 2025-05-07T20:32:09.3820639Z else: 2025-05-07T20:32:09.3820852Z scale_ub_tensor = None 2025-05-07T20:32:09.3821096Z 2025-05-07T20:32:09.3821322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.3821632Z op = silu_mul_quant 2025-05-07T20:32:09.3821900Z if compiled: 2025-05-07T20:32:09.3822139Z op = torch.compile(op) 2025-05-07T20:32:09.3822438Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3822708Z 2025-05-07T20:32:09.3822901Z y_fp8, y_scale = fn() 2025-05-07T20:32:09.3823177Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:09.3823464Z 2025-05-07T20:32:09.3823695Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.3824016Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:09.3824304Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:09.3824615Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:09.3824960Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.3825441Z 2025-05-07T20:32:09.3825648Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:09.3825841Z 2025-05-07T20:32:09.3825945Z moe/activation_test.py:126: 2025-05-07T20:32:09.3826233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3826563Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:09.3826999Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.3827773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:09.3828580Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:09.3829122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.3829858Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.3830544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:09.3831261Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.3832007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:09.3832752Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.3833473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:09.3834108Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:09.3834700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:09.3835206Z fn() 2025-05-07T20:32:09.3835704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:09.3836281Z self.fn.run( 2025-05-07T20:32:09.3836738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.3837255Z kernel = self.compile( 2025-05-07T20:32:09.3837789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.3838437Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.3838823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3839054Z 2025-05-07T20:32:09.3839260Z self = 2025-05-07T20:32:09.3840342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.3841726Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2c87c940>} 2025-05-07T20:32:09.3843069Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.3844091Z context = 2025-05-07T20:32:09.3844384Z 2025-05-07T20:32:09.3844545Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.3845067Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.3845528Z module_map=module_map) 2025-05-07T20:32:09.3845885Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.3846237Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:09.3846577Z E ^ 2025-05-07T20:32:09.3847034Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.3847484Z 2025-05-07T20:32:09.3847921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.3848529Z 2025-05-07T20:32:09.3848629Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.3849037Z self=, 2025-05-07T20:32:09.3849428Z T=4096, 2025-05-07T20:32:09.3849609Z D=5120, 2025-05-07T20:32:09.3849797Z scale_ub=None, 2025-05-07T20:32:09.3850002Z contiguous=True, 2025-05-07T20:32:09.3850215Z compiled=True, 2025-05-07T20:32:09.3850413Z ) 2025-05-07T20:32:09.9161405Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:09.9162475Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:09.9163808Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:09.9165226Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:09.9166598Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:09.9167983Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.9169279Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:09.9170640Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.9172048Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:09.9173288Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:09.9174496Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:09.9175698Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:09.9176730Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:09.9177822Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:09.9179527Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:09.9180931Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:09.9182035Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:09.9183181Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:09.9184347Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:09.9185694Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:09.9186744Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.9187688Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.9188616Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:09.9189863Z W0507 20:32:09.912056 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.1095317Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.1097403Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:10.1098999Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.1100429Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.1107735Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.1109155Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.1110525Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.1111910Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.1113338Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.1114604Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:10.1115995Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.1117216Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:10.1118373Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 407, in visit 2025-05-07T20:32:10.1119408Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:10.1120627Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.1121910Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.1123021Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/ast.py", line 415, in generic_visit 2025-05-07T20:32:10.1124060Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:10.1125237Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.1126593Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.1127659Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.1128565Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.1129309Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:10.1130326Z W0507 20:32:10.105468 87440 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.7836622Z self = 2025-05-07T20:32:10.7837168Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:10.7837451Z 2025-05-07T20:32:10.7837534Z @given( 2025-05-07T20:32:10.7837790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7838109Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7838422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7838762Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7839092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7839383Z ) 2025-05-07T20:32:10.7839742Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7840200Z def test_silu_mul_quant( 2025-05-07T20:32:10.7840442Z self, 2025-05-07T20:32:10.7840643Z T: int, 2025-05-07T20:32:10.7840852Z D: int, 2025-05-07T20:32:10.7841072Z scale_ub: Optional[float], 2025-05-07T20:32:10.7841349Z contiguous: bool, 2025-05-07T20:32:10.7841590Z compiled: bool, 2025-05-07T20:32:10.7841816Z ) -> None: 2025-05-07T20:32:10.7842042Z torch.manual_seed(2025) 2025-05-07T20:32:10.7842294Z 2025-05-07T20:32:10.7842761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7843115Z 2025-05-07T20:32:10.7843318Z x_sign = torch.sign(x) 2025-05-07T20:32:10.7843609Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.7843917Z x = x_sign * x_clamp 2025-05-07T20:32:10.7844161Z x0 = x[:, :D] 2025-05-07T20:32:10.7844491Z x1 = x[:, D:] 2025-05-07T20:32:10.7844700Z 2025-05-07T20:32:10.7844888Z if contiguous: 2025-05-07T20:32:10.7845116Z x0 = x0.contiguous() 2025-05-07T20:32:10.7845379Z x1 = x1.contiguous() 2025-05-07T20:32:10.7845620Z 2025-05-07T20:32:10.7845807Z if scale_ub is not None: 2025-05-07T20:32:10.7846085Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.7846423Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.7846731Z ) 2025-05-07T20:32:10.7846928Z else: 2025-05-07T20:32:10.7847134Z scale_ub_tensor = None 2025-05-07T20:32:10.7847393Z 2025-05-07T20:32:10.7847627Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.7847938Z op = silu_mul_quant 2025-05-07T20:32:10.7848190Z if compiled: 2025-05-07T20:32:10.7848439Z op = torch.compile(op) 2025-05-07T20:32:10.7848739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.7849013Z 2025-05-07T20:32:10.7849210Z y_fp8, y_scale = fn() 2025-05-07T20:32:10.7849499Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:10.7849782Z 2025-05-07T20:32:10.7850022Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.7850362Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:10.7850652Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:10.7850968Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:10.7851330Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:10.7851631Z 2025-05-07T20:32:10.7851839Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:10.7852057Z 2025-05-07T20:32:10.7852161Z moe/activation_test.py:126: 2025-05-07T20:32:10.7852462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.7852791Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:10.7853128Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:10.7853926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:10.7854686Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:10.7855229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.7855913Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.7856605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:10.7857327Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:10.7858071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:10.7858819Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:10.7859544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:10.7860173Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:10.7860771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:10.7861285Z fn() 2025-05-07T20:32:10.7861788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:10.7862445Z self.fn.run( 2025-05-07T20:32:10.7862914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.7863441Z kernel = self.compile( 2025-05-07T20:32:10.7863974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.7864697Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.7865089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.7865316Z 2025-05-07T20:32:10.7865532Z self = 2025-05-07T20:32:10.7866613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.7868002Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2c5e9700>} 2025-05-07T20:32:10.7869348Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.7870462Z context = 2025-05-07T20:32:10.7870748Z 2025-05-07T20:32:10.7870918Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.7871433Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.7871898Z module_map=module_map) 2025-05-07T20:32:10.7872262Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.7872612Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:10.7872878Z E ^ 2025-05-07T20:32:10.7873353Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.7873800Z 2025-05-07T20:32:10.7874225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.7874743Z 2025-05-07T20:32:10.7874849Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7875263Z self=, 2025-05-07T20:32:10.7875670Z T=16384, 2025-05-07T20:32:10.7875863Z D=5120, 2025-05-07T20:32:10.7876061Z scale_ub=None, 2025-05-07T20:32:10.7876275Z contiguous=True, 2025-05-07T20:32:10.7876494Z compiled=True, 2025-05-07T20:32:10.7876699Z ) 2025-05-07T20:32:10.8319579Z W0507 20:32:10.830472 87440 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:10.8320823Z W0507 20:32:10.830472 87440 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:10.8322149Z W0507 20:32:10.830472 87440 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:10.8323132Z W0507 20:32:10.830472 87440 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:10.8324228Z W0507 20:32:10.830472 87440 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:10.9563850Z self = 2025-05-07T20:32:10.9564381Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:10.9564657Z 2025-05-07T20:32:10.9564905Z @given( 2025-05-07T20:32:10.9565192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.9565630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.9566032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.9566464Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.9566940Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.9567230Z ) 2025-05-07T20:32:10.9567580Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.9568029Z def test_silu_mul_quant( 2025-05-07T20:32:10.9568307Z self, 2025-05-07T20:32:10.9568513Z T: int, 2025-05-07T20:32:10.9568711Z D: int, 2025-05-07T20:32:10.9568932Z scale_ub: Optional[float], 2025-05-07T20:32:10.9569207Z contiguous: bool, 2025-05-07T20:32:10.9569449Z compiled: bool, 2025-05-07T20:32:10.9569676Z ) -> None: 2025-05-07T20:32:10.9569900Z torch.manual_seed(2025) 2025-05-07T20:32:10.9570140Z 2025-05-07T20:32:10.9570415Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.9570761Z 2025-05-07T20:32:10.9570955Z x_sign = torch.sign(x) 2025-05-07T20:32:10.9571248Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.9571567Z x = x_sign * x_clamp 2025-05-07T20:32:10.9571802Z x0 = x[:, :D] 2025-05-07T20:32:10.9572023Z x1 = x[:, D:] 2025-05-07T20:32:10.9572236Z 2025-05-07T20:32:10.9572422Z if contiguous: 2025-05-07T20:32:10.9572661Z x0 = x0.contiguous() 2025-05-07T20:32:10.9572953Z x1 = x1.contiguous() 2025-05-07T20:32:10.9573195Z 2025-05-07T20:32:10.9573382Z if scale_ub is not None: 2025-05-07T20:32:10.9573659Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.9573999Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.9574309Z ) 2025-05-07T20:32:10.9574505Z else: 2025-05-07T20:32:10.9574716Z scale_ub_tensor = None 2025-05-07T20:32:10.9574964Z 2025-05-07T20:32:10.9575203Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.9575516Z op = silu_mul_quant 2025-05-07T20:32:10.9575766Z if compiled: 2025-05-07T20:32:10.9576012Z op = torch.compile(op) 2025-05-07T20:32:10.9576315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.9576591Z 2025-05-07T20:32:10.9576778Z y_fp8, y_scale = fn() 2025-05-07T20:32:10.9577065Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:10.9577358Z 2025-05-07T20:32:10.9577592Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.9577925Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:10.9578217Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:10.9578557Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:10.9578946Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:10.9579258Z 2025-05-07T20:32:10.9579464Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:10.9579659Z 2025-05-07T20:32:10.9579759Z moe/activation_test.py:126: 2025-05-07T20:32:10.9580063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9580403Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:10.9580728Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:10.9581524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:10.9582292Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:10.9582844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.9583532Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.9584308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:10.9585051Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:10.9585805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:10.9586636Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:10.9587375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:10.9588028Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:10.9588627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:10.9589155Z fn() 2025-05-07T20:32:10.9589667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:10.9590364Z self.fn.run( 2025-05-07T20:32:10.9590828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.9591365Z kernel = self.compile( 2025-05-07T20:32:10.9591916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.9592574Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.9592979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9593218Z 2025-05-07T20:32:10.9593428Z self = 2025-05-07T20:32:10.9594537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.9595940Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2c052d30>} 2025-05-07T20:32:10.9597303Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.9598398Z context = 2025-05-07T20:32:10.9598687Z 2025-05-07T20:32:10.9598862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.9599388Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.9599852Z module_map=module_map) 2025-05-07T20:32:10.9600220Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.9600580Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:10.9600843Z E ^ 2025-05-07T20:32:10.9601317Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.9601769Z 2025-05-07T20:32:10.9602190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.9602715Z 2025-05-07T20:32:10.9602825Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.9603238Z self=, 2025-05-07T20:32:10.9603643Z T=1, 2025-05-07T20:32:10.9604080Z D=5120, 2025-05-07T20:32:10.9604266Z scale_ub=1200.0, 2025-05-07T20:32:10.9604488Z contiguous=True, 2025-05-07T20:32:10.9604713Z compiled=True, 2025-05-07T20:32:10.9604912Z ) 2025-05-07T20:32:11.1348410Z self = 2025-05-07T20:32:11.1349473Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.1349883Z 2025-05-07T20:32:11.1349963Z @given( 2025-05-07T20:32:11.1350194Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.1350501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.1350806Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.1351248Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.1351572Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.1351858Z ) 2025-05-07T20:32:11.1352206Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.1352644Z def test_silu_mul_quant( 2025-05-07T20:32:11.1352885Z self, 2025-05-07T20:32:11.1353082Z T: int, 2025-05-07T20:32:11.1353273Z D: int, 2025-05-07T20:32:11.1353495Z scale_ub: Optional[float], 2025-05-07T20:32:11.1353762Z contiguous: bool, 2025-05-07T20:32:11.1354013Z compiled: bool, 2025-05-07T20:32:11.1354230Z ) -> None: 2025-05-07T20:32:11.1354444Z torch.manual_seed(2025) 2025-05-07T20:32:11.1354683Z 2025-05-07T20:32:11.1354950Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.1355290Z 2025-05-07T20:32:11.1355485Z x_sign = torch.sign(x) 2025-05-07T20:32:11.1355776Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.1356085Z x = x_sign * x_clamp 2025-05-07T20:32:11.1356327Z x0 = x[:, :D] 2025-05-07T20:32:11.1356537Z x1 = x[:, D:] 2025-05-07T20:32:11.1356742Z 2025-05-07T20:32:11.1356929Z if contiguous: 2025-05-07T20:32:11.1357153Z x0 = x0.contiguous() 2025-05-07T20:32:11.1357410Z x1 = x1.contiguous() 2025-05-07T20:32:11.1357652Z 2025-05-07T20:32:11.1357837Z if scale_ub is not None: 2025-05-07T20:32:11.1358102Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.1358439Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.1358748Z ) 2025-05-07T20:32:11.1358933Z else: 2025-05-07T20:32:11.1359140Z scale_ub_tensor = None 2025-05-07T20:32:11.1359392Z 2025-05-07T20:32:11.1359625Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.1359944Z op = silu_mul_quant 2025-05-07T20:32:11.1360200Z if compiled: 2025-05-07T20:32:11.1360444Z op = torch.compile(op) 2025-05-07T20:32:11.1360740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1361019Z 2025-05-07T20:32:11.1361201Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.1361372Z 2025-05-07T20:32:11.1361470Z moe/activation_test.py:117: 2025-05-07T20:32:11.1361765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1362094Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.1362371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1362932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.1363484Z return fn(*args, **kwargs) 2025-05-07T20:32:11.1364139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.1364831Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.1365366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.1366037Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.1366686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.1367210Z kernel = self.compile( 2025-05-07T20:32:11.1367744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.1368498Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.1368913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1369146Z 2025-05-07T20:32:11.1369349Z self = 2025-05-07T20:32:11.1370508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.1371888Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2bb5ec10>} 2025-05-07T20:32:11.1373230Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.1374245Z context = 2025-05-07T20:32:11.1374532Z 2025-05-07T20:32:11.1374706Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.1375222Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.1375681Z module_map=module_map) 2025-05-07T20:32:11.1376047Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.1376396Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.1376641Z E ^ 2025-05-07T20:32:11.1377107Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.1377564Z 2025-05-07T20:32:11.1377979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.1378496Z 2025-05-07T20:32:11.1378607Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.1379057Z self=, 2025-05-07T20:32:11.1379461Z T=1, 2025-05-07T20:32:11.1379641Z D=5120, 2025-05-07T20:32:11.1379824Z scale_ub=None, 2025-05-07T20:32:11.1380038Z contiguous=False, 2025-05-07T20:32:11.1380260Z compiled=True, 2025-05-07T20:32:11.1380448Z ) 2025-05-07T20:32:11.2211163Z self = 2025-05-07T20:32:11.2211770Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.2212133Z 2025-05-07T20:32:11.2212240Z @given( 2025-05-07T20:32:11.2212468Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.2212782Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.2213089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.2213416Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.2213746Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.2214028Z ) 2025-05-07T20:32:11.2214371Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.2214803Z def test_silu_mul_quant( 2025-05-07T20:32:11.2215042Z self, 2025-05-07T20:32:11.2215232Z T: int, 2025-05-07T20:32:11.2215430Z D: int, 2025-05-07T20:32:11.2215645Z scale_ub: Optional[float], 2025-05-07T20:32:11.2215916Z contiguous: bool, 2025-05-07T20:32:11.2216148Z compiled: bool, 2025-05-07T20:32:11.2216372Z ) -> None: 2025-05-07T20:32:11.2216587Z torch.manual_seed(2025) 2025-05-07T20:32:11.2216824Z 2025-05-07T20:32:11.2217093Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.2217430Z 2025-05-07T20:32:11.2217620Z x_sign = torch.sign(x) 2025-05-07T20:32:11.2217906Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.2218216Z x = x_sign * x_clamp 2025-05-07T20:32:11.2218661Z x0 = x[:, :D] 2025-05-07T20:32:11.2218877Z x1 = x[:, D:] 2025-05-07T20:32:11.2219088Z 2025-05-07T20:32:11.2219273Z if contiguous: 2025-05-07T20:32:11.2219501Z x0 = x0.contiguous() 2025-05-07T20:32:11.2219757Z x1 = x1.contiguous() 2025-05-07T20:32:11.2220105Z 2025-05-07T20:32:11.2220289Z if scale_ub is not None: 2025-05-07T20:32:11.2220560Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.2220900Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.2221203Z ) 2025-05-07T20:32:11.2221392Z else: 2025-05-07T20:32:11.2221599Z scale_ub_tensor = None 2025-05-07T20:32:11.2221841Z 2025-05-07T20:32:11.2222076Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.2222388Z op = silu_mul_quant 2025-05-07T20:32:11.2222636Z if compiled: 2025-05-07T20:32:11.2222874Z op = torch.compile(op) 2025-05-07T20:32:11.2223174Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.2223445Z 2025-05-07T20:32:11.2223627Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.2223906Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.2224190Z 2025-05-07T20:32:11.2224420Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.2224754Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.2225046Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.2225356Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.2225709Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.2226014Z 2025-05-07T20:32:11.2226206Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.2226404Z 2025-05-07T20:32:11.2226520Z moe/activation_test.py:126: 2025-05-07T20:32:11.2226816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.2227159Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.2227478Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.2228270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.2235297Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.2235869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.2236546Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.2237233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.2237956Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.2238761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:11.2239496Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.2240218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.2240857Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.2241458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.2241968Z fn() 2025-05-07T20:32:11.2242480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.2243060Z self.fn.run( 2025-05-07T20:32:11.2243522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.2244054Z kernel = self.compile( 2025-05-07T20:32:11.2244698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.2245351Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.2245742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.2245982Z 2025-05-07T20:32:11.2246266Z self = 2025-05-07T20:32:11.2247360Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.2248792Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2c5f81f0>} 2025-05-07T20:32:11.2250137Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.2251168Z context = 2025-05-07T20:32:11.2251465Z 2025-05-07T20:32:11.2251630Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.2252161Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.2252620Z module_map=module_map) 2025-05-07T20:32:11.2252987Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.2253347Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.2253608Z E ^ 2025-05-07T20:32:11.2254076Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.2254537Z 2025-05-07T20:32:11.2254959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.2255471Z 2025-05-07T20:32:11.2255583Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.2255988Z self=, 2025-05-07T20:32:11.2256381Z T=1, 2025-05-07T20:32:11.2256563Z D=5120, 2025-05-07T20:32:11.2256757Z scale_ub=None, 2025-05-07T20:32:11.2256967Z contiguous=True, 2025-05-07T20:32:11.2257191Z compiled=False, 2025-05-07T20:32:11.2257398Z ) 2025-05-07T20:32:11.5961895Z self = 2025-05-07T20:32:11.5962507Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.5962872Z 2025-05-07T20:32:11.5962975Z @given( 2025-05-07T20:32:11.5963291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5963711Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5964124Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5964491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5964818Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5965100Z ) 2025-05-07T20:32:11.5965449Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5965893Z def test_silu_mul_quant( 2025-05-07T20:32:11.5966142Z self, 2025-05-07T20:32:11.5966334Z T: int, 2025-05-07T20:32:11.5966531Z D: int, 2025-05-07T20:32:11.5966747Z scale_ub: Optional[float], 2025-05-07T20:32:11.5967017Z contiguous: bool, 2025-05-07T20:32:11.5967254Z compiled: bool, 2025-05-07T20:32:11.5967480Z ) -> None: 2025-05-07T20:32:11.5967702Z torch.manual_seed(2025) 2025-05-07T20:32:11.5967939Z 2025-05-07T20:32:11.5968215Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5968607Z 2025-05-07T20:32:11.5968793Z x_sign = torch.sign(x) 2025-05-07T20:32:11.5969252Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.5969567Z x = x_sign * x_clamp 2025-05-07T20:32:11.5969801Z x0 = x[:, :D] 2025-05-07T20:32:11.5970018Z x1 = x[:, D:] 2025-05-07T20:32:11.5970227Z 2025-05-07T20:32:11.5970405Z if contiguous: 2025-05-07T20:32:11.5970639Z x0 = x0.contiguous() 2025-05-07T20:32:11.5971021Z x1 = x1.contiguous() 2025-05-07T20:32:11.5971259Z 2025-05-07T20:32:11.5971453Z if scale_ub is not None: 2025-05-07T20:32:11.5971727Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.5972064Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.5972366Z ) 2025-05-07T20:32:11.5972566Z else: 2025-05-07T20:32:11.5972776Z scale_ub_tensor = None 2025-05-07T20:32:11.5973021Z 2025-05-07T20:32:11.5973253Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.5973568Z op = silu_mul_quant 2025-05-07T20:32:11.5973816Z if compiled: 2025-05-07T20:32:11.5974062Z op = torch.compile(op) 2025-05-07T20:32:11.5974360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.5974630Z 2025-05-07T20:32:11.5974827Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.5974994Z 2025-05-07T20:32:11.5975106Z moe/activation_test.py:117: 2025-05-07T20:32:11.5975395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5975732Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.5976016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.5976706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.5977388Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.5977918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.5978629Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.5979305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.5979834Z kernel = self.compile( 2025-05-07T20:32:11.5980374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.5981026Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.5981419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5981646Z 2025-05-07T20:32:11.5981853Z self = 2025-05-07T20:32:11.5982933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.5984314Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2bb5eb80>} 2025-05-07T20:32:11.5985652Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.5986685Z context = 2025-05-07T20:32:11.5986974Z 2025-05-07T20:32:11.5987142Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.5987663Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.5988130Z module_map=module_map) 2025-05-07T20:32:11.5988496Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.5988852Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.5989193Z E ^ 2025-05-07T20:32:11.5989664Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.5990201Z 2025-05-07T20:32:11.5990618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.5991205Z 2025-05-07T20:32:11.5991316Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5991731Z self=, 2025-05-07T20:32:11.5992160Z T=128, 2025-05-07T20:32:11.5992346Z D=5120, 2025-05-07T20:32:11.5992539Z scale_ub=None, 2025-05-07T20:32:11.5992753Z contiguous=False, 2025-05-07T20:32:11.5992974Z compiled=True, 2025-05-07T20:32:11.5993177Z ) 2025-05-07T20:32:11.5993492Z self = 2025-05-07T20:32:11.5993988Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.5994260Z 2025-05-07T20:32:11.5994336Z @given( 2025-05-07T20:32:11.5994565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5994869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5995180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5995511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5995840Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5996119Z ) 2025-05-07T20:32:11.5996465Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5996901Z def test_silu_mul_quant( 2025-05-07T20:32:11.5997137Z self, 2025-05-07T20:32:11.5997330Z T: int, 2025-05-07T20:32:11.5997534Z D: int, 2025-05-07T20:32:11.5997746Z scale_ub: Optional[float], 2025-05-07T20:32:11.5998016Z contiguous: bool, 2025-05-07T20:32:11.5998279Z compiled: bool, 2025-05-07T20:32:11.5998529Z ) -> None: 2025-05-07T20:32:11.5998749Z torch.manual_seed(2025) 2025-05-07T20:32:11.5998994Z 2025-05-07T20:32:11.5999261Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5999599Z 2025-05-07T20:32:11.5999788Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6000077Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6000385Z x = x_sign * x_clamp 2025-05-07T20:32:11.6000623Z x0 = x[:, :D] 2025-05-07T20:32:11.6000838Z x1 = x[:, D:] 2025-05-07T20:32:11.6001043Z 2025-05-07T20:32:11.6001229Z if contiguous: 2025-05-07T20:32:11.6001458Z x0 = x0.contiguous() 2025-05-07T20:32:11.6001709Z x1 = x1.contiguous() 2025-05-07T20:32:11.6001948Z 2025-05-07T20:32:11.6002140Z if scale_ub is not None: 2025-05-07T20:32:11.6002407Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6002745Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6003055Z ) 2025-05-07T20:32:11.6003242Z else: 2025-05-07T20:32:11.6003451Z scale_ub_tensor = None 2025-05-07T20:32:11.6003869Z 2025-05-07T20:32:11.6004097Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6004407Z op = silu_mul_quant 2025-05-07T20:32:11.6004655Z if compiled: 2025-05-07T20:32:11.6004892Z op = torch.compile(op) 2025-05-07T20:32:11.6005186Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6005458Z 2025-05-07T20:32:11.6005634Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6005799Z 2025-05-07T20:32:11.6005896Z moe/activation_test.py:117: 2025-05-07T20:32:11.6006188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6006514Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6006783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6007456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6008017Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6008724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6009406Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6010068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6010739Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6011393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6011921Z kernel = self.compile( 2025-05-07T20:32:11.6012453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6013101Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6013484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6013715Z 2025-05-07T20:32:11.6013919Z self = 2025-05-07T20:32:11.6014993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6016368Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2c5e9a60>} 2025-05-07T20:32:11.6017700Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6018721Z context = 2025-05-07T20:32:11.6019010Z 2025-05-07T20:32:11.6019173Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6019686Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6020149Z module_map=module_map) 2025-05-07T20:32:11.6020506Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6020852Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6021100Z E ^ 2025-05-07T20:32:11.6021556Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6022005Z 2025-05-07T20:32:11.6022418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6022924Z 2025-05-07T20:32:11.6023030Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6023435Z self=, 2025-05-07T20:32:11.6023836Z T=128, 2025-05-07T20:32:11.6024014Z D=7168, 2025-05-07T20:32:11.6024197Z scale_ub=1200.0, 2025-05-07T20:32:11.6024414Z contiguous=False, 2025-05-07T20:32:11.6024633Z compiled=False, 2025-05-07T20:32:11.6024833Z ) 2025-05-07T20:32:11.7574797Z self = 2025-05-07T20:32:11.7575516Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.7575900Z 2025-05-07T20:32:11.7576008Z @given( 2025-05-07T20:32:11.7576305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7576725Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7577140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7577547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7577870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7578316Z ) 2025-05-07T20:32:11.7578667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7579099Z def test_silu_mul_quant( 2025-05-07T20:32:11.7579341Z self, 2025-05-07T20:32:11.7579535Z T: int, 2025-05-07T20:32:11.7579729Z D: int, 2025-05-07T20:32:11.7580058Z scale_ub: Optional[float], 2025-05-07T20:32:11.7580329Z contiguous: bool, 2025-05-07T20:32:11.7580563Z compiled: bool, 2025-05-07T20:32:11.7580785Z ) -> None: 2025-05-07T20:32:11.7580997Z torch.manual_seed(2025) 2025-05-07T20:32:11.7581233Z 2025-05-07T20:32:11.7581510Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7581879Z 2025-05-07T20:32:11.7582072Z x_sign = torch.sign(x) 2025-05-07T20:32:11.7582364Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.7582667Z x = x_sign * x_clamp 2025-05-07T20:32:11.7582911Z x0 = x[:, :D] 2025-05-07T20:32:11.7583131Z x1 = x[:, D:] 2025-05-07T20:32:11.7583337Z 2025-05-07T20:32:11.7583521Z if contiguous: 2025-05-07T20:32:11.7583753Z x0 = x0.contiguous() 2025-05-07T20:32:11.7584005Z x1 = x1.contiguous() 2025-05-07T20:32:11.7584243Z 2025-05-07T20:32:11.7584442Z if scale_ub is not None: 2025-05-07T20:32:11.7584715Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.7585044Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.7585356Z ) 2025-05-07T20:32:11.7585545Z else: 2025-05-07T20:32:11.7585752Z scale_ub_tensor = None 2025-05-07T20:32:11.7586003Z 2025-05-07T20:32:11.7586232Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.7586536Z op = silu_mul_quant 2025-05-07T20:32:11.7586785Z if compiled: 2025-05-07T20:32:11.7587031Z op = torch.compile(op) 2025-05-07T20:32:11.7587325Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.7587599Z 2025-05-07T20:32:11.7587791Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.7587954Z 2025-05-07T20:32:11.7588057Z moe/activation_test.py:117: 2025-05-07T20:32:11.7588357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.7588689Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.7588972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.7589653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.7590419Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.7590953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.7591627Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.7592285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.7592808Z kernel = self.compile( 2025-05-07T20:32:11.7593340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.7593985Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.7594383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.7594612Z 2025-05-07T20:32:11.7594821Z self = 2025-05-07T20:32:11.7595907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.7597827Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b75d4c0>} 2025-05-07T20:32:11.7599188Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.7600212Z context = 2025-05-07T20:32:11.7600573Z 2025-05-07T20:32:11.7600742Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.7601259Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.7601725Z module_map=module_map) 2025-05-07T20:32:11.7602086Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.7602436Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.7602691Z E ^ 2025-05-07T20:32:11.7603163Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.7603616Z 2025-05-07T20:32:11.7604205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.7604716Z 2025-05-07T20:32:11.7604822Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7605239Z self=, 2025-05-07T20:32:11.7605643Z T=128, 2025-05-07T20:32:11.7605833Z D=5120, 2025-05-07T20:32:11.7606018Z scale_ub=None, 2025-05-07T20:32:11.7606233Z contiguous=False, 2025-05-07T20:32:11.7606457Z compiled=False, 2025-05-07T20:32:11.7606655Z ) 2025-05-07T20:32:11.7606970Z self = 2025-05-07T20:32:11.7607456Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.7607726Z 2025-05-07T20:32:11.7607801Z @given( 2025-05-07T20:32:11.7608037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7608376Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7608710Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7609034Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7609360Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7609648Z ) 2025-05-07T20:32:11.7609991Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7610429Z def test_silu_mul_quant( 2025-05-07T20:32:11.7610672Z self, 2025-05-07T20:32:11.7610857Z T: int, 2025-05-07T20:32:11.7611055Z D: int, 2025-05-07T20:32:11.7611273Z scale_ub: Optional[float], 2025-05-07T20:32:11.7611536Z contiguous: bool, 2025-05-07T20:32:11.7611776Z compiled: bool, 2025-05-07T20:32:11.7611998Z ) -> None: 2025-05-07T20:32:11.7612208Z torch.manual_seed(2025) 2025-05-07T20:32:11.7612448Z 2025-05-07T20:32:11.7612723Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7613059Z 2025-05-07T20:32:11.7613250Z x_sign = torch.sign(x) 2025-05-07T20:32:11.7613538Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.7613851Z x = x_sign * x_clamp 2025-05-07T20:32:11.7614090Z x0 = x[:, :D] 2025-05-07T20:32:11.7614301Z x1 = x[:, D:] 2025-05-07T20:32:11.7614506Z 2025-05-07T20:32:11.7614681Z if contiguous: 2025-05-07T20:32:11.7614910Z x0 = x0.contiguous() 2025-05-07T20:32:11.7615164Z x1 = x1.contiguous() 2025-05-07T20:32:11.7615403Z 2025-05-07T20:32:11.7615591Z if scale_ub is not None: 2025-05-07T20:32:11.7615860Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.7616189Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.7616497Z ) 2025-05-07T20:32:11.7616691Z else: 2025-05-07T20:32:11.7616896Z scale_ub_tensor = None 2025-05-07T20:32:11.7617328Z 2025-05-07T20:32:11.7617564Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.7617872Z op = silu_mul_quant 2025-05-07T20:32:11.7618125Z if compiled: 2025-05-07T20:32:11.7618395Z op = torch.compile(op) 2025-05-07T20:32:11.7618826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.7619100Z 2025-05-07T20:32:11.7619290Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.7619452Z 2025-05-07T20:32:11.7619557Z moe/activation_test.py:117: 2025-05-07T20:32:11.7619844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.7620171Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.7620449Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.7621140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.7621835Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.7622372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.7623046Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.7623694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.7624221Z kernel = self.compile( 2025-05-07T20:32:11.7624757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.7625396Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.7625788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.7626019Z 2025-05-07T20:32:11.7626224Z self = 2025-05-07T20:32:11.7627313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.7628718Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2bc2aee0>} 2025-05-07T20:32:11.7630144Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.7631168Z context = 2025-05-07T20:32:11.7631460Z 2025-05-07T20:32:11.7631625Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.7632142Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.7632617Z module_map=module_map) 2025-05-07T20:32:11.7632982Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.7633333Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.7633584Z E ^ 2025-05-07T20:32:11.7634049Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.7634505Z 2025-05-07T20:32:11.7634919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.7635428Z 2025-05-07T20:32:11.7635534Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7635939Z self=, 2025-05-07T20:32:11.7636344Z T=128, 2025-05-07T20:32:11.7636533Z D=5120, 2025-05-07T20:32:11.7636720Z scale_ub=1200.0, 2025-05-07T20:32:11.7636938Z contiguous=True, 2025-05-07T20:32:11.7637162Z compiled=False, 2025-05-07T20:32:11.7637361Z ) 2025-05-07T20:32:11.9956450Z self = 2025-05-07T20:32:11.9957921Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.9958569Z 2025-05-07T20:32:11.9958695Z @given( 2025-05-07T20:32:11.9959012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.9959564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.9959949Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.9960276Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.9960602Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.9960887Z ) 2025-05-07T20:32:11.9961232Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.9961680Z def test_silu_mul_quant( 2025-05-07T20:32:11.9961924Z self, 2025-05-07T20:32:11.9962117Z T: int, 2025-05-07T20:32:11.9962311Z D: int, 2025-05-07T20:32:11.9962541Z scale_ub: Optional[float], 2025-05-07T20:32:11.9962818Z contiguous: bool, 2025-05-07T20:32:11.9963053Z compiled: bool, 2025-05-07T20:32:11.9963278Z ) -> None: 2025-05-07T20:32:11.9963495Z torch.manual_seed(2025) 2025-05-07T20:32:11.9963733Z 2025-05-07T20:32:11.9964017Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.9964364Z 2025-05-07T20:32:11.9964561Z x_sign = torch.sign(x) 2025-05-07T20:32:11.9964849Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.9965165Z x = x_sign * x_clamp 2025-05-07T20:32:11.9965400Z x0 = x[:, :D] 2025-05-07T20:32:11.9965619Z x1 = x[:, D:] 2025-05-07T20:32:11.9965835Z 2025-05-07T20:32:11.9966018Z if contiguous: 2025-05-07T20:32:11.9966248Z x0 = x0.contiguous() 2025-05-07T20:32:11.9966506Z x1 = x1.contiguous() 2025-05-07T20:32:11.9966742Z 2025-05-07T20:32:11.9966938Z if scale_ub is not None: 2025-05-07T20:32:11.9967215Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.9967556Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.9967867Z ) 2025-05-07T20:32:11.9968065Z else: 2025-05-07T20:32:11.9968280Z scale_ub_tensor = None 2025-05-07T20:32:11.9968528Z 2025-05-07T20:32:11.9968760Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.9969076Z op = silu_mul_quant 2025-05-07T20:32:11.9969322Z if compiled: 2025-05-07T20:32:11.9969569Z op = torch.compile(op) 2025-05-07T20:32:11.9969865Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9970135Z 2025-05-07T20:32:11.9970329Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.9970493Z 2025-05-07T20:32:11.9970597Z moe/activation_test.py:117: 2025-05-07T20:32:11.9970886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9971222Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.9971503Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9972189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.9972875Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.9973414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.9974089Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.9974749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.9975280Z kernel = self.compile( 2025-05-07T20:32:11.9975820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.9976554Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.9976946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9977171Z 2025-05-07T20:32:11.9977381Z self = 2025-05-07T20:32:11.9978625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.9980468Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b75d280>} 2025-05-07T20:32:11.9981845Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.9982867Z context = 2025-05-07T20:32:11.9983153Z 2025-05-07T20:32:11.9983328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.9983842Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.9984309Z module_map=module_map) 2025-05-07T20:32:11.9984683Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.9985034Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.9985289Z E ^ 2025-05-07T20:32:11.9985754Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.9986201Z 2025-05-07T20:32:11.9986620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.9987129Z 2025-05-07T20:32:11.9987232Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.9987649Z self=, 2025-05-07T20:32:11.9988099Z T=1, 2025-05-07T20:32:11.9988322Z D=7168, 2025-05-07T20:32:11.9988556Z scale_ub=1200.0, 2025-05-07T20:32:11.9988832Z contiguous=True, 2025-05-07T20:32:11.9989108Z compiled=True, 2025-05-07T20:32:11.9989364Z ) 2025-05-07T20:32:11.9989766Z self = 2025-05-07T20:32:11.9990379Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.9990637Z 2025-05-07T20:32:11.9990713Z @given( 2025-05-07T20:32:11.9990941Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.9991250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.9991547Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.9991874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.9992199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.9992487Z ) 2025-05-07T20:32:11.9992829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.9993264Z def test_silu_mul_quant( 2025-05-07T20:32:11.9993501Z self, 2025-05-07T20:32:11.9993688Z T: int, 2025-05-07T20:32:11.9993882Z D: int, 2025-05-07T20:32:11.9994107Z scale_ub: Optional[float], 2025-05-07T20:32:11.9994372Z contiguous: bool, 2025-05-07T20:32:11.9994615Z compiled: bool, 2025-05-07T20:32:11.9994840Z ) -> None: 2025-05-07T20:32:11.9995049Z torch.manual_seed(2025) 2025-05-07T20:32:11.9995290Z 2025-05-07T20:32:11.9995565Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.9995900Z 2025-05-07T20:32:11.9996092Z x_sign = torch.sign(x) 2025-05-07T20:32:11.9996382Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.9996686Z x = x_sign * x_clamp 2025-05-07T20:32:11.9996926Z x0 = x[:, :D] 2025-05-07T20:32:11.9997223Z x1 = x[:, D:] 2025-05-07T20:32:11.9997438Z 2025-05-07T20:32:11.9997620Z if contiguous: 2025-05-07T20:32:11.9997848Z x0 = x0.contiguous() 2025-05-07T20:32:11.9998104Z x1 = x1.contiguous() 2025-05-07T20:32:11.9998340Z 2025-05-07T20:32:11.9998527Z if scale_ub is not None: 2025-05-07T20:32:11.9998873Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.9999204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.9999511Z ) 2025-05-07T20:32:11.9999705Z else: 2025-05-07T20:32:11.9999910Z scale_ub_tensor = None 2025-05-07T20:32:12.0000160Z 2025-05-07T20:32:12.0000387Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.0000695Z op = silu_mul_quant 2025-05-07T20:32:12.0000943Z if compiled: 2025-05-07T20:32:12.0001189Z op = torch.compile(op) 2025-05-07T20:32:12.0001486Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.0001758Z 2025-05-07T20:32:12.0001956Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.0002119Z 2025-05-07T20:32:12.0002222Z moe/activation_test.py:117: 2025-05-07T20:32:12.0002508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.0002847Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.0003130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.0003680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.0004402Z return fn(*args, **kwargs) 2025-05-07T20:32:12.0005058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.0005742Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.0006270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.0006955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.0007612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.0008133Z kernel = self.compile( 2025-05-07T20:32:12.0008723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.0009369Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.0009763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.0009989Z 2025-05-07T20:32:12.0010196Z self = 2025-05-07T20:32:12.0011279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.0012652Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2afb7820>} 2025-05-07T20:32:12.0013999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.0015016Z context = 2025-05-07T20:32:12.0015307Z 2025-05-07T20:32:12.0015471Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.0015996Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.0016464Z module_map=module_map) 2025-05-07T20:32:12.0016820Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.0017304Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.0017567Z E ^ 2025-05-07T20:32:12.0018028Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.0018487Z 2025-05-07T20:32:12.0018951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.0019595Z 2025-05-07T20:32:12.0019699Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.0020108Z self=, 2025-05-07T20:32:12.0020508Z T=1, 2025-05-07T20:32:12.0020689Z D=7168, 2025-05-07T20:32:12.0020882Z scale_ub=1200.0, 2025-05-07T20:32:12.0021106Z contiguous=False, 2025-05-07T20:32:12.0021333Z compiled=True, 2025-05-07T20:32:12.0021533Z ) 2025-05-07T20:32:12.1680690Z self = 2025-05-07T20:32:12.1681262Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.1681633Z 2025-05-07T20:32:12.1681712Z @given( 2025-05-07T20:32:12.1681941Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.1682250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.1682546Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.1682879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.1683206Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.1683487Z ) 2025-05-07T20:32:12.1683831Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.1684269Z def test_silu_mul_quant( 2025-05-07T20:32:12.1684505Z self, 2025-05-07T20:32:12.1684697Z T: int, 2025-05-07T20:32:12.1684892Z D: int, 2025-05-07T20:32:12.1685105Z scale_ub: Optional[float], 2025-05-07T20:32:12.1685371Z contiguous: bool, 2025-05-07T20:32:12.1685615Z compiled: bool, 2025-05-07T20:32:12.1685831Z ) -> None: 2025-05-07T20:32:12.1686049Z torch.manual_seed(2025) 2025-05-07T20:32:12.1686286Z 2025-05-07T20:32:12.1686555Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.1686891Z 2025-05-07T20:32:12.1687083Z x_sign = torch.sign(x) 2025-05-07T20:32:12.1687378Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.1687680Z x = x_sign * x_clamp 2025-05-07T20:32:12.1687917Z x0 = x[:, :D] 2025-05-07T20:32:12.1688132Z x1 = x[:, D:] 2025-05-07T20:32:12.1688334Z 2025-05-07T20:32:12.1688522Z if contiguous: 2025-05-07T20:32:12.1688753Z x0 = x0.contiguous() 2025-05-07T20:32:12.1689007Z x1 = x1.contiguous() 2025-05-07T20:32:12.1689249Z 2025-05-07T20:32:12.1689443Z if scale_ub is not None: 2025-05-07T20:32:12.1689710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.1690053Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.1690383Z ) 2025-05-07T20:32:12.1690576Z else: 2025-05-07T20:32:12.1690784Z scale_ub_tensor = None 2025-05-07T20:32:12.1691032Z 2025-05-07T20:32:12.1691264Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.1691580Z op = silu_mul_quant 2025-05-07T20:32:12.1691832Z if compiled: 2025-05-07T20:32:12.1692078Z op = torch.compile(op) 2025-05-07T20:32:12.1692369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1692639Z 2025-05-07T20:32:12.1692832Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.1692995Z 2025-05-07T20:32:12.1693095Z moe/activation_test.py:117: 2025-05-07T20:32:12.1693384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1693712Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.1693999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1694707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.1695265Z return fn(*args, **kwargs) 2025-05-07T20:32:12.1695916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.1696706Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.1697238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.1697909Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.1698570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.1699093Z kernel = self.compile( 2025-05-07T20:32:12.1699628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.1700280Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.1700670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1700896Z 2025-05-07T20:32:12.1701100Z self = 2025-05-07T20:32:12.1702182Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.1703562Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b391310>} 2025-05-07T20:32:12.1705086Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.1706109Z context = 2025-05-07T20:32:12.1706398Z 2025-05-07T20:32:12.1706566Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.1707094Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.1707565Z module_map=module_map) 2025-05-07T20:32:12.1707925Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.1708275Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.1708534Z E ^ 2025-05-07T20:32:12.1708987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.1709441Z 2025-05-07T20:32:12.1709922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.1710429Z 2025-05-07T20:32:12.1710535Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1710940Z self=, 2025-05-07T20:32:12.1711331Z T=1, 2025-05-07T20:32:12.1711515Z D=7168, 2025-05-07T20:32:12.1711707Z scale_ub=None, 2025-05-07T20:32:12.1711916Z contiguous=False, 2025-05-07T20:32:12.1712137Z compiled=True, 2025-05-07T20:32:12.1712340Z ) 2025-05-07T20:32:12.4596093Z self = 2025-05-07T20:32:12.4596593Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.4596882Z 2025-05-07T20:32:12.4596966Z @given( 2025-05-07T20:32:12.4597196Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4597509Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4597813Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4598139Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4598617Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4598950Z ) 2025-05-07T20:32:12.4599297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4599733Z def test_silu_mul_quant( 2025-05-07T20:32:12.4599975Z self, 2025-05-07T20:32:12.4600166Z T: int, 2025-05-07T20:32:12.4600477Z D: int, 2025-05-07T20:32:12.4600697Z scale_ub: Optional[float], 2025-05-07T20:32:12.4600962Z contiguous: bool, 2025-05-07T20:32:12.4601202Z compiled: bool, 2025-05-07T20:32:12.4601426Z ) -> None: 2025-05-07T20:32:12.4601639Z torch.manual_seed(2025) 2025-05-07T20:32:12.4601876Z 2025-05-07T20:32:12.4602145Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4602477Z 2025-05-07T20:32:12.4602666Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4602953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4603258Z x = x_sign * x_clamp 2025-05-07T20:32:12.4603501Z x0 = x[:, :D] 2025-05-07T20:32:12.4603860Z x1 = x[:, D:] 2025-05-07T20:32:12.4604066Z 2025-05-07T20:32:12.4604251Z if contiguous: 2025-05-07T20:32:12.4604486Z x0 = x0.contiguous() 2025-05-07T20:32:12.4604744Z x1 = x1.contiguous() 2025-05-07T20:32:12.4604990Z 2025-05-07T20:32:12.4605181Z if scale_ub is not None: 2025-05-07T20:32:12.4605452Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4605782Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4606088Z ) 2025-05-07T20:32:12.4606279Z else: 2025-05-07T20:32:12.4606484Z scale_ub_tensor = None 2025-05-07T20:32:12.4606736Z 2025-05-07T20:32:12.4606967Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4607274Z op = silu_mul_quant 2025-05-07T20:32:12.4607525Z if compiled: 2025-05-07T20:32:12.4607772Z op = torch.compile(op) 2025-05-07T20:32:12.4608069Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4608344Z 2025-05-07T20:32:12.4608538Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.4608820Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.4609109Z 2025-05-07T20:32:12.4609347Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4609687Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.4609972Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.4610284Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.4610644Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4610948Z 2025-05-07T20:32:12.4611147Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.4611340Z 2025-05-07T20:32:12.4611445Z moe/activation_test.py:126: 2025-05-07T20:32:12.4611736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4612075Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.4612401Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.4613186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.4613943Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.4614482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4615157Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4615833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.4616550Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4617412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:12.4618159Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.4618929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.4619562Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.4620271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.4620790Z fn() 2025-05-07T20:32:12.4621286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.4621861Z self.fn.run( 2025-05-07T20:32:12.4622326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4622847Z kernel = self.compile( 2025-05-07T20:32:12.4623389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4624039Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4624434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4624659Z 2025-05-07T20:32:12.4624865Z self = 2025-05-07T20:32:12.4625954Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4627325Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b23b040>} 2025-05-07T20:32:12.4628674Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4629750Z context = 2025-05-07T20:32:12.4630110Z 2025-05-07T20:32:12.4630276Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4630802Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4631265Z module_map=module_map) 2025-05-07T20:32:12.4631621Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4631977Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.4632244Z E ^ 2025-05-07T20:32:12.4632708Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4633162Z 2025-05-07T20:32:12.4633577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4634085Z 2025-05-07T20:32:12.4634188Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4634604Z self=, 2025-05-07T20:32:12.4635006Z T=1, 2025-05-07T20:32:12.4635190Z D=5120, 2025-05-07T20:32:12.4635388Z scale_ub=1200.0, 2025-05-07T20:32:12.4635618Z contiguous=False, 2025-05-07T20:32:12.4635842Z compiled=True, 2025-05-07T20:32:12.4636046Z ) 2025-05-07T20:32:12.6647838Z self = 2025-05-07T20:32:12.6648350Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.6652392Z 2025-05-07T20:32:12.6652475Z @given( 2025-05-07T20:32:12.6652707Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.6653013Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.6653320Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.6653792Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.6654121Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.6654411Z ) 2025-05-07T20:32:12.6654764Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.6655210Z def test_silu_mul_quant( 2025-05-07T20:32:12.6655509Z self, 2025-05-07T20:32:12.6655706Z T: int, 2025-05-07T20:32:12.6655906Z D: int, 2025-05-07T20:32:12.6656119Z scale_ub: Optional[float], 2025-05-07T20:32:12.6656390Z contiguous: bool, 2025-05-07T20:32:12.6656643Z compiled: bool, 2025-05-07T20:32:12.6656864Z ) -> None: 2025-05-07T20:32:12.6657081Z torch.manual_seed(2025) 2025-05-07T20:32:12.6657341Z 2025-05-07T20:32:12.6664071Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.6664417Z 2025-05-07T20:32:12.6664613Z x_sign = torch.sign(x) 2025-05-07T20:32:12.6664914Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.6665221Z x = x_sign * x_clamp 2025-05-07T20:32:12.6665465Z x0 = x[:, :D] 2025-05-07T20:32:12.6665683Z x1 = x[:, D:] 2025-05-07T20:32:12.6665893Z 2025-05-07T20:32:12.6666081Z if contiguous: 2025-05-07T20:32:12.6666344Z x0 = x0.contiguous() 2025-05-07T20:32:12.6666606Z x1 = x1.contiguous() 2025-05-07T20:32:12.6666846Z 2025-05-07T20:32:12.6667035Z if scale_ub is not None: 2025-05-07T20:32:12.6667313Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.6667658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.6667961Z ) 2025-05-07T20:32:12.6668156Z else: 2025-05-07T20:32:12.6668366Z scale_ub_tensor = None 2025-05-07T20:32:12.6668608Z 2025-05-07T20:32:12.6668842Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.6669155Z op = silu_mul_quant 2025-05-07T20:32:12.6669404Z if compiled: 2025-05-07T20:32:12.6669653Z op = torch.compile(op) 2025-05-07T20:32:12.6670028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.6670300Z 2025-05-07T20:32:12.6670493Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.6670661Z 2025-05-07T20:32:12.6670762Z moe/activation_test.py:117: 2025-05-07T20:32:12.6671061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.6671384Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.6671665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.6672227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.6672780Z return fn(*args, **kwargs) 2025-05-07T20:32:12.6673438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.6674126Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.6674661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.6675336Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.6675997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.6676533Z kernel = self.compile( 2025-05-07T20:32:12.6677064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.6677710Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.6678101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.6678445Z 2025-05-07T20:32:12.6678654Z self = 2025-05-07T20:32:12.6679804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.6681176Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b23bf70>} 2025-05-07T20:32:12.6682554Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.6683578Z context = 2025-05-07T20:32:12.6683868Z 2025-05-07T20:32:12.6684033Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.6684546Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.6685013Z module_map=module_map) 2025-05-07T20:32:12.6685380Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.6685728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.6685995Z E ^ 2025-05-07T20:32:12.6686460Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.6686910Z 2025-05-07T20:32:12.6687326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.6687832Z 2025-05-07T20:32:12.6687933Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.6688340Z self=, 2025-05-07T20:32:12.6688769Z T=1, 2025-05-07T20:32:12.6688972Z D=5120, 2025-05-07T20:32:12.6689165Z scale_ub=1200.0, 2025-05-07T20:32:12.6689390Z contiguous=False, 2025-05-07T20:32:12.6689617Z compiled=False, 2025-05-07T20:32:12.6689818Z ) 2025-05-07T20:32:12.6690141Z self = 2025-05-07T20:32:12.6690626Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.6690894Z 2025-05-07T20:32:12.6690970Z @given( 2025-05-07T20:32:12.6691202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.6691512Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.6691825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.6692149Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.6692479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.6692767Z ) 2025-05-07T20:32:12.6693116Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.6693551Z def test_silu_mul_quant( 2025-05-07T20:32:12.6693793Z self, 2025-05-07T20:32:12.6693980Z T: int, 2025-05-07T20:32:12.6694177Z D: int, 2025-05-07T20:32:12.6694403Z scale_ub: Optional[float], 2025-05-07T20:32:12.6694670Z contiguous: bool, 2025-05-07T20:32:12.6694911Z compiled: bool, 2025-05-07T20:32:12.6695130Z ) -> None: 2025-05-07T20:32:12.6695342Z torch.manual_seed(2025) 2025-05-07T20:32:12.6695583Z 2025-05-07T20:32:12.6695862Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.6696205Z 2025-05-07T20:32:12.6696392Z x_sign = torch.sign(x) 2025-05-07T20:32:12.6696682Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.6696991Z x = x_sign * x_clamp 2025-05-07T20:32:12.6697229Z x0 = x[:, :D] 2025-05-07T20:32:12.6697448Z x1 = x[:, D:] 2025-05-07T20:32:12.6697724Z 2025-05-07T20:32:12.6697906Z if contiguous: 2025-05-07T20:32:12.6698137Z x0 = x0.contiguous() 2025-05-07T20:32:12.6698396Z x1 = x1.contiguous() 2025-05-07T20:32:12.6698628Z 2025-05-07T20:32:12.6698894Z if scale_ub is not None: 2025-05-07T20:32:12.6699189Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.6699546Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.6699850Z ) 2025-05-07T20:32:12.6700041Z else: 2025-05-07T20:32:12.6700246Z scale_ub_tensor = None 2025-05-07T20:32:12.6700538Z 2025-05-07T20:32:12.6700766Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.6701077Z op = silu_mul_quant 2025-05-07T20:32:12.6701330Z if compiled: 2025-05-07T20:32:12.6701576Z op = torch.compile(op) 2025-05-07T20:32:12.6701874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.6702147Z 2025-05-07T20:32:12.6702342Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.6702506Z 2025-05-07T20:32:12.6702611Z moe/activation_test.py:117: 2025-05-07T20:32:12.6702896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.6703227Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.6703512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.6704495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.6705186Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.6705726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.6706397Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.6707042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.6707569Z kernel = self.compile( 2025-05-07T20:32:12.6708099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.6708738Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.6709130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.6709358Z 2025-05-07T20:32:12.6709563Z self = 2025-05-07T20:32:12.6710696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.6712068Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2ac483a0>} 2025-05-07T20:32:12.6713403Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.6714423Z context = 2025-05-07T20:32:12.6714710Z 2025-05-07T20:32:12.6714872Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.6715387Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.6715850Z module_map=module_map) 2025-05-07T20:32:12.6716209Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.6716561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.6716815Z E ^ 2025-05-07T20:32:12.6717280Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.6717833Z 2025-05-07T20:32:12.6718244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.6718803Z 2025-05-07T20:32:12.6718906Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.6719425Z self=, 2025-05-07T20:32:12.6719831Z T=16384, 2025-05-07T20:32:12.6720024Z D=5120, 2025-05-07T20:32:12.6720205Z scale_ub=1200.0, 2025-05-07T20:32:12.6720426Z contiguous=False, 2025-05-07T20:32:12.6720652Z compiled=True, 2025-05-07T20:32:12.6720911Z ) 2025-05-07T20:32:12.7903360Z self = 2025-05-07T20:32:12.7904014Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.7904298Z 2025-05-07T20:32:12.7904412Z @given( 2025-05-07T20:32:12.7904688Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.7905126Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.7905435Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.7905760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.7906084Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.7906375Z ) 2025-05-07T20:32:12.7906723Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.7907161Z def test_silu_mul_quant( 2025-05-07T20:32:12.7907397Z self, 2025-05-07T20:32:12.7907587Z T: int, 2025-05-07T20:32:12.7907776Z D: int, 2025-05-07T20:32:12.7907990Z scale_ub: Optional[float], 2025-05-07T20:32:12.7908255Z contiguous: bool, 2025-05-07T20:32:12.7908483Z compiled: bool, 2025-05-07T20:32:12.7908697Z ) -> None: 2025-05-07T20:32:12.7908909Z torch.manual_seed(2025) 2025-05-07T20:32:12.7909145Z 2025-05-07T20:32:12.7909403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.7909738Z 2025-05-07T20:32:12.7909997Z x_sign = torch.sign(x) 2025-05-07T20:32:12.7910275Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.7910576Z x = x_sign * x_clamp 2025-05-07T20:32:12.7910821Z x0 = x[:, :D] 2025-05-07T20:32:12.7911027Z x1 = x[:, D:] 2025-05-07T20:32:12.7911228Z 2025-05-07T20:32:12.7911403Z if contiguous: 2025-05-07T20:32:12.7911620Z x0 = x0.contiguous() 2025-05-07T20:32:12.7911880Z x1 = x1.contiguous() 2025-05-07T20:32:12.7912118Z 2025-05-07T20:32:12.7912311Z if scale_ub is not None: 2025-05-07T20:32:12.7912573Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.7912906Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.7913211Z ) 2025-05-07T20:32:12.7913390Z else: 2025-05-07T20:32:12.7913597Z scale_ub_tensor = None 2025-05-07T20:32:12.7913841Z 2025-05-07T20:32:12.7914067Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.7914377Z op = silu_mul_quant 2025-05-07T20:32:12.7914626Z if compiled: 2025-05-07T20:32:12.7914865Z op = torch.compile(op) 2025-05-07T20:32:12.7915160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.7915432Z 2025-05-07T20:32:12.7915616Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.7915783Z 2025-05-07T20:32:12.7915880Z moe/activation_test.py:117: 2025-05-07T20:32:12.7916172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.7916501Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.7916770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.7917322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.7917877Z return fn(*args, **kwargs) 2025-05-07T20:32:12.7918538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.7919388Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.7919915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.7920736Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.7921388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.7921911Z kernel = self.compile( 2025-05-07T20:32:12.7922501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.7923141Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.7923528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.7923755Z 2025-05-07T20:32:12.7923960Z self = 2025-05-07T20:32:12.7925045Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.7926417Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2abfa0d0>} 2025-05-07T20:32:12.7927749Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.7928771Z context = 2025-05-07T20:32:12.7929111Z 2025-05-07T20:32:12.7929273Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.7929788Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.7930241Z module_map=module_map) 2025-05-07T20:32:12.7930607Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.7930954Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.7931200Z E ^ 2025-05-07T20:32:12.7931661Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.7932108Z 2025-05-07T20:32:12.7932521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.7933033Z 2025-05-07T20:32:12.7933136Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.7933536Z self=, 2025-05-07T20:32:12.7933940Z T=2048, 2025-05-07T20:32:12.7934118Z D=7168, 2025-05-07T20:32:12.7934304Z scale_ub=1200.0, 2025-05-07T20:32:12.7934521Z contiguous=False, 2025-05-07T20:32:12.7934737Z compiled=True, 2025-05-07T20:32:12.7934928Z ) 2025-05-07T20:32:12.7935243Z self = 2025-05-07T20:32:12.7935748Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.7936015Z 2025-05-07T20:32:12.7936091Z @given( 2025-05-07T20:32:12.7936309Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.7936614Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.7936912Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.7937235Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.7937565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.7937844Z ) 2025-05-07T20:32:12.7938182Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.7938614Z def test_silu_mul_quant( 2025-05-07T20:32:12.7938911Z self, 2025-05-07T20:32:12.7939090Z T: int, 2025-05-07T20:32:12.7939280Z D: int, 2025-05-07T20:32:12.7939493Z scale_ub: Optional[float], 2025-05-07T20:32:12.7939765Z contiguous: bool, 2025-05-07T20:32:12.7940066Z compiled: bool, 2025-05-07T20:32:12.7940282Z ) -> None: 2025-05-07T20:32:12.7940489Z torch.manual_seed(2025) 2025-05-07T20:32:12.7940729Z 2025-05-07T20:32:12.7940997Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.7941355Z 2025-05-07T20:32:12.7941546Z x_sign = torch.sign(x) 2025-05-07T20:32:12.7941863Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.7942166Z x = x_sign * x_clamp 2025-05-07T20:32:12.7942394Z x0 = x[:, :D] 2025-05-07T20:32:12.7942602Z x1 = x[:, D:] 2025-05-07T20:32:12.7942803Z 2025-05-07T20:32:12.7942976Z if contiguous: 2025-05-07T20:32:12.7943199Z x0 = x0.contiguous() 2025-05-07T20:32:12.7943456Z x1 = x1.contiguous() 2025-05-07T20:32:12.7943694Z 2025-05-07T20:32:12.7943880Z if scale_ub is not None: 2025-05-07T20:32:12.7944152Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.7944478Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.7944777Z ) 2025-05-07T20:32:12.7944969Z else: 2025-05-07T20:32:12.7945170Z scale_ub_tensor = None 2025-05-07T20:32:12.7945412Z 2025-05-07T20:32:12.7945635Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.7945946Z op = silu_mul_quant 2025-05-07T20:32:12.7946182Z if compiled: 2025-05-07T20:32:12.7946423Z op = torch.compile(op) 2025-05-07T20:32:12.7946720Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.7946986Z 2025-05-07T20:32:12.7947181Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.7947340Z 2025-05-07T20:32:12.7947449Z moe/activation_test.py:117: 2025-05-07T20:32:12.7947740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.7948068Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.7948350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.7948945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.7949488Z return fn(*args, **kwargs) 2025-05-07T20:32:12.7950195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.7950877Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.7951398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.7952069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.7952719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.7953243Z kernel = self.compile( 2025-05-07T20:32:12.7953769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.7954415Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.7954804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.7955027Z 2025-05-07T20:32:12.7955234Z self = 2025-05-07T20:32:12.7956312Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.7957684Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2abfaca0>} 2025-05-07T20:32:12.7959201Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.7960222Z context = 2025-05-07T20:32:12.7960504Z 2025-05-07T20:32:12.7960667Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.7961181Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.7961682Z module_map=module_map) 2025-05-07T20:32:12.7962041Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.7962383Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.7962639Z E ^ 2025-05-07T20:32:12.7963099Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.7963551Z 2025-05-07T20:32:12.7963961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.7964471Z 2025-05-07T20:32:13.0657036Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0657463Z self=, 2025-05-07T20:32:13.0657892Z T=1, 2025-05-07T20:32:13.0658085Z D=5120, 2025-05-07T20:32:13.0658279Z scale_ub=None, 2025-05-07T20:32:13.0658491Z contiguous=False, 2025-05-07T20:32:13.0658729Z compiled=False, 2025-05-07T20:32:13.0658939Z ) 2025-05-07T20:32:13.0659301Z self = 2025-05-07T20:32:13.0659785Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:13.0660046Z 2025-05-07T20:32:13.0660133Z @given( 2025-05-07T20:32:13.0660359Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.0660671Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.0660970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.0661294Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.0661615Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.0661894Z ) 2025-05-07T20:32:13.0662232Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.0662662Z def test_silu_mul_quant( 2025-05-07T20:32:13.0662896Z self, 2025-05-07T20:32:13.0663086Z T: int, 2025-05-07T20:32:13.0663274Z D: int, 2025-05-07T20:32:13.0663490Z scale_ub: Optional[float], 2025-05-07T20:32:13.0663755Z contiguous: bool, 2025-05-07T20:32:13.0663983Z compiled: bool, 2025-05-07T20:32:13.0664200Z ) -> None: 2025-05-07T20:32:13.0664416Z torch.manual_seed(2025) 2025-05-07T20:32:13.0664648Z 2025-05-07T20:32:13.0664914Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.0665249Z 2025-05-07T20:32:13.0665438Z x_sign = torch.sign(x) 2025-05-07T20:32:13.0665723Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.0666031Z x = x_sign * x_clamp 2025-05-07T20:32:13.0666266Z x0 = x[:, :D] 2025-05-07T20:32:13.0666475Z x1 = x[:, D:] 2025-05-07T20:32:13.0666679Z 2025-05-07T20:32:13.0666862Z if contiguous: 2025-05-07T20:32:13.0667090Z x0 = x0.contiguous() 2025-05-07T20:32:13.0667342Z x1 = x1.contiguous() 2025-05-07T20:32:13.0667576Z 2025-05-07T20:32:13.0667760Z if scale_ub is not None: 2025-05-07T20:32:13.0668026Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.0668360Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.0668660Z ) 2025-05-07T20:32:13.0668854Z else: 2025-05-07T20:32:13.0669070Z scale_ub_tensor = None 2025-05-07T20:32:13.0669428Z 2025-05-07T20:32:13.0669655Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.0670035Z op = silu_mul_quant 2025-05-07T20:32:13.0670279Z if compiled: 2025-05-07T20:32:13.0670682Z op = torch.compile(op) 2025-05-07T20:32:13.0670985Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0671252Z 2025-05-07T20:32:13.0671433Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.0671598Z 2025-05-07T20:32:13.0671695Z moe/activation_test.py:117: 2025-05-07T20:32:13.0671984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0672364Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.0672642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0673326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.0674004Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.0674532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.0675215Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.0675870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.0676394Z kernel = self.compile( 2025-05-07T20:32:13.0676926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.0677577Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.0684060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0684304Z 2025-05-07T20:32:13.0684523Z self = 2025-05-07T20:32:13.0685623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.0687016Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b178670>} 2025-05-07T20:32:13.0688358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.0689381Z context = 2025-05-07T20:32:13.0689666Z 2025-05-07T20:32:13.0689835Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.0690349Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.0690811Z module_map=module_map) 2025-05-07T20:32:13.0691171Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.0691516Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.0691760Z E ^ 2025-05-07T20:32:13.0692229Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.0692681Z 2025-05-07T20:32:13.0693098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.0693609Z 2025-05-07T20:32:13.0693716Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0694141Z self=, 2025-05-07T20:32:13.0694543Z T=4096, 2025-05-07T20:32:13.0694727Z D=7168, 2025-05-07T20:32:13.0694914Z scale_ub=1200.0, 2025-05-07T20:32:13.0695129Z contiguous=False, 2025-05-07T20:32:13.0695346Z compiled=False, 2025-05-07T20:32:13.0695629Z ) 2025-05-07T20:32:13.0695940Z self = 2025-05-07T20:32:13.0696430Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.0696703Z 2025-05-07T20:32:13.0696782Z @given( 2025-05-07T20:32:13.0697078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.0697385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.0697686Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.0698001Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.0698373Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.0698654Z ) 2025-05-07T20:32:13.0698995Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.0699423Z def test_silu_mul_quant( 2025-05-07T20:32:13.0699654Z self, 2025-05-07T20:32:13.0699840Z T: int, 2025-05-07T20:32:13.0700023Z D: int, 2025-05-07T20:32:13.0700237Z scale_ub: Optional[float], 2025-05-07T20:32:13.0700502Z contiguous: bool, 2025-05-07T20:32:13.0700729Z compiled: bool, 2025-05-07T20:32:13.0700941Z ) -> None: 2025-05-07T20:32:13.0701151Z torch.manual_seed(2025) 2025-05-07T20:32:13.0701386Z 2025-05-07T20:32:13.0701648Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.0701980Z 2025-05-07T20:32:13.0702161Z x_sign = torch.sign(x) 2025-05-07T20:32:13.0702440Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.0702739Z x = x_sign * x_clamp 2025-05-07T20:32:13.0702977Z x0 = x[:, :D] 2025-05-07T20:32:13.0703187Z x1 = x[:, D:] 2025-05-07T20:32:13.0703385Z 2025-05-07T20:32:13.0703568Z if contiguous: 2025-05-07T20:32:13.0704055Z x0 = x0.contiguous() 2025-05-07T20:32:13.0704308Z x1 = x1.contiguous() 2025-05-07T20:32:13.0704537Z 2025-05-07T20:32:13.0704718Z if scale_ub is not None: 2025-05-07T20:32:13.0704979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.0705308Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.0705604Z ) 2025-05-07T20:32:13.0705792Z else: 2025-05-07T20:32:13.0706002Z scale_ub_tensor = None 2025-05-07T20:32:13.0706246Z 2025-05-07T20:32:13.0706469Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.0706775Z op = silu_mul_quant 2025-05-07T20:32:13.0707018Z if compiled: 2025-05-07T20:32:13.0707253Z op = torch.compile(op) 2025-05-07T20:32:13.0707547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0707809Z 2025-05-07T20:32:13.0707988Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.0708153Z 2025-05-07T20:32:13.0708249Z moe/activation_test.py:117: 2025-05-07T20:32:13.0708537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0708884Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.0709189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0709923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.0710615Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.0711135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.0711806Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.0712462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.0712975Z kernel = self.compile( 2025-05-07T20:32:13.0713502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.0714143Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.0714623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0714848Z 2025-05-07T20:32:13.0715053Z self = 2025-05-07T20:32:13.0716245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.0717615Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b086040>} 2025-05-07T20:32:13.0719060Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.0720076Z context = 2025-05-07T20:32:13.0720365Z 2025-05-07T20:32:13.0720526Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.0721044Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.0721501Z module_map=module_map) 2025-05-07T20:32:13.0721857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.0722201Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.0722452Z E ^ 2025-05-07T20:32:13.0722913Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.0723363Z 2025-05-07T20:32:13.0723774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.0724282Z 2025-05-07T20:32:13.0724383Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0724793Z self=, 2025-05-07T20:32:13.0725194Z T=16384, 2025-05-07T20:32:13.0725373Z D=7168, 2025-05-07T20:32:13.0725560Z scale_ub=None, 2025-05-07T20:32:13.0725768Z contiguous=True, 2025-05-07T20:32:13.0725982Z compiled=True, 2025-05-07T20:32:13.0726179Z ) 2025-05-07T20:32:13.3701602Z self = 2025-05-07T20:32:13.3702337Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:13.3702730Z 2025-05-07T20:32:13.3702840Z @given( 2025-05-07T20:32:13.3703174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.3703601Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.3704220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.3704634Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.3704974Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.3705266Z ) 2025-05-07T20:32:13.3705627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.3706080Z def test_silu_mul_quant( 2025-05-07T20:32:13.3706326Z self, 2025-05-07T20:32:13.3706532Z T: int, 2025-05-07T20:32:13.3706751Z D: int, 2025-05-07T20:32:13.3706980Z scale_ub: Optional[float], 2025-05-07T20:32:13.3707252Z contiguous: bool, 2025-05-07T20:32:13.3707500Z compiled: bool, 2025-05-07T20:32:13.3707734Z ) -> None: 2025-05-07T20:32:13.3707952Z torch.manual_seed(2025) 2025-05-07T20:32:13.3708210Z 2025-05-07T20:32:13.3708494Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.3708919Z 2025-05-07T20:32:13.3709114Z x_sign = torch.sign(x) 2025-05-07T20:32:13.3709406Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.3709719Z x = x_sign * x_clamp 2025-05-07T20:32:13.3710025Z x0 = x[:, :D] 2025-05-07T20:32:13.3710387Z x1 = x[:, D:] 2025-05-07T20:32:13.3710597Z 2025-05-07T20:32:13.3710785Z if contiguous: 2025-05-07T20:32:13.3711019Z x0 = x0.contiguous() 2025-05-07T20:32:13.3711283Z x1 = x1.contiguous() 2025-05-07T20:32:13.3711683Z 2025-05-07T20:32:13.3711878Z if scale_ub is not None: 2025-05-07T20:32:13.3712153Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.3712497Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.3712806Z ) 2025-05-07T20:32:13.3713001Z else: 2025-05-07T20:32:13.3713292Z scale_ub_tensor = None 2025-05-07T20:32:13.3713541Z 2025-05-07T20:32:13.3713783Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.3714103Z op = silu_mul_quant 2025-05-07T20:32:13.3714359Z if compiled: 2025-05-07T20:32:13.3714603Z op = torch.compile(op) 2025-05-07T20:32:13.3714903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.3715184Z 2025-05-07T20:32:13.3715372Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.3715544Z 2025-05-07T20:32:13.3715649Z moe/activation_test.py:117: 2025-05-07T20:32:13.3715955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.3716284Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.3716574Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.3717144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.3717711Z return fn(*args, **kwargs) 2025-05-07T20:32:13.3718371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.3719109Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.3719659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.3720338Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.3721003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.3721551Z kernel = self.compile( 2025-05-07T20:32:13.3722095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.3722744Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.3723147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.3723379Z 2025-05-07T20:32:13.3723603Z self = 2025-05-07T20:32:13.3724699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.3726093Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b086ca0>} 2025-05-07T20:32:13.3727443Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.3728478Z context = 2025-05-07T20:32:13.3728767Z 2025-05-07T20:32:13.3728943Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.3729468Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.3729943Z module_map=module_map) 2025-05-07T20:32:13.3730313Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.3730733Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.3730996Z E ^ 2025-05-07T20:32:13.3731470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.3731927Z 2025-05-07T20:32:13.3732424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.3732937Z 2025-05-07T20:32:13.3733045Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.3733464Z self=, 2025-05-07T20:32:13.3733914Z T=4096, 2025-05-07T20:32:13.3734105Z D=5120, 2025-05-07T20:32:13.3734294Z scale_ub=None, 2025-05-07T20:32:13.3734511Z contiguous=False, 2025-05-07T20:32:13.3734742Z compiled=True, 2025-05-07T20:32:13.3734946Z ) 2025-05-07T20:32:13.3735269Z self = 2025-05-07T20:32:13.3735774Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:13.3736052Z 2025-05-07T20:32:13.3736129Z @given( 2025-05-07T20:32:13.3736364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.3736684Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.3736992Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.3737326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.3737658Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.3737948Z ) 2025-05-07T20:32:13.3738294Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.3738750Z def test_silu_mul_quant( 2025-05-07T20:32:13.3738998Z self, 2025-05-07T20:32:13.3739193Z T: int, 2025-05-07T20:32:13.3739396Z D: int, 2025-05-07T20:32:13.3739616Z scale_ub: Optional[float], 2025-05-07T20:32:13.3739887Z contiguous: bool, 2025-05-07T20:32:13.3740136Z compiled: bool, 2025-05-07T20:32:13.3740368Z ) -> None: 2025-05-07T20:32:13.3740586Z torch.manual_seed(2025) 2025-05-07T20:32:13.3740837Z 2025-05-07T20:32:13.3741119Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.3741459Z 2025-05-07T20:32:13.3741664Z x_sign = torch.sign(x) 2025-05-07T20:32:13.3741962Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.3742276Z x = x_sign * x_clamp 2025-05-07T20:32:13.3742519Z x0 = x[:, :D] 2025-05-07T20:32:13.3742745Z x1 = x[:, D:] 2025-05-07T20:32:13.3742960Z 2025-05-07T20:32:13.3743152Z if contiguous: 2025-05-07T20:32:13.3743402Z x0 = x0.contiguous() 2025-05-07T20:32:13.3743663Z x1 = x1.contiguous() 2025-05-07T20:32:13.3743914Z 2025-05-07T20:32:13.3744112Z if scale_ub is not None: 2025-05-07T20:32:13.3744387Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.3744727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.3745043Z ) 2025-05-07T20:32:13.3745238Z else: 2025-05-07T20:32:13.3745457Z scale_ub_tensor = None 2025-05-07T20:32:13.3745717Z 2025-05-07T20:32:13.3745952Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.3746274Z op = silu_mul_quant 2025-05-07T20:32:13.3746533Z if compiled: 2025-05-07T20:32:13.3746782Z op = torch.compile(op) 2025-05-07T20:32:13.3747091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.3747372Z 2025-05-07T20:32:13.3747578Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.3747753Z 2025-05-07T20:32:13.3747854Z moe/activation_test.py:117: 2025-05-07T20:32:13.3748159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.3748493Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.3748776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.3749400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.3750042Z return fn(*args, **kwargs) 2025-05-07T20:32:13.3750773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.3751465Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.3752003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.3752681Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.3753376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.3753909Z kernel = self.compile( 2025-05-07T20:32:13.3754448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.3755107Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.3755501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.3755736Z 2025-05-07T20:32:13.3755950Z self = 2025-05-07T20:32:13.3757040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.3758422Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2aec68b0>} 2025-05-07T20:32:13.3759769Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.3760799Z context = 2025-05-07T20:32:13.3761092Z 2025-05-07T20:32:13.3761261Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.3761793Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.3762260Z module_map=module_map) 2025-05-07T20:32:13.3762627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.3762985Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.3763244Z E ^ 2025-05-07T20:32:13.3763724Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.3764189Z 2025-05-07T20:32:13.3764608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.3765120Z 2025-05-07T20:32:13.5715994Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5716613Z self=, 2025-05-07T20:32:13.5717179Z T=4096, 2025-05-07T20:32:13.5717477Z D=5120, 2025-05-07T20:32:13.5717739Z scale_ub=1200.0, 2025-05-07T20:32:13.5718061Z contiguous=False, 2025-05-07T20:32:13.5718368Z compiled=False, 2025-05-07T20:32:13.5718581Z ) 2025-05-07T20:32:13.5718905Z self = 2025-05-07T20:32:13.5719449Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.5719734Z 2025-05-07T20:32:13.5719818Z @given( 2025-05-07T20:32:13.5720043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5720357Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5720673Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5721014Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5721554Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5721850Z ) 2025-05-07T20:32:13.5722205Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5722643Z def test_silu_mul_quant( 2025-05-07T20:32:13.5723064Z self, 2025-05-07T20:32:13.5723270Z T: int, 2025-05-07T20:32:13.5723465Z D: int, 2025-05-07T20:32:13.5723687Z scale_ub: Optional[float], 2025-05-07T20:32:13.5723955Z contiguous: bool, 2025-05-07T20:32:13.5724188Z compiled: bool, 2025-05-07T20:32:13.5724420Z ) -> None: 2025-05-07T20:32:13.5724709Z torch.manual_seed(2025) 2025-05-07T20:32:13.5724944Z 2025-05-07T20:32:13.5725216Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5725558Z 2025-05-07T20:32:13.5725750Z x_sign = torch.sign(x) 2025-05-07T20:32:13.5726040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.5726354Z x = x_sign * x_clamp 2025-05-07T20:32:13.5726596Z x0 = x[:, :D] 2025-05-07T20:32:13.5726807Z x1 = x[:, D:] 2025-05-07T20:32:13.5727015Z 2025-05-07T20:32:13.5727204Z if contiguous: 2025-05-07T20:32:13.5727431Z x0 = x0.contiguous() 2025-05-07T20:32:13.5727699Z x1 = x1.contiguous() 2025-05-07T20:32:13.5727939Z 2025-05-07T20:32:13.5728123Z if scale_ub is not None: 2025-05-07T20:32:13.5728400Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.5728737Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.5729037Z ) 2025-05-07T20:32:13.5729243Z else: 2025-05-07T20:32:13.5729462Z scale_ub_tensor = None 2025-05-07T20:32:13.5729702Z 2025-05-07T20:32:13.5729935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.5730254Z op = silu_mul_quant 2025-05-07T20:32:13.5730493Z if compiled: 2025-05-07T20:32:13.5730744Z op = torch.compile(op) 2025-05-07T20:32:13.5731046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5731325Z 2025-05-07T20:32:13.5731513Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.5731684Z 2025-05-07T20:32:13.5731785Z moe/activation_test.py:117: 2025-05-07T20:32:13.5732092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5732416Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.5732702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5733401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.5734094Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.5734638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.5735316Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.5735988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.5736516Z kernel = self.compile( 2025-05-07T20:32:13.5737056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.5737707Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.5738109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5738336Z 2025-05-07T20:32:13.5738543Z self = 2025-05-07T20:32:13.5739680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.5741074Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2afcd040>} 2025-05-07T20:32:13.5742546Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.5743572Z context = 2025-05-07T20:32:13.5743867Z 2025-05-07T20:32:13.5744031Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.5744569Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.5745072Z module_map=module_map) 2025-05-07T20:32:13.5745443Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.5745803Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.5746055Z E ^ 2025-05-07T20:32:13.5746529Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.5756296Z 2025-05-07T20:32:13.5756762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.5757294Z 2025-05-07T20:32:13.5757406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5757824Z self=, 2025-05-07T20:32:13.5758242Z T=4096, 2025-05-07T20:32:13.5758437Z D=5120, 2025-05-07T20:32:13.5758639Z scale_ub=1200.0, 2025-05-07T20:32:13.5758872Z contiguous=False, 2025-05-07T20:32:13.5759091Z compiled=True, 2025-05-07T20:32:13.5759304Z ) 2025-05-07T20:32:13.5759632Z self = 2025-05-07T20:32:13.5760133Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:13.5760408Z 2025-05-07T20:32:13.5760483Z @given( 2025-05-07T20:32:13.5760717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5761035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5761335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5761671Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5762011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5762299Z ) 2025-05-07T20:32:13.5762658Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5763103Z def test_silu_mul_quant( 2025-05-07T20:32:13.5763351Z self, 2025-05-07T20:32:13.5763540Z T: int, 2025-05-07T20:32:13.5763736Z D: int, 2025-05-07T20:32:13.5763955Z scale_ub: Optional[float], 2025-05-07T20:32:13.5764220Z contiguous: bool, 2025-05-07T20:32:13.5764459Z compiled: bool, 2025-05-07T20:32:13.5764684Z ) -> None: 2025-05-07T20:32:13.5764891Z torch.manual_seed(2025) 2025-05-07T20:32:13.5765137Z 2025-05-07T20:32:13.5765411Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5765747Z 2025-05-07T20:32:13.5765947Z x_sign = torch.sign(x) 2025-05-07T20:32:13.5766243Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.5766545Z x = x_sign * x_clamp 2025-05-07T20:32:13.5766787Z x0 = x[:, :D] 2025-05-07T20:32:13.5767003Z x1 = x[:, D:] 2025-05-07T20:32:13.5767200Z 2025-05-07T20:32:13.5767385Z if contiguous: 2025-05-07T20:32:13.5767619Z x0 = x0.contiguous() 2025-05-07T20:32:13.5767875Z x1 = x1.contiguous() 2025-05-07T20:32:13.5768118Z 2025-05-07T20:32:13.5768314Z if scale_ub is not None: 2025-05-07T20:32:13.5768590Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.5768921Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.5769232Z ) 2025-05-07T20:32:13.5769428Z else: 2025-05-07T20:32:13.5769724Z scale_ub_tensor = None 2025-05-07T20:32:13.5769976Z 2025-05-07T20:32:13.5770209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.5770521Z op = silu_mul_quant 2025-05-07T20:32:13.5770773Z if compiled: 2025-05-07T20:32:13.5771104Z op = torch.compile(op) 2025-05-07T20:32:13.5771398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5771681Z 2025-05-07T20:32:13.5771880Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.5772043Z 2025-05-07T20:32:13.5772144Z moe/activation_test.py:117: 2025-05-07T20:32:13.5772481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5772820Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.5773104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5773661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.5774220Z return fn(*args, **kwargs) 2025-05-07T20:32:13.5774893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.5775579Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.5776112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.5776796Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.5777456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.5777981Z kernel = self.compile( 2025-05-07T20:32:13.5778516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.5779170Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.5779561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5779790Z 2025-05-07T20:32:13.5779995Z self = 2025-05-07T20:32:13.5781080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.5782463Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2afcdee0>} 2025-05-07T20:32:13.5783809Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.5784832Z context = 2025-05-07T20:32:13.5785120Z 2025-05-07T20:32:13.5785281Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.5785804Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.5786275Z module_map=module_map) 2025-05-07T20:32:13.5786630Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.5786988Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.5787244Z E ^ 2025-05-07T20:32:13.5787714Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.5788167Z 2025-05-07T20:32:13.5788580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.5789096Z 2025-05-07T20:32:13.8576552Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.8577070Z self=, 2025-05-07T20:32:13.8577856Z T=2048, 2025-05-07T20:32:13.8578117Z D=7168, 2025-05-07T20:32:13.8578371Z scale_ub=1200.0, 2025-05-07T20:32:13.8578665Z contiguous=False, 2025-05-07T20:32:13.8578970Z compiled=False, 2025-05-07T20:32:13.8579198Z ) 2025-05-07T20:32:13.8579701Z self = 2025-05-07T20:32:13.8580193Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.8580468Z 2025-05-07T20:32:13.8580548Z @given( 2025-05-07T20:32:13.8580769Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8581137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8581441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8581763Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8582084Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8582367Z ) 2025-05-07T20:32:13.8582719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8583154Z def test_silu_mul_quant( 2025-05-07T20:32:13.8583393Z self, 2025-05-07T20:32:13.8583581Z T: int, 2025-05-07T20:32:13.8583767Z D: int, 2025-05-07T20:32:13.8583981Z scale_ub: Optional[float], 2025-05-07T20:32:13.8584241Z contiguous: bool, 2025-05-07T20:32:13.8584472Z compiled: bool, 2025-05-07T20:32:13.8584697Z ) -> None: 2025-05-07T20:32:13.8584913Z torch.manual_seed(2025) 2025-05-07T20:32:13.8585147Z 2025-05-07T20:32:13.8585405Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.8585743Z 2025-05-07T20:32:13.8585921Z x_sign = torch.sign(x) 2025-05-07T20:32:13.8586201Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.8586501Z x = x_sign * x_clamp 2025-05-07T20:32:13.8586730Z x0 = x[:, :D] 2025-05-07T20:32:13.8586934Z x1 = x[:, D:] 2025-05-07T20:32:13.8587143Z 2025-05-07T20:32:13.8587320Z if contiguous: 2025-05-07T20:32:13.8587542Z x0 = x0.contiguous() 2025-05-07T20:32:13.8587799Z x1 = x1.contiguous() 2025-05-07T20:32:13.8588039Z 2025-05-07T20:32:13.8588221Z if scale_ub is not None: 2025-05-07T20:32:13.8588496Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.8588832Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.8589129Z ) 2025-05-07T20:32:13.8589319Z else: 2025-05-07T20:32:13.8589525Z scale_ub_tensor = None 2025-05-07T20:32:13.8589768Z 2025-05-07T20:32:13.8590077Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.8590387Z op = silu_mul_quant 2025-05-07T20:32:13.8590633Z if compiled: 2025-05-07T20:32:13.8590878Z op = torch.compile(op) 2025-05-07T20:32:13.8591169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8591436Z 2025-05-07T20:32:13.8591622Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.8591789Z 2025-05-07T20:32:13.8591889Z moe/activation_test.py:117: 2025-05-07T20:32:13.8592180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8592506Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.8592782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8593469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.8594146Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.8594693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.8595370Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.8596021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.8596629Z kernel = self.compile( 2025-05-07T20:32:13.8597167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.8597818Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.8598291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8598518Z 2025-05-07T20:32:13.8598730Z self = 2025-05-07T20:32:13.8599865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.8601285Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2ad4d550>} 2025-05-07T20:32:13.8602641Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.8603663Z context = 2025-05-07T20:32:13.8604153Z 2025-05-07T20:32:13.8604318Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.8604839Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.8605308Z module_map=module_map) 2025-05-07T20:32:13.8605669Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.8606020Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.8606284Z E ^ 2025-05-07T20:32:13.8606748Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.8607205Z 2025-05-07T20:32:13.8607624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.8608140Z 2025-05-07T20:32:13.8608245Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.8608664Z self=, 2025-05-07T20:32:13.8609073Z T=1, 2025-05-07T20:32:13.8609283Z D=7168, 2025-05-07T20:32:13.8609481Z scale_ub=None, 2025-05-07T20:32:13.8609691Z contiguous=True, 2025-05-07T20:32:13.8609912Z compiled=False, 2025-05-07T20:32:13.8610116Z ) 2025-05-07T20:32:13.8610427Z self = 2025-05-07T20:32:13.8610910Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.8611174Z 2025-05-07T20:32:13.8611248Z @given( 2025-05-07T20:32:13.8611479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8611784Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8612095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8612424Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8612745Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8613031Z ) 2025-05-07T20:32:13.8613384Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8613816Z def test_silu_mul_quant( 2025-05-07T20:32:13.8614055Z self, 2025-05-07T20:32:13.8614246Z T: int, 2025-05-07T20:32:13.8614436Z D: int, 2025-05-07T20:32:13.8614653Z scale_ub: Optional[float], 2025-05-07T20:32:13.8614926Z contiguous: bool, 2025-05-07T20:32:13.8615161Z compiled: bool, 2025-05-07T20:32:13.8615376Z ) -> None: 2025-05-07T20:32:13.8615589Z torch.manual_seed(2025) 2025-05-07T20:32:13.8615827Z 2025-05-07T20:32:13.8616091Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.8616513Z 2025-05-07T20:32:13.8616705Z x_sign = torch.sign(x) 2025-05-07T20:32:13.8616985Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.8617292Z x = x_sign * x_clamp 2025-05-07T20:32:13.8617533Z x0 = x[:, :D] 2025-05-07T20:32:13.8617884Z x1 = x[:, D:] 2025-05-07T20:32:13.8618092Z 2025-05-07T20:32:13.8618275Z if contiguous: 2025-05-07T20:32:13.8618498Z x0 = x0.contiguous() 2025-05-07T20:32:13.8618754Z x1 = x1.contiguous() 2025-05-07T20:32:13.8618992Z 2025-05-07T20:32:13.8619179Z if scale_ub is not None: 2025-05-07T20:32:13.8619509Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.8619843Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.8620147Z ) 2025-05-07T20:32:13.8620337Z else: 2025-05-07T20:32:13.8620552Z scale_ub_tensor = None 2025-05-07T20:32:13.8620800Z 2025-05-07T20:32:13.8621025Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.8621342Z op = silu_mul_quant 2025-05-07T20:32:13.8621592Z if compiled: 2025-05-07T20:32:13.8621829Z op = torch.compile(op) 2025-05-07T20:32:13.8622125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8622401Z 2025-05-07T20:32:13.8622588Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.8622755Z 2025-05-07T20:32:13.8622851Z moe/activation_test.py:117: 2025-05-07T20:32:13.8623137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8623457Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.8623740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8624422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.8625109Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.8625636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.8626313Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.8626972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.8627497Z kernel = self.compile( 2025-05-07T20:32:13.8628024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.8628675Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.8629114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8629343Z 2025-05-07T20:32:13.8629552Z self = 2025-05-07T20:32:13.8630697Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.8632080Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2acad160>} 2025-05-07T20:32:13.8633433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.8634459Z context = 2025-05-07T20:32:13.8634749Z 2025-05-07T20:32:13.8634915Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.8635435Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.8635910Z module_map=module_map) 2025-05-07T20:32:13.8636270Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.8636674Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.8636931Z E ^ 2025-05-07T20:32:13.8637397Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.8637919Z 2025-05-07T20:32:13.8638334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.8638850Z 2025-05-07T20:32:13.8638957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.8639425Z self=, 2025-05-07T20:32:13.8639866Z T=16384, 2025-05-07T20:32:13.8640053Z D=7168, 2025-05-07T20:32:13.8640245Z scale_ub=1200.0, 2025-05-07T20:32:13.8640474Z contiguous=False, 2025-05-07T20:32:13.8640694Z compiled=True, 2025-05-07T20:32:13.8640897Z ) 2025-05-07T20:32:14.0567433Z self = 2025-05-07T20:32:14.0568794Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.0569174Z 2025-05-07T20:32:14.0569279Z @given( 2025-05-07T20:32:14.0569577Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.0569902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.0570205Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.0570539Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.0570871Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.0571161Z ) 2025-05-07T20:32:14.0571509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.0571951Z def test_silu_mul_quant( 2025-05-07T20:32:14.0572194Z self, 2025-05-07T20:32:14.0572385Z T: int, 2025-05-07T20:32:14.0572590Z D: int, 2025-05-07T20:32:14.0572805Z scale_ub: Optional[float], 2025-05-07T20:32:14.0573074Z contiguous: bool, 2025-05-07T20:32:14.0573316Z compiled: bool, 2025-05-07T20:32:14.0573547Z ) -> None: 2025-05-07T20:32:14.0573760Z torch.manual_seed(2025) 2025-05-07T20:32:14.0574002Z 2025-05-07T20:32:14.0574283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.0574619Z 2025-05-07T20:32:14.0574814Z x_sign = torch.sign(x) 2025-05-07T20:32:14.0575101Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.0575411Z x = x_sign * x_clamp 2025-05-07T20:32:14.0575643Z x0 = x[:, :D] 2025-05-07T20:32:14.0575861Z x1 = x[:, D:] 2025-05-07T20:32:14.0576070Z 2025-05-07T20:32:14.0576248Z if contiguous: 2025-05-07T20:32:14.0576479Z x0 = x0.contiguous() 2025-05-07T20:32:14.0576742Z x1 = x1.contiguous() 2025-05-07T20:32:14.0576977Z 2025-05-07T20:32:14.0577167Z if scale_ub is not None: 2025-05-07T20:32:14.0577438Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.0577773Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.0578084Z ) 2025-05-07T20:32:14.0578276Z else: 2025-05-07T20:32:14.0578484Z scale_ub_tensor = None 2025-05-07T20:32:14.0578736Z 2025-05-07T20:32:14.0579004Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.0579337Z op = silu_mul_quant 2025-05-07T20:32:14.0579591Z if compiled: 2025-05-07T20:32:14.0579844Z op = torch.compile(op) 2025-05-07T20:32:14.0580141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.0580417Z 2025-05-07T20:32:14.0580617Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.0580784Z 2025-05-07T20:32:14.0580890Z moe/activation_test.py:117: 2025-05-07T20:32:14.0581181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.0581513Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.0581798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.0582479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.0583036Z return fn(*args, **kwargs) 2025-05-07T20:32:14.0583810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.0584505Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.0585037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.0585777Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.0586443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.0586969Z kernel = self.compile( 2025-05-07T20:32:14.0587511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.0588169Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.0588561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.0588787Z 2025-05-07T20:32:14.0589000Z self = 2025-05-07T20:32:14.0590202Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.0591585Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2acaddc0>} 2025-05-07T20:32:14.0592935Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.0593957Z context = 2025-05-07T20:32:14.0594243Z 2025-05-07T20:32:14.0594406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.0594930Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.0595396Z module_map=module_map) 2025-05-07T20:32:14.0595754Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.0596107Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.0596371Z E ^ 2025-05-07T20:32:14.0596836Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.0597284Z 2025-05-07T20:32:14.0597698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.0598216Z 2025-05-07T20:32:14.0598321Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.0598736Z self=, 2025-05-07T20:32:14.0599128Z T=1, 2025-05-07T20:32:14.0599312Z D=7168, 2025-05-07T20:32:14.0599507Z scale_ub=None, 2025-05-07T20:32:14.0599720Z contiguous=False, 2025-05-07T20:32:14.0599941Z compiled=False, 2025-05-07T20:32:14.0600149Z ) 2025-05-07T20:32:14.0600464Z self = 2025-05-07T20:32:14.0600942Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:14.0601211Z 2025-05-07T20:32:14.0601287Z @given( 2025-05-07T20:32:14.0601518Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.0601823Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.0602135Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.0602465Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.0602842Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.0603129Z ) 2025-05-07T20:32:14.0603482Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.0604232Z def test_silu_mul_quant( 2025-05-07T20:32:14.0604496Z self, 2025-05-07T20:32:14.0604700Z T: int, 2025-05-07T20:32:14.0604914Z D: int, 2025-05-07T20:32:14.0605139Z scale_ub: Optional[float], 2025-05-07T20:32:14.0605447Z contiguous: bool, 2025-05-07T20:32:14.0605698Z compiled: bool, 2025-05-07T20:32:14.0606001Z ) -> None: 2025-05-07T20:32:14.0606228Z torch.manual_seed(2025) 2025-05-07T20:32:14.0606486Z 2025-05-07T20:32:14.0606782Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.0607173Z 2025-05-07T20:32:14.0607371Z x_sign = torch.sign(x) 2025-05-07T20:32:14.0607691Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.0608044Z x = x_sign * x_clamp 2025-05-07T20:32:14.0608305Z x0 = x[:, :D] 2025-05-07T20:32:14.0608540Z x1 = x[:, D:] 2025-05-07T20:32:14.0608765Z 2025-05-07T20:32:14.0608984Z if contiguous: 2025-05-07T20:32:14.0609265Z x0 = x0.contiguous() 2025-05-07T20:32:14.0609558Z x1 = x1.contiguous() 2025-05-07T20:32:14.0609817Z 2025-05-07T20:32:14.0617491Z if scale_ub is not None: 2025-05-07T20:32:14.0617840Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.0618190Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.0618509Z ) 2025-05-07T20:32:14.0618701Z else: 2025-05-07T20:32:14.0618920Z scale_ub_tensor = None 2025-05-07T20:32:14.0619198Z 2025-05-07T20:32:14.0619459Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.0619785Z op = silu_mul_quant 2025-05-07T20:32:14.0620041Z if compiled: 2025-05-07T20:32:14.0620295Z op = torch.compile(op) 2025-05-07T20:32:14.0620594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.0620872Z 2025-05-07T20:32:14.0621067Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.0621232Z 2025-05-07T20:32:14.0621337Z moe/activation_test.py:117: 2025-05-07T20:32:14.0621636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.0621970Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.0622244Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.0622934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.0623626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.0624161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.0624834Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.0625494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.0626030Z kernel = self.compile( 2025-05-07T20:32:14.0626568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.0627223Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.0627619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.0627848Z 2025-05-07T20:32:14.0628064Z self = 2025-05-07T20:32:14.0629148Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.0630611Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2ad67790>} 2025-05-07T20:32:14.0632143Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.0633174Z context = 2025-05-07T20:32:14.0633462Z 2025-05-07T20:32:14.0633634Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.0634194Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.0634663Z module_map=module_map) 2025-05-07T20:32:14.0635031Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.0635380Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.0635639Z E ^ 2025-05-07T20:32:14.0636114Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.0636565Z 2025-05-07T20:32:14.0636998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.0637502Z 2025-05-07T20:32:14.0637606Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.0638015Z self=, 2025-05-07T20:32:14.0638413Z T=2048, 2025-05-07T20:32:14.0638595Z D=7168, 2025-05-07T20:32:14.0638794Z scale_ub=None, 2025-05-07T20:32:14.0639043Z contiguous=False, 2025-05-07T20:32:14.0639287Z compiled=True, 2025-05-07T20:32:14.0639490Z ) 2025-05-07T20:32:14.1817801Z self = 2025-05-07T20:32:14.1818533Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.1819150Z 2025-05-07T20:32:14.1819352Z @given( 2025-05-07T20:32:14.1819932Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1820668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1821229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1821836Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1822427Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1822950Z ) 2025-05-07T20:32:14.1823579Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1824376Z def test_silu_mul_quant( 2025-05-07T20:32:14.1824821Z self, 2025-05-07T20:32:14.1825169Z T: int, 2025-05-07T20:32:14.1825521Z D: int, 2025-05-07T20:32:14.1825912Z scale_ub: Optional[float], 2025-05-07T20:32:14.1826407Z contiguous: bool, 2025-05-07T20:32:14.1826840Z compiled: bool, 2025-05-07T20:32:14.1827237Z ) -> None: 2025-05-07T20:32:14.1827636Z torch.manual_seed(2025) 2025-05-07T20:32:14.1828072Z 2025-05-07T20:32:14.1828553Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1829172Z 2025-05-07T20:32:14.1829417Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1829707Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1830077Z x = x_sign * x_clamp 2025-05-07T20:32:14.1830317Z x0 = x[:, :D] 2025-05-07T20:32:14.1830530Z x1 = x[:, D:] 2025-05-07T20:32:14.1830734Z 2025-05-07T20:32:14.1830922Z if contiguous: 2025-05-07T20:32:14.1831155Z x0 = x0.contiguous() 2025-05-07T20:32:14.1831408Z x1 = x1.contiguous() 2025-05-07T20:32:14.1831647Z 2025-05-07T20:32:14.1831838Z if scale_ub is not None: 2025-05-07T20:32:14.1832106Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1832443Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1832748Z ) 2025-05-07T20:32:14.1833050Z else: 2025-05-07T20:32:14.1833257Z scale_ub_tensor = None 2025-05-07T20:32:14.1833511Z 2025-05-07T20:32:14.1833737Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1834053Z op = silu_mul_quant 2025-05-07T20:32:14.1834421Z if compiled: 2025-05-07T20:32:14.1834670Z op = torch.compile(op) 2025-05-07T20:32:14.1834962Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1835234Z 2025-05-07T20:32:14.1835421Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1835583Z 2025-05-07T20:32:14.1835742Z moe/activation_test.py:117: 2025-05-07T20:32:14.1836033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1836362Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1836635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1837191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.1837747Z return fn(*args, **kwargs) 2025-05-07T20:32:14.1838393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1839104Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1839665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1840342Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1840989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1841518Z kernel = self.compile( 2025-05-07T20:32:14.1842047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1842699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1843087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1843315Z 2025-05-07T20:32:14.1843520Z self = 2025-05-07T20:32:14.1844607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1845989Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2acf8430>} 2025-05-07T20:32:14.1847330Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1848353Z context = 2025-05-07T20:32:14.1848649Z 2025-05-07T20:32:14.1848814Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1849330Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1849791Z module_map=module_map) 2025-05-07T20:32:14.1850152Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1850506Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.1850759Z E ^ 2025-05-07T20:32:14.1851217Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1851672Z 2025-05-07T20:32:14.1852085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1852594Z 2025-05-07T20:32:14.1852706Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.1853114Z self=, 2025-05-07T20:32:14.1853575Z T=4096, 2025-05-07T20:32:14.1853765Z D=7168, 2025-05-07T20:32:14.1853952Z scale_ub=None, 2025-05-07T20:32:14.1854175Z contiguous=False, 2025-05-07T20:32:14.1854399Z compiled=True, 2025-05-07T20:32:14.1854682Z ) 2025-05-07T20:32:14.1854993Z self = 2025-05-07T20:32:14.1855484Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.1855754Z 2025-05-07T20:32:14.1855835Z @given( 2025-05-07T20:32:14.1856053Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.1856402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.1856710Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.1857026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.1857354Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.1857638Z ) 2025-05-07T20:32:14.1857989Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.1858417Z def test_silu_mul_quant( 2025-05-07T20:32:14.1858663Z self, 2025-05-07T20:32:14.1858853Z T: int, 2025-05-07T20:32:14.1859045Z D: int, 2025-05-07T20:32:14.1859297Z scale_ub: Optional[float], 2025-05-07T20:32:14.1859578Z contiguous: bool, 2025-05-07T20:32:14.1859805Z compiled: bool, 2025-05-07T20:32:14.1860021Z ) -> None: 2025-05-07T20:32:14.1860231Z torch.manual_seed(2025) 2025-05-07T20:32:14.1860464Z 2025-05-07T20:32:14.1860730Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.1861064Z 2025-05-07T20:32:14.1861244Z x_sign = torch.sign(x) 2025-05-07T20:32:14.1861524Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.1861824Z x = x_sign * x_clamp 2025-05-07T20:32:14.1862056Z x0 = x[:, :D] 2025-05-07T20:32:14.1862272Z x1 = x[:, D:] 2025-05-07T20:32:14.1862468Z 2025-05-07T20:32:14.1862640Z if contiguous: 2025-05-07T20:32:14.1862863Z x0 = x0.contiguous() 2025-05-07T20:32:14.1863113Z x1 = x1.contiguous() 2025-05-07T20:32:14.1863355Z 2025-05-07T20:32:14.1863548Z if scale_ub is not None: 2025-05-07T20:32:14.1863820Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.1864149Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.1864447Z ) 2025-05-07T20:32:14.1864633Z else: 2025-05-07T20:32:14.1864838Z scale_ub_tensor = None 2025-05-07T20:32:14.1865077Z 2025-05-07T20:32:14.1865301Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.1865608Z op = silu_mul_quant 2025-05-07T20:32:14.1865850Z if compiled: 2025-05-07T20:32:14.1866092Z op = torch.compile(op) 2025-05-07T20:32:14.1866380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1866648Z 2025-05-07T20:32:14.1866831Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.1866993Z 2025-05-07T20:32:14.1867094Z moe/activation_test.py:117: 2025-05-07T20:32:14.1867380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1867705Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.1867980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.1868524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.1869065Z return fn(*args, **kwargs) 2025-05-07T20:32:14.1869771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.1870512Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.1871042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.1871760Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.1872412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.1872935Z kernel = self.compile( 2025-05-07T20:32:14.1873561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.1874209Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.1874595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.1874857Z 2025-05-07T20:32:14.1875069Z self = 2025-05-07T20:32:14.1876141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.1877518Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2aa7e040>} 2025-05-07T20:32:14.1878863Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.1879881Z context = 2025-05-07T20:32:14.1880168Z 2025-05-07T20:32:14.1880335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.1880848Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.1881308Z module_map=module_map) 2025-05-07T20:32:14.1881664Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.1882007Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.1882261Z E ^ 2025-05-07T20:32:14.1882718Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.1883170Z 2025-05-07T20:32:14.1883591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.1884094Z 2025-05-07T20:32:14.5805861Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.5806513Z self=, 2025-05-07T20:32:14.5807054Z T=16384, 2025-05-07T20:32:14.5807322Z D=5120, 2025-05-07T20:32:14.5807582Z scale_ub=1200.0, 2025-05-07T20:32:14.5807865Z contiguous=False, 2025-05-07T20:32:14.5808154Z compiled=False, 2025-05-07T20:32:14.5808417Z ) 2025-05-07T20:32:14.5808798Z self = 2025-05-07T20:32:14.5809294Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.5809575Z 2025-05-07T20:32:14.5809660Z @given( 2025-05-07T20:32:14.5809888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.5810205Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.5810517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.5810847Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.5811172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.5811464Z ) 2025-05-07T20:32:14.5811812Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.5812256Z def test_silu_mul_quant( 2025-05-07T20:32:14.5812499Z self, 2025-05-07T20:32:14.5812697Z T: int, 2025-05-07T20:32:14.5812889Z D: int, 2025-05-07T20:32:14.5813111Z scale_ub: Optional[float], 2025-05-07T20:32:14.5813389Z contiguous: bool, 2025-05-07T20:32:14.5813624Z compiled: bool, 2025-05-07T20:32:14.5813972Z ) -> None: 2025-05-07T20:32:14.5814202Z torch.manual_seed(2025) 2025-05-07T20:32:14.5814445Z 2025-05-07T20:32:14.5814718Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.5815063Z 2025-05-07T20:32:14.5815378Z x_sign = torch.sign(x) 2025-05-07T20:32:14.5815671Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.5815987Z x = x_sign * x_clamp 2025-05-07T20:32:14.5816231Z x0 = x[:, :D] 2025-05-07T20:32:14.5816455Z x1 = x[:, D:] 2025-05-07T20:32:14.5816665Z 2025-05-07T20:32:14.5816917Z if contiguous: 2025-05-07T20:32:14.5817156Z x0 = x0.contiguous() 2025-05-07T20:32:14.5817413Z x1 = x1.contiguous() 2025-05-07T20:32:14.5817656Z 2025-05-07T20:32:14.5817846Z if scale_ub is not None: 2025-05-07T20:32:14.5818128Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.5818467Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.5818777Z ) 2025-05-07T20:32:14.5818978Z else: 2025-05-07T20:32:14.5819216Z scale_ub_tensor = None 2025-05-07T20:32:14.5819486Z 2025-05-07T20:32:14.5819720Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.5820040Z op = silu_mul_quant 2025-05-07T20:32:14.5820284Z if compiled: 2025-05-07T20:32:14.5820536Z op = torch.compile(op) 2025-05-07T20:32:14.5820833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.5821104Z 2025-05-07T20:32:14.5821306Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.5821473Z 2025-05-07T20:32:14.5821587Z moe/activation_test.py:117: 2025-05-07T20:32:14.5821877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.5822205Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.5822489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.5823179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.5823864Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.5824407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.5825084Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.5825739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.5826262Z kernel = self.compile( 2025-05-07T20:32:14.5826799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.5827444Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.5827833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.5828067Z 2025-05-07T20:32:14.5828275Z self = 2025-05-07T20:32:14.5829416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.5830887Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2aa7e8b0>} 2025-05-07T20:32:14.5832229Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.5833247Z context = 2025-05-07T20:32:14.5833537Z 2025-05-07T20:32:14.5833702Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.5834274Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.5834739Z module_map=module_map) 2025-05-07T20:32:14.5835095Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.5835524Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.5835783Z E ^ 2025-05-07T20:32:14.5836244Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.5836696Z 2025-05-07T20:32:14.5837108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.5837669Z 2025-05-07T20:32:14.5837772Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.5838181Z self=, 2025-05-07T20:32:14.5838574Z T=16384, 2025-05-07T20:32:14.5838771Z D=5120, 2025-05-07T20:32:14.5838967Z scale_ub=1200.0, 2025-05-07T20:32:14.5839183Z contiguous=True, 2025-05-07T20:32:14.5839405Z compiled=True, 2025-05-07T20:32:14.5839610Z ) 2025-05-07T20:32:14.5839923Z self = 2025-05-07T20:32:14.5840423Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.5840696Z 2025-05-07T20:32:14.5840776Z @given( 2025-05-07T20:32:14.5841002Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.5841320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.5841632Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.5841959Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.5842302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.5842581Z ) 2025-05-07T20:32:14.5842928Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.5843370Z def test_silu_mul_quant( 2025-05-07T20:32:14.5843605Z self, 2025-05-07T20:32:14.5843804Z T: int, 2025-05-07T20:32:14.5844004Z D: int, 2025-05-07T20:32:14.5844215Z scale_ub: Optional[float], 2025-05-07T20:32:14.5844487Z contiguous: bool, 2025-05-07T20:32:14.5844741Z compiled: bool, 2025-05-07T20:32:14.5844961Z ) -> None: 2025-05-07T20:32:14.5845177Z torch.manual_seed(2025) 2025-05-07T20:32:14.5845416Z 2025-05-07T20:32:14.5845679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.5846020Z 2025-05-07T20:32:14.5846222Z x_sign = torch.sign(x) 2025-05-07T20:32:14.5846503Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.5846810Z x = x_sign * x_clamp 2025-05-07T20:32:14.5847047Z x0 = x[:, :D] 2025-05-07T20:32:14.5847261Z x1 = x[:, D:] 2025-05-07T20:32:14.5847470Z 2025-05-07T20:32:14.5847654Z if contiguous: 2025-05-07T20:32:14.5847888Z x0 = x0.contiguous() 2025-05-07T20:32:14.5848137Z x1 = x1.contiguous() 2025-05-07T20:32:14.5848375Z 2025-05-07T20:32:14.5848569Z if scale_ub is not None: 2025-05-07T20:32:14.5848843Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.5849219Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.5849547Z ) 2025-05-07T20:32:14.5849732Z else: 2025-05-07T20:32:14.5849942Z scale_ub_tensor = None 2025-05-07T20:32:14.5850196Z 2025-05-07T20:32:14.5850424Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.5850739Z op = silu_mul_quant 2025-05-07T20:32:14.5850996Z if compiled: 2025-05-07T20:32:14.5851235Z op = torch.compile(op) 2025-05-07T20:32:14.5851530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.5851804Z 2025-05-07T20:32:14.5851992Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.5852153Z 2025-05-07T20:32:14.5852302Z moe/activation_test.py:117: 2025-05-07T20:32:14.5852593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.5852928Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.5853211Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.5853845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.5854403Z return fn(*args, **kwargs) 2025-05-07T20:32:14.5855052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.5855780Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.5856318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.5856992Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.5857644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.5858173Z kernel = self.compile( 2025-05-07T20:32:14.5858717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.5859407Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.5859822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.5860050Z 2025-05-07T20:32:14.5860260Z self = 2025-05-07T20:32:14.5861344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.5862728Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2aa185e0>} 2025-05-07T20:32:14.5864090Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.5865115Z context = 2025-05-07T20:32:14.5865402Z 2025-05-07T20:32:14.5865568Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.5866090Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.5866557Z module_map=module_map) 2025-05-07T20:32:14.5866919Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.5867271Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.5867530Z E ^ 2025-05-07T20:32:14.5867993Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.5868446Z 2025-05-07T20:32:14.5868864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.5869375Z 2025-05-07T20:32:14.8130896Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.8131448Z self=, 2025-05-07T20:32:14.8131958Z T=16384, 2025-05-07T20:32:14.8132159Z D=5120, 2025-05-07T20:32:14.8132351Z scale_ub=None, 2025-05-07T20:32:14.8132561Z contiguous=False, 2025-05-07T20:32:14.8132793Z compiled=True, 2025-05-07T20:32:14.8133000Z ) 2025-05-07T20:32:14.8133316Z self = 2025-05-07T20:32:14.8133817Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.8134099Z 2025-05-07T20:32:14.8134181Z @given( 2025-05-07T20:32:14.8134543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.8134848Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.8135158Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.8135492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.8135931Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.8136220Z ) 2025-05-07T20:32:14.8136572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.8137008Z def test_silu_mul_quant( 2025-05-07T20:32:14.8137253Z self, 2025-05-07T20:32:14.8137508Z T: int, 2025-05-07T20:32:14.8137703Z D: int, 2025-05-07T20:32:14.8137927Z scale_ub: Optional[float], 2025-05-07T20:32:14.8138202Z contiguous: bool, 2025-05-07T20:32:14.8138441Z compiled: bool, 2025-05-07T20:32:14.8138659Z ) -> None: 2025-05-07T20:32:14.8138878Z torch.manual_seed(2025) 2025-05-07T20:32:14.8139119Z 2025-05-07T20:32:14.8139429Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.8139804Z 2025-05-07T20:32:14.8140001Z x_sign = torch.sign(x) 2025-05-07T20:32:14.8140287Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.8140604Z x = x_sign * x_clamp 2025-05-07T20:32:14.8140840Z x0 = x[:, :D] 2025-05-07T20:32:14.8141056Z x1 = x[:, D:] 2025-05-07T20:32:14.8141254Z 2025-05-07T20:32:14.8141437Z if contiguous: 2025-05-07T20:32:14.8141668Z x0 = x0.contiguous() 2025-05-07T20:32:14.8141922Z x1 = x1.contiguous() 2025-05-07T20:32:14.8142168Z 2025-05-07T20:32:14.8142358Z if scale_ub is not None: 2025-05-07T20:32:14.8142620Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.8142954Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.8143262Z ) 2025-05-07T20:32:14.8143451Z else: 2025-05-07T20:32:14.8143660Z scale_ub_tensor = None 2025-05-07T20:32:14.8143910Z 2025-05-07T20:32:14.8144145Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.8144458Z op = silu_mul_quant 2025-05-07T20:32:14.8144701Z if compiled: 2025-05-07T20:32:14.8144958Z op = torch.compile(op) 2025-05-07T20:32:14.8145250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.8145525Z 2025-05-07T20:32:14.8145713Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.8145878Z 2025-05-07T20:32:14.8145979Z moe/activation_test.py:117: 2025-05-07T20:32:14.8146264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.8146596Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.8146871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.8147433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.8147991Z return fn(*args, **kwargs) 2025-05-07T20:32:14.8148652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.8149337Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.8149936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.8150614Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.8151265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.8151788Z kernel = self.compile( 2025-05-07T20:32:14.8152318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.8152968Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.8153357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.8153637Z 2025-05-07T20:32:14.8153843Z self = 2025-05-07T20:32:14.8155009Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.8156390Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a97c5e0>} 2025-05-07T20:32:14.8157778Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.8158806Z context = 2025-05-07T20:32:14.8159130Z 2025-05-07T20:32:14.8159312Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.8159836Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.8160305Z module_map=module_map) 2025-05-07T20:32:14.8160668Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.8161024Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.8161284Z E ^ 2025-05-07T20:32:14.8161741Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.8162196Z 2025-05-07T20:32:14.8162609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.8163120Z 2025-05-07T20:32:14.8163221Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.8163631Z self=, 2025-05-07T20:32:14.8164031Z T=2048, 2025-05-07T20:32:14.8164217Z D=5120, 2025-05-07T20:32:14.8164414Z scale_ub=None, 2025-05-07T20:32:14.8164624Z contiguous=False, 2025-05-07T20:32:14.8164852Z compiled=True, 2025-05-07T20:32:14.8165049Z ) 2025-05-07T20:32:14.9374067Z self = 2025-05-07T20:32:14.9374576Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.9374856Z 2025-05-07T20:32:14.9374959Z @given( 2025-05-07T20:32:14.9375276Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.9375718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.9376139Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.9376549Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.9376879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.9377164Z ) 2025-05-07T20:32:14.9377510Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.9377957Z def test_silu_mul_quant( 2025-05-07T20:32:14.9378199Z self, 2025-05-07T20:32:14.9378391Z T: int, 2025-05-07T20:32:14.9378586Z D: int, 2025-05-07T20:32:14.9378804Z scale_ub: Optional[float], 2025-05-07T20:32:14.9379076Z contiguous: bool, 2025-05-07T20:32:14.9379326Z compiled: bool, 2025-05-07T20:32:14.9379551Z ) -> None: 2025-05-07T20:32:14.9379764Z torch.manual_seed(2025) 2025-05-07T20:32:14.9380010Z 2025-05-07T20:32:14.9380283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.9380627Z 2025-05-07T20:32:14.9380823Z x_sign = torch.sign(x) 2025-05-07T20:32:14.9381109Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.9381418Z x = x_sign * x_clamp 2025-05-07T20:32:14.9381654Z x0 = x[:, :D] 2025-05-07T20:32:14.9381875Z x1 = x[:, D:] 2025-05-07T20:32:14.9382086Z 2025-05-07T20:32:14.9382386Z if contiguous: 2025-05-07T20:32:14.9382626Z x0 = x0.contiguous() 2025-05-07T20:32:14.9382892Z x1 = x1.contiguous() 2025-05-07T20:32:14.9383126Z 2025-05-07T20:32:14.9383317Z if scale_ub is not None: 2025-05-07T20:32:14.9383715Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.9384061Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.9384372Z ) 2025-05-07T20:32:14.9384571Z else: 2025-05-07T20:32:14.9384782Z scale_ub_tensor = None 2025-05-07T20:32:14.9385034Z 2025-05-07T20:32:14.9385329Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.9385647Z op = silu_mul_quant 2025-05-07T20:32:14.9385896Z if compiled: 2025-05-07T20:32:14.9386143Z op = torch.compile(op) 2025-05-07T20:32:14.9386442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.9386712Z 2025-05-07T20:32:14.9386904Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.9387072Z 2025-05-07T20:32:14.9387180Z moe/activation_test.py:117: 2025-05-07T20:32:14.9387474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.9387808Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.9388100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.9388658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.9389221Z return fn(*args, **kwargs) 2025-05-07T20:32:14.9389954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.9390647Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.9391172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.9391849Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.9392510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.9393036Z kernel = self.compile( 2025-05-07T20:32:14.9393584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.9394239Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.9394639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.9394864Z 2025-05-07T20:32:14.9395072Z self = 2025-05-07T20:32:14.9396164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.9397554Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2aa18c10>} 2025-05-07T20:32:14.9398911Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.9399982Z context = 2025-05-07T20:32:14.9400269Z 2025-05-07T20:32:14.9400437Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.9400959Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.9401430Z module_map=module_map) 2025-05-07T20:32:14.9401789Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.9402142Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.9402405Z E ^ 2025-05-07T20:32:14.9402935Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.9403384Z 2025-05-07T20:32:14.9403976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.9405004Z 2025-05-07T20:32:14.9405116Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.9405532Z self=, 2025-05-07T20:32:14.9405935Z T=2048, 2025-05-07T20:32:14.9406123Z D=5120, 2025-05-07T20:32:14.9406315Z scale_ub=1200.0, 2025-05-07T20:32:14.9406603Z contiguous=False, 2025-05-07T20:32:14.9406826Z compiled=True, 2025-05-07T20:32:14.9407031Z ) 2025-05-07T20:32:14.9407351Z self = 2025-05-07T20:32:14.9407840Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.9408119Z 2025-05-07T20:32:14.9408204Z @given( 2025-05-07T20:32:14.9408435Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.9408743Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.9409097Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.9409462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.9409796Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.9410077Z ) 2025-05-07T20:32:14.9410431Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.9410872Z def test_silu_mul_quant( 2025-05-07T20:32:14.9411119Z self, 2025-05-07T20:32:14.9411309Z T: int, 2025-05-07T20:32:14.9411505Z D: int, 2025-05-07T20:32:14.9411726Z scale_ub: Optional[float], 2025-05-07T20:32:14.9411997Z contiguous: bool, 2025-05-07T20:32:14.9412234Z compiled: bool, 2025-05-07T20:32:14.9412458Z ) -> None: 2025-05-07T20:32:14.9412668Z torch.manual_seed(2025) 2025-05-07T20:32:14.9412916Z 2025-05-07T20:32:14.9413185Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.9413519Z 2025-05-07T20:32:14.9413715Z x_sign = torch.sign(x) 2025-05-07T20:32:14.9414013Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.9414317Z x = x_sign * x_clamp 2025-05-07T20:32:14.9414557Z x0 = x[:, :D] 2025-05-07T20:32:14.9414767Z x1 = x[:, D:] 2025-05-07T20:32:14.9414970Z 2025-05-07T20:32:14.9415154Z if contiguous: 2025-05-07T20:32:14.9415383Z x0 = x0.contiguous() 2025-05-07T20:32:14.9415642Z x1 = x1.contiguous() 2025-05-07T20:32:14.9415882Z 2025-05-07T20:32:14.9416073Z if scale_ub is not None: 2025-05-07T20:32:14.9416350Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.9416676Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.9416980Z ) 2025-05-07T20:32:14.9417180Z else: 2025-05-07T20:32:14.9417391Z scale_ub_tensor = None 2025-05-07T20:32:14.9417640Z 2025-05-07T20:32:14.9417873Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.9418181Z op = silu_mul_quant 2025-05-07T20:32:14.9418437Z if compiled: 2025-05-07T20:32:14.9418684Z op = torch.compile(op) 2025-05-07T20:32:14.9418978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.9419256Z 2025-05-07T20:32:14.9419448Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.9419611Z 2025-05-07T20:32:14.9419712Z moe/activation_test.py:117: 2025-05-07T20:32:14.9420008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.9420334Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.9420622Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.9421171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.9421816Z return fn(*args, **kwargs) 2025-05-07T20:32:14.9422473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.9423159Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.9423793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.9424473Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.9425134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.9425701Z kernel = self.compile( 2025-05-07T20:32:14.9426241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.9426885Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.9427279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.9427514Z 2025-05-07T20:32:14.9427722Z self = 2025-05-07T20:32:14.9428810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.9430257Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a748820>} 2025-05-07T20:32:14.9431606Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.9432629Z context = 2025-05-07T20:32:14.9432916Z 2025-05-07T20:32:14.9433088Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.9433611Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.9434077Z module_map=module_map) 2025-05-07T20:32:14.9434446Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.9434802Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.9435056Z E ^ 2025-05-07T20:32:14.9435525Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.9435977Z 2025-05-07T20:32:14.9436401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.9436906Z 2025-05-07T20:32:15.1689811Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.1690295Z self=, 2025-05-07T20:32:15.1690915Z T=4096, 2025-05-07T20:32:15.1691172Z D=5120, 2025-05-07T20:32:15.1691422Z scale_ub=1200.0, 2025-05-07T20:32:15.1691718Z contiguous=True, 2025-05-07T20:32:15.1692013Z compiled=True, 2025-05-07T20:32:15.1692211Z ) 2025-05-07T20:32:15.1692558Z self = 2025-05-07T20:32:15.1693047Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:15.1693319Z 2025-05-07T20:32:15.1693397Z @given( 2025-05-07T20:32:15.1693622Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.1693928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.1694232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.1694557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.1694879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.1695166Z ) 2025-05-07T20:32:15.1695627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.1696060Z def test_silu_mul_quant( 2025-05-07T20:32:15.1696300Z self, 2025-05-07T20:32:15.1696491Z T: int, 2025-05-07T20:32:15.1696680Z D: int, 2025-05-07T20:32:15.1697019Z scale_ub: Optional[float], 2025-05-07T20:32:15.1697292Z contiguous: bool, 2025-05-07T20:32:15.1697533Z compiled: bool, 2025-05-07T20:32:15.1697749Z ) -> None: 2025-05-07T20:32:15.1697962Z torch.manual_seed(2025) 2025-05-07T20:32:15.1698206Z 2025-05-07T20:32:15.1698473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.1698874Z 2025-05-07T20:32:15.1699065Z x_sign = torch.sign(x) 2025-05-07T20:32:15.1699350Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.1699655Z x = x_sign * x_clamp 2025-05-07T20:32:15.1699892Z x0 = x[:, :D] 2025-05-07T20:32:15.1700098Z x1 = x[:, D:] 2025-05-07T20:32:15.1700305Z 2025-05-07T20:32:15.1700491Z if contiguous: 2025-05-07T20:32:15.1700714Z x0 = x0.contiguous() 2025-05-07T20:32:15.1700969Z x1 = x1.contiguous() 2025-05-07T20:32:15.1701213Z 2025-05-07T20:32:15.1701407Z if scale_ub is not None: 2025-05-07T20:32:15.1701677Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.1702005Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.1702311Z ) 2025-05-07T20:32:15.1702507Z else: 2025-05-07T20:32:15.1702713Z scale_ub_tensor = None 2025-05-07T20:32:15.1702968Z 2025-05-07T20:32:15.1703192Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.1703505Z op = silu_mul_quant 2025-05-07T20:32:15.1703937Z if compiled: 2025-05-07T20:32:15.1704183Z op = torch.compile(op) 2025-05-07T20:32:15.1704478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.1704753Z 2025-05-07T20:32:15.1704936Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.1705106Z 2025-05-07T20:32:15.1705205Z moe/activation_test.py:117: 2025-05-07T20:32:15.1705496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.1705826Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.1706109Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.1706665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.1707218Z return fn(*args, **kwargs) 2025-05-07T20:32:15.1707873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.1708559Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.1709119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.1709877Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.1710539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.1711066Z kernel = self.compile( 2025-05-07T20:32:15.1711610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.1712253Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.1712643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.1712874Z 2025-05-07T20:32:15.1713085Z self = 2025-05-07T20:32:15.1714163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.1715608Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a6e1430>} 2025-05-07T20:32:15.1717051Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.1718077Z context = 2025-05-07T20:32:15.1718365Z 2025-05-07T20:32:15.1718536Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.1719107Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.1719576Z module_map=module_map) 2025-05-07T20:32:15.1719943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.1720290Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.1720545Z E ^ 2025-05-07T20:32:15.1721014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.1721467Z 2025-05-07T20:32:15.1721889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.1722396Z 2025-05-07T20:32:15.1722503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.1722907Z self=, 2025-05-07T20:32:15.1723303Z T=128, 2025-05-07T20:32:15.1723500Z D=5120, 2025-05-07T20:32:15.1723693Z scale_ub=1200.0, 2025-05-07T20:32:15.1723917Z contiguous=False, 2025-05-07T20:32:15.1724139Z compiled=True, 2025-05-07T20:32:15.1724336Z ) 2025-05-07T20:32:15.4923625Z self = 2025-05-07T20:32:15.4924319Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:15.4924762Z 2025-05-07T20:32:15.4924865Z @given( 2025-05-07T20:32:15.4925174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.4925588Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.4926009Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.4926441Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.4926861Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.4927209Z ) 2025-05-07T20:32:15.4927548Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.4927992Z def test_silu_mul_quant( 2025-05-07T20:32:15.4928259Z self, 2025-05-07T20:32:15.4928444Z T: int, 2025-05-07T20:32:15.4928633Z D: int, 2025-05-07T20:32:15.4928849Z scale_ub: Optional[float], 2025-05-07T20:32:15.4929117Z contiguous: bool, 2025-05-07T20:32:15.4929356Z compiled: bool, 2025-05-07T20:32:15.4929615Z ) -> None: 2025-05-07T20:32:15.4929833Z torch.manual_seed(2025) 2025-05-07T20:32:15.4930071Z 2025-05-07T20:32:15.4930334Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.4930670Z 2025-05-07T20:32:15.4930855Z x_sign = torch.sign(x) 2025-05-07T20:32:15.4931134Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.4931439Z x = x_sign * x_clamp 2025-05-07T20:32:15.4942828Z x0 = x[:, :D] 2025-05-07T20:32:15.4943141Z x1 = x[:, D:] 2025-05-07T20:32:15.4943378Z 2025-05-07T20:32:15.4943574Z if contiguous: 2025-05-07T20:32:15.4943804Z x0 = x0.contiguous() 2025-05-07T20:32:15.4944056Z x1 = x1.contiguous() 2025-05-07T20:32:15.4944296Z 2025-05-07T20:32:15.4944476Z if scale_ub is not None: 2025-05-07T20:32:15.4944748Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.4945088Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.4945518Z ) 2025-05-07T20:32:15.4945703Z else: 2025-05-07T20:32:15.4945910Z scale_ub_tensor = None 2025-05-07T20:32:15.4946154Z 2025-05-07T20:32:15.4946378Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.4946815Z op = silu_mul_quant 2025-05-07T20:32:15.4947069Z if compiled: 2025-05-07T20:32:15.4947316Z op = torch.compile(op) 2025-05-07T20:32:15.4947614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4947889Z 2025-05-07T20:32:15.4948079Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.4948311Z 2025-05-07T20:32:15.4948413Z moe/activation_test.py:117: 2025-05-07T20:32:15.4948709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4949029Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.4949309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4949954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.4950515Z return fn(*args, **kwargs) 2025-05-07T20:32:15.4951168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.4951862Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.4952392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.4953064Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.4953728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.4954257Z kernel = self.compile( 2025-05-07T20:32:15.4954787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.4955432Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.4955829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4956052Z 2025-05-07T20:32:15.4956271Z self = 2025-05-07T20:32:15.4957359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.4958737Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a63f040>} 2025-05-07T20:32:15.4960082Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.4961103Z context = 2025-05-07T20:32:15.4961391Z 2025-05-07T20:32:15.4961561Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.4962081Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.4962559Z module_map=module_map) 2025-05-07T20:32:15.4962924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.4963277Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.4963523Z E ^ 2025-05-07T20:32:15.4963991Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.4964445Z 2025-05-07T20:32:15.4964865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.4965375Z 2025-05-07T20:32:15.4965487Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.4965969Z self=, 2025-05-07T20:32:15.4966366Z T=16384, 2025-05-07T20:32:15.4966560Z D=7168, 2025-05-07T20:32:15.4966740Z scale_ub=1200.0, 2025-05-07T20:32:15.4966957Z contiguous=True, 2025-05-07T20:32:15.4967250Z compiled=True, 2025-05-07T20:32:15.4967447Z ) 2025-05-07T20:32:15.4967765Z self = 2025-05-07T20:32:15.4968261Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:15.4968533Z 2025-05-07T20:32:15.4968610Z @given( 2025-05-07T20:32:15.4968875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.4969180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.4969479Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.4969802Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.4970123Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.4970412Z ) 2025-05-07T20:32:15.4970755Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.4971191Z def test_silu_mul_quant( 2025-05-07T20:32:15.4971424Z self, 2025-05-07T20:32:15.4971614Z T: int, 2025-05-07T20:32:15.4971805Z D: int, 2025-05-07T20:32:15.4972020Z scale_ub: Optional[float], 2025-05-07T20:32:15.4972285Z contiguous: bool, 2025-05-07T20:32:15.4972512Z compiled: bool, 2025-05-07T20:32:15.4972729Z ) -> None: 2025-05-07T20:32:15.4972940Z torch.manual_seed(2025) 2025-05-07T20:32:15.4973173Z 2025-05-07T20:32:15.4973439Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.4973771Z 2025-05-07T20:32:15.4973952Z x_sign = torch.sign(x) 2025-05-07T20:32:15.4974230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.4974533Z x = x_sign * x_clamp 2025-05-07T20:32:15.4974765Z x0 = x[:, :D] 2025-05-07T20:32:15.4974980Z x1 = x[:, D:] 2025-05-07T20:32:15.4975184Z 2025-05-07T20:32:15.4975357Z if contiguous: 2025-05-07T20:32:15.4975589Z x0 = x0.contiguous() 2025-05-07T20:32:15.4975839Z x1 = x1.contiguous() 2025-05-07T20:32:15.4976074Z 2025-05-07T20:32:15.4976259Z if scale_ub is not None: 2025-05-07T20:32:15.4976522Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.4976852Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.4977151Z ) 2025-05-07T20:32:15.4977337Z else: 2025-05-07T20:32:15.4977542Z scale_ub_tensor = None 2025-05-07T20:32:15.4977780Z 2025-05-07T20:32:15.4978002Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.4978313Z op = silu_mul_quant 2025-05-07T20:32:15.4978551Z if compiled: 2025-05-07T20:32:15.4978792Z op = torch.compile(op) 2025-05-07T20:32:15.4979079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4979369Z 2025-05-07T20:32:15.4979577Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.4979736Z 2025-05-07T20:32:15.4979835Z moe/activation_test.py:117: 2025-05-07T20:32:15.4980130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4980446Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.4980720Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4981266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.4981810Z return fn(*args, **kwargs) 2025-05-07T20:32:15.4982467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.4983143Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.4983670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.4984390Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.4985042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.4985641Z kernel = self.compile( 2025-05-07T20:32:15.4986168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.4986811Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.4987195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4987458Z 2025-05-07T20:32:15.4987671Z self = 2025-05-07T20:32:15.4988751Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.4990255Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a63fb80>} 2025-05-07T20:32:15.4991612Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.4992626Z context = 2025-05-07T20:32:15.4992908Z 2025-05-07T20:32:15.4993077Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.4993584Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.4994040Z module_map=module_map) 2025-05-07T20:32:15.4994402Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.4994745Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.4994995Z E ^ 2025-05-07T20:32:15.4995457Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.4995904Z 2025-05-07T20:32:15.4996324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.4996831Z 2025-05-07T20:32:15.7773207Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7773625Z self=, 2025-05-07T20:32:15.7774154Z T=16384, 2025-05-07T20:32:15.7774372Z D=5120, 2025-05-07T20:32:15.7774633Z scale_ub=1200.0, 2025-05-07T20:32:15.7774887Z contiguous=True, 2025-05-07T20:32:15.7775103Z compiled=False, 2025-05-07T20:32:15.7775333Z ) 2025-05-07T20:32:15.7775653Z self = 2025-05-07T20:32:15.7776147Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.7776428Z 2025-05-07T20:32:15.7776508Z @given( 2025-05-07T20:32:15.7776730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7777043Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7777351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7777678Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7777998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7778282Z ) 2025-05-07T20:32:15.7778623Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7779060Z def test_silu_mul_quant( 2025-05-07T20:32:15.7779306Z self, 2025-05-07T20:32:15.7779524Z T: int, 2025-05-07T20:32:15.7779737Z D: int, 2025-05-07T20:32:15.7779951Z scale_ub: Optional[float], 2025-05-07T20:32:15.7780220Z contiguous: bool, 2025-05-07T20:32:15.7780575Z compiled: bool, 2025-05-07T20:32:15.7780797Z ) -> None: 2025-05-07T20:32:15.7781008Z torch.manual_seed(2025) 2025-05-07T20:32:15.7781242Z 2025-05-07T20:32:15.7781512Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7781852Z 2025-05-07T20:32:15.7782153Z x_sign = torch.sign(x) 2025-05-07T20:32:15.7782442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.7782744Z x = x_sign * x_clamp 2025-05-07T20:32:15.7782975Z x0 = x[:, :D] 2025-05-07T20:32:15.7783188Z x1 = x[:, D:] 2025-05-07T20:32:15.7783452Z 2025-05-07T20:32:15.7783627Z if contiguous: 2025-05-07T20:32:15.7783858Z x0 = x0.contiguous() 2025-05-07T20:32:15.7784112Z x1 = x1.contiguous() 2025-05-07T20:32:15.7784341Z 2025-05-07T20:32:15.7784524Z if scale_ub is not None: 2025-05-07T20:32:15.7784789Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.7785118Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.7785418Z ) 2025-05-07T20:32:15.7785602Z else: 2025-05-07T20:32:15.7785808Z scale_ub_tensor = None 2025-05-07T20:32:15.7786047Z 2025-05-07T20:32:15.7786286Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.7786600Z op = silu_mul_quant 2025-05-07T20:32:15.7786847Z if compiled: 2025-05-07T20:32:15.7787090Z op = torch.compile(op) 2025-05-07T20:32:15.7787383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.7787649Z 2025-05-07T20:32:15.7787839Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.7788002Z 2025-05-07T20:32:15.7788101Z moe/activation_test.py:117: 2025-05-07T20:32:15.7788397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.7788728Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.7789000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.7789739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.7790497Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.7791028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.7791705Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.7792359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.7792882Z kernel = self.compile( 2025-05-07T20:32:15.7793410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.7794052Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.7794440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.7794665Z 2025-05-07T20:32:15.7794872Z self = 2025-05-07T20:32:15.7795948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.7797322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a92e5e0>} 2025-05-07T20:32:15.7798661Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.7799687Z context = 2025-05-07T20:32:15.7799975Z 2025-05-07T20:32:15.7800144Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.7800720Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.7801186Z module_map=module_map) 2025-05-07T20:32:15.7801623Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.7801978Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.7802244Z E ^ 2025-05-07T20:32:15.7802718Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.7803167Z 2025-05-07T20:32:15.7803648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.7804332Z 2025-05-07T20:32:15.7804438Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7804857Z self=, 2025-05-07T20:32:15.7805285Z T=1, 2025-05-07T20:32:15.7805472Z D=7168, 2025-05-07T20:32:15.7805662Z scale_ub=1200.0, 2025-05-07T20:32:15.7805890Z contiguous=False, 2025-05-07T20:32:15.7806120Z compiled=False, 2025-05-07T20:32:15.7806321Z ) 2025-05-07T20:32:15.7806645Z self = 2025-05-07T20:32:15.7807134Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:15.7807399Z 2025-05-07T20:32:15.7807477Z @given( 2025-05-07T20:32:15.7807709Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7808032Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7808340Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7808672Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7809004Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7809320Z ) 2025-05-07T20:32:15.7809693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7810142Z def test_silu_mul_quant( 2025-05-07T20:32:15.7810386Z self, 2025-05-07T20:32:15.7810579Z T: int, 2025-05-07T20:32:15.7810780Z D: int, 2025-05-07T20:32:15.7811003Z scale_ub: Optional[float], 2025-05-07T20:32:15.7811279Z contiguous: bool, 2025-05-07T20:32:15.7811524Z compiled: bool, 2025-05-07T20:32:15.7811749Z ) -> None: 2025-05-07T20:32:15.7811964Z torch.manual_seed(2025) 2025-05-07T20:32:15.7812210Z 2025-05-07T20:32:15.7812492Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7812836Z 2025-05-07T20:32:15.7813033Z x_sign = torch.sign(x) 2025-05-07T20:32:15.7813328Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.7813643Z x = x_sign * x_clamp 2025-05-07T20:32:15.7813881Z x0 = x[:, :D] 2025-05-07T20:32:15.7814104Z x1 = x[:, D:] 2025-05-07T20:32:15.7814316Z 2025-05-07T20:32:15.7814508Z if contiguous: 2025-05-07T20:32:15.7814753Z x0 = x0.contiguous() 2025-05-07T20:32:15.7815020Z x1 = x1.contiguous() 2025-05-07T20:32:15.7815261Z 2025-05-07T20:32:15.7815459Z if scale_ub is not None: 2025-05-07T20:32:15.7815739Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.7816073Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.7816391Z ) 2025-05-07T20:32:15.7816593Z else: 2025-05-07T20:32:15.7816805Z scale_ub_tensor = None 2025-05-07T20:32:15.7817060Z 2025-05-07T20:32:15.7817295Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.7817611Z op = silu_mul_quant 2025-05-07T20:32:15.7817861Z if compiled: 2025-05-07T20:32:15.7818111Z op = torch.compile(op) 2025-05-07T20:32:15.7818402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.7818680Z 2025-05-07T20:32:15.7818878Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.7819119Z 2025-05-07T20:32:15.7819224Z moe/activation_test.py:117: 2025-05-07T20:32:15.7819516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.7819906Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.7820298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.7820986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.7821676Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.7822219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.7822959Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.7823619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.7824151Z kernel = self.compile( 2025-05-07T20:32:15.7824698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.7825345Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.7825747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.7825979Z 2025-05-07T20:32:15.7826187Z self = 2025-05-07T20:32:15.7827271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.7828654Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a92e9d0>} 2025-05-07T20:32:15.7830094Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.7831122Z context = 2025-05-07T20:32:15.7831409Z 2025-05-07T20:32:15.7831586Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.7832109Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.7832577Z module_map=module_map) 2025-05-07T20:32:15.7832943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.7833307Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.7833566Z E ^ 2025-05-07T20:32:15.7834040Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.7834492Z 2025-05-07T20:32:15.7834914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.7835427Z 2025-05-07T20:32:15.7835541Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7835955Z self=, 2025-05-07T20:32:15.7836370Z T=4096, 2025-05-07T20:32:15.7836561Z D=7168, 2025-05-07T20:32:15.7836752Z scale_ub=1200.0, 2025-05-07T20:32:15.7836988Z contiguous=False, 2025-05-07T20:32:15.7837221Z compiled=True, 2025-05-07T20:32:15.7837420Z ) 2025-05-07T20:32:15.9036376Z self = 2025-05-07T20:32:15.9036899Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:15.9037196Z 2025-05-07T20:32:15.9037282Z @given( 2025-05-07T20:32:15.9037513Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.9037824Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.9038125Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.9038558Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.9038883Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.9039168Z ) 2025-05-07T20:32:15.9039674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.9040108Z def test_silu_mul_quant( 2025-05-07T20:32:15.9040345Z self, 2025-05-07T20:32:15.9040537Z T: int, 2025-05-07T20:32:15.9040728Z D: int, 2025-05-07T20:32:15.9040942Z scale_ub: Optional[float], 2025-05-07T20:32:15.9041211Z contiguous: bool, 2025-05-07T20:32:15.9041502Z compiled: bool, 2025-05-07T20:32:15.9041722Z ) -> None: 2025-05-07T20:32:15.9041935Z torch.manual_seed(2025) 2025-05-07T20:32:15.9042167Z 2025-05-07T20:32:15.9042429Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.9042769Z 2025-05-07T20:32:15.9042959Z x_sign = torch.sign(x) 2025-05-07T20:32:15.9043247Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.9043557Z x = x_sign * x_clamp 2025-05-07T20:32:15.9043799Z x0 = x[:, :D] 2025-05-07T20:32:15.9044002Z x1 = x[:, D:] 2025-05-07T20:32:15.9044213Z 2025-05-07T20:32:15.9044401Z if contiguous: 2025-05-07T20:32:15.9044630Z x0 = x0.contiguous() 2025-05-07T20:32:15.9044891Z x1 = x1.contiguous() 2025-05-07T20:32:15.9045133Z 2025-05-07T20:32:15.9045319Z if scale_ub is not None: 2025-05-07T20:32:15.9045597Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.9045942Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.9046241Z ) 2025-05-07T20:32:15.9046431Z else: 2025-05-07T20:32:15.9046644Z scale_ub_tensor = None 2025-05-07T20:32:15.9046893Z 2025-05-07T20:32:15.9047127Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.9047448Z op = silu_mul_quant 2025-05-07T20:32:15.9047703Z if compiled: 2025-05-07T20:32:15.9047945Z op = torch.compile(op) 2025-05-07T20:32:15.9048242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9048513Z 2025-05-07T20:32:15.9048698Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.9048866Z 2025-05-07T20:32:15.9048962Z moe/activation_test.py:117: 2025-05-07T20:32:15.9049255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9049655Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.9049931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9050489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.9051045Z return fn(*args, **kwargs) 2025-05-07T20:32:15.9051703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.9052379Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.9052907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.9053590Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.9054239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.9054768Z kernel = self.compile( 2025-05-07T20:32:15.9055303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.9055956Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.9056340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9056576Z 2025-05-07T20:32:15.9056783Z self = 2025-05-07T20:32:15.9057863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.9059359Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a654c10>} 2025-05-07T20:32:15.9060696Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.9061760Z context = 2025-05-07T20:32:15.9062048Z 2025-05-07T20:32:15.9062211Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.9062727Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.9063188Z module_map=module_map) 2025-05-07T20:32:15.9063549Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.9063900Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.9064155Z E ^ 2025-05-07T20:32:15.9064620Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.9065074Z 2025-05-07T20:32:15.9065489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.9073074Z 2025-05-07T20:32:15.9073255Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.9073701Z self=, 2025-05-07T20:32:15.9074110Z T=128, 2025-05-07T20:32:15.9074290Z D=7168, 2025-05-07T20:32:15.9074480Z scale_ub=1200.0, 2025-05-07T20:32:15.9074706Z contiguous=False, 2025-05-07T20:32:15.9074932Z compiled=True, 2025-05-07T20:32:15.9075139Z ) 2025-05-07T20:32:15.9075472Z self = 2025-05-07T20:32:15.9075962Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:15.9076238Z 2025-05-07T20:32:15.9076323Z @given( 2025-05-07T20:32:15.9076552Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.9076867Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.9077168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.9077498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.9077831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.9078110Z ) 2025-05-07T20:32:15.9078464Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.9078904Z def test_silu_mul_quant( 2025-05-07T20:32:15.9079141Z self, 2025-05-07T20:32:15.9079340Z T: int, 2025-05-07T20:32:15.9079565Z D: int, 2025-05-07T20:32:15.9079801Z scale_ub: Optional[float], 2025-05-07T20:32:15.9080071Z contiguous: bool, 2025-05-07T20:32:15.9080307Z compiled: bool, 2025-05-07T20:32:15.9080529Z ) -> None: 2025-05-07T20:32:15.9080745Z torch.manual_seed(2025) 2025-05-07T20:32:15.9080988Z 2025-05-07T20:32:15.9081261Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.9081602Z 2025-05-07T20:32:15.9081792Z x_sign = torch.sign(x) 2025-05-07T20:32:15.9082076Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.9082387Z x = x_sign * x_clamp 2025-05-07T20:32:15.9082621Z x0 = x[:, :D] 2025-05-07T20:32:15.9082835Z x1 = x[:, D:] 2025-05-07T20:32:15.9083038Z 2025-05-07T20:32:15.9083222Z if contiguous: 2025-05-07T20:32:15.9083456Z x0 = x0.contiguous() 2025-05-07T20:32:15.9083713Z x1 = x1.contiguous() 2025-05-07T20:32:15.9084033Z 2025-05-07T20:32:15.9084225Z if scale_ub is not None: 2025-05-07T20:32:15.9084495Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.9084832Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.9085144Z ) 2025-05-07T20:32:15.9085414Z else: 2025-05-07T20:32:15.9085620Z scale_ub_tensor = None 2025-05-07T20:32:15.9085866Z 2025-05-07T20:32:15.9086095Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.9086402Z op = silu_mul_quant 2025-05-07T20:32:15.9086647Z if compiled: 2025-05-07T20:32:15.9086925Z op = torch.compile(op) 2025-05-07T20:32:15.9087207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9087471Z 2025-05-07T20:32:15.9087657Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.9087816Z 2025-05-07T20:32:15.9087916Z moe/activation_test.py:117: 2025-05-07T20:32:15.9088201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9088530Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.9088799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9089395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.9089959Z return fn(*args, **kwargs) 2025-05-07T20:32:15.9090608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.9091283Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.9091815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.9092481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.9093127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.9093645Z kernel = self.compile( 2025-05-07T20:32:15.9094175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.9094817Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.9095212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9095435Z 2025-05-07T20:32:15.9095643Z self = 2025-05-07T20:32:15.9096713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.9098081Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a5e6820>} 2025-05-07T20:32:15.9099416Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.9100427Z context = 2025-05-07T20:32:15.9100722Z 2025-05-07T20:32:15.9100883Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.9101395Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.9101854Z module_map=module_map) 2025-05-07T20:32:15.9102209Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.9102551Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.9102801Z E ^ 2025-05-07T20:32:15.9103261Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.9103937Z 2025-05-07T20:32:15.9104468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.9104982Z 2025-05-07T20:32:16.0854215Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.0856599Z self=, 2025-05-07T20:32:16.0857858Z T=2048, 2025-05-07T20:32:16.0858264Z D=7168, 2025-05-07T20:32:16.0858636Z scale_ub=None, 2025-05-07T20:32:16.0859071Z contiguous=True, 2025-05-07T20:32:16.0859486Z compiled=True, 2025-05-07T20:32:16.0859702Z ) 2025-05-07T20:32:16.0860132Z self = 2025-05-07T20:32:16.0860642Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.0860917Z 2025-05-07T20:32:16.0861004Z @given( 2025-05-07T20:32:16.0861238Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.0861564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.0861892Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.0862227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.0862563Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.0862858Z ) 2025-05-07T20:32:16.0863227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.0863674Z def test_silu_mul_quant( 2025-05-07T20:32:16.0863925Z self, 2025-05-07T20:32:16.0864131Z T: int, 2025-05-07T20:32:16.0864327Z D: int, 2025-05-07T20:32:16.0864558Z scale_ub: Optional[float], 2025-05-07T20:32:16.0864844Z contiguous: bool, 2025-05-07T20:32:16.0865085Z compiled: bool, 2025-05-07T20:32:16.0865321Z ) -> None: 2025-05-07T20:32:16.0865544Z torch.manual_seed(2025) 2025-05-07T20:32:16.0865787Z 2025-05-07T20:32:16.0866071Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.0866424Z 2025-05-07T20:32:16.0866619Z x_sign = torch.sign(x) 2025-05-07T20:32:16.0866922Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.0867247Z x = x_sign * x_clamp 2025-05-07T20:32:16.0867491Z x0 = x[:, :D] 2025-05-07T20:32:16.0867724Z x1 = x[:, D:] 2025-05-07T20:32:16.0867948Z 2025-05-07T20:32:16.0868135Z if contiguous: 2025-05-07T20:32:16.0868382Z x0 = x0.contiguous() 2025-05-07T20:32:16.0868656Z x1 = x1.contiguous() 2025-05-07T20:32:16.0868916Z 2025-05-07T20:32:16.0869110Z if scale_ub is not None: 2025-05-07T20:32:16.0869401Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.0869753Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.0870197Z ) 2025-05-07T20:32:16.0870403Z else: 2025-05-07T20:32:16.0870625Z scale_ub_tensor = None 2025-05-07T20:32:16.0870880Z 2025-05-07T20:32:16.0871124Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.0871451Z op = silu_mul_quant 2025-05-07T20:32:16.0871703Z if compiled: 2025-05-07T20:32:16.0871977Z op = torch.compile(op) 2025-05-07T20:32:16.0872288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.0872570Z 2025-05-07T20:32:16.0872772Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.0872940Z 2025-05-07T20:32:16.0873054Z moe/activation_test.py:117: 2025-05-07T20:32:16.0873361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.0873695Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.0873988Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.0874557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.0875117Z return fn(*args, **kwargs) 2025-05-07T20:32:16.0875785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.0876580Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.0877125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.0877891Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.0878565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.0879113Z kernel = self.compile( 2025-05-07T20:32:16.0879658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.0880420Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.0880831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.0881063Z 2025-05-07T20:32:16.0881281Z self = 2025-05-07T20:32:16.0882378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.0883917Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a5314c0>} 2025-05-07T20:32:16.0885415Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.0886464Z context = 2025-05-07T20:32:16.0886753Z 2025-05-07T20:32:16.0886934Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.0887457Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.0887943Z module_map=module_map) 2025-05-07T20:32:16.0888322Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.0888687Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.0888957Z E ^ 2025-05-07T20:32:16.0889436Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.0889894Z 2025-05-07T20:32:16.0890324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.0890845Z 2025-05-07T20:32:16.0890962Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.0891378Z self=, 2025-05-07T20:32:16.0891794Z T=16384, 2025-05-07T20:32:16.0892000Z D=5120, 2025-05-07T20:32:16.0892195Z scale_ub=None, 2025-05-07T20:32:16.0892427Z contiguous=False, 2025-05-07T20:32:16.0892663Z compiled=False, 2025-05-07T20:32:16.0892870Z ) 2025-05-07T20:32:16.0893203Z self = 2025-05-07T20:32:16.0893730Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.0894022Z 2025-05-07T20:32:16.0894102Z @given( 2025-05-07T20:32:16.0894344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.0894658Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.0894976Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.0895403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.0895802Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.0896092Z ) 2025-05-07T20:32:16.0896451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.0896898Z def test_silu_mul_quant( 2025-05-07T20:32:16.0897212Z self, 2025-05-07T20:32:16.0897415Z T: int, 2025-05-07T20:32:16.0897625Z D: int, 2025-05-07T20:32:16.0897848Z scale_ub: Optional[float], 2025-05-07T20:32:16.0898128Z contiguous: bool, 2025-05-07T20:32:16.0898378Z compiled: bool, 2025-05-07T20:32:16.0898687Z ) -> None: 2025-05-07T20:32:16.0898913Z torch.manual_seed(2025) 2025-05-07T20:32:16.0899174Z 2025-05-07T20:32:16.0899441Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.0899805Z 2025-05-07T20:32:16.0900028Z x_sign = torch.sign(x) 2025-05-07T20:32:16.0900355Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.0902405Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.0904805Z 2025-05-07T20:32:16.0904929Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.0905154Z 2025-05-07T20:32:16.0905259Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.0905685Z self=, 2025-05-07T20:32:16.0906083Z T=4096, 2025-05-07T20:32:16.0906280Z D=7168, 2025-05-07T20:32:16.0906479Z scale_ub=1200.0, 2025-05-07T20:32:16.0906698Z contiguous=True, 2025-05-07T20:32:16.0906928Z compiled=True, 2025-05-07T20:32:16.0907141Z ) 2025-05-07T20:32:16.0907466Z self = 2025-05-07T20:32:16.0907953Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.0908233Z 2025-05-07T20:32:16.0908313Z @given( 2025-05-07T20:32:16.0908545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.0908851Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.0909168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.0909511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.0909916Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.0910243Z ) 2025-05-07T20:32:16.0910598Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.0911044Z def test_silu_mul_quant( 2025-05-07T20:32:16.0911286Z self, 2025-05-07T20:32:16.0911484Z T: int, 2025-05-07T20:32:16.0911688Z D: int, 2025-05-07T20:32:16.0911902Z scale_ub: Optional[float], 2025-05-07T20:32:16.0912174Z contiguous: bool, 2025-05-07T20:32:16.0912415Z compiled: bool, 2025-05-07T20:32:16.0912641Z ) -> None: 2025-05-07T20:32:16.0912853Z torch.manual_seed(2025) 2025-05-07T20:32:16.0913100Z 2025-05-07T20:32:16.0913370Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.0913714Z 2025-05-07T20:32:16.0913915Z x_sign = torch.sign(x) 2025-05-07T20:32:16.0914200Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.0916215Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.0918195Z 2025-05-07T20:32:16.0918316Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.0918538Z 2025-05-07T20:32:16.0918641Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.0919056Z self=, 2025-05-07T20:32:16.0919598Z T=16384, 2025-05-07T20:32:16.0919798Z D=7168, 2025-05-07T20:32:16.0919992Z scale_ub=None, 2025-05-07T20:32:16.0920198Z contiguous=False, 2025-05-07T20:32:16.0920427Z compiled=False, 2025-05-07T20:32:16.0920635Z ) 2025-05-07T20:32:16.1977431Z self = 2025-05-07T20:32:16.1978490Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.1978849Z 2025-05-07T20:32:16.1978935Z @given( 2025-05-07T20:32:16.1979177Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.1979503Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.1979811Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.1980209Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.1980552Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.1980839Z ) 2025-05-07T20:32:16.1981209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.1981666Z def test_silu_mul_quant( 2025-05-07T20:32:16.1981908Z self, 2025-05-07T20:32:16.1982109Z T: int, 2025-05-07T20:32:16.1982314Z D: int, 2025-05-07T20:32:16.1982531Z scale_ub: Optional[float], 2025-05-07T20:32:16.1982817Z contiguous: bool, 2025-05-07T20:32:16.1983063Z compiled: bool, 2025-05-07T20:32:16.1983302Z ) -> None: 2025-05-07T20:32:16.1983521Z torch.manual_seed(2025) 2025-05-07T20:32:16.1983774Z 2025-05-07T20:32:16.1984053Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.1986161Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.1988102Z 2025-05-07T20:32:16.1988225Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.1988455Z 2025-05-07T20:32:16.1988562Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.1988983Z self=, 2025-05-07T20:32:16.1989395Z T=2048, 2025-05-07T20:32:16.1989582Z D=7168, 2025-05-07T20:32:16.1989781Z scale_ub=1200.0, 2025-05-07T20:32:16.1990148Z contiguous=True, 2025-05-07T20:32:16.1990369Z compiled=True, 2025-05-07T20:32:16.1990581Z ) 2025-05-07T20:32:16.1990903Z self = 2025-05-07T20:32:16.1991397Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.1991685Z 2025-05-07T20:32:16.1991764Z @given( 2025-05-07T20:32:16.1991999Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.1992312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.1992628Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.1992970Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.1993312Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.1993600Z ) 2025-05-07T20:32:16.1993959Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.1994408Z def test_silu_mul_quant( 2025-05-07T20:32:16.1994653Z self, 2025-05-07T20:32:16.1994984Z T: int, 2025-05-07T20:32:16.1995187Z D: int, 2025-05-07T20:32:16.1995404Z scale_ub: Optional[float], 2025-05-07T20:32:16.1995686Z contiguous: bool, 2025-05-07T20:32:16.1995943Z compiled: bool, 2025-05-07T20:32:16.1996168Z ) -> None: 2025-05-07T20:32:16.1996541Z torch.manual_seed(2025) 2025-05-07T20:32:16.1996797Z 2025-05-07T20:32:16.1997071Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.1997422Z 2025-05-07T20:32:16.1997623Z x_sign = torch.sign(x) 2025-05-07T20:32:16.1997917Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2000051Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.2001956Z 2025-05-07T20:32:16.2002085Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.2002307Z 2025-05-07T20:32:16.2002412Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2002833Z self=, 2025-05-07T20:32:16.2003238Z T=2048, 2025-05-07T20:32:16.2003433Z D=7168, 2025-05-07T20:32:16.2003639Z scale_ub=None, 2025-05-07T20:32:16.2004210Z contiguous=True, 2025-05-07T20:32:16.2004448Z compiled=False, 2025-05-07T20:32:16.2004662Z ) 2025-05-07T20:32:16.2004978Z self = 2025-05-07T20:32:16.2005479Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.2005766Z 2025-05-07T20:32:16.2005846Z @given( 2025-05-07T20:32:16.2006082Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2006396Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2006711Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2007060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2007387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2007685Z ) 2025-05-07T20:32:16.2008039Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2008489Z def test_silu_mul_quant( 2025-05-07T20:32:16.2008733Z self, 2025-05-07T20:32:16.2008935Z T: int, 2025-05-07T20:32:16.2009137Z D: int, 2025-05-07T20:32:16.2009369Z scale_ub: Optional[float], 2025-05-07T20:32:16.2009689Z contiguous: bool, 2025-05-07T20:32:16.2009934Z compiled: bool, 2025-05-07T20:32:16.2010158Z ) -> None: 2025-05-07T20:32:16.2010385Z torch.manual_seed(2025) 2025-05-07T20:32:16.2010637Z 2025-05-07T20:32:16.2010911Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2011265Z 2025-05-07T20:32:16.2011465Z > x_sign = torch.sign(x) 2025-05-07T20:32:16.2013445Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.2015346Z 2025-05-07T20:32:16.2015467Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:16.2015692Z 2025-05-07T20:32:16.2015879Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2016299Z self=, 2025-05-07T20:32:16.2016708Z T=1, 2025-05-07T20:32:16.2016890Z D=7168, 2025-05-07T20:32:16.2017090Z scale_ub=1200.0, 2025-05-07T20:32:16.2017436Z contiguous=True, 2025-05-07T20:32:16.2017666Z compiled=False, 2025-05-07T20:32:16.2017881Z ) 2025-05-07T20:32:16.3602708Z self = 2025-05-07T20:32:16.3603449Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.3604308Z 2025-05-07T20:32:16.3604403Z @given( 2025-05-07T20:32:16.3604636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3604961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3605281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3605621Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3605956Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3606250Z ) 2025-05-07T20:32:16.3606606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3607045Z def test_silu_mul_quant( 2025-05-07T20:32:16.3607304Z self, 2025-05-07T20:32:16.3607506Z T: int, 2025-05-07T20:32:16.3607705Z D: int, 2025-05-07T20:32:16.3607932Z scale_ub: Optional[float], 2025-05-07T20:32:16.3608210Z contiguous: bool, 2025-05-07T20:32:16.3608448Z compiled: bool, 2025-05-07T20:32:16.3608683Z ) -> None: 2025-05-07T20:32:16.3608911Z torch.manual_seed(2025) 2025-05-07T20:32:16.3609152Z 2025-05-07T20:32:16.3609429Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3609908Z 2025-05-07T20:32:16.3610236Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3610584Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3619985Z x = x_sign * x_clamp 2025-05-07T20:32:16.3620256Z x0 = x[:, :D] 2025-05-07T20:32:16.3620592Z x1 = x[:, D:] 2025-05-07T20:32:16.3620872Z 2025-05-07T20:32:16.3621062Z if contiguous: 2025-05-07T20:32:16.3621305Z x0 = x0.contiguous() 2025-05-07T20:32:16.3621586Z x1 = x1.contiguous() 2025-05-07T20:32:16.3621827Z 2025-05-07T20:32:16.3622027Z if scale_ub is not None: 2025-05-07T20:32:16.3622313Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3622654Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3622980Z ) 2025-05-07T20:32:16.3623187Z else: 2025-05-07T20:32:16.3623399Z scale_ub_tensor = None 2025-05-07T20:32:16.3623664Z 2025-05-07T20:32:16.3623908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3624226Z op = silu_mul_quant 2025-05-07T20:32:16.3624494Z if compiled: 2025-05-07T20:32:16.3624761Z op = torch.compile(op) 2025-05-07T20:32:16.3625070Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3625346Z 2025-05-07T20:32:16.3625551Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3625718Z 2025-05-07T20:32:16.3625832Z moe/activation_test.py:117: 2025-05-07T20:32:16.3626136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3626480Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3626775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3627471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3628177Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3628726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3629415Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3630351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3630893Z kernel = self.compile( 2025-05-07T20:32:16.3631602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3632269Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3632671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3632910Z 2025-05-07T20:32:16.3633121Z self = 2025-05-07T20:32:16.3634283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3635685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a1e8040>} 2025-05-07T20:32:16.3637040Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3638076Z context = 2025-05-07T20:32:16.3638378Z 2025-05-07T20:32:16.3638546Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3639082Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3639549Z module_map=module_map) 2025-05-07T20:32:16.3639933Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3640346Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3640609Z E ^ 2025-05-07T20:32:16.3641095Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3641557Z 2025-05-07T20:32:16.3641976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3642496Z 2025-05-07T20:32:16.3642613Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3643029Z self=, 2025-05-07T20:32:16.3643442Z T=128, 2025-05-07T20:32:16.3643641Z D=5120, 2025-05-07T20:32:16.3643837Z scale_ub=None, 2025-05-07T20:32:16.3644067Z contiguous=True, 2025-05-07T20:32:16.3644296Z compiled=False, 2025-05-07T20:32:16.3644515Z ) 2025-05-07T20:32:16.3644834Z self = 2025-05-07T20:32:16.3645333Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.3645610Z 2025-05-07T20:32:16.3645691Z @given( 2025-05-07T20:32:16.3645928Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3646237Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3646551Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3646893Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3647218Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3647510Z ) 2025-05-07T20:32:16.3647867Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3648304Z def test_silu_mul_quant( 2025-05-07T20:32:16.3648556Z self, 2025-05-07T20:32:16.3648760Z T: int, 2025-05-07T20:32:16.3648959Z D: int, 2025-05-07T20:32:16.3649185Z scale_ub: Optional[float], 2025-05-07T20:32:16.3649464Z contiguous: bool, 2025-05-07T20:32:16.3649711Z compiled: bool, 2025-05-07T20:32:16.3649933Z ) -> None: 2025-05-07T20:32:16.3650154Z torch.manual_seed(2025) 2025-05-07T20:32:16.3650458Z 2025-05-07T20:32:16.3650730Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3651079Z 2025-05-07T20:32:16.3651271Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3651634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3651956Z x = x_sign * x_clamp 2025-05-07T20:32:16.3652205Z x0 = x[:, :D] 2025-05-07T20:32:16.3652426Z x1 = x[:, D:] 2025-05-07T20:32:16.3652643Z 2025-05-07T20:32:16.3652835Z if contiguous: 2025-05-07T20:32:16.3653063Z x0 = x0.contiguous() 2025-05-07T20:32:16.3653375Z x1 = x1.contiguous() 2025-05-07T20:32:16.3653625Z 2025-05-07T20:32:16.3653816Z if scale_ub is not None: 2025-05-07T20:32:16.3654099Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3654440Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3654757Z ) 2025-05-07T20:32:16.3654950Z else: 2025-05-07T20:32:16.3655161Z scale_ub_tensor = None 2025-05-07T20:32:16.3655414Z 2025-05-07T20:32:16.3655640Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3655953Z op = silu_mul_quant 2025-05-07T20:32:16.3656208Z if compiled: 2025-05-07T20:32:16.3656448Z op = torch.compile(op) 2025-05-07T20:32:16.3656745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3657018Z 2025-05-07T20:32:16.3657203Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3657372Z 2025-05-07T20:32:16.3657471Z moe/activation_test.py:117: 2025-05-07T20:32:16.3657769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3658093Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3658379Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3659067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3659760Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3660349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3661031Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3661694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3662223Z kernel = self.compile( 2025-05-07T20:32:16.3662756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3663412Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3663811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3664038Z 2025-05-07T20:32:16.3664242Z self = 2025-05-07T20:32:16.3665332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3666721Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a1e89d0>} 2025-05-07T20:32:16.3668065Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3669089Z context = 2025-05-07T20:32:16.3669374Z 2025-05-07T20:32:16.3669541Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3670123Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3670688Z module_map=module_map) 2025-05-07T20:32:16.3671046Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3671397Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3672293Z E ^ 2025-05-07T20:32:16.3672764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3673213Z 2025-05-07T20:32:16.3673626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3674177Z 2025-05-07T20:32:16.3674278Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3674692Z self=, 2025-05-07T20:32:16.3675090Z T=128, 2025-05-07T20:32:16.3675271Z D=7168, 2025-05-07T20:32:16.3675461Z scale_ub=None, 2025-05-07T20:32:16.3675679Z contiguous=True, 2025-05-07T20:32:16.3675901Z compiled=False, 2025-05-07T20:32:16.3676103Z ) 2025-05-07T20:32:16.4580212Z self = 2025-05-07T20:32:16.4580954Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.4581349Z 2025-05-07T20:32:16.4581452Z @given( 2025-05-07T20:32:16.4581759Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.4582096Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.4582405Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.4582740Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.4583065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.4583351Z ) 2025-05-07T20:32:16.4583702Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.4584135Z def test_silu_mul_quant( 2025-05-07T20:32:16.4584378Z self, 2025-05-07T20:32:16.4584581Z T: int, 2025-05-07T20:32:16.4584776Z D: int, 2025-05-07T20:32:16.4584998Z scale_ub: Optional[float], 2025-05-07T20:32:16.4585269Z contiguous: bool, 2025-05-07T20:32:16.4585511Z compiled: bool, 2025-05-07T20:32:16.4585733Z ) -> None: 2025-05-07T20:32:16.4585953Z torch.manual_seed(2025) 2025-05-07T20:32:16.4586196Z 2025-05-07T20:32:16.4586466Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.4586810Z 2025-05-07T20:32:16.4587002Z x_sign = torch.sign(x) 2025-05-07T20:32:16.4587286Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.4587598Z x = x_sign * x_clamp 2025-05-07T20:32:16.4587842Z x0 = x[:, :D] 2025-05-07T20:32:16.4588053Z x1 = x[:, D:] 2025-05-07T20:32:16.4588258Z 2025-05-07T20:32:16.4588449Z if contiguous: 2025-05-07T20:32:16.4588675Z x0 = x0.contiguous() 2025-05-07T20:32:16.4588936Z x1 = x1.contiguous() 2025-05-07T20:32:16.4589183Z 2025-05-07T20:32:16.4589371Z if scale_ub is not None: 2025-05-07T20:32:16.4589647Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.4590126Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.4590461Z ) 2025-05-07T20:32:16.4590661Z else: 2025-05-07T20:32:16.4590874Z scale_ub_tensor = None 2025-05-07T20:32:16.4591131Z 2025-05-07T20:32:16.4591365Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.4591677Z op = silu_mul_quant 2025-05-07T20:32:16.4591931Z if compiled: 2025-05-07T20:32:16.4592182Z op = torch.compile(op) 2025-05-07T20:32:16.4592479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.4592749Z 2025-05-07T20:32:16.4592943Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.4593110Z 2025-05-07T20:32:16.4593222Z moe/activation_test.py:117: 2025-05-07T20:32:16.4593512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.4594067Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.4594354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.4595231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.4595925Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.4596471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.4597149Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.4597883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.4598416Z kernel = self.compile( 2025-05-07T20:32:16.4598958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.4599613Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.4600014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.4600289Z 2025-05-07T20:32:16.4600501Z self = 2025-05-07T20:32:16.4601580Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.4602981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a301430>} 2025-05-07T20:32:16.4604598Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.4605638Z context = 2025-05-07T20:32:16.4605933Z 2025-05-07T20:32:16.4606096Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.4606625Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.4607086Z module_map=module_map) 2025-05-07T20:32:16.4607452Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.4607805Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.4608062Z E ^ 2025-05-07T20:32:16.4608531Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.4608987Z 2025-05-07T20:32:16.4609402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.4609913Z 2025-05-07T20:32:16.4610024Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.4610439Z self=, 2025-05-07T20:32:16.4610843Z T=2048, 2025-05-07T20:32:16.4611030Z D=7168, 2025-05-07T20:32:16.4611222Z scale_ub=1200.0, 2025-05-07T20:32:16.4611454Z contiguous=True, 2025-05-07T20:32:16.4611676Z compiled=False, 2025-05-07T20:32:16.4611879Z ) 2025-05-07T20:32:16.4612199Z self = 2025-05-07T20:32:16.4612699Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.4612976Z 2025-05-07T20:32:16.4613057Z @given( 2025-05-07T20:32:16.4613283Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.4613604Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.4613921Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.4614249Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.4614662Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.4614965Z ) 2025-05-07T20:32:16.4615318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.4615754Z def test_silu_mul_quant( 2025-05-07T20:32:16.4616129Z self, 2025-05-07T20:32:16.4616331Z T: int, 2025-05-07T20:32:16.4616523Z D: int, 2025-05-07T20:32:16.4616752Z scale_ub: Optional[float], 2025-05-07T20:32:16.4617023Z contiguous: bool, 2025-05-07T20:32:16.4617258Z compiled: bool, 2025-05-07T20:32:16.4617489Z ) -> None: 2025-05-07T20:32:16.4617775Z torch.manual_seed(2025) 2025-05-07T20:32:16.4618013Z 2025-05-07T20:32:16.4618290Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.4620415Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.4622277Z 2025-05-07T20:32:16.4622395Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.4622605Z 2025-05-07T20:32:16.4622713Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.4623131Z self=, 2025-05-07T20:32:16.4623531Z T=1, 2025-05-07T20:32:16.4623717Z D=5120, 2025-05-07T20:32:16.4623903Z scale_ub=1200.0, 2025-05-07T20:32:16.4624124Z contiguous=True, 2025-05-07T20:32:16.4624353Z compiled=False, 2025-05-07T20:32:16.4624556Z ) 2025-05-07T20:32:16.5119838Z self = 2025-05-07T20:32:16.5120394Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.5120658Z 2025-05-07T20:32:16.5120734Z @given( 2025-05-07T20:32:16.5120965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5121287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5121585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5121916Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5122239Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5122526Z ) 2025-05-07T20:32:16.5122906Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5123338Z def test_silu_mul_quant( 2025-05-07T20:32:16.5123577Z self, 2025-05-07T20:32:16.5123771Z T: int, 2025-05-07T20:32:16.5123958Z D: int, 2025-05-07T20:32:16.5124173Z scale_ub: Optional[float], 2025-05-07T20:32:16.5124442Z contiguous: bool, 2025-05-07T20:32:16.5124681Z compiled: bool, 2025-05-07T20:32:16.5124900Z ) -> None: 2025-05-07T20:32:16.5125119Z torch.manual_seed(2025) 2025-05-07T20:32:16.5125357Z 2025-05-07T20:32:16.5125627Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5125967Z 2025-05-07T20:32:16.5126160Z x_sign = torch.sign(x) 2025-05-07T20:32:16.5126448Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.5126760Z x = x_sign * x_clamp 2025-05-07T20:32:16.5127006Z x0 = x[:, :D] 2025-05-07T20:32:16.5127218Z x1 = x[:, D:] 2025-05-07T20:32:16.5127425Z 2025-05-07T20:32:16.5127609Z if contiguous: 2025-05-07T20:32:16.5127834Z x0 = x0.contiguous() 2025-05-07T20:32:16.5128098Z x1 = x1.contiguous() 2025-05-07T20:32:16.5128333Z 2025-05-07T20:32:16.5128514Z if scale_ub is not None: 2025-05-07T20:32:16.5128784Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.5129249Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.5129574Z ) 2025-05-07T20:32:16.5129782Z else: 2025-05-07T20:32:16.5129987Z scale_ub_tensor = None 2025-05-07T20:32:16.5130365Z 2025-05-07T20:32:16.5130592Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.5130901Z op = silu_mul_quant 2025-05-07T20:32:16.5131150Z if compiled: 2025-05-07T20:32:16.5131391Z op = torch.compile(op) 2025-05-07T20:32:16.5131692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5132030Z 2025-05-07T20:32:16.5132210Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.5132377Z 2025-05-07T20:32:16.5132476Z moe/activation_test.py:117: 2025-05-07T20:32:16.5132770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5133091Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.5133374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.5134066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.5134745Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.5135278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.5135951Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.5136605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.5137136Z kernel = self.compile( 2025-05-07T20:32:16.5137663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.5138311Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.5138706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.5138935Z 2025-05-07T20:32:16.5139141Z self = 2025-05-07T20:32:16.5140451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.5141830Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a221160>} 2025-05-07T20:32:16.5143168Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.5144188Z context = 2025-05-07T20:32:16.5144475Z 2025-05-07T20:32:16.5144637Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.5145159Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.5145628Z module_map=module_map) 2025-05-07T20:32:16.5145990Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.5146335Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.5146599Z E ^ 2025-05-07T20:32:16.5147064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.5147518Z 2025-05-07T20:32:16.5147935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.5148449Z 2025-05-07T20:32:16.5148549Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5148963Z self=, 2025-05-07T20:32:16.5149416Z T=2048, 2025-05-07T20:32:16.5149596Z D=5120, 2025-05-07T20:32:16.5149785Z scale_ub=None, 2025-05-07T20:32:16.5150050Z contiguous=True, 2025-05-07T20:32:16.5150270Z compiled=False, 2025-05-07T20:32:16.5150476Z ) 2025-05-07T20:32:16.5150912Z self = 2025-05-07T20:32:16.5151401Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.5151686Z 2025-05-07T20:32:16.5151760Z @given( 2025-05-07T20:32:16.5151991Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5152334Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5152637Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5152967Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5153296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5153571Z ) 2025-05-07T20:32:16.5153916Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5154362Z def test_silu_mul_quant( 2025-05-07T20:32:16.5154595Z self, 2025-05-07T20:32:16.5154786Z T: int, 2025-05-07T20:32:16.5154980Z D: int, 2025-05-07T20:32:16.5155193Z scale_ub: Optional[float], 2025-05-07T20:32:16.5155458Z contiguous: bool, 2025-05-07T20:32:16.5155692Z compiled: bool, 2025-05-07T20:32:16.5155901Z ) -> None: 2025-05-07T20:32:16.5156115Z torch.manual_seed(2025) 2025-05-07T20:32:16.5156351Z 2025-05-07T20:32:16.5156613Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5156951Z 2025-05-07T20:32:16.5157138Z > x_sign = torch.sign(x) 2025-05-07T20:32:16.5159088Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.5160957Z 2025-05-07T20:32:16.5161072Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:16.5161279Z 2025-05-07T20:32:16.5161393Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5161799Z self=, 2025-05-07T20:32:16.5162202Z T=16384, 2025-05-07T20:32:16.5162389Z D=5120, 2025-05-07T20:32:16.5162570Z scale_ub=None, 2025-05-07T20:32:16.5162793Z contiguous=True, 2025-05-07T20:32:16.5163016Z compiled=False, 2025-05-07T20:32:16.5163215Z ) 2025-05-07T20:32:16.5163575Z self = 2025-05-07T20:32:16.5164173Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.5164476Z 2025-05-07T20:32:16.5164619Z @given( 2025-05-07T20:32:16.5165033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.5171678Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.5172009Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.5172348Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.5172676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.5172974Z ) 2025-05-07T20:32:16.5173335Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.5173782Z def test_silu_mul_quant( 2025-05-07T20:32:16.5174029Z self, 2025-05-07T20:32:16.5174227Z T: int, 2025-05-07T20:32:16.5174416Z D: int, 2025-05-07T20:32:16.5174642Z scale_ub: Optional[float], 2025-05-07T20:32:16.5174921Z contiguous: bool, 2025-05-07T20:32:16.5175244Z compiled: bool, 2025-05-07T20:32:16.5175464Z ) -> None: 2025-05-07T20:32:16.5175690Z torch.manual_seed(2025) 2025-05-07T20:32:16.5175937Z 2025-05-07T20:32:16.5176206Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.5178398Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.5180356Z 2025-05-07T20:32:16.5180480Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.5180699Z 2025-05-07T20:32:16.5180807Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.5181224Z self=, 2025-05-07T20:32:16.5181633Z T=4096, 2025-05-07T20:32:16.5181818Z D=5120, 2025-05-07T20:32:16.5182015Z scale_ub=None, 2025-05-07T20:32:16.5182221Z contiguous=True, 2025-05-07T20:32:16.5182449Z compiled=False, 2025-05-07T20:32:16.5182656Z ) 2025-05-07T20:32:16.6236725Z self = 2025-05-07T20:32:16.6237798Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.6238347Z 2025-05-07T20:32:16.6238491Z @given( 2025-05-07T20:32:16.6238942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6239431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6239730Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6240062Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6240397Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6240677Z ) 2025-05-07T20:32:16.6241025Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6241469Z def test_silu_mul_quant( 2025-05-07T20:32:16.6241717Z self, 2025-05-07T20:32:16.6241913Z T: int, 2025-05-07T20:32:16.6242116Z D: int, 2025-05-07T20:32:16.6242335Z scale_ub: Optional[float], 2025-05-07T20:32:16.6242599Z contiguous: bool, 2025-05-07T20:32:16.6242837Z compiled: bool, 2025-05-07T20:32:16.6243063Z ) -> None: 2025-05-07T20:32:16.6243273Z torch.manual_seed(2025) 2025-05-07T20:32:16.6243518Z 2025-05-07T20:32:16.6243794Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6245877Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6247778Z 2025-05-07T20:32:16.6247895Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6248119Z 2025-05-07T20:32:16.6248220Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6248642Z self=, 2025-05-07T20:32:16.6249046Z T=2048, 2025-05-07T20:32:16.6249232Z D=5120, 2025-05-07T20:32:16.6249413Z scale_ub=None, 2025-05-07T20:32:16.6249631Z contiguous=False, 2025-05-07T20:32:16.6249860Z compiled=False, 2025-05-07T20:32:16.6250058Z ) 2025-05-07T20:32:16.6250523Z self = 2025-05-07T20:32:16.6251096Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.6251416Z 2025-05-07T20:32:16.6251496Z @given( 2025-05-07T20:32:16.6251857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6252211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6252552Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6252920Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6253293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6253673Z ) 2025-05-07T20:32:16.6254066Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6254586Z def test_silu_mul_quant( 2025-05-07T20:32:16.6254852Z self, 2025-05-07T20:32:16.6255052Z T: int, 2025-05-07T20:32:16.6255266Z D: int, 2025-05-07T20:32:16.6255501Z scale_ub: Optional[float], 2025-05-07T20:32:16.6255794Z contiguous: bool, 2025-05-07T20:32:16.6256049Z compiled: bool, 2025-05-07T20:32:16.6256286Z ) -> None: 2025-05-07T20:32:16.6256508Z torch.manual_seed(2025) 2025-05-07T20:32:16.6256769Z 2025-05-07T20:32:16.6257067Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6259675Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6262069Z 2025-05-07T20:32:16.6262202Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6262444Z 2025-05-07T20:32:16.6262552Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6263026Z self=, 2025-05-07T20:32:16.6263492Z T=4096, 2025-05-07T20:32:16.6263685Z D=7168, 2025-05-07T20:32:16.6263885Z scale_ub=None, 2025-05-07T20:32:16.6264117Z contiguous=True, 2025-05-07T20:32:16.6264348Z compiled=True, 2025-05-07T20:32:16.6264564Z ) 2025-05-07T20:32:16.6264920Z self = 2025-05-07T20:32:16.6265489Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.6265799Z 2025-05-07T20:32:16.6265876Z @given( 2025-05-07T20:32:16.6266120Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6266467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6266800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6267173Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6267541Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6267859Z ) 2025-05-07T20:32:16.6268268Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6268784Z def test_silu_mul_quant( 2025-05-07T20:32:16.6269049Z self, 2025-05-07T20:32:16.6269252Z T: int, 2025-05-07T20:32:16.6269461Z D: int, 2025-05-07T20:32:16.6269697Z scale_ub: Optional[float], 2025-05-07T20:32:16.6270037Z contiguous: bool, 2025-05-07T20:32:16.6270276Z compiled: bool, 2025-05-07T20:32:16.6270495Z ) -> None: 2025-05-07T20:32:16.6270703Z torch.manual_seed(2025) 2025-05-07T20:32:16.6270942Z 2025-05-07T20:32:16.6271215Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6273403Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6275348Z 2025-05-07T20:32:16.6275470Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6275718Z 2025-05-07T20:32:16.6275817Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6276234Z self=, 2025-05-07T20:32:16.6276640Z T=2048, 2025-05-07T20:32:16.6276817Z D=5120, 2025-05-07T20:32:16.6277009Z scale_ub=1200.0, 2025-05-07T20:32:16.6277233Z contiguous=False, 2025-05-07T20:32:16.6277454Z compiled=False, 2025-05-07T20:32:16.6277655Z ) 2025-05-07T20:32:16.6277968Z self = 2025-05-07T20:32:16.6278458Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.6278737Z 2025-05-07T20:32:16.6278818Z @given( 2025-05-07T20:32:16.6279041Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6279346Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6279646Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6279971Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6280301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6280580Z ) 2025-05-07T20:32:16.6280925Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6281364Z def test_silu_mul_quant( 2025-05-07T20:32:16.6281598Z self, 2025-05-07T20:32:16.6281792Z T: int, 2025-05-07T20:32:16.6281988Z D: int, 2025-05-07T20:32:16.6282194Z scale_ub: Optional[float], 2025-05-07T20:32:16.6282459Z contiguous: bool, 2025-05-07T20:32:16.6282692Z compiled: bool, 2025-05-07T20:32:16.6282905Z ) -> None: 2025-05-07T20:32:16.6283121Z torch.manual_seed(2025) 2025-05-07T20:32:16.6283358Z 2025-05-07T20:32:16.6283627Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6285709Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6287621Z 2025-05-07T20:32:16.6287735Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6287953Z 2025-05-07T20:32:16.6288056Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6288481Z self=, 2025-05-07T20:32:16.6288879Z T=4096, 2025-05-07T20:32:16.6289072Z D=7168, 2025-05-07T20:32:16.6289265Z scale_ub=1200.0, 2025-05-07T20:32:16.6289491Z contiguous=True, 2025-05-07T20:32:16.6289712Z compiled=False, 2025-05-07T20:32:16.6289920Z ) 2025-05-07T20:32:16.6290242Z self = 2025-05-07T20:32:16.6290728Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.6291011Z 2025-05-07T20:32:16.6291089Z @given( 2025-05-07T20:32:16.6291315Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6291620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6291984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6292313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6292640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6292928Z ) 2025-05-07T20:32:16.6293389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6293835Z def test_silu_mul_quant( 2025-05-07T20:32:16.6294071Z self, 2025-05-07T20:32:16.6294265Z T: int, 2025-05-07T20:32:16.6294464Z D: int, 2025-05-07T20:32:16.6294719Z scale_ub: Optional[float], 2025-05-07T20:32:16.6294998Z contiguous: bool, 2025-05-07T20:32:16.6295240Z compiled: bool, 2025-05-07T20:32:16.6295455Z ) -> None: 2025-05-07T20:32:16.6295672Z torch.manual_seed(2025) 2025-05-07T20:32:16.6295908Z 2025-05-07T20:32:16.6296169Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6298292Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.6300222Z 2025-05-07T20:32:16.6300343Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.6300565Z 2025-05-07T20:32:16.6300668Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6301084Z self=, 2025-05-07T20:32:16.6301484Z T=16384, 2025-05-07T20:32:16.6301674Z D=7168, 2025-05-07T20:32:16.6301867Z scale_ub=None, 2025-05-07T20:32:16.6302074Z contiguous=False, 2025-05-07T20:32:16.6302297Z compiled=True, 2025-05-07T20:32:16.6302515Z ) 2025-05-07T20:32:16.9538336Z self = 2025-05-07T20:32:16.9539324Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.9539840Z 2025-05-07T20:32:16.9539955Z @given( 2025-05-07T20:32:16.9540212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.9540515Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.9540818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.9541148Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.9541475Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.9541750Z ) 2025-05-07T20:32:16.9542095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.9542526Z def test_silu_mul_quant( 2025-05-07T20:32:16.9542761Z self, 2025-05-07T20:32:16.9542955Z T: int, 2025-05-07T20:32:16.9543143Z D: int, 2025-05-07T20:32:16.9543349Z scale_ub: Optional[float], 2025-05-07T20:32:16.9543615Z contiguous: bool, 2025-05-07T20:32:16.9543848Z compiled: bool, 2025-05-07T20:32:16.9544068Z ) -> None: 2025-05-07T20:32:16.9544282Z torch.manual_seed(2025) 2025-05-07T20:32:16.9544520Z 2025-05-07T20:32:16.9544779Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.9546849Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.9548822Z 2025-05-07T20:32:16.9548939Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.9549157Z 2025-05-07T20:32:16.9549371Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.9549893Z self=, 2025-05-07T20:32:16.9550287Z T=4096, 2025-05-07T20:32:16.9550469Z D=7168, 2025-05-07T20:32:16.9550659Z scale_ub=None, 2025-05-07T20:32:16.9550860Z contiguous=True, 2025-05-07T20:32:16.9551152Z compiled=False, 2025-05-07T20:32:16.9551349Z ) 2025-05-07T20:32:16.9551653Z self = 2025-05-07T20:32:16.9552140Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.9552413Z 2025-05-07T20:32:16.9552492Z @given( 2025-05-07T20:32:16.9552723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.9553035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.9553338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.9553662Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.9553990Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.9554272Z ) 2025-05-07T20:32:16.9554614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.9555038Z def test_silu_mul_quant( 2025-05-07T20:32:16.9555278Z self, 2025-05-07T20:32:16.9555471Z T: int, 2025-05-07T20:32:16.9555659Z D: int, 2025-05-07T20:32:16.9555870Z scale_ub: Optional[float], 2025-05-07T20:32:16.9556134Z contiguous: bool, 2025-05-07T20:32:16.9556368Z compiled: bool, 2025-05-07T20:32:16.9556580Z ) -> None: 2025-05-07T20:32:16.9556793Z torch.manual_seed(2025) 2025-05-07T20:32:16.9557026Z 2025-05-07T20:32:16.9557287Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.9559353Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.9561290Z 2025-05-07T20:32:16.9561404Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.9561613Z 2025-05-07T20:32:16.9561719Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.9562127Z self=, 2025-05-07T20:32:16.9562516Z T=16384, 2025-05-07T20:32:16.9562706Z D=7168, 2025-05-07T20:32:16.9562890Z scale_ub=None, 2025-05-07T20:32:16.9563091Z contiguous=True, 2025-05-07T20:32:16.9563310Z compiled=False, 2025-05-07T20:32:16.9563511Z ) 2025-05-07T20:32:16.9563823Z self = 2025-05-07T20:32:16.9564312Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.9564583Z 2025-05-07T20:32:16.9564666Z @given( 2025-05-07T20:32:16.9564883Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.9565194Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.9565501Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.9565817Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.9566143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.9566419Z ) 2025-05-07T20:32:16.9566765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.9567244Z def test_silu_mul_quant( 2025-05-07T20:32:16.9567474Z self, 2025-05-07T20:32:16.9567663Z T: int, 2025-05-07T20:32:16.9567849Z D: int, 2025-05-07T20:32:16.9568058Z scale_ub: Optional[float], 2025-05-07T20:32:16.9568402Z contiguous: bool, 2025-05-07T20:32:16.9568631Z compiled: bool, 2025-05-07T20:32:16.9568844Z ) -> None: 2025-05-07T20:32:16.9569052Z torch.manual_seed(2025) 2025-05-07T20:32:16.9569284Z 2025-05-07T20:32:16.9569544Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.9571637Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.9573513Z 2025-05-07T20:32:16.9573625Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.9573838Z 2025-05-07T20:32:16.9573941Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.9574339Z self=, 2025-05-07T20:32:16.9574726Z T=16384, 2025-05-07T20:32:16.9574906Z D=7168, 2025-05-07T20:32:16.9575082Z scale_ub=1200.0, 2025-05-07T20:32:16.9575301Z contiguous=True, 2025-05-07T20:32:16.9575516Z compiled=False, 2025-05-07T20:32:16.9575707Z ) 2025-05-07T20:32:16.9576016Z self = 2025-05-07T20:32:16.9576506Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.9576775Z 2025-05-07T20:32:16.9576850Z @given( 2025-05-07T20:32:16.9577064Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.9577372Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.9577672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.9577998Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.9578318Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.9578598Z ) 2025-05-07T20:32:16.9578932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.9579364Z def test_silu_mul_quant( 2025-05-07T20:32:16.9579615Z self, 2025-05-07T20:32:16.9579794Z T: int, 2025-05-07T20:32:16.9579983Z D: int, 2025-05-07T20:32:16.9580195Z scale_ub: Optional[float], 2025-05-07T20:32:16.9580454Z contiguous: bool, 2025-05-07T20:32:16.9580685Z compiled: bool, 2025-05-07T20:32:16.9580893Z ) -> None: 2025-05-07T20:32:16.9581098Z torch.manual_seed(2025) 2025-05-07T20:32:16.9581327Z 2025-05-07T20:32:16.9581589Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.9583644Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.9585515Z 2025-05-07T20:32:16.9585632Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.9585839Z 2025-05-07T20:32:16.9585938Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.9586344Z self=, 2025-05-07T20:32:16.9586798Z T=128, 2025-05-07T20:32:16.9586981Z D=5120, 2025-05-07T20:32:16.9587157Z scale_ub=1200.0, 2025-05-07T20:32:16.9587377Z contiguous=False, 2025-05-07T20:32:16.9587596Z compiled=False, 2025-05-07T20:32:16.9587787Z ) 2025-05-07T20:32:17.1259035Z self = 2025-05-07T20:32:17.1259798Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.1260095Z 2025-05-07T20:32:17.1260176Z @given( 2025-05-07T20:32:17.1260390Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.1260764Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.1261062Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.1261381Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.1261700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.1261981Z ) 2025-05-07T20:32:17.1262323Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.1262750Z def test_silu_mul_quant( 2025-05-07T20:32:17.1262984Z self, 2025-05-07T20:32:17.1263171Z T: int, 2025-05-07T20:32:17.1263355Z D: int, 2025-05-07T20:32:17.1263571Z scale_ub: Optional[float], 2025-05-07T20:32:17.1263832Z contiguous: bool, 2025-05-07T20:32:17.1264056Z compiled: bool, 2025-05-07T20:32:17.1264280Z ) -> None: 2025-05-07T20:32:17.1264486Z torch.manual_seed(2025) 2025-05-07T20:32:17.1264713Z 2025-05-07T20:32:17.1264976Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.1265308Z 2025-05-07T20:32:17.1265492Z x_sign = torch.sign(x) 2025-05-07T20:32:17.1265774Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.1266074Z x = x_sign * x_clamp 2025-05-07T20:32:17.1266306Z x0 = x[:, :D] 2025-05-07T20:32:17.1266516Z x1 = x[:, D:] 2025-05-07T20:32:17.1266718Z 2025-05-07T20:32:17.1266898Z if contiguous: 2025-05-07T20:32:17.1267115Z x0 = x0.contiguous() 2025-05-07T20:32:17.1267366Z x1 = x1.contiguous() 2025-05-07T20:32:17.1267593Z 2025-05-07T20:32:17.1267775Z if scale_ub is not None: 2025-05-07T20:32:17.1268042Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.1268371Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.1268671Z ) 2025-05-07T20:32:17.1268852Z else: 2025-05-07T20:32:17.1269052Z scale_ub_tensor = None 2025-05-07T20:32:17.1269294Z 2025-05-07T20:32:17.1269515Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.1269885Z op = silu_mul_quant 2025-05-07T20:32:17.1270159Z if compiled: 2025-05-07T20:32:17.1270405Z op = torch.compile(op) 2025-05-07T20:32:17.1270698Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.1270969Z 2025-05-07T20:32:17.1271149Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.1271318Z 2025-05-07T20:32:17.1271413Z moe/activation_test.py:117: 2025-05-07T20:32:17.1271704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.1272026Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.1272301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.1272984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.1273673Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.1274205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.1274880Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.1275526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.1276172Z kernel = self.compile( 2025-05-07T20:32:17.1276705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.1283689Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.1284128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.1284367Z 2025-05-07T20:32:17.1284575Z self = 2025-05-07T20:32:17.1285654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.1287071Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a106ca0>} 2025-05-07T20:32:17.1288425Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.1289442Z context = 2025-05-07T20:32:17.1289732Z 2025-05-07T20:32:17.1289896Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.1290413Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.1290887Z module_map=module_map) 2025-05-07T20:32:17.1291244Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.1291599Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.1291860Z E ^ 2025-05-07T20:32:17.1292310Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.1292766Z 2025-05-07T20:32:17.1293180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.1293695Z 2025-05-07T20:32:17.1293796Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.1294212Z self=, 2025-05-07T20:32:17.1294605Z T=2048, 2025-05-07T20:32:17.1294788Z D=7168, 2025-05-07T20:32:17.1294977Z scale_ub=None, 2025-05-07T20:32:17.1295184Z contiguous=False, 2025-05-07T20:32:17.1295410Z compiled=False, 2025-05-07T20:32:17.1295619Z ) 2025-05-07T20:32:17.1295930Z self = 2025-05-07T20:32:17.1296421Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.1296688Z 2025-05-07T20:32:17.1296767Z @given( 2025-05-07T20:32:17.1296994Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.1297298Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.1297604Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.1297942Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.1298261Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.1298551Z ) 2025-05-07T20:32:17.1298895Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.1299330Z def test_silu_mul_quant( 2025-05-07T20:32:17.1299570Z self, 2025-05-07T20:32:17.1299760Z T: int, 2025-05-07T20:32:17.1299946Z D: int, 2025-05-07T20:32:17.1300162Z scale_ub: Optional[float], 2025-05-07T20:32:17.1300428Z contiguous: bool, 2025-05-07T20:32:17.1300660Z compiled: bool, 2025-05-07T20:32:17.1300877Z ) -> None: 2025-05-07T20:32:17.1301094Z torch.manual_seed(2025) 2025-05-07T20:32:17.1301333Z 2025-05-07T20:32:17.1301593Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.1303966Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.1305888Z 2025-05-07T20:32:17.1306004Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.1306213Z 2025-05-07T20:32:17.1306317Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.1306723Z self=, 2025-05-07T20:32:17.1307116Z T=128, 2025-05-07T20:32:17.1307297Z D=7168, 2025-05-07T20:32:17.1307482Z scale_ub=1200.0, 2025-05-07T20:32:17.1307690Z contiguous=True, 2025-05-07T20:32:17.1307907Z compiled=True, 2025-05-07T20:32:17.1308104Z ) 2025-05-07T20:32:17.1762770Z self = 2025-05-07T20:32:17.1763279Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.1763570Z 2025-05-07T20:32:17.1763655Z @given( 2025-05-07T20:32:17.1763882Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.1764196Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.1764502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.1764824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.1765155Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.1765442Z ) 2025-05-07T20:32:17.1765791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.1766227Z def test_silu_mul_quant( 2025-05-07T20:32:17.1766473Z self, 2025-05-07T20:32:17.1766670Z T: int, 2025-05-07T20:32:17.1766860Z D: int, 2025-05-07T20:32:17.1767082Z scale_ub: Optional[float], 2025-05-07T20:32:17.1767351Z contiguous: bool, 2025-05-07T20:32:17.1767587Z compiled: bool, 2025-05-07T20:32:17.1767803Z ) -> None: 2025-05-07T20:32:17.1768015Z torch.manual_seed(2025) 2025-05-07T20:32:17.1768245Z 2025-05-07T20:32:17.1768508Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.1768837Z 2025-05-07T20:32:17.1769022Z x_sign = torch.sign(x) 2025-05-07T20:32:17.1769303Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.1769635Z x = x_sign * x_clamp 2025-05-07T20:32:17.1769895Z x0 = x[:, :D] 2025-05-07T20:32:17.1770111Z x1 = x[:, D:] 2025-05-07T20:32:17.1770317Z 2025-05-07T20:32:17.1770493Z if contiguous: 2025-05-07T20:32:17.1770717Z x0 = x0.contiguous() 2025-05-07T20:32:17.1770974Z x1 = x1.contiguous() 2025-05-07T20:32:17.1771207Z 2025-05-07T20:32:17.1771387Z if scale_ub is not None: 2025-05-07T20:32:17.1771649Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.1771984Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.1772282Z ) 2025-05-07T20:32:17.1772467Z else: 2025-05-07T20:32:17.1772670Z scale_ub_tensor = None 2025-05-07T20:32:17.1772910Z 2025-05-07T20:32:17.1773135Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.1773444Z op = silu_mul_quant 2025-05-07T20:32:17.1773681Z if compiled: 2025-05-07T20:32:17.1773923Z op = torch.compile(op) 2025-05-07T20:32:17.1774213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.1774477Z 2025-05-07T20:32:17.1774662Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.1774828Z 2025-05-07T20:32:17.1775033Z moe/activation_test.py:117: 2025-05-07T20:32:17.1775323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.1775643Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.1775916Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.1776578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.1777129Z return fn(*args, **kwargs) 2025-05-07T20:32:17.1777782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.1778521Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.1779048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.1779718Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.1780409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.1780931Z kernel = self.compile( 2025-05-07T20:32:17.1781459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.1782109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.1782496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.1782718Z 2025-05-07T20:32:17.1782928Z self = 2025-05-07T20:32:17.1784000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.1785363Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a3a4280>} 2025-05-07T20:32:17.1786701Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.1787714Z context = 2025-05-07T20:32:17.1787996Z 2025-05-07T20:32:17.1788161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.1788667Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.1789129Z module_map=module_map) 2025-05-07T20:32:17.1789493Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.1789896Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.1790147Z E ^ 2025-05-07T20:32:17.1790603Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.1791050Z 2025-05-07T20:32:17.1791470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.1791994Z 2025-05-07T20:32:17.1792105Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.1792507Z self=, 2025-05-07T20:32:17.1792902Z T=128, 2025-05-07T20:32:17.1793080Z D=7168, 2025-05-07T20:32:17.1793263Z scale_ub=1200.0, 2025-05-07T20:32:17.1793483Z contiguous=True, 2025-05-07T20:32:17.1793707Z compiled=False, 2025-05-07T20:32:17.1793904Z ) 2025-05-07T20:32:17.1794213Z self = 2025-05-07T20:32:17.1794692Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.1794959Z 2025-05-07T20:32:17.1795038Z @given( 2025-05-07T20:32:17.1795256Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.1795612Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.1795916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.1796232Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.1796635Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.1796912Z ) 2025-05-07T20:32:17.1797248Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.1797683Z def test_silu_mul_quant( 2025-05-07T20:32:17.1797917Z self, 2025-05-07T20:32:17.1798167Z T: int, 2025-05-07T20:32:17.1798361Z D: int, 2025-05-07T20:32:17.1798580Z scale_ub: Optional[float], 2025-05-07T20:32:17.1798851Z contiguous: bool, 2025-05-07T20:32:17.1799082Z compiled: bool, 2025-05-07T20:32:17.1799299Z ) -> None: 2025-05-07T20:32:17.1799508Z torch.manual_seed(2025) 2025-05-07T20:32:17.1799739Z 2025-05-07T20:32:17.1800007Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.1800337Z 2025-05-07T20:32:17.1800519Z x_sign = torch.sign(x) 2025-05-07T20:32:17.1800802Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.1802802Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.1804816Z 2025-05-07T20:32:17.1804937Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:17.1805149Z 2025-05-07T20:32:17.1805257Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.1805661Z self=, 2025-05-07T20:32:17.1806066Z T=128, 2025-05-07T20:32:17.1806243Z D=5120, 2025-05-07T20:32:17.1806421Z scale_ub=1200.0, 2025-05-07T20:32:17.1806641Z contiguous=True, 2025-05-07T20:32:17.1806853Z compiled=True, 2025-05-07T20:32:17.1807044Z ) 2025-05-07T20:32:17.1807349Z self = 2025-05-07T20:32:17.1807828Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.1808094Z 2025-05-07T20:32:17.1808166Z @given( 2025-05-07T20:32:17.1808385Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.1808688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.1808988Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.1809303Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.1809628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.1809911Z ) 2025-05-07T20:32:17.1810247Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.1810678Z def test_silu_mul_quant( 2025-05-07T20:32:17.1810919Z self, 2025-05-07T20:32:17.1811121Z T: int, 2025-05-07T20:32:17.1811306Z D: int, 2025-05-07T20:32:17.1811515Z scale_ub: Optional[float], 2025-05-07T20:32:17.1811781Z contiguous: bool, 2025-05-07T20:32:17.1812008Z compiled: bool, 2025-05-07T20:32:17.1812226Z ) -> None: 2025-05-07T20:32:17.1812433Z torch.manual_seed(2025) 2025-05-07T20:32:17.1812666Z 2025-05-07T20:32:17.1812930Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.1813265Z 2025-05-07T20:32:17.1813450Z x_sign = torch.sign(x) 2025-05-07T20:32:17.1813726Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.1815827Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.1817730Z 2025-05-07T20:32:17.1817848Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:17.1818113Z 2025-05-07T20:32:17.1818217Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.1818620Z self=, 2025-05-07T20:32:17.1819021Z T=128, 2025-05-07T20:32:17.1819200Z D=7168, 2025-05-07T20:32:17.1819380Z scale_ub=None, 2025-05-07T20:32:17.1819586Z contiguous=True, 2025-05-07T20:32:17.1819804Z compiled=True, 2025-05-07T20:32:17.1819999Z ) 2025-05-07T20:32:17.4298200Z self = 2025-05-07T20:32:17.4298712Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.4298993Z 2025-05-07T20:32:17.4299075Z @given( 2025-05-07T20:32:17.4299301Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4299605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4299906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4300270Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4300608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4300888Z ) 2025-05-07T20:32:17.4301228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4301664Z def test_silu_mul_quant( 2025-05-07T20:32:17.4301905Z self, 2025-05-07T20:32:17.4302097Z T: int, 2025-05-07T20:32:17.4302294Z D: int, 2025-05-07T20:32:17.4302512Z scale_ub: Optional[float], 2025-05-07T20:32:17.4302778Z contiguous: bool, 2025-05-07T20:32:17.4303021Z compiled: bool, 2025-05-07T20:32:17.4303248Z ) -> None: 2025-05-07T20:32:17.4303462Z torch.manual_seed(2025) 2025-05-07T20:32:17.4303845Z 2025-05-07T20:32:17.4304115Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4306156Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.4308002Z 2025-05-07T20:32:17.4308128Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.4308336Z 2025-05-07T20:32:17.4339375Z FAILED 2025-05-07T20:32:17.4339619Z 2025-05-07T20:32:17.4339805Z =================================== FAILURES =================================== 2025-05-07T20:32:17.4340256Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:17.4340791Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:17.4341655Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:32:17.4342398Z | yield 2025-05-07T20:32:17.4342970Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run 2025-05-07T20:32:17.4343673Z | self._callTestMethod(testMethod) 2025-05-07T20:32:17.4344431Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod 2025-05-07T20:32:17.4345327Z | method() 2025-05-07T20:32:17.4346340Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:17.4347373Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4348244Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:17.4349086Z | raise the_error_hypothesis_found 2025-05-07T20:32:17.4350189Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:17.4350846Z +-+---------------- 1 ---------------- 2025-05-07T20:32:17.4351232Z | Traceback (most recent call last): 2025-05-07T20:32:17.4352195Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:17.4353253Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4356085Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.4358799Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:17.4359388Z | self=, 2025-05-07T20:32:17.4359948Z | T=2048, 2025-05-07T20:32:17.4360271Z | D=5120, # or any other generated value 2025-05-07T20:32:17.4360725Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:17.4361200Z | contiguous=True, # or any other generated value 2025-05-07T20:32:17.4361699Z | compiled=False, # or any other generated value 2025-05-07T20:32:17.4362119Z | ) 2025-05-07T20:32:17.4362346Z | 2025-05-07T20:32:17.4363047Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:17.4363872Z +---------------- 2 ---------------- 2025-05-07T20:32:17.4364259Z | Traceback (most recent call last): 2025-05-07T20:32:17.4365219Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:17.4366291Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4369158Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.4371891Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:17.4372491Z | self=, 2025-05-07T20:32:17.4373029Z | T=128, 2025-05-07T20:32:17.4373301Z | D=7168, 2025-05-07T20:32:17.4373522Z | scale_ub=None, 2025-05-07T20:32:17.4373794Z | contiguous=True, 2025-05-07T20:32:17.4374141Z | compiled=True, 2025-05-07T20:32:17.4374526Z | ) 2025-05-07T20:32:17.4374753Z | 2025-05-07T20:32:17.4375404Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:17.4376090Z +---------------- 3 ---------------- 2025-05-07T20:32:17.4376373Z | Traceback (most recent call last): 2025-05-07T20:32:17.4377068Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:17.4377835Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4379969Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.4381959Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:17.4382395Z | self=, 2025-05-07T20:32:17.4382802Z | T=128, 2025-05-07T20:32:17.4382994Z | D=5120, 2025-05-07T20:32:17.4383199Z | scale_ub=1200.0, 2025-05-07T20:32:17.4383432Z | contiguous=True, 2025-05-07T20:32:17.4383668Z | compiled=True, 2025-05-07T20:32:17.4383888Z | ) 2025-05-07T20:32:17.4384052Z | 2025-05-07T20:32:17.4384573Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:17.4385176Z +---------------- 4 ---------------- 2025-05-07T20:32:17.4385457Z | Traceback (most recent call last): 2025-05-07T20:32:17.4386309Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:17.4387365Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.4388297Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:17.4389276Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4390622Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:17.4391809Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.4392656Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:17.4393656Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4394678Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:17.4395752Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4396858Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:17.4397956Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4399044Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:17.4400005Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.4400893Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:17.4401810Z | fn() 2025-05-07T20:32:17.4402698Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:17.4403570Z | self.fn.run( 2025-05-07T20:32:17.4404689Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:17.4405514Z | kernel = self.compile( 2025-05-07T20:32:17.4406567Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:17.4428640Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4429673Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:17.4430903Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4431626Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4432112Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.4432472Z | ^ 2025-05-07T20:32:17.4433135Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4433925Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:17.4434490Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:17.4435196Z | self=, 2025-05-07T20:32:17.4435784Z | T=1, # or any other generated value 2025-05-07T20:32:17.4436212Z | D=5120, # or any other generated value 2025-05-07T20:32:17.4436655Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:17.4437145Z | contiguous=True, # or any other generated value 2025-05-07T20:32:17.4437629Z | compiled=True, # or any other generated value 2025-05-07T20:32:17.4438019Z | ) 2025-05-07T20:32:17.4438260Z | 2025-05-07T20:32:17.4438970Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:17.4439803Z +------------------------------------ 2025-05-07T20:32:17.4440295Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:17.4440817Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4441395Z self=, 2025-05-07T20:32:17.4441939Z T=1, 2025-05-07T20:32:17.4442196Z D=5120, 2025-05-07T20:32:17.4442464Z scale_ub=None, 2025-05-07T20:32:17.4442761Z contiguous=True, 2025-05-07T20:32:17.4443074Z compiled=True, 2025-05-07T20:32:17.4443362Z ) 2025-05-07T20:32:17.4443798Z self = 2025-05-07T20:32:17.4444468Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.4444840Z 2025-05-07T20:32:17.4444935Z @given( 2025-05-07T20:32:17.4445241Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4445661Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4446091Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4446563Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4447003Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4447399Z ) 2025-05-07T20:32:17.4447895Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4448502Z def test_silu_mul_quant( 2025-05-07T20:32:17.4448828Z self, 2025-05-07T20:32:17.4449087Z T: int, 2025-05-07T20:32:17.4449351Z D: int, 2025-05-07T20:32:17.4449861Z scale_ub: Optional[float], 2025-05-07T20:32:17.4450247Z contiguous: bool, 2025-05-07T20:32:17.4450567Z compiled: bool, 2025-05-07T20:32:17.4450869Z ) -> None: 2025-05-07T20:32:17.4451158Z torch.manual_seed(2025) 2025-05-07T20:32:17.4451485Z 2025-05-07T20:32:17.4451982Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4452454Z 2025-05-07T20:32:17.4452728Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4453097Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4453516Z x = x_sign * x_clamp 2025-05-07T20:32:17.4453914Z x0 = x[:, :D] 2025-05-07T20:32:17.4454192Z x1 = x[:, D:] 2025-05-07T20:32:17.4454481Z 2025-05-07T20:32:17.4454737Z if contiguous: 2025-05-07T20:32:17.4455048Z x0 = x0.contiguous() 2025-05-07T20:32:17.4455397Z x1 = x1.contiguous() 2025-05-07T20:32:17.4455719Z 2025-05-07T20:32:17.4455966Z if scale_ub is not None: 2025-05-07T20:32:17.4456332Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4456780Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4457189Z ) 2025-05-07T20:32:17.4457438Z else: 2025-05-07T20:32:17.4457721Z scale_ub_tensor = None 2025-05-07T20:32:17.4458061Z 2025-05-07T20:32:17.4458375Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4458810Z op = silu_mul_quant 2025-05-07T20:32:17.4459162Z if compiled: 2025-05-07T20:32:17.4459500Z op = torch.compile(op) 2025-05-07T20:32:17.4459924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4460306Z 2025-05-07T20:32:17.4460568Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.4460956Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.4461362Z 2025-05-07T20:32:17.4461692Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4462160Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.4462565Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.4462993Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.4463498Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4463926Z 2025-05-07T20:32:17.4464198Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.4464469Z 2025-05-07T20:32:17.4464604Z moe/activation_test.py:126: 2025-05-07T20:32:17.4465017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4465483Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.4465941Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4467053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.4468112Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.4468874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4469791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4470803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.4471742Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4472731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.4473724Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4474695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.4475549Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.4476417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.4477130Z fn() 2025-05-07T20:32:17.4477938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.4478754Z self.fn.run( 2025-05-07T20:32:17.4479391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4480129Z kernel = self.compile( 2025-05-07T20:32:17.4480877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4481788Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4482297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4482602Z 2025-05-07T20:32:17.4482867Z self = 2025-05-07T20:32:17.4484294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4486146Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2ece7040>} 2025-05-07T20:32:17.4487915Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4489279Z context = 2025-05-07T20:32:17.4489669Z 2025-05-07T20:32:17.4489886Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4490572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4491203Z module_map=module_map) 2025-05-07T20:32:17.4491694Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4492160Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.4492525Z E ^ 2025-05-07T20:32:17.4493150Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4493773Z 2025-05-07T20:32:17.4494339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4495017Z 2025-05-07T20:32:17.4495162Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4495722Z self=, 2025-05-07T20:32:17.4496248Z T=2048, 2025-05-07T20:32:17.4496493Z D=5120, 2025-05-07T20:32:17.4496740Z scale_ub=1200.0, 2025-05-07T20:32:17.4497026Z contiguous=True, 2025-05-07T20:32:17.4497326Z compiled=False, 2025-05-07T20:32:17.4497604Z ) 2025-05-07T20:32:17.4498031Z self = 2025-05-07T20:32:17.4498720Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.4499098Z 2025-05-07T20:32:17.4499210Z @given( 2025-05-07T20:32:17.4499512Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4499932Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4500342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4500782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4501212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4501590Z ) 2025-05-07T20:32:17.4502056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4502643Z def test_silu_mul_quant( 2025-05-07T20:32:17.4502965Z self, 2025-05-07T20:32:17.4503279Z T: int, 2025-05-07T20:32:17.4503527Z D: int, 2025-05-07T20:32:17.4504107Z scale_ub: Optional[float], 2025-05-07T20:32:17.4504479Z contiguous: bool, 2025-05-07T20:32:17.4504802Z compiled: bool, 2025-05-07T20:32:17.4505268Z ) -> None: 2025-05-07T20:32:17.4505555Z torch.manual_seed(2025) 2025-05-07T20:32:17.4505865Z 2025-05-07T20:32:17.4506236Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4506710Z 2025-05-07T20:32:17.4506962Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4507443Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4507860Z x = x_sign * x_clamp 2025-05-07T20:32:17.4508182Z x0 = x[:, :D] 2025-05-07T20:32:17.4508465Z x1 = x[:, D:] 2025-05-07T20:32:17.4508745Z 2025-05-07T20:32:17.4508991Z if contiguous: 2025-05-07T20:32:17.4509300Z x0 = x0.contiguous() 2025-05-07T20:32:17.4509652Z x1 = x1.contiguous() 2025-05-07T20:32:17.4510080Z 2025-05-07T20:32:17.4510331Z if scale_ub is not None: 2025-05-07T20:32:17.4510695Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4511155Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4511560Z ) 2025-05-07T20:32:17.4511818Z else: 2025-05-07T20:32:17.4512100Z scale_ub_tensor = None 2025-05-07T20:32:17.4512445Z 2025-05-07T20:32:17.4512755Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4513180Z op = silu_mul_quant 2025-05-07T20:32:17.4513518Z if compiled: 2025-05-07T20:32:17.4513846Z op = torch.compile(op) 2025-05-07T20:32:17.4514250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4514619Z 2025-05-07T20:32:17.4514876Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.4515107Z 2025-05-07T20:32:17.4515233Z moe/activation_test.py:117: 2025-05-07T20:32:17.4515642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4516089Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.4516481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4517431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.4518364Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.4519093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4519977Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4520825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4521516Z kernel = self.compile( 2025-05-07T20:32:17.4522271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4523171Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4523705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4524010Z 2025-05-07T20:32:17.4524287Z self = 2025-05-07T20:32:17.4525759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4527624Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2f1a53a0>} 2025-05-07T20:32:17.4529470Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4530977Z context = 2025-05-07T20:32:17.4531269Z 2025-05-07T20:32:17.4531444Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4532074Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4532547Z module_map=module_map) 2025-05-07T20:32:17.4532914Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4533262Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.4533559Z E ^ 2025-05-07T20:32:17.4534024Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4534476Z 2025-05-07T20:32:17.4534901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4535416Z 2025-05-07T20:32:17.4535517Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4535938Z self=, 2025-05-07T20:32:17.4536348Z T=2048, 2025-05-07T20:32:17.4536529Z D=5120, 2025-05-07T20:32:17.4536729Z scale_ub=1200.0, 2025-05-07T20:32:17.4536953Z contiguous=True, 2025-05-07T20:32:17.4537164Z compiled=True, 2025-05-07T20:32:17.4537364Z ) 2025-05-07T20:32:17.4537686Z self = 2025-05-07T20:32:17.4538179Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.4538449Z 2025-05-07T20:32:17.4538527Z @given( 2025-05-07T20:32:17.4538754Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4539063Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4539362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4539690Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4540020Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4540298Z ) 2025-05-07T20:32:17.4540644Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4541086Z def test_silu_mul_quant( 2025-05-07T20:32:17.4541324Z self, 2025-05-07T20:32:17.4541516Z T: int, 2025-05-07T20:32:17.4541714Z D: int, 2025-05-07T20:32:17.4541935Z scale_ub: Optional[float], 2025-05-07T20:32:17.4542201Z contiguous: bool, 2025-05-07T20:32:17.4542437Z compiled: bool, 2025-05-07T20:32:17.4542663Z ) -> None: 2025-05-07T20:32:17.4542870Z torch.manual_seed(2025) 2025-05-07T20:32:17.4543112Z 2025-05-07T20:32:17.4543384Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4543722Z 2025-05-07T20:32:17.4543907Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4544196Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4544502Z x = x_sign * x_clamp 2025-05-07T20:32:17.4544740Z x0 = x[:, :D] 2025-05-07T20:32:17.4544954Z x1 = x[:, D:] 2025-05-07T20:32:17.4545154Z 2025-05-07T20:32:17.4545335Z if contiguous: 2025-05-07T20:32:17.4545569Z x0 = x0.contiguous() 2025-05-07T20:32:17.4545822Z x1 = x1.contiguous() 2025-05-07T20:32:17.4546064Z 2025-05-07T20:32:17.4546258Z if scale_ub is not None: 2025-05-07T20:32:17.4546524Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4546861Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4547174Z ) 2025-05-07T20:32:17.4547366Z else: 2025-05-07T20:32:17.4547573Z scale_ub_tensor = None 2025-05-07T20:32:17.4547825Z 2025-05-07T20:32:17.4548054Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4548363Z op = silu_mul_quant 2025-05-07T20:32:17.4548611Z if compiled: 2025-05-07T20:32:17.4548911Z op = torch.compile(op) 2025-05-07T20:32:17.4549201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4549479Z 2025-05-07T20:32:17.4549671Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.4550110Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.4550403Z 2025-05-07T20:32:17.4550634Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4550962Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.4551259Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.4551576Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.4551975Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4552278Z 2025-05-07T20:32:17.4552481Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.4552674Z 2025-05-07T20:32:17.4552777Z moe/activation_test.py:126: 2025-05-07T20:32:17.4553067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4553405Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.4553735Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4554532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.4555301Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.4555847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4556531Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4557218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.4557939Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4558696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.4559446Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4560173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.4560813Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.4561416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.4561931Z fn() 2025-05-07T20:32:17.4562432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.4563012Z self.fn.run( 2025-05-07T20:32:17.4563482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4564005Z kernel = self.compile( 2025-05-07T20:32:17.4564550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4565206Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4565608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4565839Z 2025-05-07T20:32:17.4566045Z self = 2025-05-07T20:32:17.4567142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4568536Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d89f670>} 2025-05-07T20:32:17.4569897Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4570978Z context = 2025-05-07T20:32:17.4571277Z 2025-05-07T20:32:17.4571519Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4572054Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4572527Z module_map=module_map) 2025-05-07T20:32:17.4572891Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4573296Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.4573568Z E ^ 2025-05-07T20:32:17.4574035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4574496Z 2025-05-07T20:32:17.4574912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4575431Z 2025-05-07T20:32:17.4575533Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4575949Z self=, 2025-05-07T20:32:17.4576354Z T=16384, 2025-05-07T20:32:17.4576543Z D=7168, 2025-05-07T20:32:17.4576733Z scale_ub=1200.0, 2025-05-07T20:32:17.4576948Z contiguous=False, 2025-05-07T20:32:17.4577176Z compiled=False, 2025-05-07T20:32:17.4577377Z ) 2025-05-07T20:32:17.4577688Z self = 2025-05-07T20:32:17.4578188Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.4578472Z 2025-05-07T20:32:17.4578550Z @given( 2025-05-07T20:32:17.4578777Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4579084Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4579396Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4579726Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4580052Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4580337Z ) 2025-05-07T20:32:17.4580691Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4581128Z def test_silu_mul_quant( 2025-05-07T20:32:17.4581370Z self, 2025-05-07T20:32:17.4581566Z T: int, 2025-05-07T20:32:17.4581762Z D: int, 2025-05-07T20:32:17.4581974Z scale_ub: Optional[float], 2025-05-07T20:32:17.4582488Z contiguous: bool, 2025-05-07T20:32:17.4582730Z compiled: bool, 2025-05-07T20:32:17.4582948Z ) -> None: 2025-05-07T20:32:17.4583162Z torch.manual_seed(2025) 2025-05-07T20:32:17.4583408Z 2025-05-07T20:32:17.4583673Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4584010Z 2025-05-07T20:32:17.4584209Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4584493Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4584800Z x = x_sign * x_clamp 2025-05-07T20:32:17.4585040Z x0 = x[:, :D] 2025-05-07T20:32:17.4585254Z x1 = x[:, D:] 2025-05-07T20:32:17.4585465Z 2025-05-07T20:32:17.4585654Z if contiguous: 2025-05-07T20:32:17.4585877Z x0 = x0.contiguous() 2025-05-07T20:32:17.4586136Z x1 = x1.contiguous() 2025-05-07T20:32:17.4586377Z 2025-05-07T20:32:17.4586562Z if scale_ub is not None: 2025-05-07T20:32:17.4586841Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4587181Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4587492Z ) 2025-05-07T20:32:17.4587677Z else: 2025-05-07T20:32:17.4587887Z scale_ub_tensor = None 2025-05-07T20:32:17.4588133Z 2025-05-07T20:32:17.4588354Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4588719Z op = silu_mul_quant 2025-05-07T20:32:17.4588965Z if compiled: 2025-05-07T20:32:17.4589202Z op = torch.compile(op) 2025-05-07T20:32:17.4589499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4589769Z 2025-05-07T20:32:17.4590097Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.4590266Z 2025-05-07T20:32:17.4590362Z moe/activation_test.py:117: 2025-05-07T20:32:17.4590652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4590978Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.4591288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4591975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.4592663Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.4593193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4593873Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4594530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4595063Z kernel = self.compile( 2025-05-07T20:32:17.4595589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4596236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4596633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4596860Z 2025-05-07T20:32:17.4597065Z self = 2025-05-07T20:32:17.4598150Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4599529Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d88c280>} 2025-05-07T20:32:17.4600930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4601952Z context = 2025-05-07T20:32:17.4602241Z 2025-05-07T20:32:17.4602402Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4602919Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4603383Z module_map=module_map) 2025-05-07T20:32:17.4603943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4604386Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.4604642Z E ^ 2025-05-07T20:32:17.4605113Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4605562Z 2025-05-07T20:32:17.4614226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4614776Z 2025-05-07T20:32:17.4614891Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4615306Z self=, 2025-05-07T20:32:17.4615722Z T=1, 2025-05-07T20:32:17.4615911Z D=7168, 2025-05-07T20:32:17.4616097Z scale_ub=None, 2025-05-07T20:32:17.4616316Z contiguous=True, 2025-05-07T20:32:17.4616542Z compiled=True, 2025-05-07T20:32:17.4616740Z ) 2025-05-07T20:32:17.4617055Z self = 2025-05-07T20:32:17.4617660Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.4617916Z 2025-05-07T20:32:17.4617999Z @given( 2025-05-07T20:32:17.4618221Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4618533Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4618985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4619310Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4619636Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4619921Z ) 2025-05-07T20:32:17.4620261Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4620769Z def test_silu_mul_quant( 2025-05-07T20:32:17.4621016Z self, 2025-05-07T20:32:17.4621205Z T: int, 2025-05-07T20:32:17.4621398Z D: int, 2025-05-07T20:32:17.4621616Z scale_ub: Optional[float], 2025-05-07T20:32:17.4621884Z contiguous: bool, 2025-05-07T20:32:17.4622118Z compiled: bool, 2025-05-07T20:32:17.4622341Z ) -> None: 2025-05-07T20:32:17.4622557Z torch.manual_seed(2025) 2025-05-07T20:32:17.4622789Z 2025-05-07T20:32:17.4623058Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4623396Z 2025-05-07T20:32:17.4623585Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4623873Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4624177Z x = x_sign * x_clamp 2025-05-07T20:32:17.4624413Z x0 = x[:, :D] 2025-05-07T20:32:17.4624628Z x1 = x[:, D:] 2025-05-07T20:32:17.4624839Z 2025-05-07T20:32:17.4625013Z if contiguous: 2025-05-07T20:32:17.4625245Z x0 = x0.contiguous() 2025-05-07T20:32:17.4625504Z x1 = x1.contiguous() 2025-05-07T20:32:17.4625740Z 2025-05-07T20:32:17.4625929Z if scale_ub is not None: 2025-05-07T20:32:17.4626202Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4626529Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4626843Z ) 2025-05-07T20:32:17.4627037Z else: 2025-05-07T20:32:17.4627247Z scale_ub_tensor = None 2025-05-07T20:32:17.4627495Z 2025-05-07T20:32:17.4627730Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4628045Z op = silu_mul_quant 2025-05-07T20:32:17.4628288Z if compiled: 2025-05-07T20:32:17.4628542Z op = torch.compile(op) 2025-05-07T20:32:17.4628839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4629107Z 2025-05-07T20:32:17.4629301Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.4629582Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.4629969Z 2025-05-07T20:32:17.4630240Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4630573Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.4630856Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.4631170Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.4631529Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4631829Z 2025-05-07T20:32:17.4632027Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.4632223Z 2025-05-07T20:32:17.4632328Z moe/activation_test.py:126: 2025-05-07T20:32:17.4632618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4632948Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.4633270Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4634062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.4634818Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.4635360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4636111Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4636796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.4637584Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4638339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.4639086Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4639843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.4640480Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.4641078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.4641596Z fn() 2025-05-07T20:32:17.4642093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.4642676Z self.fn.run( 2025-05-07T20:32:17.4643143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4643665Z kernel = self.compile( 2025-05-07T20:32:17.4644198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4644845Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4645238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4645463Z 2025-05-07T20:32:17.4645667Z self = 2025-05-07T20:32:17.4646752Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4648141Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d89f5e0>} 2025-05-07T20:32:17.4649486Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4650513Z context = 2025-05-07T20:32:17.4650798Z 2025-05-07T20:32:17.4650962Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4651484Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4651949Z module_map=module_map) 2025-05-07T20:32:17.4652307Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4652660Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.4652923Z E ^ 2025-05-07T20:32:17.4653397Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4653845Z 2025-05-07T20:32:17.4654257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4654770Z 2025-05-07T20:32:17.4654870Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4655282Z self=, 2025-05-07T20:32:17.4655686Z T=4096, 2025-05-07T20:32:17.4655862Z D=5120, 2025-05-07T20:32:17.4656048Z scale_ub=None, 2025-05-07T20:32:17.4656258Z contiguous=False, 2025-05-07T20:32:17.4656476Z compiled=False, 2025-05-07T20:32:17.4656674Z ) 2025-05-07T20:32:17.4657037Z self = 2025-05-07T20:32:17.4657521Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.4657797Z 2025-05-07T20:32:17.4657873Z @given( 2025-05-07T20:32:17.4658175Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4658483Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4658787Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4659114Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4659440Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4659755Z ) 2025-05-07T20:32:17.4660104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4660541Z def test_silu_mul_quant( 2025-05-07T20:32:17.4660773Z self, 2025-05-07T20:32:17.4660964Z T: int, 2025-05-07T20:32:17.4661156Z D: int, 2025-05-07T20:32:17.4661368Z scale_ub: Optional[float], 2025-05-07T20:32:17.4661639Z contiguous: bool, 2025-05-07T20:32:17.4661872Z compiled: bool, 2025-05-07T20:32:17.4662088Z ) -> None: 2025-05-07T20:32:17.4662307Z torch.manual_seed(2025) 2025-05-07T20:32:17.4662543Z 2025-05-07T20:32:17.4662809Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4663152Z 2025-05-07T20:32:17.4663345Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4663624Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4663926Z x = x_sign * x_clamp 2025-05-07T20:32:17.4664174Z x0 = x[:, :D] 2025-05-07T20:32:17.4664388Z x1 = x[:, D:] 2025-05-07T20:32:17.4664583Z 2025-05-07T20:32:17.4664762Z if contiguous: 2025-05-07T20:32:17.4664992Z x0 = x0.contiguous() 2025-05-07T20:32:17.4665235Z x1 = x1.contiguous() 2025-05-07T20:32:17.4665475Z 2025-05-07T20:32:17.4665663Z if scale_ub is not None: 2025-05-07T20:32:17.4665927Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4666256Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4666560Z ) 2025-05-07T20:32:17.4666742Z else: 2025-05-07T20:32:17.4666950Z scale_ub_tensor = None 2025-05-07T20:32:17.4667193Z 2025-05-07T20:32:17.4667412Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4667719Z op = silu_mul_quant 2025-05-07T20:32:17.4667965Z if compiled: 2025-05-07T20:32:17.4668201Z op = torch.compile(op) 2025-05-07T20:32:17.4668498Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4668764Z 2025-05-07T20:32:17.4668947Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.4669113Z 2025-05-07T20:32:17.4669209Z moe/activation_test.py:117: 2025-05-07T20:32:17.4669496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4669879Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.4670154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4670839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.4671533Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.4672063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4672747Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4673403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4673930Z kernel = self.compile( 2025-05-07T20:32:17.4674462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4675110Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4675552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4675776Z 2025-05-07T20:32:17.4675987Z self = 2025-05-07T20:32:17.4677142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4678521Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d1f3550>} 2025-05-07T20:32:17.4679902Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4680928Z context = 2025-05-07T20:32:17.4681215Z 2025-05-07T20:32:17.4681377Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4681896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4682363Z module_map=module_map) 2025-05-07T20:32:17.4682722Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4683068Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.4683321Z E ^ 2025-05-07T20:32:17.4683785Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4684240Z 2025-05-07T20:32:17.4684660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4685170Z 2025-05-07T20:32:17.4685269Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4685675Z self=, 2025-05-07T20:32:17.4686078Z T=4096, 2025-05-07T20:32:17.4686250Z D=7168, 2025-05-07T20:32:17.4686439Z scale_ub=None, 2025-05-07T20:32:17.4686648Z contiguous=False, 2025-05-07T20:32:17.4686862Z compiled=False, 2025-05-07T20:32:17.4687063Z ) 2025-05-07T20:32:17.4687380Z self = 2025-05-07T20:32:17.4687866Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.4688138Z 2025-05-07T20:32:17.4688217Z @given( 2025-05-07T20:32:17.4688442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4688754Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4689054Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4689373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4689701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4690013Z ) 2025-05-07T20:32:17.4690369Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4690803Z def test_silu_mul_quant( 2025-05-07T20:32:17.4691036Z self, 2025-05-07T20:32:17.4691219Z T: int, 2025-05-07T20:32:17.4691411Z D: int, 2025-05-07T20:32:17.4691631Z scale_ub: Optional[float], 2025-05-07T20:32:17.4691899Z contiguous: bool, 2025-05-07T20:32:17.4692126Z compiled: bool, 2025-05-07T20:32:17.4692340Z ) -> None: 2025-05-07T20:32:17.4692544Z torch.manual_seed(2025) 2025-05-07T20:32:17.4692767Z 2025-05-07T20:32:17.4693036Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4693368Z 2025-05-07T20:32:17.4693547Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4693831Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4694136Z x = x_sign * x_clamp 2025-05-07T20:32:17.4694360Z x0 = x[:, :D] 2025-05-07T20:32:17.4694649Z x1 = x[:, D:] 2025-05-07T20:32:17.4694849Z 2025-05-07T20:32:17.4695022Z if contiguous: 2025-05-07T20:32:17.4695244Z x0 = x0.contiguous() 2025-05-07T20:32:17.4695495Z x1 = x1.contiguous() 2025-05-07T20:32:17.4695722Z 2025-05-07T20:32:17.4695983Z if scale_ub is not None: 2025-05-07T20:32:17.4696250Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4696571Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4696872Z ) 2025-05-07T20:32:17.4697059Z else: 2025-05-07T20:32:17.4697258Z scale_ub_tensor = None 2025-05-07T20:32:17.4697543Z 2025-05-07T20:32:17.4697767Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4698075Z op = silu_mul_quant 2025-05-07T20:32:17.4698314Z if compiled: 2025-05-07T20:32:17.4698558Z op = torch.compile(op) 2025-05-07T20:32:17.4698850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4699113Z 2025-05-07T20:32:17.4699296Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.4699459Z 2025-05-07T20:32:17.4699559Z moe/activation_test.py:117: 2025-05-07T20:32:17.4699844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4700179Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.4700454Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4701139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.4701824Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.4702356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4703034Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4703680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4704500Z kernel = self.compile( 2025-05-07T20:32:17.4705044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4705697Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4706086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4706314Z 2025-05-07T20:32:17.4706516Z self = 2025-05-07T20:32:17.4707599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4708982Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d096700>} 2025-05-07T20:32:17.4710429Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4711459Z context = 2025-05-07T20:32:17.4711748Z 2025-05-07T20:32:17.4711913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4712437Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4712901Z module_map=module_map) 2025-05-07T20:32:17.4713263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4713615Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.4713868Z E ^ 2025-05-07T20:32:17.4714329Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4714878Z 2025-05-07T20:32:17.4715293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4715799Z 2025-05-07T20:32:17.4715907Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4716424Z self=, 2025-05-07T20:32:17.4716829Z T=128, 2025-05-07T20:32:17.4717012Z D=7168, 2025-05-07T20:32:17.4717201Z scale_ub=None, 2025-05-07T20:32:17.4717410Z contiguous=False, 2025-05-07T20:32:17.4717631Z compiled=True, 2025-05-07T20:32:17.4717890Z ) 2025-05-07T20:32:17.4718196Z self = 2025-05-07T20:32:17.4718681Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.4718943Z 2025-05-07T20:32:17.4719019Z @given( 2025-05-07T20:32:17.4719235Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4719543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4719849Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4720215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4720541Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4720829Z ) 2025-05-07T20:32:17.4721171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4721603Z def test_silu_mul_quant( 2025-05-07T20:32:17.4721838Z self, 2025-05-07T20:32:17.4722024Z T: int, 2025-05-07T20:32:17.4722209Z D: int, 2025-05-07T20:32:17.4722422Z scale_ub: Optional[float], 2025-05-07T20:32:17.4722684Z contiguous: bool, 2025-05-07T20:32:17.4722913Z compiled: bool, 2025-05-07T20:32:17.4723129Z ) -> None: 2025-05-07T20:32:17.4723338Z torch.manual_seed(2025) 2025-05-07T20:32:17.4723571Z 2025-05-07T20:32:17.4723834Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4724172Z 2025-05-07T20:32:17.4724355Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4724637Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4724939Z x = x_sign * x_clamp 2025-05-07T20:32:17.4725173Z x0 = x[:, :D] 2025-05-07T20:32:17.4725378Z x1 = x[:, D:] 2025-05-07T20:32:17.4725581Z 2025-05-07T20:32:17.4725757Z if contiguous: 2025-05-07T20:32:17.4725974Z x0 = x0.contiguous() 2025-05-07T20:32:17.4726227Z x1 = x1.contiguous() 2025-05-07T20:32:17.4726459Z 2025-05-07T20:32:17.4726647Z if scale_ub is not None: 2025-05-07T20:32:17.4726915Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4727244Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4727543Z ) 2025-05-07T20:32:17.4727730Z else: 2025-05-07T20:32:17.4727933Z scale_ub_tensor = None 2025-05-07T20:32:17.4728168Z 2025-05-07T20:32:17.4728393Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4728702Z op = silu_mul_quant 2025-05-07T20:32:17.4728939Z if compiled: 2025-05-07T20:32:17.4729180Z op = torch.compile(op) 2025-05-07T20:32:17.4729474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4729738Z 2025-05-07T20:32:17.4729923Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.4730198Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.4730481Z 2025-05-07T20:32:17.4730705Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4731036Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.4731320Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.4731625Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.4731977Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4732278Z 2025-05-07T20:32:17.4732523Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.4732721Z 2025-05-07T20:32:17.4732817Z moe/activation_test.py:126: 2025-05-07T20:32:17.4733108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4733509Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.4733825Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4734607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.4735368Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.4735941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4736622Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4737305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.4738028Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4738770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.4739520Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4740244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.4740875Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.4741470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.4741981Z fn() 2025-05-07T20:32:17.4742476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.4743045Z self.fn.run( 2025-05-07T20:32:17.4743507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4744033Z kernel = self.compile( 2025-05-07T20:32:17.4744574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4745217Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4745612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4745841Z 2025-05-07T20:32:17.4746057Z self = 2025-05-07T20:32:17.4747149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4748522Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d3c13a0>} 2025-05-07T20:32:17.4749939Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4750958Z context = 2025-05-07T20:32:17.4751242Z 2025-05-07T20:32:17.4751411Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4751923Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4752385Z module_map=module_map) 2025-05-07T20:32:17.4752742Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4753089Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.4753339Z E ^ 2025-05-07T20:32:17.4753799Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4754299Z 2025-05-07T20:32:17.4754715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4755294Z 2025-05-07T20:32:17.4755393Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4755802Z self=, 2025-05-07T20:32:17.4756197Z T=128, 2025-05-07T20:32:17.4756375Z D=7168, 2025-05-07T20:32:17.4756551Z scale_ub=None, 2025-05-07T20:32:17.4756825Z contiguous=False, 2025-05-07T20:32:17.4757044Z compiled=False, 2025-05-07T20:32:17.4757237Z ) 2025-05-07T20:32:17.4757544Z self = 2025-05-07T20:32:17.4758023Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.4758287Z 2025-05-07T20:32:17.4758360Z @given( 2025-05-07T20:32:17.4758582Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4758889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4759184Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4759513Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4759836Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4760111Z ) 2025-05-07T20:32:17.4760446Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4760887Z def test_silu_mul_quant( 2025-05-07T20:32:17.4772183Z self, 2025-05-07T20:32:17.4772396Z T: int, 2025-05-07T20:32:17.4772593Z D: int, 2025-05-07T20:32:17.4772815Z scale_ub: Optional[float], 2025-05-07T20:32:17.4773085Z contiguous: bool, 2025-05-07T20:32:17.4773325Z compiled: bool, 2025-05-07T20:32:17.4773546Z ) -> None: 2025-05-07T20:32:17.4773762Z torch.manual_seed(2025) 2025-05-07T20:32:17.4774013Z 2025-05-07T20:32:17.4774289Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4774627Z 2025-05-07T20:32:17.4774825Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4775121Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4775429Z x = x_sign * x_clamp 2025-05-07T20:32:17.4775674Z x0 = x[:, :D] 2025-05-07T20:32:17.4775886Z x1 = x[:, D:] 2025-05-07T20:32:17.4776090Z 2025-05-07T20:32:17.4776276Z if contiguous: 2025-05-07T20:32:17.4776504Z x0 = x0.contiguous() 2025-05-07T20:32:17.4776755Z x1 = x1.contiguous() 2025-05-07T20:32:17.4776990Z 2025-05-07T20:32:17.4777177Z if scale_ub is not None: 2025-05-07T20:32:17.4777449Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4777786Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4778103Z ) 2025-05-07T20:32:17.4778297Z else: 2025-05-07T20:32:17.4778502Z scale_ub_tensor = None 2025-05-07T20:32:17.4778749Z 2025-05-07T20:32:17.4778977Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4779287Z op = silu_mul_quant 2025-05-07T20:32:17.4779539Z if compiled: 2025-05-07T20:32:17.4779789Z op = torch.compile(op) 2025-05-07T20:32:17.4780082Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4780359Z 2025-05-07T20:32:17.4780548Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.4780712Z 2025-05-07T20:32:17.4780813Z moe/activation_test.py:117: 2025-05-07T20:32:17.4781113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4781451Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.4781736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4782436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.4783210Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.4783747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4784510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4785177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4785710Z kernel = self.compile( 2025-05-07T20:32:17.4786252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4786945Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4787341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4787570Z 2025-05-07T20:32:17.4787788Z self = 2025-05-07T20:32:17.4788893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4790363Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2d169b80>} 2025-05-07T20:32:17.4791732Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4792771Z context = 2025-05-07T20:32:17.4793057Z 2025-05-07T20:32:17.4793225Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4793744Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4794214Z module_map=module_map) 2025-05-07T20:32:17.4794581Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4794930Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.4795180Z E ^ 2025-05-07T20:32:17.4795661Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4796123Z 2025-05-07T20:32:17.4796546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4797066Z 2025-05-07T20:32:17.4797170Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4797576Z self=, 2025-05-07T20:32:17.4797977Z T=4096, 2025-05-07T20:32:17.4798161Z D=5120, 2025-05-07T20:32:17.4798346Z scale_ub=1200.0, 2025-05-07T20:32:17.4798560Z contiguous=True, 2025-05-07T20:32:17.4798782Z compiled=False, 2025-05-07T20:32:17.4798980Z ) 2025-05-07T20:32:17.4799299Z self = 2025-05-07T20:32:17.4799846Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.4800127Z 2025-05-07T20:32:17.4800208Z @given( 2025-05-07T20:32:17.4800438Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4800750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4801057Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4801383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4801718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4802000Z ) 2025-05-07T20:32:17.4802342Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4802783Z def test_silu_mul_quant( 2025-05-07T20:32:17.4803019Z self, 2025-05-07T20:32:17.4803256Z T: int, 2025-05-07T20:32:17.4803448Z D: int, 2025-05-07T20:32:17.4803668Z scale_ub: Optional[float], 2025-05-07T20:32:17.4804217Z contiguous: bool, 2025-05-07T20:32:17.4804452Z compiled: bool, 2025-05-07T20:32:17.4804667Z ) -> None: 2025-05-07T20:32:17.4805025Z torch.manual_seed(2025) 2025-05-07T20:32:17.4805262Z 2025-05-07T20:32:17.4805533Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4805870Z 2025-05-07T20:32:17.4806047Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4806326Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4806690Z x = x_sign * x_clamp 2025-05-07T20:32:17.4806919Z x0 = x[:, :D] 2025-05-07T20:32:17.4807130Z x1 = x[:, D:] 2025-05-07T20:32:17.4807329Z 2025-05-07T20:32:17.4807501Z if contiguous: 2025-05-07T20:32:17.4807724Z x0 = x0.contiguous() 2025-05-07T20:32:17.4807975Z x1 = x1.contiguous() 2025-05-07T20:32:17.4808202Z 2025-05-07T20:32:17.4808384Z if scale_ub is not None: 2025-05-07T20:32:17.4808651Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4808981Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4809281Z ) 2025-05-07T20:32:17.4809473Z else: 2025-05-07T20:32:17.4809674Z scale_ub_tensor = None 2025-05-07T20:32:17.4809934Z 2025-05-07T20:32:17.4810182Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4810492Z op = silu_mul_quant 2025-05-07T20:32:17.4810731Z if compiled: 2025-05-07T20:32:17.4810975Z op = torch.compile(op) 2025-05-07T20:32:17.4811266Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4811526Z 2025-05-07T20:32:17.4811710Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.4811872Z 2025-05-07T20:32:17.4811972Z moe/activation_test.py:117: 2025-05-07T20:32:17.4812259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4812589Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.4812863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4813564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.4814254Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.4814787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4815471Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4816128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4816654Z kernel = self.compile( 2025-05-07T20:32:17.4817191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4817843Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4818230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4818461Z 2025-05-07T20:32:17.4818672Z self = 2025-05-07T20:32:17.4819793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4821229Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2ccef5e0>} 2025-05-07T20:32:17.4822599Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4823706Z context = 2025-05-07T20:32:17.4823999Z 2025-05-07T20:32:17.4824165Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4824764Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4825225Z module_map=module_map) 2025-05-07T20:32:17.4825586Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4825936Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.4826226Z E ^ 2025-05-07T20:32:17.4826689Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4827146Z 2025-05-07T20:32:17.4827565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4828080Z 2025-05-07T20:32:17.4828184Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4828590Z self=, 2025-05-07T20:32:17.4828981Z T=1, 2025-05-07T20:32:17.4829153Z D=5120, 2025-05-07T20:32:17.4829335Z scale_ub=None, 2025-05-07T20:32:17.4829543Z contiguous=True, 2025-05-07T20:32:17.4829760Z compiled=True, 2025-05-07T20:32:17.4830015Z ) 2025-05-07T20:32:17.4830326Z self = 2025-05-07T20:32:17.4830811Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.4831076Z 2025-05-07T20:32:17.4831152Z @given( 2025-05-07T20:32:17.4831370Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4831679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4831980Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4832304Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4832626Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4832902Z ) 2025-05-07T20:32:17.4833243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4833671Z def test_silu_mul_quant( 2025-05-07T20:32:17.4833903Z self, 2025-05-07T20:32:17.4834098Z T: int, 2025-05-07T20:32:17.4834283Z D: int, 2025-05-07T20:32:17.4834490Z scale_ub: Optional[float], 2025-05-07T20:32:17.4834575Z contiguous: bool, 2025-05-07T20:32:17.4834658Z compiled: bool, 2025-05-07T20:32:17.4834732Z ) -> None: 2025-05-07T20:32:17.4834824Z torch.manual_seed(2025) 2025-05-07T20:32:17.4834898Z 2025-05-07T20:32:17.4835066Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4835135Z 2025-05-07T20:32:17.4835227Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4835347Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4835439Z x = x_sign * x_clamp 2025-05-07T20:32:17.4835519Z x0 = x[:, :D] 2025-05-07T20:32:17.4835592Z x1 = x[:, D:] 2025-05-07T20:32:17.4835661Z 2025-05-07T20:32:17.4835739Z if contiguous: 2025-05-07T20:32:17.4835826Z x0 = x0.contiguous() 2025-05-07T20:32:17.4835919Z x1 = x1.contiguous() 2025-05-07T20:32:17.4835984Z 2025-05-07T20:32:17.4836072Z if scale_ub is not None: 2025-05-07T20:32:17.4836176Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4836308Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4836380Z ) 2025-05-07T20:32:17.4836456Z else: 2025-05-07T20:32:17.4836546Z scale_ub_tensor = None 2025-05-07T20:32:17.4836614Z 2025-05-07T20:32:17.4836738Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4836824Z op = silu_mul_quant 2025-05-07T20:32:17.4836909Z if compiled: 2025-05-07T20:32:17.4837005Z op = torch.compile(op) 2025-05-07T20:32:17.4837192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4837260Z 2025-05-07T20:32:17.4837347Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.4837464Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.4837611Z 2025-05-07T20:32:17.4837744Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4837841Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.4837938Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.4838055Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.4838311Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4838379Z 2025-05-07T20:32:17.4838474Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.4838479Z 2025-05-07T20:32:17.4838580Z moe/activation_test.py:126: 2025-05-07T20:32:17.4838702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4838806Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.4838941Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4839505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.4839608Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.4839960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4840182Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4840551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.4840812Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4841212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.4841463Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4841839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.4842005Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.4842341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.4842417Z fn() 2025-05-07T20:32:17.4842812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.4842902Z self.fn.run( 2025-05-07T20:32:17.4843232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4843322Z kernel = self.compile( 2025-05-07T20:32:17.4843698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4843872Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4843993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4844002Z 2025-05-07T20:32:17.4844208Z self = 2025-05-07T20:32:17.4844991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4845503Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2ca9c670>} 2025-05-07T20:32:17.4846248Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4846492Z context = 2025-05-07T20:32:17.4846497Z 2025-05-07T20:32:17.4846728Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4846988Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4847093Z module_map=module_map) 2025-05-07T20:32:17.4847251Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4847391Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.4847472Z E ^ 2025-05-07T20:32:17.4847824Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4847829Z 2025-05-07T20:32:17.4848242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4848249Z 2025-05-07T20:32:17.4848349Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4848566Z self=, 2025-05-07T20:32:17.4848643Z T=2048, 2025-05-07T20:32:17.4848720Z D=5120, 2025-05-07T20:32:17.4848797Z scale_ub=None, 2025-05-07T20:32:17.4848882Z contiguous=True, 2025-05-07T20:32:17.4848957Z compiled=True, 2025-05-07T20:32:17.4849025Z ) 2025-05-07T20:32:17.4849239Z self = 2025-05-07T20:32:17.4849408Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.4849412Z 2025-05-07T20:32:17.4849489Z @given( 2025-05-07T20:32:17.4849602Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4849697Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4849812Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4849926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4850036Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4850111Z ) 2025-05-07T20:32:17.4850356Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4850450Z def test_silu_mul_quant( 2025-05-07T20:32:17.4850522Z self, 2025-05-07T20:32:17.4850594Z T: int, 2025-05-07T20:32:17.4850670Z D: int, 2025-05-07T20:32:17.4850761Z scale_ub: Optional[float], 2025-05-07T20:32:17.4850846Z contiguous: bool, 2025-05-07T20:32:17.4850934Z compiled: bool, 2025-05-07T20:32:17.4851011Z ) -> None: 2025-05-07T20:32:17.4851103Z torch.manual_seed(2025) 2025-05-07T20:32:17.4851174Z 2025-05-07T20:32:17.4851340Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4851409Z 2025-05-07T20:32:17.4851498Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4851619Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4851706Z x = x_sign * x_clamp 2025-05-07T20:32:17.4851780Z x0 = x[:, :D] 2025-05-07T20:32:17.4851853Z x1 = x[:, D:] 2025-05-07T20:32:17.4851925Z 2025-05-07T20:32:17.4852006Z if contiguous: 2025-05-07T20:32:17.4852093Z x0 = x0.contiguous() 2025-05-07T20:32:17.4852179Z x1 = x1.contiguous() 2025-05-07T20:32:17.4852243Z 2025-05-07T20:32:17.4852329Z if scale_ub is not None: 2025-05-07T20:32:17.4852434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4852566Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4852637Z ) 2025-05-07T20:32:17.4852712Z else: 2025-05-07T20:32:17.4852800Z scale_ub_tensor = None 2025-05-07T20:32:17.4852872Z 2025-05-07T20:32:17.4852995Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4853077Z op = silu_mul_quant 2025-05-07T20:32:17.4853210Z if compiled: 2025-05-07T20:32:17.4853307Z op = torch.compile(op) 2025-05-07T20:32:17.4853410Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4853482Z 2025-05-07T20:32:17.4853567Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.4853758Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.4853831Z 2025-05-07T20:32:17.4853962Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4854059Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.4854157Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.4854316Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.4854455Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4854523Z 2025-05-07T20:32:17.4854616Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.4854620Z 2025-05-07T20:32:17.4854719Z moe/activation_test.py:126: 2025-05-07T20:32:17.4854843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4854944Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.4855076Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4855647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.4855746Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.4856103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4856325Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4856686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.4856939Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4857333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.4857586Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4857959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.4858124Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.4858458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.4858534Z fn() 2025-05-07T20:32:17.4858932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.4859010Z self.fn.run( 2025-05-07T20:32:17.4859348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4859439Z kernel = self.compile( 2025-05-07T20:32:17.4859815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4859990Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4860114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4860119Z 2025-05-07T20:32:17.4860324Z self = 2025-05-07T20:32:17.4861108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4861615Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2c8dc9d0>} 2025-05-07T20:32:17.4862413Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4862675Z context = 2025-05-07T20:32:17.4862680Z 2025-05-07T20:32:17.4862844Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4863102Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4863205Z module_map=module_map) 2025-05-07T20:32:17.4863429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4863526Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.4863603Z E ^ 2025-05-07T20:32:17.4863959Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4863966Z 2025-05-07T20:32:17.4864374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4864378Z 2025-05-07T20:32:17.4864480Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4864702Z self=, 2025-05-07T20:32:17.4864779Z T=128, 2025-05-07T20:32:17.4864855Z D=5120, 2025-05-07T20:32:17.4864931Z scale_ub=None, 2025-05-07T20:32:17.4865009Z contiguous=True, 2025-05-07T20:32:17.4865086Z compiled=True, 2025-05-07T20:32:17.4865153Z ) 2025-05-07T20:32:17.4865371Z self = 2025-05-07T20:32:17.4865536Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.4865541Z 2025-05-07T20:32:17.4865613Z @given( 2025-05-07T20:32:17.4865731Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4865826Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4865938Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4866053Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4866164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4866236Z ) 2025-05-07T20:32:17.4866488Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4866575Z def test_silu_mul_quant( 2025-05-07T20:32:17.4866651Z self, 2025-05-07T20:32:17.4866723Z T: int, 2025-05-07T20:32:17.4866794Z D: int, 2025-05-07T20:32:17.4866899Z scale_ub: Optional[float], 2025-05-07T20:32:17.4866983Z contiguous: bool, 2025-05-07T20:32:17.4867064Z compiled: bool, 2025-05-07T20:32:17.4867143Z ) -> None: 2025-05-07T20:32:17.4867236Z torch.manual_seed(2025) 2025-05-07T20:32:17.4867303Z 2025-05-07T20:32:17.4867470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4867541Z 2025-05-07T20:32:17.4867629Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4867752Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4867835Z x = x_sign * x_clamp 2025-05-07T20:32:17.4867918Z x0 = x[:, :D] 2025-05-07T20:32:17.4867998Z x1 = x[:, D:] 2025-05-07T20:32:17.4868067Z 2025-05-07T20:32:17.4868153Z if contiguous: 2025-05-07T20:32:17.4868239Z x0 = x0.contiguous() 2025-05-07T20:32:17.4868321Z x1 = x1.contiguous() 2025-05-07T20:32:17.4868394Z 2025-05-07T20:32:17.4868480Z if scale_ub is not None: 2025-05-07T20:32:17.4868580Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4868713Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4868783Z ) 2025-05-07T20:32:17.4868854Z else: 2025-05-07T20:32:17.4868951Z scale_ub_tensor = None 2025-05-07T20:32:17.4869020Z 2025-05-07T20:32:17.4869143Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4869283Z op = silu_mul_quant 2025-05-07T20:32:17.4869362Z if compiled: 2025-05-07T20:32:17.4869461Z op = torch.compile(op) 2025-05-07T20:32:17.4869562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4869703Z 2025-05-07T20:32:17.4869798Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.4869975Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.4870042Z 2025-05-07T20:32:17.4870176Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4870313Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.4870408Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.4870530Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.4870664Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4870734Z 2025-05-07T20:32:17.4870828Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.4870836Z 2025-05-07T20:32:17.4870930Z moe/activation_test.py:126: 2025-05-07T20:32:17.4871055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4871155Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.4871294Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4871861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.4871955Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.4872317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4872533Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4872892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.4873151Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4873543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.4873801Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4874169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.4874329Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.4874668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.4874743Z fn() 2025-05-07T20:32:17.4875140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.4875218Z self.fn.run( 2025-05-07T20:32:17.4875548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4875643Z kernel = self.compile( 2025-05-07T20:32:17.4876017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4876196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4876322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4876326Z 2025-05-07T20:32:17.4876526Z self = 2025-05-07T20:32:17.4877309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4877811Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2c87c940>} 2025-05-07T20:32:17.4878674Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4878865Z context = 2025-05-07T20:32:17.4878869Z 2025-05-07T20:32:17.4879029Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4879292Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4879437Z module_map=module_map) 2025-05-07T20:32:17.4879597Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4879700Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.4879774Z E ^ 2025-05-07T20:32:17.4880125Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4880135Z 2025-05-07T20:32:17.4880543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4880547Z 2025-05-07T20:32:17.4880652Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4880875Z self=, 2025-05-07T20:32:17.4880949Z T=4096, 2025-05-07T20:32:17.4881023Z D=5120, 2025-05-07T20:32:17.4881106Z scale_ub=None, 2025-05-07T20:32:17.4881186Z contiguous=True, 2025-05-07T20:32:17.4881266Z compiled=True, 2025-05-07T20:32:17.4881336Z ) 2025-05-07T20:32:17.4881548Z self = 2025-05-07T20:32:17.4881713Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.4881717Z 2025-05-07T20:32:17.4881786Z @given( 2025-05-07T20:32:17.4881903Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4882003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4882111Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4882225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4882346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4882417Z ) 2025-05-07T20:32:17.4882655Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4882750Z def test_silu_mul_quant( 2025-05-07T20:32:17.4882823Z self, 2025-05-07T20:32:17.4882899Z T: int, 2025-05-07T20:32:17.4882971Z D: int, 2025-05-07T20:32:17.4883065Z scale_ub: Optional[float], 2025-05-07T20:32:17.4883154Z contiguous: bool, 2025-05-07T20:32:17.4883233Z compiled: bool, 2025-05-07T20:32:17.4883306Z ) -> None: 2025-05-07T20:32:17.4883397Z torch.manual_seed(2025) 2025-05-07T20:32:17.4883462Z 2025-05-07T20:32:17.4883625Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4883696Z 2025-05-07T20:32:17.4883784Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4883901Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4883997Z x = x_sign * x_clamp 2025-05-07T20:32:17.4884073Z x0 = x[:, :D] 2025-05-07T20:32:17.4884153Z x1 = x[:, D:] 2025-05-07T20:32:17.4884219Z 2025-05-07T20:32:17.4884298Z if contiguous: 2025-05-07T20:32:17.4884386Z x0 = x0.contiguous() 2025-05-07T20:32:17.4884473Z x1 = x1.contiguous() 2025-05-07T20:32:17.4884544Z 2025-05-07T20:32:17.4884636Z if scale_ub is not None: 2025-05-07T20:32:17.4884734Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4884863Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4884944Z ) 2025-05-07T20:32:17.4885014Z else: 2025-05-07T20:32:17.4885102Z scale_ub_tensor = None 2025-05-07T20:32:17.4885233Z 2025-05-07T20:32:17.4885359Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4885444Z op = silu_mul_quant 2025-05-07T20:32:17.4885531Z if compiled: 2025-05-07T20:32:17.4885731Z op = torch.compile(op) 2025-05-07T20:32:17.4885835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4885903Z 2025-05-07T20:32:17.4885987Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.4886111Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.4886180Z 2025-05-07T20:32:17.4886349Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4886449Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.4886543Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.4886660Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.4886802Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4886875Z 2025-05-07T20:32:17.4886976Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.4886981Z 2025-05-07T20:32:17.4887074Z moe/activation_test.py:126: 2025-05-07T20:32:17.4887202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4887304Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.4887433Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4887996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.4888098Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.4888455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4888679Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4889041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.4889296Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4889696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.4889947Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4890318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.4890485Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.4890825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.4890900Z fn() 2025-05-07T20:32:17.4891294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.4891374Z self.fn.run( 2025-05-07T20:32:17.4891709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4891796Z kernel = self.compile( 2025-05-07T20:32:17.4892180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4892352Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4892472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4892480Z 2025-05-07T20:32:17.4892685Z self = 2025-05-07T20:32:17.4893465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4894022Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2c5e9700>} 2025-05-07T20:32:17.4894869Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4895061Z context = 2025-05-07T20:32:17.4895070Z 2025-05-07T20:32:17.4895230Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4895526Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4895632Z module_map=module_map) 2025-05-07T20:32:17.4895787Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4895881Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.4895962Z E ^ 2025-05-07T20:32:17.4896316Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4896320Z 2025-05-07T20:32:17.4896738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4896742Z 2025-05-07T20:32:17.4896842Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4897059Z self=, 2025-05-07T20:32:17.4897134Z T=16384, 2025-05-07T20:32:17.4897208Z D=5120, 2025-05-07T20:32:17.4897285Z scale_ub=None, 2025-05-07T20:32:17.4897366Z contiguous=True, 2025-05-07T20:32:17.4897445Z compiled=True, 2025-05-07T20:32:17.4897511Z ) 2025-05-07T20:32:17.4897727Z self = 2025-05-07T20:32:17.4897897Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.4897905Z 2025-05-07T20:32:17.4897977Z @given( 2025-05-07T20:32:17.4898092Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4898183Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4898300Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4898413Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4898523Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4898597Z ) 2025-05-07T20:32:17.4898837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4898928Z def test_silu_mul_quant( 2025-05-07T20:32:17.4899001Z self, 2025-05-07T20:32:17.4899074Z T: int, 2025-05-07T20:32:17.4899151Z D: int, 2025-05-07T20:32:17.4899243Z scale_ub: Optional[float], 2025-05-07T20:32:17.4899327Z contiguous: bool, 2025-05-07T20:32:17.4899408Z compiled: bool, 2025-05-07T20:32:17.4899482Z ) -> None: 2025-05-07T20:32:17.4899571Z torch.manual_seed(2025) 2025-05-07T20:32:17.4899644Z 2025-05-07T20:32:17.4899805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4899875Z 2025-05-07T20:32:17.4899964Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4900088Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4900168Z x = x_sign * x_clamp 2025-05-07T20:32:17.4900244Z x0 = x[:, :D] 2025-05-07T20:32:17.4900322Z x1 = x[:, D:] 2025-05-07T20:32:17.4900390Z 2025-05-07T20:32:17.4900472Z if contiguous: 2025-05-07T20:32:17.4900560Z x0 = x0.contiguous() 2025-05-07T20:32:17.4900649Z x1 = x1.contiguous() 2025-05-07T20:32:17.4900716Z 2025-05-07T20:32:17.4900800Z if scale_ub is not None: 2025-05-07T20:32:17.4900904Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4901034Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4901151Z ) 2025-05-07T20:32:17.4901228Z else: 2025-05-07T20:32:17.4901319Z scale_ub_tensor = None 2025-05-07T20:32:17.4901389Z 2025-05-07T20:32:17.4901517Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4905924Z op = silu_mul_quant 2025-05-07T20:32:17.4906034Z if compiled: 2025-05-07T20:32:17.4906136Z op = torch.compile(op) 2025-05-07T20:32:17.4906249Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4906319Z 2025-05-07T20:32:17.4906407Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.4906600Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.4906670Z 2025-05-07T20:32:17.4906810Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4906915Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.4907015Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.4907141Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.4907284Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4907354Z 2025-05-07T20:32:17.4907452Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.4907457Z 2025-05-07T20:32:17.4907562Z moe/activation_test.py:126: 2025-05-07T20:32:17.4907689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4907801Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.4907933Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4908509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.4908611Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.4908972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4909197Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4909569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.4909976Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4910382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.4910634Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4911014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.4911181Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.4911524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.4911603Z fn() 2025-05-07T20:32:17.4912005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.4912089Z self.fn.run( 2025-05-07T20:32:17.4912426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4912522Z kernel = self.compile( 2025-05-07T20:32:17.4912909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4913081Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4913208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4913213Z 2025-05-07T20:32:17.4913421Z self = 2025-05-07T20:32:17.4914214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4914799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2c052d30>} 2025-05-07T20:32:17.4915626Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4915824Z context = 2025-05-07T20:32:17.4915868Z 2025-05-07T20:32:17.4916033Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4916298Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4916409Z module_map=module_map) 2025-05-07T20:32:17.4916569Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4916677Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.4916754Z E ^ 2025-05-07T20:32:17.4917112Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4917122Z 2025-05-07T20:32:17.4917540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4917544Z 2025-05-07T20:32:17.4917644Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4917864Z self=, 2025-05-07T20:32:17.4917943Z T=1, 2025-05-07T20:32:17.4918018Z D=5120, 2025-05-07T20:32:17.4918100Z scale_ub=1200.0, 2025-05-07T20:32:17.4918184Z contiguous=True, 2025-05-07T20:32:17.4918265Z compiled=True, 2025-05-07T20:32:17.4918341Z ) 2025-05-07T20:32:17.4918557Z self = 2025-05-07T20:32:17.4918721Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.4918726Z 2025-05-07T20:32:17.4918805Z @given( 2025-05-07T20:32:17.4918922Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4919023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4919142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4919256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4919370Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4919440Z ) 2025-05-07T20:32:17.4919686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4919783Z def test_silu_mul_quant( 2025-05-07T20:32:17.4919855Z self, 2025-05-07T20:32:17.4919928Z T: int, 2025-05-07T20:32:17.4920004Z D: int, 2025-05-07T20:32:17.4920101Z scale_ub: Optional[float], 2025-05-07T20:32:17.4920187Z contiguous: bool, 2025-05-07T20:32:17.4920279Z compiled: bool, 2025-05-07T20:32:17.4920356Z ) -> None: 2025-05-07T20:32:17.4920448Z torch.manual_seed(2025) 2025-05-07T20:32:17.4920519Z 2025-05-07T20:32:17.4920689Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4920759Z 2025-05-07T20:32:17.4920853Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4920972Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4921059Z x = x_sign * x_clamp 2025-05-07T20:32:17.4921136Z x0 = x[:, :D] 2025-05-07T20:32:17.4921213Z x1 = x[:, D:] 2025-05-07T20:32:17.4921290Z 2025-05-07T20:32:17.4921369Z if contiguous: 2025-05-07T20:32:17.4921458Z x0 = x0.contiguous() 2025-05-07T20:32:17.4921548Z x1 = x1.contiguous() 2025-05-07T20:32:17.4921616Z 2025-05-07T20:32:17.4921705Z if scale_ub is not None: 2025-05-07T20:32:17.4921812Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4921995Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4922069Z ) 2025-05-07T20:32:17.4922149Z else: 2025-05-07T20:32:17.4922241Z scale_ub_tensor = None 2025-05-07T20:32:17.4922312Z 2025-05-07T20:32:17.4922518Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4922609Z op = silu_mul_quant 2025-05-07T20:32:17.4922696Z if compiled: 2025-05-07T20:32:17.4922792Z op = torch.compile(op) 2025-05-07T20:32:17.4922895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4923006Z 2025-05-07T20:32:17.4923093Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.4923097Z 2025-05-07T20:32:17.4923192Z moe/activation_test.py:117: 2025-05-07T20:32:17.4923320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4923417Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.4923519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4923889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.4923976Z return fn(*args, **kwargs) 2025-05-07T20:32:17.4924480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.4924576Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.4924930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4925152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4925490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4925583Z kernel = self.compile( 2025-05-07T20:32:17.4925957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4926132Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4926261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4926266Z 2025-05-07T20:32:17.4926474Z self = 2025-05-07T20:32:17.4927261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4927772Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2bb5ec10>} 2025-05-07T20:32:17.4928524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4928716Z context = 2025-05-07T20:32:17.4928721Z 2025-05-07T20:32:17.4928881Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4929148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4929253Z module_map=module_map) 2025-05-07T20:32:17.4929414Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4929510Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.4929587Z E ^ 2025-05-07T20:32:17.4929942Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4929946Z 2025-05-07T20:32:17.4930357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4930407Z 2025-05-07T20:32:17.4930507Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4930729Z self=, 2025-05-07T20:32:17.4930803Z T=1, 2025-05-07T20:32:17.4930881Z D=5120, 2025-05-07T20:32:17.4931057Z scale_ub=None, 2025-05-07T20:32:17.4931142Z contiguous=False, 2025-05-07T20:32:17.4931228Z compiled=True, 2025-05-07T20:32:17.4931297Z ) 2025-05-07T20:32:17.4931511Z self = 2025-05-07T20:32:17.4931677Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.4931719Z 2025-05-07T20:32:17.4931795Z @given( 2025-05-07T20:32:17.4931911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4932010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4932121Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4932241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4932354Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4932426Z ) 2025-05-07T20:32:17.4932673Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4932766Z def test_silu_mul_quant( 2025-05-07T20:32:17.4932844Z self, 2025-05-07T20:32:17.4932920Z T: int, 2025-05-07T20:32:17.4932993Z D: int, 2025-05-07T20:32:17.4933088Z scale_ub: Optional[float], 2025-05-07T20:32:17.4933177Z contiguous: bool, 2025-05-07T20:32:17.4933259Z compiled: bool, 2025-05-07T20:32:17.4933337Z ) -> None: 2025-05-07T20:32:17.4933436Z torch.manual_seed(2025) 2025-05-07T20:32:17.4933507Z 2025-05-07T20:32:17.4933675Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4933745Z 2025-05-07T20:32:17.4933832Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4933956Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4934045Z x = x_sign * x_clamp 2025-05-07T20:32:17.4934121Z x0 = x[:, :D] 2025-05-07T20:32:17.4934202Z x1 = x[:, D:] 2025-05-07T20:32:17.4934270Z 2025-05-07T20:32:17.4934350Z if contiguous: 2025-05-07T20:32:17.4934447Z x0 = x0.contiguous() 2025-05-07T20:32:17.4934537Z x1 = x1.contiguous() 2025-05-07T20:32:17.4934608Z 2025-05-07T20:32:17.4934698Z if scale_ub is not None: 2025-05-07T20:32:17.4934800Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4934935Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4935011Z ) 2025-05-07T20:32:17.4935086Z else: 2025-05-07T20:32:17.4935179Z scale_ub_tensor = None 2025-05-07T20:32:17.4935749Z 2025-05-07T20:32:17.4935877Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4935970Z op = silu_mul_quant 2025-05-07T20:32:17.4936050Z if compiled: 2025-05-07T20:32:17.4936148Z op = torch.compile(op) 2025-05-07T20:32:17.4936255Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4936325Z 2025-05-07T20:32:17.4936411Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.4936531Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.4936605Z 2025-05-07T20:32:17.4936742Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4936842Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.4936938Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.4937060Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.4937200Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4937272Z 2025-05-07T20:32:17.4937371Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.4937376Z 2025-05-07T20:32:17.4937471Z moe/activation_test.py:126: 2025-05-07T20:32:17.4937598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4937750Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.4937882Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.4938514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.4938612Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.4938969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4939197Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4939597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.4939853Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4940245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.4940499Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.4940874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.4941042Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.4941381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.4941458Z fn() 2025-05-07T20:32:17.4941854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.4941938Z self.fn.run( 2025-05-07T20:32:17.4942267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4942357Z kernel = self.compile( 2025-05-07T20:32:17.4942740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4942914Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4943046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4943051Z 2025-05-07T20:32:17.4943253Z self = 2025-05-07T20:32:17.4944037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4944553Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2c5f81f0>} 2025-05-07T20:32:17.4945297Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4945492Z context = 2025-05-07T20:32:17.4945497Z 2025-05-07T20:32:17.4945661Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4945920Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4946028Z module_map=module_map) 2025-05-07T20:32:17.4946186Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4946287Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.4946359Z E ^ 2025-05-07T20:32:17.4946711Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4946715Z 2025-05-07T20:32:17.4947130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4947178Z 2025-05-07T20:32:17.4947279Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4947498Z self=, 2025-05-07T20:32:17.4947646Z T=1, 2025-05-07T20:32:17.4947721Z D=5120, 2025-05-07T20:32:17.4947802Z scale_ub=None, 2025-05-07T20:32:17.4947887Z contiguous=True, 2025-05-07T20:32:17.4947968Z compiled=False, 2025-05-07T20:32:17.4948041Z ) 2025-05-07T20:32:17.4948254Z self = 2025-05-07T20:32:17.4948452Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.4948456Z 2025-05-07T20:32:17.4948533Z @given( 2025-05-07T20:32:17.4948647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4948745Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4948857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4948976Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4949089Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4949159Z ) 2025-05-07T20:32:17.4949406Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4949501Z def test_silu_mul_quant( 2025-05-07T20:32:17.4949577Z self, 2025-05-07T20:32:17.4949650Z T: int, 2025-05-07T20:32:17.4949730Z D: int, 2025-05-07T20:32:17.4949887Z scale_ub: Optional[float], 2025-05-07T20:32:17.4949974Z contiguous: bool, 2025-05-07T20:32:17.4950063Z compiled: bool, 2025-05-07T20:32:17.4950138Z ) -> None: 2025-05-07T20:32:17.4950232Z torch.manual_seed(2025) 2025-05-07T20:32:17.4950300Z 2025-05-07T20:32:17.4950463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4950536Z 2025-05-07T20:32:17.4950623Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4950747Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4950835Z x = x_sign * x_clamp 2025-05-07T20:32:17.4950913Z x0 = x[:, :D] 2025-05-07T20:32:17.4950990Z x1 = x[:, D:] 2025-05-07T20:32:17.4951062Z 2025-05-07T20:32:17.4951146Z if contiguous: 2025-05-07T20:32:17.4951234Z x0 = x0.contiguous() 2025-05-07T20:32:17.4951322Z x1 = x1.contiguous() 2025-05-07T20:32:17.4951392Z 2025-05-07T20:32:17.4951480Z if scale_ub is not None: 2025-05-07T20:32:17.4951585Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4951720Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4951795Z ) 2025-05-07T20:32:17.4951869Z else: 2025-05-07T20:32:17.4951962Z scale_ub_tensor = None 2025-05-07T20:32:17.4952036Z 2025-05-07T20:32:17.4952164Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4952249Z op = silu_mul_quant 2025-05-07T20:32:17.4952336Z if compiled: 2025-05-07T20:32:17.4952432Z op = torch.compile(op) 2025-05-07T20:32:17.4952534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4952606Z 2025-05-07T20:32:17.4952697Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.4952702Z 2025-05-07T20:32:17.4952798Z moe/activation_test.py:117: 2025-05-07T20:32:17.4952923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4953021Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.4953120Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4953622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.4953719Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.4954076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4954346Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4954682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4954772Z kernel = self.compile( 2025-05-07T20:32:17.4955225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4955401Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4955522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4955564Z 2025-05-07T20:32:17.4955768Z self = 2025-05-07T20:32:17.4956551Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4957058Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2bb5eb80>} 2025-05-07T20:32:17.4957820Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4958010Z context = 2025-05-07T20:32:17.4958015Z 2025-05-07T20:32:17.4958183Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4958441Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4958545Z module_map=module_map) 2025-05-07T20:32:17.4958710Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4958809Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.4958885Z E ^ 2025-05-07T20:32:17.4959242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4959246Z 2025-05-07T20:32:17.4959661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4959666Z 2025-05-07T20:32:17.4959767Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4959986Z self=, 2025-05-07T20:32:17.4960064Z T=128, 2025-05-07T20:32:17.4960139Z D=5120, 2025-05-07T20:32:17.4960218Z scale_ub=None, 2025-05-07T20:32:17.4960299Z contiguous=False, 2025-05-07T20:32:17.4960382Z compiled=True, 2025-05-07T20:32:17.4960451Z ) 2025-05-07T20:32:17.4960669Z self = 2025-05-07T20:32:17.4960839Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.4960846Z 2025-05-07T20:32:17.4960922Z @given( 2025-05-07T20:32:17.4961040Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4961135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4961250Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4961368Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4961478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4961548Z ) 2025-05-07T20:32:17.4961794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4961888Z def test_silu_mul_quant( 2025-05-07T20:32:17.4961964Z self, 2025-05-07T20:32:17.4962038Z T: int, 2025-05-07T20:32:17.4962112Z D: int, 2025-05-07T20:32:17.4962208Z scale_ub: Optional[float], 2025-05-07T20:32:17.4962295Z contiguous: bool, 2025-05-07T20:32:17.4962376Z compiled: bool, 2025-05-07T20:32:17.4962525Z ) -> None: 2025-05-07T20:32:17.4962615Z torch.manual_seed(2025) 2025-05-07T20:32:17.4962684Z 2025-05-07T20:32:17.4962854Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4962929Z 2025-05-07T20:32:17.4963090Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4963216Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4963301Z x = x_sign * x_clamp 2025-05-07T20:32:17.4963383Z x0 = x[:, :D] 2025-05-07T20:32:17.4963458Z x1 = x[:, D:] 2025-05-07T20:32:17.4963526Z 2025-05-07T20:32:17.4963650Z if contiguous: 2025-05-07T20:32:17.4963737Z x0 = x0.contiguous() 2025-05-07T20:32:17.4963821Z x1 = x1.contiguous() 2025-05-07T20:32:17.4963893Z 2025-05-07T20:32:17.4963981Z if scale_ub is not None: 2025-05-07T20:32:17.4964081Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4964215Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4964291Z ) 2025-05-07T20:32:17.4964365Z else: 2025-05-07T20:32:17.4964460Z scale_ub_tensor = None 2025-05-07T20:32:17.4964527Z 2025-05-07T20:32:17.4964656Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4964748Z op = silu_mul_quant 2025-05-07T20:32:17.4964839Z if compiled: 2025-05-07T20:32:17.4964936Z op = torch.compile(op) 2025-05-07T20:32:17.4965037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4965110Z 2025-05-07T20:32:17.4965203Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.4965208Z 2025-05-07T20:32:17.4965301Z moe/activation_test.py:117: 2025-05-07T20:32:17.4965426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4965525Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.4965622Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4965993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.4966086Z return fn(*args, **kwargs) 2025-05-07T20:32:17.4966588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.4966683Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.4967036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4967263Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4967600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4967693Z kernel = self.compile( 2025-05-07T20:32:17.4968069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4968245Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4968375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4968380Z 2025-05-07T20:32:17.4968584Z self = 2025-05-07T20:32:17.4969371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4969929Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2c5e9a60>} 2025-05-07T20:32:17.4970684Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4970925Z context = 2025-05-07T20:32:17.4970930Z 2025-05-07T20:32:17.4971092Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4971488Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4971594Z module_map=module_map) 2025-05-07T20:32:17.4971751Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4971851Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.4971925Z E ^ 2025-05-07T20:32:17.4972318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4972325Z 2025-05-07T20:32:17.4972734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4972739Z 2025-05-07T20:32:17.4972839Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4973063Z self=, 2025-05-07T20:32:17.4973139Z T=128, 2025-05-07T20:32:17.4973211Z D=7168, 2025-05-07T20:32:17.4973295Z scale_ub=1200.0, 2025-05-07T20:32:17.4973382Z contiguous=False, 2025-05-07T20:32:17.4973464Z compiled=False, 2025-05-07T20:32:17.4973539Z ) 2025-05-07T20:32:17.4973752Z self = 2025-05-07T20:32:17.4973923Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.4973930Z 2025-05-07T20:32:17.4974005Z @given( 2025-05-07T20:32:17.4974120Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4974220Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4974331Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4974443Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4974555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4974628Z ) 2025-05-07T20:32:17.4974869Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4974962Z def test_silu_mul_quant( 2025-05-07T20:32:17.4975036Z self, 2025-05-07T20:32:17.4975118Z T: int, 2025-05-07T20:32:17.4975191Z D: int, 2025-05-07T20:32:17.4975286Z scale_ub: Optional[float], 2025-05-07T20:32:17.4975374Z contiguous: bool, 2025-05-07T20:32:17.4975456Z compiled: bool, 2025-05-07T20:32:17.4975530Z ) -> None: 2025-05-07T20:32:17.4975623Z torch.manual_seed(2025) 2025-05-07T20:32:17.4975697Z 2025-05-07T20:32:17.4975864Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4975937Z 2025-05-07T20:32:17.4976025Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4976146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4976234Z x = x_sign * x_clamp 2025-05-07T20:32:17.4976313Z x0 = x[:, :D] 2025-05-07T20:32:17.4976393Z x1 = x[:, D:] 2025-05-07T20:32:17.4976461Z 2025-05-07T20:32:17.4976540Z if contiguous: 2025-05-07T20:32:17.4976630Z x0 = x0.contiguous() 2025-05-07T20:32:17.4976719Z x1 = x1.contiguous() 2025-05-07T20:32:17.4976786Z 2025-05-07T20:32:17.4976877Z if scale_ub is not None: 2025-05-07T20:32:17.4976978Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4977108Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4977186Z ) 2025-05-07T20:32:17.4977263Z else: 2025-05-07T20:32:17.4977354Z scale_ub_tensor = None 2025-05-07T20:32:17.4977426Z 2025-05-07T20:32:17.4977551Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4977637Z op = silu_mul_quant 2025-05-07T20:32:17.4977719Z if compiled: 2025-05-07T20:32:17.4977816Z op = torch.compile(op) 2025-05-07T20:32:17.4977971Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4978039Z 2025-05-07T20:32:17.4978125Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.4978130Z 2025-05-07T20:32:17.4978228Z moe/activation_test.py:117: 2025-05-07T20:32:17.4978424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4978522Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.4978621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4979117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.4979254Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.4979606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4979827Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4980165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4980259Z kernel = self.compile( 2025-05-07T20:32:17.4980637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4980816Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4980940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4980944Z 2025-05-07T20:32:17.4981151Z self = 2025-05-07T20:32:17.4981934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4982439Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b75d4c0>} 2025-05-07T20:32:17.4983198Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4983390Z context = 2025-05-07T20:32:17.4983394Z 2025-05-07T20:32:17.4983558Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4983815Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4983923Z module_map=module_map) 2025-05-07T20:32:17.4984090Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4984185Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.4984266Z E ^ 2025-05-07T20:32:17.4984618Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4984625Z 2025-05-07T20:32:17.4985034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4985039Z 2025-05-07T20:32:17.4985149Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4985374Z self=, 2025-05-07T20:32:17.4985450Z T=128, 2025-05-07T20:32:17.4985522Z D=5120, 2025-05-07T20:32:17.4985600Z scale_ub=None, 2025-05-07T20:32:17.4985687Z contiguous=False, 2025-05-07T20:32:17.4985767Z compiled=False, 2025-05-07T20:32:17.4985836Z ) 2025-05-07T20:32:17.4986052Z self = 2025-05-07T20:32:17.4986217Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.4986221Z 2025-05-07T20:32:17.4986296Z @given( 2025-05-07T20:32:17.4986459Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4986559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4986672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4986785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4986973Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4987048Z ) 2025-05-07T20:32:17.4987288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4987377Z def test_silu_mul_quant( 2025-05-07T20:32:17.4987455Z self, 2025-05-07T20:32:17.4987590Z T: int, 2025-05-07T20:32:17.4987662Z D: int, 2025-05-07T20:32:17.4987758Z scale_ub: Optional[float], 2025-05-07T20:32:17.4987844Z contiguous: bool, 2025-05-07T20:32:17.4987926Z compiled: bool, 2025-05-07T20:32:17.4988002Z ) -> None: 2025-05-07T20:32:17.4988092Z torch.manual_seed(2025) 2025-05-07T20:32:17.4988166Z 2025-05-07T20:32:17.4988330Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.4988403Z 2025-05-07T20:32:17.4988496Z x_sign = torch.sign(x) 2025-05-07T20:32:17.4988615Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.4988704Z x = x_sign * x_clamp 2025-05-07T20:32:17.4988782Z x0 = x[:, :D] 2025-05-07T20:32:17.4988859Z x1 = x[:, D:] 2025-05-07T20:32:17.4988929Z 2025-05-07T20:32:17.4989011Z if contiguous: 2025-05-07T20:32:17.4989098Z x0 = x0.contiguous() 2025-05-07T20:32:17.4989185Z x1 = x1.contiguous() 2025-05-07T20:32:17.4989260Z 2025-05-07T20:32:17.4989347Z if scale_ub is not None: 2025-05-07T20:32:17.4989448Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.4989584Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.4989657Z ) 2025-05-07T20:32:17.4989734Z else: 2025-05-07T20:32:17.4989877Z scale_ub_tensor = None 2025-05-07T20:32:17.4989946Z 2025-05-07T20:32:17.4990079Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.4990165Z op = silu_mul_quant 2025-05-07T20:32:17.4990248Z if compiled: 2025-05-07T20:32:17.4990351Z op = torch.compile(op) 2025-05-07T20:32:17.4990453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4990521Z 2025-05-07T20:32:17.4990611Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.4990616Z 2025-05-07T20:32:17.4990710Z moe/activation_test.py:117: 2025-05-07T20:32:17.4990839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4990937Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.4991030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.4991534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.4991631Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.4991987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.4992209Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.4992548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.4992640Z kernel = self.compile( 2025-05-07T20:32:17.4993017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.4993193Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.4993319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.4993323Z 2025-05-07T20:32:17.4993526Z self = 2025-05-07T20:32:17.4994306Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.4994928Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2bc2aee0>} 2025-05-07T20:32:17.4995673Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.4995907Z context = 2025-05-07T20:32:17.4995912Z 2025-05-07T20:32:17.4996073Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.4996332Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.4996439Z module_map=module_map) 2025-05-07T20:32:17.4996598Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.4996694Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.4996767Z E ^ 2025-05-07T20:32:17.4997125Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.4997136Z 2025-05-07T20:32:17.4997543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.4997551Z 2025-05-07T20:32:17.4997650Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.4997871Z self=, 2025-05-07T20:32:17.4997945Z T=128, 2025-05-07T20:32:17.4998014Z D=5120, 2025-05-07T20:32:17.4998097Z scale_ub=1200.0, 2025-05-07T20:32:17.4998178Z contiguous=True, 2025-05-07T20:32:17.4998259Z compiled=False, 2025-05-07T20:32:17.4998332Z ) 2025-05-07T20:32:17.4998545Z self = 2025-05-07T20:32:17.4998713Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.4998717Z 2025-05-07T20:32:17.4998794Z @given( 2025-05-07T20:32:17.4998914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.4999010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.4999121Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.4999234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.4999354Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.4999424Z ) 2025-05-07T20:32:17.4999666Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.4999756Z def test_silu_mul_quant( 2025-05-07T20:32:17.4999828Z self, 2025-05-07T20:32:17.4999903Z T: int, 2025-05-07T20:32:17.4999978Z D: int, 2025-05-07T20:32:17.5000073Z scale_ub: Optional[float], 2025-05-07T20:32:17.5000160Z contiguous: bool, 2025-05-07T20:32:17.5000242Z compiled: bool, 2025-05-07T20:32:17.5000317Z ) -> None: 2025-05-07T20:32:17.5000416Z torch.manual_seed(2025) 2025-05-07T20:32:17.5000485Z 2025-05-07T20:32:17.5000651Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5000725Z 2025-05-07T20:32:17.5000813Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5000932Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5001025Z x = x_sign * x_clamp 2025-05-07T20:32:17.5001104Z x0 = x[:, :D] 2025-05-07T20:32:17.5001183Z x1 = x[:, D:] 2025-05-07T20:32:17.5001252Z 2025-05-07T20:32:17.5001333Z if contiguous: 2025-05-07T20:32:17.5001422Z x0 = x0.contiguous() 2025-05-07T20:32:17.5001507Z x1 = x1.contiguous() 2025-05-07T20:32:17.5001625Z 2025-05-07T20:32:17.5001715Z if scale_ub is not None: 2025-05-07T20:32:17.5001816Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5001946Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5002023Z ) 2025-05-07T20:32:17.5002172Z else: 2025-05-07T20:32:17.5002264Z scale_ub_tensor = None 2025-05-07T20:32:17.5002338Z 2025-05-07T20:32:17.5002464Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5002554Z op = silu_mul_quant 2025-05-07T20:32:17.5002634Z if compiled: 2025-05-07T20:32:17.5002770Z op = torch.compile(op) 2025-05-07T20:32:17.5002876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5002944Z 2025-05-07T20:32:17.5003030Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5003034Z 2025-05-07T20:32:17.5003131Z moe/activation_test.py:117: 2025-05-07T20:32:17.5003254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5003353Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5003451Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5004257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5004357Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5004715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5004934Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5005277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5005368Z kernel = self.compile( 2025-05-07T20:32:17.5005745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5005923Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5006048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5006052Z 2025-05-07T20:32:17.5006256Z self = 2025-05-07T20:32:17.5007038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5007544Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b75d280>} 2025-05-07T20:32:17.5008291Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5008482Z context = 2025-05-07T20:32:17.5008487Z 2025-05-07T20:32:17.5008652Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5008913Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5009020Z module_map=module_map) 2025-05-07T20:32:17.5009178Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5009271Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5009348Z E ^ 2025-05-07T20:32:17.5009720Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5009727Z 2025-05-07T20:32:17.5010167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5010176Z 2025-05-07T20:32:17.5010276Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5010596Z self=, 2025-05-07T20:32:17.5010678Z T=1, 2025-05-07T20:32:17.5010750Z D=7168, 2025-05-07T20:32:17.5010830Z scale_ub=1200.0, 2025-05-07T20:32:17.5010913Z contiguous=True, 2025-05-07T20:32:17.5011101Z compiled=True, 2025-05-07T20:32:17.5011173Z ) 2025-05-07T20:32:17.5011387Z self = 2025-05-07T20:32:17.5011548Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.5011552Z 2025-05-07T20:32:17.5011686Z @given( 2025-05-07T20:32:17.5011806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5011904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5012020Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5012134Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5012245Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5012320Z ) 2025-05-07T20:32:17.5012561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5012652Z def test_silu_mul_quant( 2025-05-07T20:32:17.5012727Z self, 2025-05-07T20:32:17.5012805Z T: int, 2025-05-07T20:32:17.5012878Z D: int, 2025-05-07T20:32:17.5012977Z scale_ub: Optional[float], 2025-05-07T20:32:17.5013061Z contiguous: bool, 2025-05-07T20:32:17.5013145Z compiled: bool, 2025-05-07T20:32:17.5013220Z ) -> None: 2025-05-07T20:32:17.5013314Z torch.manual_seed(2025) 2025-05-07T20:32:17.5013388Z 2025-05-07T20:32:17.5013553Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5013627Z 2025-05-07T20:32:17.5013716Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5013835Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5013919Z x = x_sign * x_clamp 2025-05-07T20:32:17.5014003Z x0 = x[:, :D] 2025-05-07T20:32:17.5014079Z x1 = x[:, D:] 2025-05-07T20:32:17.5014148Z 2025-05-07T20:32:17.5014229Z if contiguous: 2025-05-07T20:32:17.5014316Z x0 = x0.contiguous() 2025-05-07T20:32:17.5014401Z x1 = x1.contiguous() 2025-05-07T20:32:17.5014476Z 2025-05-07T20:32:17.5014564Z if scale_ub is not None: 2025-05-07T20:32:17.5014669Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5014801Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5014873Z ) 2025-05-07T20:32:17.5014952Z else: 2025-05-07T20:32:17.5015042Z scale_ub_tensor = None 2025-05-07T20:32:17.5015110Z 2025-05-07T20:32:17.5015238Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5015324Z op = silu_mul_quant 2025-05-07T20:32:17.5015404Z if compiled: 2025-05-07T20:32:17.5015503Z op = torch.compile(op) 2025-05-07T20:32:17.5015607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5015676Z 2025-05-07T20:32:17.5015766Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5015771Z 2025-05-07T20:32:17.5015863Z moe/activation_test.py:117: 2025-05-07T20:32:17.5015994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5016091Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5016187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5016557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5016650Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5017141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5017238Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5017590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5017874Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5018204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5018389Z kernel = self.compile( 2025-05-07T20:32:17.5018771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5018943Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5019068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5019110Z 2025-05-07T20:32:17.5019315Z self = 2025-05-07T20:32:17.5020096Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5020606Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2afb7820>} 2025-05-07T20:32:17.5021356Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5021548Z context = 2025-05-07T20:32:17.5021556Z 2025-05-07T20:32:17.5021715Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5021973Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5022081Z module_map=module_map) 2025-05-07T20:32:17.5022237Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5022337Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5022411Z E ^ 2025-05-07T20:32:17.5022764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5022769Z 2025-05-07T20:32:17.5023188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5023193Z 2025-05-07T20:32:17.5023291Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5023512Z self=, 2025-05-07T20:32:17.5023591Z T=1, 2025-05-07T20:32:17.5023664Z D=7168, 2025-05-07T20:32:17.5023748Z scale_ub=1200.0, 2025-05-07T20:32:17.5023830Z contiguous=False, 2025-05-07T20:32:17.5023909Z compiled=True, 2025-05-07T20:32:17.5023984Z ) 2025-05-07T20:32:17.5024197Z self = 2025-05-07T20:32:17.5024362Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.5024366Z 2025-05-07T20:32:17.5024444Z @given( 2025-05-07T20:32:17.5024561Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5024662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5024778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5024891Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5025007Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5025076Z ) 2025-05-07T20:32:17.5034765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5034890Z def test_silu_mul_quant( 2025-05-07T20:32:17.5035004Z self, 2025-05-07T20:32:17.5035112Z T: int, 2025-05-07T20:32:17.5035210Z D: int, 2025-05-07T20:32:17.5035368Z scale_ub: Optional[float], 2025-05-07T20:32:17.5035486Z contiguous: bool, 2025-05-07T20:32:17.5035709Z compiled: bool, 2025-05-07T20:32:17.5035818Z ) -> None: 2025-05-07T20:32:17.5035914Z torch.manual_seed(2025) 2025-05-07T20:32:17.5036013Z 2025-05-07T20:32:17.5036204Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5036362Z 2025-05-07T20:32:17.5036458Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5036590Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5036678Z x = x_sign * x_clamp 2025-05-07T20:32:17.5036756Z x0 = x[:, :D] 2025-05-07T20:32:17.5036837Z x1 = x[:, D:] 2025-05-07T20:32:17.5036948Z 2025-05-07T20:32:17.5037067Z if contiguous: 2025-05-07T20:32:17.5037157Z x0 = x0.contiguous() 2025-05-07T20:32:17.5037244Z x1 = x1.contiguous() 2025-05-07T20:32:17.5037315Z 2025-05-07T20:32:17.5037403Z if scale_ub is not None: 2025-05-07T20:32:17.5037507Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5037648Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5037722Z ) 2025-05-07T20:32:17.5037796Z else: 2025-05-07T20:32:17.5037892Z scale_ub_tensor = None 2025-05-07T20:32:17.5037963Z 2025-05-07T20:32:17.5038098Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5038190Z op = silu_mul_quant 2025-05-07T20:32:17.5038273Z if compiled: 2025-05-07T20:32:17.5038377Z op = torch.compile(op) 2025-05-07T20:32:17.5038483Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5038558Z 2025-05-07T20:32:17.5038679Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5038685Z 2025-05-07T20:32:17.5038781Z moe/activation_test.py:117: 2025-05-07T20:32:17.5038911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5039015Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5039111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5039492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5039593Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5040173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5040276Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5040687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5040910Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5041260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5041355Z kernel = self.compile( 2025-05-07T20:32:17.5041742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5041920Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5042048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5042053Z 2025-05-07T20:32:17.5042272Z self = 2025-05-07T20:32:17.5043058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5043577Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b391310>} 2025-05-07T20:32:17.5044329Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5044576Z context = 2025-05-07T20:32:17.5044581Z 2025-05-07T20:32:17.5044751Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5045085Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5045200Z module_map=module_map) 2025-05-07T20:32:17.5045364Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5045463Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5045585Z E ^ 2025-05-07T20:32:17.5045942Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5045947Z 2025-05-07T20:32:17.5046361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5046370Z 2025-05-07T20:32:17.5046478Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5046701Z self=, 2025-05-07T20:32:17.5046783Z T=1, 2025-05-07T20:32:17.5046860Z D=7168, 2025-05-07T20:32:17.5046941Z scale_ub=None, 2025-05-07T20:32:17.5047036Z contiguous=False, 2025-05-07T20:32:17.5047120Z compiled=True, 2025-05-07T20:32:17.5047193Z ) 2025-05-07T20:32:17.5047417Z self = 2025-05-07T20:32:17.5047583Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.5047590Z 2025-05-07T20:32:17.5047675Z @given( 2025-05-07T20:32:17.5047792Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5047890Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5048013Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5048128Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5048244Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5048318Z ) 2025-05-07T20:32:17.5048561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5048652Z def test_silu_mul_quant( 2025-05-07T20:32:17.5048738Z self, 2025-05-07T20:32:17.5048815Z T: int, 2025-05-07T20:32:17.5048890Z D: int, 2025-05-07T20:32:17.5048994Z scale_ub: Optional[float], 2025-05-07T20:32:17.5049084Z contiguous: bool, 2025-05-07T20:32:17.5049174Z compiled: bool, 2025-05-07T20:32:17.5049252Z ) -> None: 2025-05-07T20:32:17.5049352Z torch.manual_seed(2025) 2025-05-07T20:32:17.5049427Z 2025-05-07T20:32:17.5049598Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5049672Z 2025-05-07T20:32:17.5049771Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5049898Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5049990Z x = x_sign * x_clamp 2025-05-07T20:32:17.5050112Z x0 = x[:, :D] 2025-05-07T20:32:17.5050196Z x1 = x[:, D:] 2025-05-07T20:32:17.5050270Z 2025-05-07T20:32:17.5050362Z if contiguous: 2025-05-07T20:32:17.5071171Z x0 = x0.contiguous() 2025-05-07T20:32:17.5071314Z x1 = x1.contiguous() 2025-05-07T20:32:17.5071383Z 2025-05-07T20:32:17.5071473Z if scale_ub is not None: 2025-05-07T20:32:17.5071581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5071717Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5071793Z ) 2025-05-07T20:32:17.5071870Z else: 2025-05-07T20:32:17.5071962Z scale_ub_tensor = None 2025-05-07T20:32:17.5072030Z 2025-05-07T20:32:17.5072166Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5072253Z op = silu_mul_quant 2025-05-07T20:32:17.5072336Z if compiled: 2025-05-07T20:32:17.5072520Z op = torch.compile(op) 2025-05-07T20:32:17.5072628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5072701Z 2025-05-07T20:32:17.5072792Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.5072917Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.5073066Z 2025-05-07T20:32:17.5073218Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5073322Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.5073425Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.5073550Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.5073740Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.5073814Z 2025-05-07T20:32:17.5073914Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.5073919Z 2025-05-07T20:32:17.5074023Z moe/activation_test.py:126: 2025-05-07T20:32:17.5074164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5074275Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.5074421Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.5074994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.5075092Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.5075456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5075676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5076045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.5076300Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.5076696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.5076953Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.5077329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.5077499Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.5077837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.5077910Z fn() 2025-05-07T20:32:17.5078317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.5078400Z self.fn.run( 2025-05-07T20:32:17.5078734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5078829Z kernel = self.compile( 2025-05-07T20:32:17.5079204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5079383Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5079512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5079517Z 2025-05-07T20:32:17.5079721Z self = 2025-05-07T20:32:17.5080561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5081076Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b23b040>} 2025-05-07T20:32:17.5081833Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5082076Z context = 2025-05-07T20:32:17.5082081Z 2025-05-07T20:32:17.5082318Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5082587Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5082691Z module_map=module_map) 2025-05-07T20:32:17.5082856Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5082994Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.5083068Z E ^ 2025-05-07T20:32:17.5083462Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5083467Z 2025-05-07T20:32:17.5083952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5083960Z 2025-05-07T20:32:17.5084062Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5084283Z self=, 2025-05-07T20:32:17.5084361Z T=1, 2025-05-07T20:32:17.5084437Z D=5120, 2025-05-07T20:32:17.5084516Z scale_ub=1200.0, 2025-05-07T20:32:17.5084598Z contiguous=False, 2025-05-07T20:32:17.5084680Z compiled=True, 2025-05-07T20:32:17.5084754Z ) 2025-05-07T20:32:17.5084968Z self = 2025-05-07T20:32:17.5085140Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.5085145Z 2025-05-07T20:32:17.5085220Z @given( 2025-05-07T20:32:17.5085338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5085435Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5085545Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5085666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5085778Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5085850Z ) 2025-05-07T20:32:17.5086102Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5086192Z def test_silu_mul_quant( 2025-05-07T20:32:17.5086265Z self, 2025-05-07T20:32:17.5086340Z T: int, 2025-05-07T20:32:17.5086413Z D: int, 2025-05-07T20:32:17.5086507Z scale_ub: Optional[float], 2025-05-07T20:32:17.5086595Z contiguous: bool, 2025-05-07T20:32:17.5086680Z compiled: bool, 2025-05-07T20:32:17.5086757Z ) -> None: 2025-05-07T20:32:17.5086847Z torch.manual_seed(2025) 2025-05-07T20:32:17.5086916Z 2025-05-07T20:32:17.5087085Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5087156Z 2025-05-07T20:32:17.5087242Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5087371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5087457Z x = x_sign * x_clamp 2025-05-07T20:32:17.5087535Z x0 = x[:, :D] 2025-05-07T20:32:17.5087615Z x1 = x[:, D:] 2025-05-07T20:32:17.5087683Z 2025-05-07T20:32:17.5087768Z if contiguous: 2025-05-07T20:32:17.5087860Z x0 = x0.contiguous() 2025-05-07T20:32:17.5087945Z x1 = x1.contiguous() 2025-05-07T20:32:17.5088019Z 2025-05-07T20:32:17.5088116Z if scale_ub is not None: 2025-05-07T20:32:17.5088220Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5088356Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5088437Z ) 2025-05-07T20:32:17.5088512Z else: 2025-05-07T20:32:17.5088605Z scale_ub_tensor = None 2025-05-07T20:32:17.5088682Z 2025-05-07T20:32:17.5088810Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5088900Z op = silu_mul_quant 2025-05-07T20:32:17.5089040Z if compiled: 2025-05-07T20:32:17.5089140Z op = torch.compile(op) 2025-05-07T20:32:17.5089245Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5089321Z 2025-05-07T20:32:17.5089492Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5089497Z 2025-05-07T20:32:17.5089602Z moe/activation_test.py:117: 2025-05-07T20:32:17.5089759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5089861Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5089969Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5090384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5090477Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5090976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5091078Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5091435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5091660Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5092017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5092110Z kernel = self.compile( 2025-05-07T20:32:17.5092496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5092672Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5092797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5092802Z 2025-05-07T20:32:17.5093017Z self = 2025-05-07T20:32:17.5093801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5094328Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b23bf70>} 2025-05-07T20:32:17.5095078Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5095273Z context = 2025-05-07T20:32:17.5095283Z 2025-05-07T20:32:17.5095446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5095706Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5095822Z module_map=module_map) 2025-05-07T20:32:17.5095985Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5096082Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5096168Z E ^ 2025-05-07T20:32:17.5096529Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5096534Z 2025-05-07T20:32:17.5096952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5096959Z 2025-05-07T20:32:17.5097063Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5097284Z self=, 2025-05-07T20:32:17.5097368Z T=1, 2025-05-07T20:32:17.5097444Z D=5120, 2025-05-07T20:32:17.5097526Z scale_ub=1200.0, 2025-05-07T20:32:17.5097618Z contiguous=False, 2025-05-07T20:32:17.5097702Z compiled=False, 2025-05-07T20:32:17.5097824Z ) 2025-05-07T20:32:17.5098054Z self = 2025-05-07T20:32:17.5098220Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.5098225Z 2025-05-07T20:32:17.5098403Z @given( 2025-05-07T20:32:17.5098525Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5098623Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5098742Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5098858Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5099009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5099087Z ) 2025-05-07T20:32:17.5099333Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5099430Z def test_silu_mul_quant( 2025-05-07T20:32:17.5099506Z self, 2025-05-07T20:32:17.5099587Z T: int, 2025-05-07T20:32:17.5099669Z D: int, 2025-05-07T20:32:17.5099772Z scale_ub: Optional[float], 2025-05-07T20:32:17.5099878Z contiguous: bool, 2025-05-07T20:32:17.5099978Z compiled: bool, 2025-05-07T20:32:17.5100068Z ) -> None: 2025-05-07T20:32:17.5100167Z torch.manual_seed(2025) 2025-05-07T20:32:17.5100248Z 2025-05-07T20:32:17.5100415Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5100488Z 2025-05-07T20:32:17.5100586Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5100709Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5100800Z x = x_sign * x_clamp 2025-05-07T20:32:17.5100885Z x0 = x[:, :D] 2025-05-07T20:32:17.5100965Z x1 = x[:, D:] 2025-05-07T20:32:17.5101042Z 2025-05-07T20:32:17.5101124Z if contiguous: 2025-05-07T20:32:17.5101213Z x0 = x0.contiguous() 2025-05-07T20:32:17.5101305Z x1 = x1.contiguous() 2025-05-07T20:32:17.5101376Z 2025-05-07T20:32:17.5101467Z if scale_ub is not None: 2025-05-07T20:32:17.5101578Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5101712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5101788Z ) 2025-05-07T20:32:17.5101872Z else: 2025-05-07T20:32:17.5101965Z scale_ub_tensor = None 2025-05-07T20:32:17.5102036Z 2025-05-07T20:32:17.5102170Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5102258Z op = silu_mul_quant 2025-05-07T20:32:17.5102350Z if compiled: 2025-05-07T20:32:17.5102453Z op = torch.compile(op) 2025-05-07T20:32:17.5102557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5102635Z 2025-05-07T20:32:17.5102725Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5102730Z 2025-05-07T20:32:17.5102825Z moe/activation_test.py:117: 2025-05-07T20:32:17.5102960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5103063Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5103161Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5103674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5103963Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5104421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5104648Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5104991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5105090Z kernel = self.compile( 2025-05-07T20:32:17.5105471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5105645Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5105928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5105932Z 2025-05-07T20:32:17.5106165Z self = 2025-05-07T20:32:17.5107275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5107789Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2ac483a0>} 2025-05-07T20:32:17.5108608Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5108803Z context = 2025-05-07T20:32:17.5108808Z 2025-05-07T20:32:17.5108970Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5109246Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5109353Z module_map=module_map) 2025-05-07T20:32:17.5109525Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5109622Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5109700Z E ^ 2025-05-07T20:32:17.5110193Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5110198Z 2025-05-07T20:32:17.5110610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5110615Z 2025-05-07T20:32:17.5110725Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5110952Z self=, 2025-05-07T20:32:17.5111033Z T=16384, 2025-05-07T20:32:17.5111116Z D=5120, 2025-05-07T20:32:17.5111202Z scale_ub=1200.0, 2025-05-07T20:32:17.5111295Z contiguous=False, 2025-05-07T20:32:17.5111387Z compiled=True, 2025-05-07T20:32:17.5111463Z ) 2025-05-07T20:32:17.5111679Z self = 2025-05-07T20:32:17.5111860Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.5111865Z 2025-05-07T20:32:17.5111945Z @given( 2025-05-07T20:32:17.5112066Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5112171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5112285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5112406Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5112523Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5112600Z ) 2025-05-07T20:32:17.5112850Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5112943Z def test_silu_mul_quant( 2025-05-07T20:32:17.5113018Z self, 2025-05-07T20:32:17.5113105Z T: int, 2025-05-07T20:32:17.5113180Z D: int, 2025-05-07T20:32:17.5113278Z scale_ub: Optional[float], 2025-05-07T20:32:17.5113374Z contiguous: bool, 2025-05-07T20:32:17.5113457Z compiled: bool, 2025-05-07T20:32:17.5113542Z ) -> None: 2025-05-07T20:32:17.5113635Z torch.manual_seed(2025) 2025-05-07T20:32:17.5113710Z 2025-05-07T20:32:17.5113884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5113954Z 2025-05-07T20:32:17.5114043Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5114170Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5114258Z x = x_sign * x_clamp 2025-05-07T20:32:17.5114389Z x0 = x[:, :D] 2025-05-07T20:32:17.5114473Z x1 = x[:, D:] 2025-05-07T20:32:17.5114544Z 2025-05-07T20:32:17.5114626Z if contiguous: 2025-05-07T20:32:17.5114721Z x0 = x0.contiguous() 2025-05-07T20:32:17.5114886Z x1 = x1.contiguous() 2025-05-07T20:32:17.5114959Z 2025-05-07T20:32:17.5115055Z if scale_ub is not None: 2025-05-07T20:32:17.5115162Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5115301Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5115375Z ) 2025-05-07T20:32:17.5115490Z else: 2025-05-07T20:32:17.5115589Z scale_ub_tensor = None 2025-05-07T20:32:17.5115661Z 2025-05-07T20:32:17.5115790Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5115882Z op = silu_mul_quant 2025-05-07T20:32:17.5115965Z if compiled: 2025-05-07T20:32:17.5116064Z op = torch.compile(op) 2025-05-07T20:32:17.5116182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5116254Z 2025-05-07T20:32:17.5116347Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5116356Z 2025-05-07T20:32:17.5116451Z moe/activation_test.py:117: 2025-05-07T20:32:17.5116583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5116689Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5116788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5117156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5117257Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5117755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5117851Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5118216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5118443Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5118787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5118884Z kernel = self.compile( 2025-05-07T20:32:17.5119265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5119444Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5119568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5119576Z 2025-05-07T20:32:17.5119785Z self = 2025-05-07T20:32:17.5120571Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5121083Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2abfa0d0>} 2025-05-07T20:32:17.5121840Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5122032Z context = 2025-05-07T20:32:17.5122039Z 2025-05-07T20:32:17.5122208Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5122470Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5122576Z module_map=module_map) 2025-05-07T20:32:17.5122743Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5122887Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5122967Z E ^ 2025-05-07T20:32:17.5123324Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5123401Z 2025-05-07T20:32:17.5123815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5123820Z 2025-05-07T20:32:17.5123927Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5124148Z self=, 2025-05-07T20:32:17.5124380Z T=2048, 2025-05-07T20:32:17.5124457Z D=7168, 2025-05-07T20:32:17.5124544Z scale_ub=1200.0, 2025-05-07T20:32:17.5124636Z contiguous=False, 2025-05-07T20:32:17.5124719Z compiled=True, 2025-05-07T20:32:17.5124790Z ) 2025-05-07T20:32:17.5125012Z self = 2025-05-07T20:32:17.5125190Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.5125195Z 2025-05-07T20:32:17.5125271Z @given( 2025-05-07T20:32:17.5125398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5125506Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5125626Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5125742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5125856Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5125934Z ) 2025-05-07T20:32:17.5126182Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5126277Z def test_silu_mul_quant( 2025-05-07T20:32:17.5126358Z self, 2025-05-07T20:32:17.5126435Z T: int, 2025-05-07T20:32:17.5126512Z D: int, 2025-05-07T20:32:17.5126615Z scale_ub: Optional[float], 2025-05-07T20:32:17.5126703Z contiguous: bool, 2025-05-07T20:32:17.5126794Z compiled: bool, 2025-05-07T20:32:17.5126878Z ) -> None: 2025-05-07T20:32:17.5126971Z torch.manual_seed(2025) 2025-05-07T20:32:17.5127046Z 2025-05-07T20:32:17.5127227Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5127299Z 2025-05-07T20:32:17.5127400Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5127523Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5127610Z x = x_sign * x_clamp 2025-05-07T20:32:17.5127697Z x0 = x[:, :D] 2025-05-07T20:32:17.5127777Z x1 = x[:, D:] 2025-05-07T20:32:17.5127852Z 2025-05-07T20:32:17.5127941Z if contiguous: 2025-05-07T20:32:17.5128031Z x0 = x0.contiguous() 2025-05-07T20:32:17.5128120Z x1 = x1.contiguous() 2025-05-07T20:32:17.5128196Z 2025-05-07T20:32:17.5128284Z if scale_ub is not None: 2025-05-07T20:32:17.5128387Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5128528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5128602Z ) 2025-05-07T20:32:17.5128684Z else: 2025-05-07T20:32:17.5128775Z scale_ub_tensor = None 2025-05-07T20:32:17.5128846Z 2025-05-07T20:32:17.5128986Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5129075Z op = silu_mul_quant 2025-05-07T20:32:17.5129158Z if compiled: 2025-05-07T20:32:17.5129261Z op = torch.compile(op) 2025-05-07T20:32:17.5129366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5129446Z 2025-05-07T20:32:17.5129545Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5129549Z 2025-05-07T20:32:17.5129644Z moe/activation_test.py:117: 2025-05-07T20:32:17.5129778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5129880Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5129980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5130401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5130495Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5131089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5131193Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5131549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5131778Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5132153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5132265Z kernel = self.compile( 2025-05-07T20:32:17.5132644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5132821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5132952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5132956Z 2025-05-07T20:32:17.5133167Z self = 2025-05-07T20:32:17.5133957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5134468Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2abfaca0>} 2025-05-07T20:32:17.5135221Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5135416Z context = 2025-05-07T20:32:17.5135421Z 2025-05-07T20:32:17.5135584Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5135855Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5135962Z module_map=module_map) 2025-05-07T20:32:17.5136124Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5136230Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5136314Z E ^ 2025-05-07T20:32:17.5136680Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5136685Z 2025-05-07T20:32:17.5137096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5137103Z 2025-05-07T20:32:17.5137205Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5137434Z self=, 2025-05-07T20:32:17.5137510Z T=1, 2025-05-07T20:32:17.5137593Z D=5120, 2025-05-07T20:32:17.5137681Z scale_ub=None, 2025-05-07T20:32:17.5137764Z contiguous=False, 2025-05-07T20:32:17.5137856Z compiled=False, 2025-05-07T20:32:17.5137929Z ) 2025-05-07T20:32:17.5138145Z self = 2025-05-07T20:32:17.5138320Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.5138327Z 2025-05-07T20:32:17.5138403Z @given( 2025-05-07T20:32:17.5138521Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5138628Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5138743Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5138861Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5139028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5139100Z ) 2025-05-07T20:32:17.5139351Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5139442Z def test_silu_mul_quant( 2025-05-07T20:32:17.5139590Z self, 2025-05-07T20:32:17.5139674Z T: int, 2025-05-07T20:32:17.5139750Z D: int, 2025-05-07T20:32:17.5139846Z scale_ub: Optional[float], 2025-05-07T20:32:17.5139942Z contiguous: bool, 2025-05-07T20:32:17.5140026Z compiled: bool, 2025-05-07T20:32:17.5140146Z ) -> None: 2025-05-07T20:32:17.5140246Z torch.manual_seed(2025) 2025-05-07T20:32:17.5140318Z 2025-05-07T20:32:17.5140487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5140566Z 2025-05-07T20:32:17.5140653Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5140781Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5140874Z x = x_sign * x_clamp 2025-05-07T20:32:17.5140953Z x0 = x[:, :D] 2025-05-07T20:32:17.5141036Z x1 = x[:, D:] 2025-05-07T20:32:17.5141108Z 2025-05-07T20:32:17.5141192Z if contiguous: 2025-05-07T20:32:17.5141286Z x0 = x0.contiguous() 2025-05-07T20:32:17.5141380Z x1 = x1.contiguous() 2025-05-07T20:32:17.5141452Z 2025-05-07T20:32:17.5141546Z if scale_ub is not None: 2025-05-07T20:32:17.5141649Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5141781Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5141864Z ) 2025-05-07T20:32:17.5141939Z else: 2025-05-07T20:32:17.5142038Z scale_ub_tensor = None 2025-05-07T20:32:17.5142109Z 2025-05-07T20:32:17.5142237Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5142331Z op = silu_mul_quant 2025-05-07T20:32:17.5142415Z if compiled: 2025-05-07T20:32:17.5142516Z op = torch.compile(op) 2025-05-07T20:32:17.5142624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5142695Z 2025-05-07T20:32:17.5142785Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5142790Z 2025-05-07T20:32:17.5142894Z moe/activation_test.py:117: 2025-05-07T20:32:17.5143021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5143128Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5143225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5143726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5143832Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5144195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5144419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5144768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5144860Z kernel = self.compile( 2025-05-07T20:32:17.5145255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5145430Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5145554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5145558Z 2025-05-07T20:32:17.5145772Z self = 2025-05-07T20:32:17.5146559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5147074Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b178670>} 2025-05-07T20:32:17.5147957Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5148152Z context = 2025-05-07T20:32:17.5148162Z 2025-05-07T20:32:17.5148324Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5148626Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5148739Z module_map=module_map) 2025-05-07T20:32:17.5148901Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5148999Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5149081Z E ^ 2025-05-07T20:32:17.5149439Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5149444Z 2025-05-07T20:32:17.5149943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5149948Z 2025-05-07T20:32:17.5150052Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5150273Z self=, 2025-05-07T20:32:17.5150356Z T=4096, 2025-05-07T20:32:17.5150431Z D=7168, 2025-05-07T20:32:17.5150518Z scale_ub=1200.0, 2025-05-07T20:32:17.5150610Z contiguous=False, 2025-05-07T20:32:17.5150693Z compiled=False, 2025-05-07T20:32:17.5150765Z ) 2025-05-07T20:32:17.5150990Z self = 2025-05-07T20:32:17.5151164Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.5151168Z 2025-05-07T20:32:17.5151254Z @given( 2025-05-07T20:32:17.5151371Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5151470Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5151593Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5151715Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5151828Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5151907Z ) 2025-05-07T20:32:17.5152154Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5152247Z def test_silu_mul_quant( 2025-05-07T20:32:17.5152335Z self, 2025-05-07T20:32:17.5152411Z T: int, 2025-05-07T20:32:17.5152491Z D: int, 2025-05-07T20:32:17.5152592Z scale_ub: Optional[float], 2025-05-07T20:32:17.5152681Z contiguous: bool, 2025-05-07T20:32:17.5152774Z compiled: bool, 2025-05-07T20:32:17.5152851Z ) -> None: 2025-05-07T20:32:17.5152944Z torch.manual_seed(2025) 2025-05-07T20:32:17.5153029Z 2025-05-07T20:32:17.5153197Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5153269Z 2025-05-07T20:32:17.5153365Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5153492Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5153581Z x = x_sign * x_clamp 2025-05-07T20:32:17.5153672Z x0 = x[:, :D] 2025-05-07T20:32:17.5153748Z x1 = x[:, D:] 2025-05-07T20:32:17.5153825Z 2025-05-07T20:32:17.5153907Z if contiguous: 2025-05-07T20:32:17.5153998Z x0 = x0.contiguous() 2025-05-07T20:32:17.5154094Z x1 = x1.contiguous() 2025-05-07T20:32:17.5154166Z 2025-05-07T20:32:17.5154256Z if scale_ub is not None: 2025-05-07T20:32:17.5154367Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5154500Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5154575Z ) 2025-05-07T20:32:17.5154704Z else: 2025-05-07T20:32:17.5154798Z scale_ub_tensor = None 2025-05-07T20:32:17.5154871Z 2025-05-07T20:32:17.5155004Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5155095Z op = silu_mul_quant 2025-05-07T20:32:17.5155253Z if compiled: 2025-05-07T20:32:17.5155358Z op = torch.compile(op) 2025-05-07T20:32:17.5155465Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5155542Z 2025-05-07T20:32:17.5155632Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5155636Z 2025-05-07T20:32:17.5155732Z moe/activation_test.py:117: 2025-05-07T20:32:17.5155907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5156006Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5156104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5156613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5156712Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5157078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5157308Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5157648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5157747Z kernel = self.compile( 2025-05-07T20:32:17.5158130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5158307Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5158439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5158443Z 2025-05-07T20:32:17.5158648Z self = 2025-05-07T20:32:17.5159441Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5159954Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b086040>} 2025-05-07T20:32:17.5160706Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5160904Z context = 2025-05-07T20:32:17.5160909Z 2025-05-07T20:32:17.5161072Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5161339Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5161448Z module_map=module_map) 2025-05-07T20:32:17.5161620Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5161717Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5161798Z E ^ 2025-05-07T20:32:17.5162158Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5162162Z 2025-05-07T20:32:17.5162575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5162583Z 2025-05-07T20:32:17.5162684Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5162918Z self=, 2025-05-07T20:32:17.5162995Z T=16384, 2025-05-07T20:32:17.5163077Z D=7168, 2025-05-07T20:32:17.5163159Z scale_ub=None, 2025-05-07T20:32:17.5163316Z contiguous=True, 2025-05-07T20:32:17.5163403Z compiled=True, 2025-05-07T20:32:17.5163475Z ) 2025-05-07T20:32:17.5163700Z self = 2025-05-07T20:32:17.5163883Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.5163962Z 2025-05-07T20:32:17.5164040Z @given( 2025-05-07T20:32:17.5164159Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5164264Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5164380Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5164542Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5164657Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5164730Z ) 2025-05-07T20:32:17.5164983Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5165075Z def test_silu_mul_quant( 2025-05-07T20:32:17.5165152Z self, 2025-05-07T20:32:17.5165237Z T: int, 2025-05-07T20:32:17.5165314Z D: int, 2025-05-07T20:32:17.5165410Z scale_ub: Optional[float], 2025-05-07T20:32:17.5165505Z contiguous: bool, 2025-05-07T20:32:17.5165590Z compiled: bool, 2025-05-07T20:32:17.5165669Z ) -> None: 2025-05-07T20:32:17.5165775Z torch.manual_seed(2025) 2025-05-07T20:32:17.5165847Z 2025-05-07T20:32:17.5166021Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5166097Z 2025-05-07T20:32:17.5166188Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5166321Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5166414Z x = x_sign * x_clamp 2025-05-07T20:32:17.5166494Z x0 = x[:, :D] 2025-05-07T20:32:17.5166582Z x1 = x[:, D:] 2025-05-07T20:32:17.5166655Z 2025-05-07T20:32:17.5166741Z if contiguous: 2025-05-07T20:32:17.5166838Z x0 = x0.contiguous() 2025-05-07T20:32:17.5166928Z x1 = x1.contiguous() 2025-05-07T20:32:17.5167002Z 2025-05-07T20:32:17.5167099Z if scale_ub is not None: 2025-05-07T20:32:17.5167206Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5167348Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5167427Z ) 2025-05-07T20:32:17.5167504Z else: 2025-05-07T20:32:17.5167601Z scale_ub_tensor = None 2025-05-07T20:32:17.5167673Z 2025-05-07T20:32:17.5167802Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5167896Z op = silu_mul_quant 2025-05-07T20:32:17.5167981Z if compiled: 2025-05-07T20:32:17.5168080Z op = torch.compile(op) 2025-05-07T20:32:17.5168192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5168265Z 2025-05-07T20:32:17.5168355Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5168359Z 2025-05-07T20:32:17.5168494Z moe/activation_test.py:117: 2025-05-07T20:32:17.5168668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5168809Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5168935Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5169376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5169474Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5169970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5170067Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5170430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5170652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5170995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5171141Z kernel = self.compile( 2025-05-07T20:32:17.5171521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5171706Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5171906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5171911Z 2025-05-07T20:32:17.5172125Z self = 2025-05-07T20:32:17.5172908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5173458Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2b086ca0>} 2025-05-07T20:32:17.5174219Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5174420Z context = 2025-05-07T20:32:17.5174425Z 2025-05-07T20:32:17.5174593Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5174855Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5174963Z module_map=module_map) 2025-05-07T20:32:17.5175134Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5175234Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5175316Z E ^ 2025-05-07T20:32:17.5175679Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5175688Z 2025-05-07T20:32:17.5176099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5176104Z 2025-05-07T20:32:17.5176214Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5176440Z self=, 2025-05-07T20:32:17.5176515Z T=4096, 2025-05-07T20:32:17.5176597Z D=5120, 2025-05-07T20:32:17.5176678Z scale_ub=None, 2025-05-07T20:32:17.5176770Z contiguous=False, 2025-05-07T20:32:17.5176852Z compiled=True, 2025-05-07T20:32:17.5176928Z ) 2025-05-07T20:32:17.5177152Z self = 2025-05-07T20:32:17.5177323Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.5177327Z 2025-05-07T20:32:17.5177404Z @given( 2025-05-07T20:32:17.5177527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5182515Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5182658Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5182777Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5182893Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5182972Z ) 2025-05-07T20:32:17.5183220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5183318Z def test_silu_mul_quant( 2025-05-07T20:32:17.5183393Z self, 2025-05-07T20:32:17.5183468Z T: int, 2025-05-07T20:32:17.5183545Z D: int, 2025-05-07T20:32:17.5183645Z scale_ub: Optional[float], 2025-05-07T20:32:17.5183737Z contiguous: bool, 2025-05-07T20:32:17.5183820Z compiled: bool, 2025-05-07T20:32:17.5183897Z ) -> None: 2025-05-07T20:32:17.5183992Z torch.manual_seed(2025) 2025-05-07T20:32:17.5184062Z 2025-05-07T20:32:17.5184230Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5184410Z 2025-05-07T20:32:17.5184501Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5184625Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5184716Z x = x_sign * x_clamp 2025-05-07T20:32:17.5184794Z x0 = x[:, :D] 2025-05-07T20:32:17.5184948Z x1 = x[:, D:] 2025-05-07T20:32:17.5185026Z 2025-05-07T20:32:17.5185107Z if contiguous: 2025-05-07T20:32:17.5185196Z x0 = x0.contiguous() 2025-05-07T20:32:17.5185287Z x1 = x1.contiguous() 2025-05-07T20:32:17.5185357Z 2025-05-07T20:32:17.5185448Z if scale_ub is not None: 2025-05-07T20:32:17.5185594Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5185729Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5185808Z ) 2025-05-07T20:32:17.5185883Z else: 2025-05-07T20:32:17.5185975Z scale_ub_tensor = None 2025-05-07T20:32:17.5186051Z 2025-05-07T20:32:17.5186182Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5186270Z op = silu_mul_quant 2025-05-07T20:32:17.5186356Z if compiled: 2025-05-07T20:32:17.5186453Z op = torch.compile(op) 2025-05-07T20:32:17.5186565Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5186638Z 2025-05-07T20:32:17.5186726Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5186731Z 2025-05-07T20:32:17.5186832Z moe/activation_test.py:117: 2025-05-07T20:32:17.5186959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5187062Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5187165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5187536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5187625Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5188128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5188226Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5188584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5188812Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5189148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5189244Z kernel = self.compile( 2025-05-07T20:32:17.5189623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5189796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5190004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5190008Z 2025-05-07T20:32:17.5190215Z self = 2025-05-07T20:32:17.5191009Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5191527Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2aec68b0>} 2025-05-07T20:32:17.5192279Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5192478Z context = 2025-05-07T20:32:17.5192483Z 2025-05-07T20:32:17.5192647Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5192962Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5193068Z module_map=module_map) 2025-05-07T20:32:17.5193235Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5193411Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5193487Z E ^ 2025-05-07T20:32:17.5193850Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5193855Z 2025-05-07T20:32:17.5194268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5194335Z 2025-05-07T20:32:17.5194440Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5194661Z self=, 2025-05-07T20:32:17.5194736Z T=4096, 2025-05-07T20:32:17.5194815Z D=5120, 2025-05-07T20:32:17.5194898Z scale_ub=1200.0, 2025-05-07T20:32:17.5194984Z contiguous=False, 2025-05-07T20:32:17.5195069Z compiled=False, 2025-05-07T20:32:17.5195140Z ) 2025-05-07T20:32:17.5195355Z self = 2025-05-07T20:32:17.5195539Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.5195544Z 2025-05-07T20:32:17.5195618Z @given( 2025-05-07T20:32:17.5195734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5195835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5195948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5196070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5196186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5196257Z ) 2025-05-07T20:32:17.5196504Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5196594Z def test_silu_mul_quant( 2025-05-07T20:32:17.5196670Z self, 2025-05-07T20:32:17.5196746Z T: int, 2025-05-07T20:32:17.5196820Z D: int, 2025-05-07T20:32:17.5196915Z scale_ub: Optional[float], 2025-05-07T20:32:17.5197006Z contiguous: bool, 2025-05-07T20:32:17.5197094Z compiled: bool, 2025-05-07T20:32:17.5197176Z ) -> None: 2025-05-07T20:32:17.5197267Z torch.manual_seed(2025) 2025-05-07T20:32:17.5197337Z 2025-05-07T20:32:17.5197506Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5197578Z 2025-05-07T20:32:17.5197669Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5197797Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5197885Z x = x_sign * x_clamp 2025-05-07T20:32:17.5197964Z x0 = x[:, :D] 2025-05-07T20:32:17.5198044Z x1 = x[:, D:] 2025-05-07T20:32:17.5198114Z 2025-05-07T20:32:17.5198197Z if contiguous: 2025-05-07T20:32:17.5198290Z x0 = x0.contiguous() 2025-05-07T20:32:17.5198379Z x1 = x1.contiguous() 2025-05-07T20:32:17.5198451Z 2025-05-07T20:32:17.5198543Z if scale_ub is not None: 2025-05-07T20:32:17.5198647Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5198795Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5198870Z ) 2025-05-07T20:32:17.5198945Z else: 2025-05-07T20:32:17.5199041Z scale_ub_tensor = None 2025-05-07T20:32:17.5199111Z 2025-05-07T20:32:17.5199241Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5199337Z op = silu_mul_quant 2025-05-07T20:32:17.5199421Z if compiled: 2025-05-07T20:32:17.5199519Z op = torch.compile(op) 2025-05-07T20:32:17.5199633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5199703Z 2025-05-07T20:32:17.5199796Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5199801Z 2025-05-07T20:32:17.5199894Z moe/activation_test.py:117: 2025-05-07T20:32:17.5200070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5200173Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5200271Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5200861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5200960Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5201318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5201580Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5201916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5202009Z kernel = self.compile( 2025-05-07T20:32:17.5202394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5202569Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5202699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5202704Z 2025-05-07T20:32:17.5202914Z self = 2025-05-07T20:32:17.5203699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5204612Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2afcd040>} 2025-05-07T20:32:17.5205370Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5205575Z context = 2025-05-07T20:32:17.5205580Z 2025-05-07T20:32:17.5205750Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5206011Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5206124Z module_map=module_map) 2025-05-07T20:32:17.5206284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5206387Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5206462Z E ^ 2025-05-07T20:32:17.5206817Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5206822Z 2025-05-07T20:32:17.5207238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5207246Z 2025-05-07T20:32:17.5207346Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5207571Z self=, 2025-05-07T20:32:17.5207646Z T=4096, 2025-05-07T20:32:17.5207725Z D=5120, 2025-05-07T20:32:17.5207811Z scale_ub=1200.0, 2025-05-07T20:32:17.5207895Z contiguous=False, 2025-05-07T20:32:17.5207976Z compiled=True, 2025-05-07T20:32:17.5208052Z ) 2025-05-07T20:32:17.5208267Z self = 2025-05-07T20:32:17.5208439Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.5208444Z 2025-05-07T20:32:17.5208526Z @given( 2025-05-07T20:32:17.5208642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5208744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5208855Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5209082Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5209199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5209271Z ) 2025-05-07T20:32:17.5209515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5209726Z def test_silu_mul_quant( 2025-05-07T20:32:17.5209804Z self, 2025-05-07T20:32:17.5209881Z T: int, 2025-05-07T20:32:17.5209961Z D: int, 2025-05-07T20:32:17.5210057Z scale_ub: Optional[float], 2025-05-07T20:32:17.5210143Z contiguous: bool, 2025-05-07T20:32:17.5210293Z compiled: bool, 2025-05-07T20:32:17.5210370Z ) -> None: 2025-05-07T20:32:17.5210467Z torch.manual_seed(2025) 2025-05-07T20:32:17.5210537Z 2025-05-07T20:32:17.5210703Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5210780Z 2025-05-07T20:32:17.5210872Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5210993Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5211087Z x = x_sign * x_clamp 2025-05-07T20:32:17.5211165Z x0 = x[:, :D] 2025-05-07T20:32:17.5211241Z x1 = x[:, D:] 2025-05-07T20:32:17.5211316Z 2025-05-07T20:32:17.5211397Z if contiguous: 2025-05-07T20:32:17.5211489Z x0 = x0.contiguous() 2025-05-07T20:32:17.5211582Z x1 = x1.contiguous() 2025-05-07T20:32:17.5211655Z 2025-05-07T20:32:17.5211749Z if scale_ub is not None: 2025-05-07T20:32:17.5211850Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5211982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5212065Z ) 2025-05-07T20:32:17.5212140Z else: 2025-05-07T20:32:17.5212231Z scale_ub_tensor = None 2025-05-07T20:32:17.5212306Z 2025-05-07T20:32:17.5212434Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5212523Z op = silu_mul_quant 2025-05-07T20:32:17.5212612Z if compiled: 2025-05-07T20:32:17.5212710Z op = torch.compile(op) 2025-05-07T20:32:17.5212815Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5212889Z 2025-05-07T20:32:17.5212976Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5212986Z 2025-05-07T20:32:17.5213085Z moe/activation_test.py:117: 2025-05-07T20:32:17.5213212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5213309Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5213410Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5213778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5213868Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5214365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5214459Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5214823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5215045Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5215385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5215486Z kernel = self.compile( 2025-05-07T20:32:17.5215867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5216044Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5216173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5216178Z 2025-05-07T20:32:17.5216382Z self = 2025-05-07T20:32:17.5217171Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5217803Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2afcdee0>} 2025-05-07T20:32:17.5218565Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5218799Z context = 2025-05-07T20:32:17.5218804Z 2025-05-07T20:32:17.5218965Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5219229Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5219338Z module_map=module_map) 2025-05-07T20:32:17.5219504Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5219601Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5219675Z E ^ 2025-05-07T20:32:17.5220045Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5220050Z 2025-05-07T20:32:17.5220460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5220465Z 2025-05-07T20:32:17.5220567Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5220790Z self=, 2025-05-07T20:32:17.5220865Z T=2048, 2025-05-07T20:32:17.5220943Z D=7168, 2025-05-07T20:32:17.5221024Z scale_ub=1200.0, 2025-05-07T20:32:17.5221106Z contiguous=False, 2025-05-07T20:32:17.5221191Z compiled=False, 2025-05-07T20:32:17.5221263Z ) 2025-05-07T20:32:17.5221477Z self = 2025-05-07T20:32:17.5221659Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.5221663Z 2025-05-07T20:32:17.5221738Z @given( 2025-05-07T20:32:17.5221858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5221961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5222072Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5222193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5222306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5222378Z ) 2025-05-07T20:32:17.5222630Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5222721Z def test_silu_mul_quant( 2025-05-07T20:32:17.5222796Z self, 2025-05-07T20:32:17.5222875Z T: int, 2025-05-07T20:32:17.5222949Z D: int, 2025-05-07T20:32:17.5223048Z scale_ub: Optional[float], 2025-05-07T20:32:17.5223137Z contiguous: bool, 2025-05-07T20:32:17.5223221Z compiled: bool, 2025-05-07T20:32:17.5223298Z ) -> None: 2025-05-07T20:32:17.5223395Z torch.manual_seed(2025) 2025-05-07T20:32:17.5223473Z 2025-05-07T20:32:17.5223645Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5223716Z 2025-05-07T20:32:17.5223806Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5223932Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5224019Z x = x_sign * x_clamp 2025-05-07T20:32:17.5224098Z x0 = x[:, :D] 2025-05-07T20:32:17.5224182Z x1 = x[:, D:] 2025-05-07T20:32:17.5224251Z 2025-05-07T20:32:17.5224331Z if contiguous: 2025-05-07T20:32:17.5224425Z x0 = x0.contiguous() 2025-05-07T20:32:17.5224512Z x1 = x1.contiguous() 2025-05-07T20:32:17.5224581Z 2025-05-07T20:32:17.5224722Z if scale_ub is not None: 2025-05-07T20:32:17.5224825Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5224961Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5225038Z ) 2025-05-07T20:32:17.5225112Z else: 2025-05-07T20:32:17.5225308Z scale_ub_tensor = None 2025-05-07T20:32:17.5225379Z 2025-05-07T20:32:17.5225509Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5225600Z op = silu_mul_quant 2025-05-07T20:32:17.5225683Z if compiled: 2025-05-07T20:32:17.5225779Z op = torch.compile(op) 2025-05-07T20:32:17.5225928Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5225998Z 2025-05-07T20:32:17.5226085Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5226090Z 2025-05-07T20:32:17.5226189Z moe/activation_test.py:117: 2025-05-07T20:32:17.5226314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5226418Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5226516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5227017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5227123Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5227479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5227699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5228041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5228131Z kernel = self.compile( 2025-05-07T20:32:17.5228513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5228686Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5228810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5228815Z 2025-05-07T20:32:17.5229022Z self = 2025-05-07T20:32:17.5229864Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5230380Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2ad4d550>} 2025-05-07T20:32:17.5231131Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5231328Z context = 2025-05-07T20:32:17.5231335Z 2025-05-07T20:32:17.5231497Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5231760Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5231868Z module_map=module_map) 2025-05-07T20:32:17.5232027Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5232123Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5232201Z E ^ 2025-05-07T20:32:17.5232555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5232562Z 2025-05-07T20:32:17.5232980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5232984Z 2025-05-07T20:32:17.5233085Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5233353Z self=, 2025-05-07T20:32:17.5233432Z T=1, 2025-05-07T20:32:17.5233507Z D=7168, 2025-05-07T20:32:17.5233587Z scale_ub=None, 2025-05-07T20:32:17.5233674Z contiguous=True, 2025-05-07T20:32:17.5233829Z compiled=False, 2025-05-07T20:32:17.5233901Z ) 2025-05-07T20:32:17.5234127Z self = 2025-05-07T20:32:17.5234288Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.5234293Z 2025-05-07T20:32:17.5234372Z @given( 2025-05-07T20:32:17.5234532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5234629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5234745Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5234860Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5234969Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5235049Z ) 2025-05-07T20:32:17.5235292Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5235390Z def test_silu_mul_quant( 2025-05-07T20:32:17.5235465Z self, 2025-05-07T20:32:17.5235542Z T: int, 2025-05-07T20:32:17.5235627Z D: int, 2025-05-07T20:32:17.5235726Z scale_ub: Optional[float], 2025-05-07T20:32:17.5235813Z contiguous: bool, 2025-05-07T20:32:17.5235899Z compiled: bool, 2025-05-07T20:32:17.5235975Z ) -> None: 2025-05-07T20:32:17.5236067Z torch.manual_seed(2025) 2025-05-07T20:32:17.5236147Z 2025-05-07T20:32:17.5236312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5236387Z 2025-05-07T20:32:17.5236482Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5236603Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5236691Z x = x_sign * x_clamp 2025-05-07T20:32:17.5236773Z x0 = x[:, :D] 2025-05-07T20:32:17.5236853Z x1 = x[:, D:] 2025-05-07T20:32:17.5236926Z 2025-05-07T20:32:17.5237006Z if contiguous: 2025-05-07T20:32:17.5237094Z x0 = x0.contiguous() 2025-05-07T20:32:17.5237182Z x1 = x1.contiguous() 2025-05-07T20:32:17.5237257Z 2025-05-07T20:32:17.5237345Z if scale_ub is not None: 2025-05-07T20:32:17.5237450Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5237582Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5237657Z ) 2025-05-07T20:32:17.5237739Z else: 2025-05-07T20:32:17.5237836Z scale_ub_tensor = None 2025-05-07T20:32:17.5237906Z 2025-05-07T20:32:17.5238037Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5238123Z op = silu_mul_quant 2025-05-07T20:32:17.5238209Z if compiled: 2025-05-07T20:32:17.5238306Z op = torch.compile(op) 2025-05-07T20:32:17.5238409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5238487Z 2025-05-07T20:32:17.5238574Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5238579Z 2025-05-07T20:32:17.5238673Z moe/activation_test.py:117: 2025-05-07T20:32:17.5238807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5238906Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5239002Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5239506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5239603Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5239962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5240187Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5240530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5240669Z kernel = self.compile( 2025-05-07T20:32:17.5241048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5241299Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5241423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5241428Z 2025-05-07T20:32:17.5241638Z self = 2025-05-07T20:32:17.5242423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5242976Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2acad160>} 2025-05-07T20:32:17.5243725Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5243921Z context = 2025-05-07T20:32:17.5243926Z 2025-05-07T20:32:17.5244091Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5244352Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5244461Z module_map=module_map) 2025-05-07T20:32:17.5244626Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5244724Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5244802Z E ^ 2025-05-07T20:32:17.5245155Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5245163Z 2025-05-07T20:32:17.5245573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5245583Z 2025-05-07T20:32:17.5245689Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5245907Z self=, 2025-05-07T20:32:17.5245987Z T=16384, 2025-05-07T20:32:17.5246060Z D=7168, 2025-05-07T20:32:17.5246142Z scale_ub=1200.0, 2025-05-07T20:32:17.5246230Z contiguous=False, 2025-05-07T20:32:17.5246315Z compiled=True, 2025-05-07T20:32:17.5246384Z ) 2025-05-07T20:32:17.5246601Z self = 2025-05-07T20:32:17.5246774Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.5246778Z 2025-05-07T20:32:17.5246856Z @given( 2025-05-07T20:32:17.5246973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5247072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5247187Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5247300Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5247418Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5247493Z ) 2025-05-07T20:32:17.5247741Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5247830Z def test_silu_mul_quant( 2025-05-07T20:32:17.5247909Z self, 2025-05-07T20:32:17.5247984Z T: int, 2025-05-07T20:32:17.5248064Z D: int, 2025-05-07T20:32:17.5248162Z scale_ub: Optional[float], 2025-05-07T20:32:17.5248248Z contiguous: bool, 2025-05-07T20:32:17.5248336Z compiled: bool, 2025-05-07T20:32:17.5248410Z ) -> None: 2025-05-07T20:32:17.5248502Z torch.manual_seed(2025) 2025-05-07T20:32:17.5248578Z 2025-05-07T20:32:17.5248741Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5248859Z 2025-05-07T20:32:17.5248951Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5249072Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5249230Z x = x_sign * x_clamp 2025-05-07T20:32:17.5249314Z x0 = x[:, :D] 2025-05-07T20:32:17.5249392Z x1 = x[:, D:] 2025-05-07T20:32:17.5249462Z 2025-05-07T20:32:17.5249547Z if contiguous: 2025-05-07T20:32:17.5249636Z x0 = x0.contiguous() 2025-05-07T20:32:17.5249723Z x1 = x1.contiguous() 2025-05-07T20:32:17.5249837Z 2025-05-07T20:32:17.5249925Z if scale_ub is not None: 2025-05-07T20:32:17.5250031Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5250162Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5250239Z ) 2025-05-07T20:32:17.5250316Z else: 2025-05-07T20:32:17.5250406Z scale_ub_tensor = None 2025-05-07T20:32:17.5250479Z 2025-05-07T20:32:17.5250611Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5250697Z op = silu_mul_quant 2025-05-07T20:32:17.5250779Z if compiled: 2025-05-07T20:32:17.5250883Z op = torch.compile(op) 2025-05-07T20:32:17.5250986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5251055Z 2025-05-07T20:32:17.5251150Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5251155Z 2025-05-07T20:32:17.5251248Z moe/activation_test.py:117: 2025-05-07T20:32:17.5251375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5251477Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5251574Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5251944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5252035Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5252529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5252626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5252984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5253211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5253548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5253640Z kernel = self.compile( 2025-05-07T20:32:17.5254020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5254192Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5254322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5254329Z 2025-05-07T20:32:17.5254531Z self = 2025-05-07T20:32:17.5255318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5255832Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2acaddc0>} 2025-05-07T20:32:17.5256578Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5256776Z context = 2025-05-07T20:32:17.5256780Z 2025-05-07T20:32:17.5256944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5257251Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5257360Z module_map=module_map) 2025-05-07T20:32:17.5257675Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5257777Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5257851Z E ^ 2025-05-07T20:32:17.5258208Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5258213Z 2025-05-07T20:32:17.5258665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5258669Z 2025-05-07T20:32:17.5258770Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5258992Z self=, 2025-05-07T20:32:17.5259067Z T=1, 2025-05-07T20:32:17.5259147Z D=7168, 2025-05-07T20:32:17.5259230Z scale_ub=None, 2025-05-07T20:32:17.5259314Z contiguous=False, 2025-05-07T20:32:17.5259396Z compiled=False, 2025-05-07T20:32:17.5259469Z ) 2025-05-07T20:32:17.5259689Z self = 2025-05-07T20:32:17.5259856Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.5259861Z 2025-05-07T20:32:17.5259942Z @given( 2025-05-07T20:32:17.5260069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5260181Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5260321Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5260435Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5260547Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5260617Z ) 2025-05-07T20:32:17.5260858Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5260955Z def test_silu_mul_quant( 2025-05-07T20:32:17.5261030Z self, 2025-05-07T20:32:17.5261103Z T: int, 2025-05-07T20:32:17.5261183Z D: int, 2025-05-07T20:32:17.5261278Z scale_ub: Optional[float], 2025-05-07T20:32:17.5261368Z contiguous: bool, 2025-05-07T20:32:17.5261456Z compiled: bool, 2025-05-07T20:32:17.5261529Z ) -> None: 2025-05-07T20:32:17.5261625Z torch.manual_seed(2025) 2025-05-07T20:32:17.5261695Z 2025-05-07T20:32:17.5261860Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5261937Z 2025-05-07T20:32:17.5262025Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5262146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5262234Z x = x_sign * x_clamp 2025-05-07T20:32:17.5262315Z x0 = x[:, :D] 2025-05-07T20:32:17.5262391Z x1 = x[:, D:] 2025-05-07T20:32:17.5262466Z 2025-05-07T20:32:17.5262546Z if contiguous: 2025-05-07T20:32:17.5262636Z x0 = x0.contiguous() 2025-05-07T20:32:17.5262727Z x1 = x1.contiguous() 2025-05-07T20:32:17.5262796Z 2025-05-07T20:32:17.5262885Z if scale_ub is not None: 2025-05-07T20:32:17.5262990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5263125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5263203Z ) 2025-05-07T20:32:17.5263277Z else: 2025-05-07T20:32:17.5263365Z scale_ub_tensor = None 2025-05-07T20:32:17.5263438Z 2025-05-07T20:32:17.5263566Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5263658Z op = silu_mul_quant 2025-05-07T20:32:17.5263741Z if compiled: 2025-05-07T20:32:17.5263837Z op = torch.compile(op) 2025-05-07T20:32:17.5263940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5264013Z 2025-05-07T20:32:17.5264100Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5264153Z 2025-05-07T20:32:17.5264249Z moe/activation_test.py:117: 2025-05-07T20:32:17.5264372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5264471Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5264642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5265141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5265238Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5265598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5265860Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5266200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5266291Z kernel = self.compile( 2025-05-07T20:32:17.5266667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5266845Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5266971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5266975Z 2025-05-07T20:32:17.5267179Z self = 2025-05-07T20:32:17.5267964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5268472Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2ad67790>} 2025-05-07T20:32:17.5269239Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5269433Z context = 2025-05-07T20:32:17.5269437Z 2025-05-07T20:32:17.5269607Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5269922Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5270026Z module_map=module_map) 2025-05-07T20:32:17.5270193Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5270295Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5270370Z E ^ 2025-05-07T20:32:17.5270728Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5270733Z 2025-05-07T20:32:17.5271141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5271148Z 2025-05-07T20:32:17.5271250Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5271469Z self=, 2025-05-07T20:32:17.5271546Z T=2048, 2025-05-07T20:32:17.5271622Z D=7168, 2025-05-07T20:32:17.5271700Z scale_ub=None, 2025-05-07T20:32:17.5271783Z contiguous=False, 2025-05-07T20:32:17.5271864Z compiled=True, 2025-05-07T20:32:17.5271934Z ) 2025-05-07T20:32:17.5272149Z self = 2025-05-07T20:32:17.5272322Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.5272327Z 2025-05-07T20:32:17.5272400Z @given( 2025-05-07T20:32:17.5272518Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5272616Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5272731Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5272900Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5273012Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5273081Z ) 2025-05-07T20:32:17.5273401Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5273492Z def test_silu_mul_quant( 2025-05-07T20:32:17.5273569Z self, 2025-05-07T20:32:17.5273642Z T: int, 2025-05-07T20:32:17.5273714Z D: int, 2025-05-07T20:32:17.5273812Z scale_ub: Optional[float], 2025-05-07T20:32:17.5273898Z contiguous: bool, 2025-05-07T20:32:17.5274023Z compiled: bool, 2025-05-07T20:32:17.5274098Z ) -> None: 2025-05-07T20:32:17.5274192Z torch.manual_seed(2025) 2025-05-07T20:32:17.5274260Z 2025-05-07T20:32:17.5274429Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5274499Z 2025-05-07T20:32:17.5274586Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5274711Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5274795Z x = x_sign * x_clamp 2025-05-07T20:32:17.5274875Z x0 = x[:, :D] 2025-05-07T20:32:17.5274951Z x1 = x[:, D:] 2025-05-07T20:32:17.5275019Z 2025-05-07T20:32:17.5275108Z if contiguous: 2025-05-07T20:32:17.5275197Z x0 = x0.contiguous() 2025-05-07T20:32:17.5275283Z x1 = x1.contiguous() 2025-05-07T20:32:17.5275355Z 2025-05-07T20:32:17.5275442Z if scale_ub is not None: 2025-05-07T20:32:17.5275544Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5275683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5275756Z ) 2025-05-07T20:32:17.5275829Z else: 2025-05-07T20:32:17.5275923Z scale_ub_tensor = None 2025-05-07T20:32:17.5275993Z 2025-05-07T20:32:17.5276119Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5276215Z op = silu_mul_quant 2025-05-07T20:32:17.5276295Z if compiled: 2025-05-07T20:32:17.5276393Z op = torch.compile(op) 2025-05-07T20:32:17.5276496Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5276564Z 2025-05-07T20:32:17.5276662Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5276666Z 2025-05-07T20:32:17.5276760Z moe/activation_test.py:117: 2025-05-07T20:32:17.5276882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5276989Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5277086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5277459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5277549Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5278041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5278142Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5278496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5278726Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5279061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5279150Z kernel = self.compile( 2025-05-07T20:32:17.5279532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5279707Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5279828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5279833Z 2025-05-07T20:32:17.5280038Z self = 2025-05-07T20:32:17.5280816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5281447Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2acf8430>} 2025-05-07T20:32:17.5282202Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5282434Z context = 2025-05-07T20:32:17.5282442Z 2025-05-07T20:32:17.5282603Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5282863Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5282974Z module_map=module_map) 2025-05-07T20:32:17.5283133Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5283229Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5283306Z E ^ 2025-05-07T20:32:17.5283665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5283670Z 2025-05-07T20:32:17.5284080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5284088Z 2025-05-07T20:32:17.5284188Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5284406Z self=, 2025-05-07T20:32:17.5284481Z T=4096, 2025-05-07T20:32:17.5284553Z D=7168, 2025-05-07T20:32:17.5284630Z scale_ub=None, 2025-05-07T20:32:17.5284715Z contiguous=False, 2025-05-07T20:32:17.5284799Z compiled=True, 2025-05-07T20:32:17.5284868Z ) 2025-05-07T20:32:17.5285090Z self = 2025-05-07T20:32:17.5285263Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.5285268Z 2025-05-07T20:32:17.5285358Z @given( 2025-05-07T20:32:17.5285476Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5285576Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5285702Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5285819Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5285934Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5286015Z ) 2025-05-07T20:32:17.5286258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5286350Z def test_silu_mul_quant( 2025-05-07T20:32:17.5286433Z self, 2025-05-07T20:32:17.5286508Z T: int, 2025-05-07T20:32:17.5286593Z D: int, 2025-05-07T20:32:17.5286690Z scale_ub: Optional[float], 2025-05-07T20:32:17.5286777Z contiguous: bool, 2025-05-07T20:32:17.5286867Z compiled: bool, 2025-05-07T20:32:17.5286944Z ) -> None: 2025-05-07T20:32:17.5287041Z torch.manual_seed(2025) 2025-05-07T20:32:17.5287121Z 2025-05-07T20:32:17.5287288Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5287362Z 2025-05-07T20:32:17.5287462Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5287585Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5287676Z x = x_sign * x_clamp 2025-05-07T20:32:17.5287763Z x0 = x[:, :D] 2025-05-07T20:32:17.5287844Z x1 = x[:, D:] 2025-05-07T20:32:17.5287924Z 2025-05-07T20:32:17.5288007Z if contiguous: 2025-05-07T20:32:17.5288097Z x0 = x0.contiguous() 2025-05-07T20:32:17.5288193Z x1 = x1.contiguous() 2025-05-07T20:32:17.5288337Z 2025-05-07T20:32:17.5288426Z if scale_ub is not None: 2025-05-07T20:32:17.5288536Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5288673Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5288749Z ) 2025-05-07T20:32:17.5289445Z else: 2025-05-07T20:32:17.5289543Z scale_ub_tensor = None 2025-05-07T20:32:17.5289614Z 2025-05-07T20:32:17.5289749Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5289840Z op = silu_mul_quant 2025-05-07T20:32:17.5289924Z if compiled: 2025-05-07T20:32:17.5290071Z op = torch.compile(op) 2025-05-07T20:32:17.5290178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5290256Z 2025-05-07T20:32:17.5290343Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5290348Z 2025-05-07T20:32:17.5290441Z moe/activation_test.py:117: 2025-05-07T20:32:17.5290573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5290676Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5290775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5291153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5291244Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5291742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5291838Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5292199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5292430Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5292766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5292858Z kernel = self.compile( 2025-05-07T20:32:17.5293243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5293418Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5293553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5293558Z 2025-05-07T20:32:17.5293765Z self = 2025-05-07T20:32:17.5294549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5295066Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2aa7e040>} 2025-05-07T20:32:17.5295815Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5296017Z context = 2025-05-07T20:32:17.5296027Z 2025-05-07T20:32:17.5296191Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5296457Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5296563Z module_map=module_map) 2025-05-07T20:32:17.5296731Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5296831Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5296907Z E ^ 2025-05-07T20:32:17.5297262Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5297267Z 2025-05-07T20:32:17.5297728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5297732Z 2025-05-07T20:32:17.5297835Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5298133Z self=, 2025-05-07T20:32:17.5298211Z T=16384, 2025-05-07T20:32:17.5298287Z D=5120, 2025-05-07T20:32:17.5298375Z scale_ub=1200.0, 2025-05-07T20:32:17.5298460Z contiguous=False, 2025-05-07T20:32:17.5298542Z compiled=False, 2025-05-07T20:32:17.5298620Z ) 2025-05-07T20:32:17.5298880Z self = 2025-05-07T20:32:17.5299059Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.5299069Z 2025-05-07T20:32:17.5299147Z @given( 2025-05-07T20:32:17.5299263Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5299368Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5299486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5299603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5299720Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5299793Z ) 2025-05-07T20:32:17.5300042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5300142Z def test_silu_mul_quant( 2025-05-07T20:32:17.5300218Z self, 2025-05-07T20:32:17.5300294Z T: int, 2025-05-07T20:32:17.5300375Z D: int, 2025-05-07T20:32:17.5300472Z scale_ub: Optional[float], 2025-05-07T20:32:17.5300567Z contiguous: bool, 2025-05-07T20:32:17.5300652Z compiled: bool, 2025-05-07T20:32:17.5300730Z ) -> None: 2025-05-07T20:32:17.5300828Z torch.manual_seed(2025) 2025-05-07T20:32:17.5300901Z 2025-05-07T20:32:17.5304826Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5304907Z 2025-05-07T20:32:17.5305006Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5305132Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5305216Z x = x_sign * x_clamp 2025-05-07T20:32:17.5305296Z x0 = x[:, :D] 2025-05-07T20:32:17.5305378Z x1 = x[:, D:] 2025-05-07T20:32:17.5305446Z 2025-05-07T20:32:17.5305528Z if contiguous: 2025-05-07T20:32:17.5305614Z x0 = x0.contiguous() 2025-05-07T20:32:17.5305696Z x1 = x1.contiguous() 2025-05-07T20:32:17.5305765Z 2025-05-07T20:32:17.5305851Z if scale_ub is not None: 2025-05-07T20:32:17.5305956Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5306090Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5306160Z ) 2025-05-07T20:32:17.5306231Z else: 2025-05-07T20:32:17.5306324Z scale_ub_tensor = None 2025-05-07T20:32:17.5306392Z 2025-05-07T20:32:17.5306520Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5306605Z op = silu_mul_quant 2025-05-07T20:32:17.5306686Z if compiled: 2025-05-07T20:32:17.5306785Z op = torch.compile(op) 2025-05-07T20:32:17.5306886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5306956Z 2025-05-07T20:32:17.5307044Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5307049Z 2025-05-07T20:32:17.5307145Z moe/activation_test.py:117: 2025-05-07T20:32:17.5307270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5307368Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5307466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5307974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5308068Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5308424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5308768Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5309213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5309303Z kernel = self.compile( 2025-05-07T20:32:17.5309680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5309916Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5310121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5310126Z 2025-05-07T20:32:17.5310357Z self = 2025-05-07T20:32:17.5311340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5311977Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2aa7e8b0>} 2025-05-07T20:32:17.5312916Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5313137Z context = 2025-05-07T20:32:17.5313144Z 2025-05-07T20:32:17.5313323Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5313632Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5313742Z module_map=module_map) 2025-05-07T20:32:17.5313917Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5314022Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5314097Z E ^ 2025-05-07T20:32:17.5314523Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5314532Z 2025-05-07T20:32:17.5315036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5315041Z 2025-05-07T20:32:17.5315145Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5315402Z self=, 2025-05-07T20:32:17.5315481Z T=16384, 2025-05-07T20:32:17.5315556Z D=5120, 2025-05-07T20:32:17.5315641Z scale_ub=1200.0, 2025-05-07T20:32:17.5315726Z contiguous=True, 2025-05-07T20:32:17.5315808Z compiled=True, 2025-05-07T20:32:17.5315882Z ) 2025-05-07T20:32:17.5316130Z self = 2025-05-07T20:32:17.5316325Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.5316334Z 2025-05-07T20:32:17.5316411Z @given( 2025-05-07T20:32:17.5316532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5316643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5316763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5316886Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5317007Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5317079Z ) 2025-05-07T20:32:17.5317368Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5317466Z def test_silu_mul_quant( 2025-05-07T20:32:17.5317540Z self, 2025-05-07T20:32:17.5317616Z T: int, 2025-05-07T20:32:17.5317695Z D: int, 2025-05-07T20:32:17.5317799Z scale_ub: Optional[float], 2025-05-07T20:32:17.5317935Z contiguous: bool, 2025-05-07T20:32:17.5318015Z compiled: bool, 2025-05-07T20:32:17.5318088Z ) -> None: 2025-05-07T20:32:17.5318180Z torch.manual_seed(2025) 2025-05-07T20:32:17.5318248Z 2025-05-07T20:32:17.5318486Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5318560Z 2025-05-07T20:32:17.5318646Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5318762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5318851Z x = x_sign * x_clamp 2025-05-07T20:32:17.5318927Z x0 = x[:, :D] 2025-05-07T20:32:17.5319068Z x1 = x[:, D:] 2025-05-07T20:32:17.5319138Z 2025-05-07T20:32:17.5319216Z if contiguous: 2025-05-07T20:32:17.5319305Z x0 = x0.contiguous() 2025-05-07T20:32:17.5319388Z x1 = x1.contiguous() 2025-05-07T20:32:17.5319453Z 2025-05-07T20:32:17.5319541Z if scale_ub is not None: 2025-05-07T20:32:17.5319639Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5319775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5319860Z ) 2025-05-07T20:32:17.5319946Z else: 2025-05-07T20:32:17.5320042Z scale_ub_tensor = None 2025-05-07T20:32:17.5320130Z 2025-05-07T20:32:17.5320261Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5320345Z op = silu_mul_quant 2025-05-07T20:32:17.5320427Z if compiled: 2025-05-07T20:32:17.5320520Z op = torch.compile(op) 2025-05-07T20:32:17.5320622Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5320692Z 2025-05-07T20:32:17.5320777Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5320781Z 2025-05-07T20:32:17.5320876Z moe/activation_test.py:117: 2025-05-07T20:32:17.5320998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5321094Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5321188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5321551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5321641Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5322140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5322234Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5322589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5322814Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5323148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5323239Z kernel = self.compile( 2025-05-07T20:32:17.5323616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5323790Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5323910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5323915Z 2025-05-07T20:32:17.5324122Z self = 2025-05-07T20:32:17.5324903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5325412Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2aa185e0>} 2025-05-07T20:32:17.5326157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5326396Z context = 2025-05-07T20:32:17.5326401Z 2025-05-07T20:32:17.5326566Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5326898Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5327002Z module_map=module_map) 2025-05-07T20:32:17.5327163Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5327254Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5327365Z E ^ 2025-05-07T20:32:17.5327729Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5327734Z 2025-05-07T20:32:17.5328142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5328150Z 2025-05-07T20:32:17.5328251Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5328468Z self=, 2025-05-07T20:32:17.5328543Z T=16384, 2025-05-07T20:32:17.5328617Z D=5120, 2025-05-07T20:32:17.5328697Z scale_ub=None, 2025-05-07T20:32:17.5328780Z contiguous=False, 2025-05-07T20:32:17.5328861Z compiled=True, 2025-05-07T20:32:17.5328926Z ) 2025-05-07T20:32:17.5329145Z self = 2025-05-07T20:32:17.5329318Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.5329324Z 2025-05-07T20:32:17.5329394Z @given( 2025-05-07T20:32:17.5329511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5329604Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5329714Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5329830Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5329943Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5330013Z ) 2025-05-07T20:32:17.5330259Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5330354Z def test_silu_mul_quant( 2025-05-07T20:32:17.5330428Z self, 2025-05-07T20:32:17.5330499Z T: int, 2025-05-07T20:32:17.5330569Z D: int, 2025-05-07T20:32:17.5330669Z scale_ub: Optional[float], 2025-05-07T20:32:17.5330752Z contiguous: bool, 2025-05-07T20:32:17.5330832Z compiled: bool, 2025-05-07T20:32:17.5330914Z ) -> None: 2025-05-07T20:32:17.5331002Z torch.manual_seed(2025) 2025-05-07T20:32:17.5331069Z 2025-05-07T20:32:17.5331234Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5331303Z 2025-05-07T20:32:17.5331387Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5331506Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5331594Z x = x_sign * x_clamp 2025-05-07T20:32:17.5331672Z x0 = x[:, :D] 2025-05-07T20:32:17.5331745Z x1 = x[:, D:] 2025-05-07T20:32:17.5331816Z 2025-05-07T20:32:17.5331897Z if contiguous: 2025-05-07T20:32:17.5331991Z x0 = x0.contiguous() 2025-05-07T20:32:17.5332072Z x1 = x1.contiguous() 2025-05-07T20:32:17.5332144Z 2025-05-07T20:32:17.5332227Z if scale_ub is not None: 2025-05-07T20:32:17.5332326Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5332457Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5332529Z ) 2025-05-07T20:32:17.5332598Z else: 2025-05-07T20:32:17.5332691Z scale_ub_tensor = None 2025-05-07T20:32:17.5332758Z 2025-05-07T20:32:17.5332883Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5332967Z op = silu_mul_quant 2025-05-07T20:32:17.5333044Z if compiled: 2025-05-07T20:32:17.5333189Z op = torch.compile(op) 2025-05-07T20:32:17.5333290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5333357Z 2025-05-07T20:32:17.5333445Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5333449Z 2025-05-07T20:32:17.5333617Z moe/activation_test.py:117: 2025-05-07T20:32:17.5333740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5333839Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5333935Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5334301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5334431Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5334925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5335021Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5335377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5335597Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5335938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5336026Z kernel = self.compile( 2025-05-07T20:32:17.5336403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5336573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5336696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5336700Z 2025-05-07T20:32:17.5336905Z self = 2025-05-07T20:32:17.5337686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5338206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a97c5e0>} 2025-05-07T20:32:17.5338956Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5339147Z context = 2025-05-07T20:32:17.5339154Z 2025-05-07T20:32:17.5339314Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5339569Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5339674Z module_map=module_map) 2025-05-07T20:32:17.5339831Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5339924Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5339999Z E ^ 2025-05-07T20:32:17.5340358Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5340363Z 2025-05-07T20:32:17.5340771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5340776Z 2025-05-07T20:32:17.5340873Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5341092Z self=, 2025-05-07T20:32:17.5341169Z T=2048, 2025-05-07T20:32:17.5341238Z D=5120, 2025-05-07T20:32:17.5341314Z scale_ub=None, 2025-05-07T20:32:17.5341398Z contiguous=False, 2025-05-07T20:32:17.5341478Z compiled=True, 2025-05-07T20:32:17.5341546Z ) 2025-05-07T20:32:17.5341817Z self = 2025-05-07T20:32:17.5341984Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.5341988Z 2025-05-07T20:32:17.5342065Z @given( 2025-05-07T20:32:17.5342248Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5342344Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5342457Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5342567Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5342674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5342784Z ) 2025-05-07T20:32:17.5343024Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5343117Z def test_silu_mul_quant( 2025-05-07T20:32:17.5343188Z self, 2025-05-07T20:32:17.5343259Z T: int, 2025-05-07T20:32:17.5343335Z D: int, 2025-05-07T20:32:17.5343426Z scale_ub: Optional[float], 2025-05-07T20:32:17.5343511Z contiguous: bool, 2025-05-07T20:32:17.5343594Z compiled: bool, 2025-05-07T20:32:17.5343665Z ) -> None: 2025-05-07T20:32:17.5343754Z torch.manual_seed(2025) 2025-05-07T20:32:17.5343826Z 2025-05-07T20:32:17.5343992Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5344063Z 2025-05-07T20:32:17.5344149Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5344267Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5344348Z x = x_sign * x_clamp 2025-05-07T20:32:17.5344431Z x0 = x[:, :D] 2025-05-07T20:32:17.5344506Z x1 = x[:, D:] 2025-05-07T20:32:17.5344577Z 2025-05-07T20:32:17.5344653Z if contiguous: 2025-05-07T20:32:17.5344738Z x0 = x0.contiguous() 2025-05-07T20:32:17.5344823Z x1 = x1.contiguous() 2025-05-07T20:32:17.5344889Z 2025-05-07T20:32:17.5344976Z if scale_ub is not None: 2025-05-07T20:32:17.5345079Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5345206Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5345276Z ) 2025-05-07T20:32:17.5345347Z else: 2025-05-07T20:32:17.5345440Z scale_ub_tensor = None 2025-05-07T20:32:17.5345507Z 2025-05-07T20:32:17.5345634Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5345719Z op = silu_mul_quant 2025-05-07T20:32:17.5345804Z if compiled: 2025-05-07T20:32:17.5345899Z op = torch.compile(op) 2025-05-07T20:32:17.5346001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5346073Z 2025-05-07T20:32:17.5346158Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5346163Z 2025-05-07T20:32:17.5346254Z moe/activation_test.py:117: 2025-05-07T20:32:17.5346379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5346475Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5346571Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5346944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5347032Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5347532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5347626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5347979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5348204Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5348537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5348625Z kernel = self.compile( 2025-05-07T20:32:17.5349006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5349226Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5349349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5349450Z 2025-05-07T20:32:17.5349654Z self = 2025-05-07T20:32:17.5350490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5351041Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2aa18c10>} 2025-05-07T20:32:17.5351784Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5351982Z context = 2025-05-07T20:32:17.5351986Z 2025-05-07T20:32:17.5352153Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5352411Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5352512Z module_map=module_map) 2025-05-07T20:32:17.5352669Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5352769Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5352838Z E ^ 2025-05-07T20:32:17.5353191Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5353196Z 2025-05-07T20:32:17.5353607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5353614Z 2025-05-07T20:32:17.5353712Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5353932Z self=, 2025-05-07T20:32:17.5354004Z T=2048, 2025-05-07T20:32:17.5354078Z D=5120, 2025-05-07T20:32:17.5354164Z scale_ub=1200.0, 2025-05-07T20:32:17.5354246Z contiguous=False, 2025-05-07T20:32:17.5354321Z compiled=True, 2025-05-07T20:32:17.5354392Z ) 2025-05-07T20:32:17.5354607Z self = 2025-05-07T20:32:17.5354781Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.5354789Z 2025-05-07T20:32:17.5354862Z @given( 2025-05-07T20:32:17.5354973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5355069Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5355179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5355295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5355407Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5355477Z ) 2025-05-07T20:32:17.5355722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5355811Z def test_silu_mul_quant( 2025-05-07T20:32:17.5355880Z self, 2025-05-07T20:32:17.5355948Z T: int, 2025-05-07T20:32:17.5356022Z D: int, 2025-05-07T20:32:17.5356115Z scale_ub: Optional[float], 2025-05-07T20:32:17.5356200Z contiguous: bool, 2025-05-07T20:32:17.5356283Z compiled: bool, 2025-05-07T20:32:17.5356355Z ) -> None: 2025-05-07T20:32:17.5356446Z torch.manual_seed(2025) 2025-05-07T20:32:17.5356514Z 2025-05-07T20:32:17.5356677Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5356748Z 2025-05-07T20:32:17.5356832Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5356999Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5357083Z x = x_sign * x_clamp 2025-05-07T20:32:17.5357155Z x0 = x[:, :D] 2025-05-07T20:32:17.5357230Z x1 = x[:, D:] 2025-05-07T20:32:17.5357302Z 2025-05-07T20:32:17.5357452Z if contiguous: 2025-05-07T20:32:17.5357540Z x0 = x0.contiguous() 2025-05-07T20:32:17.5357623Z x1 = x1.contiguous() 2025-05-07T20:32:17.5357690Z 2025-05-07T20:32:17.5357777Z if scale_ub is not None: 2025-05-07T20:32:17.5357876Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5358045Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5358121Z ) 2025-05-07T20:32:17.5358194Z else: 2025-05-07T20:32:17.5358284Z scale_ub_tensor = None 2025-05-07T20:32:17.5358353Z 2025-05-07T20:32:17.5358477Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5358558Z op = silu_mul_quant 2025-05-07T20:32:17.5358642Z if compiled: 2025-05-07T20:32:17.5358736Z op = torch.compile(op) 2025-05-07T20:32:17.5358837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5358906Z 2025-05-07T20:32:17.5358992Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5359001Z 2025-05-07T20:32:17.5359095Z moe/activation_test.py:117: 2025-05-07T20:32:17.5359215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5359309Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5359407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5359773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5359862Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5360355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5360449Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5360802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5361022Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5361360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5361451Z kernel = self.compile( 2025-05-07T20:32:17.5361829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5362003Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5362123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5362127Z 2025-05-07T20:32:17.5362328Z self = 2025-05-07T20:32:17.5363110Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5363622Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a748820>} 2025-05-07T20:32:17.5364375Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5364569Z context = 2025-05-07T20:32:17.5364573Z 2025-05-07T20:32:17.5364745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5365005Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5365154Z module_map=module_map) 2025-05-07T20:32:17.5365312Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5365404Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5365476Z E ^ 2025-05-07T20:32:17.5365900Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5365905Z 2025-05-07T20:32:17.5366314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5366325Z 2025-05-07T20:32:17.5366464Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5366681Z self=, 2025-05-07T20:32:17.5366758Z T=4096, 2025-05-07T20:32:17.5366828Z D=5120, 2025-05-07T20:32:17.5366906Z scale_ub=1200.0, 2025-05-07T20:32:17.5366986Z contiguous=True, 2025-05-07T20:32:17.5367063Z compiled=True, 2025-05-07T20:32:17.5367132Z ) 2025-05-07T20:32:17.5367346Z self = 2025-05-07T20:32:17.5367512Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.5367516Z 2025-05-07T20:32:17.5367599Z @given( 2025-05-07T20:32:17.5367710Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5367802Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5367913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5368024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5368134Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5368211Z ) 2025-05-07T20:32:17.5368451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5368539Z def test_silu_mul_quant( 2025-05-07T20:32:17.5368612Z self, 2025-05-07T20:32:17.5368681Z T: int, 2025-05-07T20:32:17.5368756Z D: int, 2025-05-07T20:32:17.5368853Z scale_ub: Optional[float], 2025-05-07T20:32:17.5368937Z contiguous: bool, 2025-05-07T20:32:17.5369016Z compiled: bool, 2025-05-07T20:32:17.5369088Z ) -> None: 2025-05-07T20:32:17.5369175Z torch.manual_seed(2025) 2025-05-07T20:32:17.5369249Z 2025-05-07T20:32:17.5369411Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5369477Z 2025-05-07T20:32:17.5369567Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5369685Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5369771Z x = x_sign * x_clamp 2025-05-07T20:32:17.5369846Z x0 = x[:, :D] 2025-05-07T20:32:17.5369920Z x1 = x[:, D:] 2025-05-07T20:32:17.5369989Z 2025-05-07T20:32:17.5370068Z if contiguous: 2025-05-07T20:32:17.5370153Z x0 = x0.contiguous() 2025-05-07T20:32:17.5370238Z x1 = x1.contiguous() 2025-05-07T20:32:17.5370307Z 2025-05-07T20:32:17.5370394Z if scale_ub is not None: 2025-05-07T20:32:17.5370494Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5370622Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5370693Z ) 2025-05-07T20:32:17.5370769Z else: 2025-05-07T20:32:17.5370864Z scale_ub_tensor = None 2025-05-07T20:32:17.5370928Z 2025-05-07T20:32:17.5371055Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5371137Z op = silu_mul_quant 2025-05-07T20:32:17.5371215Z if compiled: 2025-05-07T20:32:17.5371316Z op = torch.compile(op) 2025-05-07T20:32:17.5371414Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5371481Z 2025-05-07T20:32:17.5371570Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5371574Z 2025-05-07T20:32:17.5371665Z moe/activation_test.py:117: 2025-05-07T20:32:17.5371787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5371931Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5372024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5372390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5372550Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5373043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5373139Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5373493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5373756Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5374089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5374176Z kernel = self.compile( 2025-05-07T20:32:17.5374554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5374723Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5374852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5374856Z 2025-05-07T20:32:17.5375058Z self = 2025-05-07T20:32:17.5375838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5376347Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a6e1430>} 2025-05-07T20:32:17.5377097Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5377292Z context = 2025-05-07T20:32:17.5377297Z 2025-05-07T20:32:17.5377459Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5377717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5377820Z module_map=module_map) 2025-05-07T20:32:17.5377978Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5378076Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5378146Z E ^ 2025-05-07T20:32:17.5378497Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5378501Z 2025-05-07T20:32:17.5378913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5378920Z 2025-05-07T20:32:17.5379015Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5379239Z self=, 2025-05-07T20:32:17.5379310Z T=128, 2025-05-07T20:32:17.5379379Z D=5120, 2025-05-07T20:32:17.5379461Z scale_ub=1200.0, 2025-05-07T20:32:17.5379540Z contiguous=False, 2025-05-07T20:32:17.5379616Z compiled=True, 2025-05-07T20:32:17.5379686Z ) 2025-05-07T20:32:17.5379901Z self = 2025-05-07T20:32:17.5380069Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.5380073Z 2025-05-07T20:32:17.5380160Z @given( 2025-05-07T20:32:17.5380287Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5380404Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5380569Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5380682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5380794Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5380861Z ) 2025-05-07T20:32:17.5381198Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5381295Z def test_silu_mul_quant( 2025-05-07T20:32:17.5381368Z self, 2025-05-07T20:32:17.5381439Z T: int, 2025-05-07T20:32:17.5381512Z D: int, 2025-05-07T20:32:17.5381604Z scale_ub: Optional[float], 2025-05-07T20:32:17.5381725Z contiguous: bool, 2025-05-07T20:32:17.5381808Z compiled: bool, 2025-05-07T20:32:17.5381878Z ) -> None: 2025-05-07T20:32:17.5381967Z torch.manual_seed(2025) 2025-05-07T20:32:17.5382035Z 2025-05-07T20:32:17.5382196Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5382270Z 2025-05-07T20:32:17.5382360Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5382478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5382563Z x = x_sign * x_clamp 2025-05-07T20:32:17.5382638Z x0 = x[:, :D] 2025-05-07T20:32:17.5382714Z x1 = x[:, D:] 2025-05-07T20:32:17.5382793Z 2025-05-07T20:32:17.5382869Z if contiguous: 2025-05-07T20:32:17.5382954Z x0 = x0.contiguous() 2025-05-07T20:32:17.5383042Z x1 = x1.contiguous() 2025-05-07T20:32:17.5383107Z 2025-05-07T20:32:17.5383190Z if scale_ub is not None: 2025-05-07T20:32:17.5383290Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5383420Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5383495Z ) 2025-05-07T20:32:17.5383567Z else: 2025-05-07T20:32:17.5383655Z scale_ub_tensor = None 2025-05-07T20:32:17.5383724Z 2025-05-07T20:32:17.5383847Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5383934Z op = silu_mul_quant 2025-05-07T20:32:17.5384017Z if compiled: 2025-05-07T20:32:17.5384109Z op = torch.compile(op) 2025-05-07T20:32:17.5384209Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5384282Z 2025-05-07T20:32:17.5384371Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5384375Z 2025-05-07T20:32:17.5384470Z moe/activation_test.py:117: 2025-05-07T20:32:17.5384592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5384687Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5384787Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5385153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5385237Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5385729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5385824Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5386181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5386404Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5386738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5386828Z kernel = self.compile( 2025-05-07T20:32:17.5387202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5387375Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5387498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5387503Z 2025-05-07T20:32:17.5387705Z self = 2025-05-07T20:32:17.5388541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5389190Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a63f040>} 2025-05-07T20:32:17.5389999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5390232Z context = 2025-05-07T20:32:17.5390237Z 2025-05-07T20:32:17.5390394Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5390657Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5390762Z module_map=module_map) 2025-05-07T20:32:17.5390917Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5391010Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5391084Z E ^ 2025-05-07T20:32:17.5391439Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5391444Z 2025-05-07T20:32:17.5391852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5391859Z 2025-05-07T20:32:17.5391958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5392177Z self=, 2025-05-07T20:32:17.5392247Z T=16384, 2025-05-07T20:32:17.5392318Z D=7168, 2025-05-07T20:32:17.5392394Z scale_ub=1200.0, 2025-05-07T20:32:17.5392473Z contiguous=True, 2025-05-07T20:32:17.5392554Z compiled=True, 2025-05-07T20:32:17.5392621Z ) 2025-05-07T20:32:17.5392832Z self = 2025-05-07T20:32:17.5393005Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.5393014Z 2025-05-07T20:32:17.5393084Z @given( 2025-05-07T20:32:17.5393197Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5393292Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5393402Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5393522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5393629Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5393696Z ) 2025-05-07T20:32:17.5393942Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5394029Z def test_silu_mul_quant( 2025-05-07T20:32:17.5394101Z self, 2025-05-07T20:32:17.5394178Z T: int, 2025-05-07T20:32:17.5394250Z D: int, 2025-05-07T20:32:17.5394340Z scale_ub: Optional[float], 2025-05-07T20:32:17.5394426Z contiguous: bool, 2025-05-07T20:32:17.5394504Z compiled: bool, 2025-05-07T20:32:17.5394576Z ) -> None: 2025-05-07T20:32:17.5394671Z torch.manual_seed(2025) 2025-05-07T20:32:17.5394738Z 2025-05-07T20:32:17.5394901Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5394967Z 2025-05-07T20:32:17.5395053Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5395174Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5395261Z x = x_sign * x_clamp 2025-05-07T20:32:17.5395335Z x0 = x[:, :D] 2025-05-07T20:32:17.5395411Z x1 = x[:, D:] 2025-05-07T20:32:17.5395478Z 2025-05-07T20:32:17.5395554Z if contiguous: 2025-05-07T20:32:17.5395643Z x0 = x0.contiguous() 2025-05-07T20:32:17.5395725Z x1 = x1.contiguous() 2025-05-07T20:32:17.5395836Z 2025-05-07T20:32:17.5395923Z if scale_ub is not None: 2025-05-07T20:32:17.5396021Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5396149Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5396293Z ) 2025-05-07T20:32:17.5396366Z else: 2025-05-07T20:32:17.5396457Z scale_ub_tensor = None 2025-05-07T20:32:17.5396527Z 2025-05-07T20:32:17.5396651Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5396737Z op = silu_mul_quant 2025-05-07T20:32:17.5396854Z if compiled: 2025-05-07T20:32:17.5396949Z op = torch.compile(op) 2025-05-07T20:32:17.5397055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5397121Z 2025-05-07T20:32:17.5397204Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5397208Z 2025-05-07T20:32:17.5397307Z moe/activation_test.py:117: 2025-05-07T20:32:17.5397427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5397535Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5397629Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5397996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5398086Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5398575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5398668Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5399026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5399248Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5399584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5399677Z kernel = self.compile( 2025-05-07T20:32:17.5400052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5400228Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5400358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5400362Z 2025-05-07T20:32:17.5400567Z self = 2025-05-07T20:32:17.5401347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5401863Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a63fb80>} 2025-05-07T20:32:17.5402612Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5402805Z context = 2025-05-07T20:32:17.5402810Z 2025-05-07T20:32:17.5402971Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5403227Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5403329Z module_map=module_map) 2025-05-07T20:32:17.5403491Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5403582Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5403653Z E ^ 2025-05-07T20:32:17.5404170Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5404249Z 2025-05-07T20:32:17.5404658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5404664Z 2025-05-07T20:32:17.5404762Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5405087Z self=, 2025-05-07T20:32:17.5405161Z T=16384, 2025-05-07T20:32:17.5405236Z D=5120, 2025-05-07T20:32:17.5405315Z scale_ub=1200.0, 2025-05-07T20:32:17.5405398Z contiguous=True, 2025-05-07T20:32:17.5405477Z compiled=False, 2025-05-07T20:32:17.5405603Z ) 2025-05-07T20:32:17.5405816Z self = 2025-05-07T20:32:17.5405989Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.5405994Z 2025-05-07T20:32:17.5406062Z @given( 2025-05-07T20:32:17.5406180Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5406276Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5406384Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5406499Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5406608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5406686Z ) 2025-05-07T20:32:17.5406925Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5407011Z def test_silu_mul_quant( 2025-05-07T20:32:17.5407088Z self, 2025-05-07T20:32:17.5407160Z T: int, 2025-05-07T20:32:17.5407233Z D: int, 2025-05-07T20:32:17.5407333Z scale_ub: Optional[float], 2025-05-07T20:32:17.5407416Z contiguous: bool, 2025-05-07T20:32:17.5407496Z compiled: bool, 2025-05-07T20:32:17.5407572Z ) -> None: 2025-05-07T20:32:17.5407659Z torch.manual_seed(2025) 2025-05-07T20:32:17.5407726Z 2025-05-07T20:32:17.5407891Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5407967Z 2025-05-07T20:32:17.5408055Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5408180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5408264Z x = x_sign * x_clamp 2025-05-07T20:32:17.5408342Z x0 = x[:, :D] 2025-05-07T20:32:17.5408422Z x1 = x[:, D:] 2025-05-07T20:32:17.5408488Z 2025-05-07T20:32:17.5408569Z if contiguous: 2025-05-07T20:32:17.5408655Z x0 = x0.contiguous() 2025-05-07T20:32:17.5408738Z x1 = x1.contiguous() 2025-05-07T20:32:17.5408807Z 2025-05-07T20:32:17.5408893Z if scale_ub is not None: 2025-05-07T20:32:17.5408998Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5409132Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5409202Z ) 2025-05-07T20:32:17.5409275Z else: 2025-05-07T20:32:17.5409366Z scale_ub_tensor = None 2025-05-07T20:32:17.5409434Z 2025-05-07T20:32:17.5409563Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5409647Z op = silu_mul_quant 2025-05-07T20:32:17.5409726Z if compiled: 2025-05-07T20:32:17.5409827Z op = torch.compile(op) 2025-05-07T20:32:17.5409933Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5410000Z 2025-05-07T20:32:17.5410088Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5410092Z 2025-05-07T20:32:17.5410183Z moe/activation_test.py:117: 2025-05-07T20:32:17.5410304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5410403Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5410501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5411005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5411096Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5411456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5411760Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5412165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5412254Z kernel = self.compile( 2025-05-07T20:32:17.5412632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5412800Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5412972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5412976Z 2025-05-07T20:32:17.5413179Z self = 2025-05-07T20:32:17.5413960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5414478Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a92e5e0>} 2025-05-07T20:32:17.5415231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5415425Z context = 2025-05-07T20:32:17.5415432Z 2025-05-07T20:32:17.5415590Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5415852Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5415954Z module_map=module_map) 2025-05-07T20:32:17.5416113Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5416211Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5416285Z E ^ 2025-05-07T20:32:17.5416643Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5416648Z 2025-05-07T20:32:17.5417058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5417062Z 2025-05-07T20:32:17.5417158Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5417382Z self=, 2025-05-07T20:32:17.5417455Z T=1, 2025-05-07T20:32:17.5417526Z D=7168, 2025-05-07T20:32:17.5417606Z scale_ub=1200.0, 2025-05-07T20:32:17.5417684Z contiguous=False, 2025-05-07T20:32:17.5417763Z compiled=False, 2025-05-07T20:32:17.5417835Z ) 2025-05-07T20:32:17.5418046Z self = 2025-05-07T20:32:17.5418211Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.5418221Z 2025-05-07T20:32:17.5418295Z @given( 2025-05-07T20:32:17.5418414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5418510Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5418618Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5418730Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5418844Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5418913Z ) 2025-05-07T20:32:17.5419152Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5419248Z def test_silu_mul_quant( 2025-05-07T20:32:17.5419320Z self, 2025-05-07T20:32:17.5419388Z T: int, 2025-05-07T20:32:17.5419466Z D: int, 2025-05-07T20:32:17.5419558Z scale_ub: Optional[float], 2025-05-07T20:32:17.5419691Z contiguous: bool, 2025-05-07T20:32:17.5419773Z compiled: bool, 2025-05-07T20:32:17.5419847Z ) -> None: 2025-05-07T20:32:17.5419937Z torch.manual_seed(2025) 2025-05-07T20:32:17.5420005Z 2025-05-07T20:32:17.5420243Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5420318Z 2025-05-07T20:32:17.5420403Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5420524Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5420609Z x = x_sign * x_clamp 2025-05-07T20:32:17.5420684Z x0 = x[:, :D] 2025-05-07T20:32:17.5420794Z x1 = x[:, D:] 2025-05-07T20:32:17.5420864Z 2025-05-07T20:32:17.5420945Z if contiguous: 2025-05-07T20:32:17.5421035Z x0 = x0.contiguous() 2025-05-07T20:32:17.5421121Z x1 = x1.contiguous() 2025-05-07T20:32:17.5424269Z 2025-05-07T20:32:17.5424368Z if scale_ub is not None: 2025-05-07T20:32:17.5424471Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5424616Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5424684Z ) 2025-05-07T20:32:17.5424752Z else: 2025-05-07T20:32:17.5424844Z scale_ub_tensor = None 2025-05-07T20:32:17.5424920Z 2025-05-07T20:32:17.5425051Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5425134Z op = silu_mul_quant 2025-05-07T20:32:17.5425214Z if compiled: 2025-05-07T20:32:17.5425310Z op = torch.compile(op) 2025-05-07T20:32:17.5425408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5425479Z 2025-05-07T20:32:17.5425565Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5425570Z 2025-05-07T20:32:17.5425662Z moe/activation_test.py:117: 2025-05-07T20:32:17.5425787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5425888Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5425985Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5426491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5426582Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5426946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5427166Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5427500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5427589Z kernel = self.compile( 2025-05-07T20:32:17.5427966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5428138Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5428264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5428272Z 2025-05-07T20:32:17.5428474Z self = 2025-05-07T20:32:17.5429254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5429763Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a92e9d0>} 2025-05-07T20:32:17.5430581Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5430771Z context = 2025-05-07T20:32:17.5430838Z 2025-05-07T20:32:17.5430998Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5431260Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5431433Z module_map=module_map) 2025-05-07T20:32:17.5431595Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5431691Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5431762Z E ^ 2025-05-07T20:32:17.5432113Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5432158Z 2025-05-07T20:32:17.5432569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5432573Z 2025-05-07T20:32:17.5432668Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5432890Z self=, 2025-05-07T20:32:17.5432966Z T=4096, 2025-05-07T20:32:17.5433037Z D=7168, 2025-05-07T20:32:17.5433117Z scale_ub=1200.0, 2025-05-07T20:32:17.5433200Z contiguous=False, 2025-05-07T20:32:17.5433277Z compiled=True, 2025-05-07T20:32:17.5433347Z ) 2025-05-07T20:32:17.5433573Z self = 2025-05-07T20:32:17.5433742Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.5433749Z 2025-05-07T20:32:17.5433824Z @given( 2025-05-07T20:32:17.5433936Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5434034Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5434145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5434256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5434367Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5434436Z ) 2025-05-07T20:32:17.5434680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5434772Z def test_silu_mul_quant( 2025-05-07T20:32:17.5434841Z self, 2025-05-07T20:32:17.5434910Z T: int, 2025-05-07T20:32:17.5434984Z D: int, 2025-05-07T20:32:17.5435082Z scale_ub: Optional[float], 2025-05-07T20:32:17.5435169Z contiguous: bool, 2025-05-07T20:32:17.5435248Z compiled: bool, 2025-05-07T20:32:17.5435318Z ) -> None: 2025-05-07T20:32:17.5435412Z torch.manual_seed(2025) 2025-05-07T20:32:17.5435478Z 2025-05-07T20:32:17.5435641Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5435715Z 2025-05-07T20:32:17.5435801Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5435917Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5436004Z x = x_sign * x_clamp 2025-05-07T20:32:17.5436080Z x0 = x[:, :D] 2025-05-07T20:32:17.5436151Z x1 = x[:, D:] 2025-05-07T20:32:17.5436223Z 2025-05-07T20:32:17.5436301Z if contiguous: 2025-05-07T20:32:17.5436388Z x0 = x0.contiguous() 2025-05-07T20:32:17.5436470Z x1 = x1.contiguous() 2025-05-07T20:32:17.5436538Z 2025-05-07T20:32:17.5436630Z if scale_ub is not None: 2025-05-07T20:32:17.5436728Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5436856Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5436934Z ) 2025-05-07T20:32:17.5437003Z else: 2025-05-07T20:32:17.5437092Z scale_ub_tensor = None 2025-05-07T20:32:17.5437165Z 2025-05-07T20:32:17.5437289Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5437372Z op = silu_mul_quant 2025-05-07T20:32:17.5437453Z if compiled: 2025-05-07T20:32:17.5437544Z op = torch.compile(op) 2025-05-07T20:32:17.5437647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5437767Z 2025-05-07T20:32:17.5437853Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5437858Z 2025-05-07T20:32:17.5437953Z moe/activation_test.py:117: 2025-05-07T20:32:17.5438074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5438244Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5438343Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5438706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5438795Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5439349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5439441Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5439796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5440013Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5440349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5440440Z kernel = self.compile( 2025-05-07T20:32:17.5440820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5440992Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5441112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5441119Z 2025-05-07T20:32:17.5441321Z self = 2025-05-07T20:32:17.5442103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5442611Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a654c10>} 2025-05-07T20:32:17.5443359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5443552Z context = 2025-05-07T20:32:17.5443557Z 2025-05-07T20:32:17.5443714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5443976Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5444078Z module_map=module_map) 2025-05-07T20:32:17.5444238Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5444330Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5444401Z E ^ 2025-05-07T20:32:17.5444755Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5444760Z 2025-05-07T20:32:17.5445170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5445175Z 2025-05-07T20:32:17.5445275Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5445492Z self=, 2025-05-07T20:32:17.5445562Z T=128, 2025-05-07T20:32:17.5445643Z D=7168, 2025-05-07T20:32:17.5445719Z scale_ub=1200.0, 2025-05-07T20:32:17.5445801Z contiguous=False, 2025-05-07T20:32:17.5445880Z compiled=True, 2025-05-07T20:32:17.5445947Z ) 2025-05-07T20:32:17.5446167Z self = 2025-05-07T20:32:17.5446333Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.5446382Z 2025-05-07T20:32:17.5446457Z @given( 2025-05-07T20:32:17.5446572Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5446667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5446888Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5447003Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5447110Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5447180Z ) 2025-05-07T20:32:17.5447421Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5447546Z def test_silu_mul_quant( 2025-05-07T20:32:17.5447623Z self, 2025-05-07T20:32:17.5447695Z T: int, 2025-05-07T20:32:17.5447765Z D: int, 2025-05-07T20:32:17.5447860Z scale_ub: Optional[float], 2025-05-07T20:32:17.5447945Z contiguous: bool, 2025-05-07T20:32:17.5448024Z compiled: bool, 2025-05-07T20:32:17.5448099Z ) -> None: 2025-05-07T20:32:17.5448191Z torch.manual_seed(2025) 2025-05-07T20:32:17.5448259Z 2025-05-07T20:32:17.5448426Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5448494Z 2025-05-07T20:32:17.5448578Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5448706Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5448789Z x = x_sign * x_clamp 2025-05-07T20:32:17.5448868Z x0 = x[:, :D] 2025-05-07T20:32:17.5448941Z x1 = x[:, D:] 2025-05-07T20:32:17.5449006Z 2025-05-07T20:32:17.5449087Z if contiguous: 2025-05-07T20:32:17.5449174Z x0 = x0.contiguous() 2025-05-07T20:32:17.5449256Z x1 = x1.contiguous() 2025-05-07T20:32:17.5449325Z 2025-05-07T20:32:17.5449411Z if scale_ub is not None: 2025-05-07T20:32:17.5449514Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5449651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5449747Z ) 2025-05-07T20:32:17.5449825Z else: 2025-05-07T20:32:17.5449934Z scale_ub_tensor = None 2025-05-07T20:32:17.5450002Z 2025-05-07T20:32:17.5450131Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5450223Z op = silu_mul_quant 2025-05-07T20:32:17.5450302Z if compiled: 2025-05-07T20:32:17.5450397Z op = torch.compile(op) 2025-05-07T20:32:17.5450497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5450565Z 2025-05-07T20:32:17.5450652Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5450660Z 2025-05-07T20:32:17.5450750Z moe/activation_test.py:117: 2025-05-07T20:32:17.5450872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5450971Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5451064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5451428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5451516Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5452005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5452102Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5452454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5452672Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5453012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5453101Z kernel = self.compile( 2025-05-07T20:32:17.5453479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5453648Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5453818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5453823Z 2025-05-07T20:32:17.5454029Z self = 2025-05-07T20:32:17.5454885Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5455397Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a5e6820>} 2025-05-07T20:32:17.5456180Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5456373Z context = 2025-05-07T20:32:17.5456383Z 2025-05-07T20:32:17.5456542Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5456798Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5456907Z module_map=module_map) 2025-05-07T20:32:17.5457066Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5457158Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5457231Z E ^ 2025-05-07T20:32:17.5457589Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5457601Z 2025-05-07T20:32:17.5458010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5458015Z 2025-05-07T20:32:17.5458112Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5458332Z self=, 2025-05-07T20:32:17.5458404Z T=2048, 2025-05-07T20:32:17.5458473Z D=7168, 2025-05-07T20:32:17.5458546Z scale_ub=None, 2025-05-07T20:32:17.5458628Z contiguous=True, 2025-05-07T20:32:17.5458710Z compiled=True, 2025-05-07T20:32:17.5458777Z ) 2025-05-07T20:32:17.5458991Z self = 2025-05-07T20:32:17.5459156Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.5459160Z 2025-05-07T20:32:17.5459231Z @given( 2025-05-07T20:32:17.5459346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5459440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5459553Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5459664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5459770Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5459841Z ) 2025-05-07T20:32:17.5460083Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5460175Z def test_silu_mul_quant( 2025-05-07T20:32:17.5460244Z self, 2025-05-07T20:32:17.5460313Z T: int, 2025-05-07T20:32:17.5460393Z D: int, 2025-05-07T20:32:17.5460486Z scale_ub: Optional[float], 2025-05-07T20:32:17.5460568Z contiguous: bool, 2025-05-07T20:32:17.5460652Z compiled: bool, 2025-05-07T20:32:17.5460725Z ) -> None: 2025-05-07T20:32:17.5460812Z torch.manual_seed(2025) 2025-05-07T20:32:17.5460885Z 2025-05-07T20:32:17.5461047Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5461121Z 2025-05-07T20:32:17.5461206Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5461325Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5461408Z x = x_sign * x_clamp 2025-05-07T20:32:17.5461484Z x0 = x[:, :D] 2025-05-07T20:32:17.5461603Z x1 = x[:, D:] 2025-05-07T20:32:17.5461672Z 2025-05-07T20:32:17.5461748Z if contiguous: 2025-05-07T20:32:17.5461834Z x0 = x0.contiguous() 2025-05-07T20:32:17.5461921Z x1 = x1.contiguous() 2025-05-07T20:32:17.5461987Z 2025-05-07T20:32:17.5462145Z if scale_ub is not None: 2025-05-07T20:32:17.5462247Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5462375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5462444Z ) 2025-05-07T20:32:17.5462517Z else: 2025-05-07T20:32:17.5462648Z scale_ub_tensor = None 2025-05-07T20:32:17.5462715Z 2025-05-07T20:32:17.5462842Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5462926Z op = silu_mul_quant 2025-05-07T20:32:17.5463007Z if compiled: 2025-05-07T20:32:17.5463100Z op = torch.compile(op) 2025-05-07T20:32:17.5463199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5463272Z 2025-05-07T20:32:17.5463356Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5463361Z 2025-05-07T20:32:17.5463451Z moe/activation_test.py:117: 2025-05-07T20:32:17.5463580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5463676Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5463767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5464133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5464219Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5464719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5464812Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5465164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5465385Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5465717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5465810Z kernel = self.compile( 2025-05-07T20:32:17.5466188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5466358Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5466479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5466486Z 2025-05-07T20:32:17.5466688Z self = 2025-05-07T20:32:17.5467469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5467981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a5314c0>} 2025-05-07T20:32:17.5468729Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5468921Z context = 2025-05-07T20:32:17.5468926Z 2025-05-07T20:32:17.5469087Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5469346Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5469448Z module_map=module_map) 2025-05-07T20:32:17.5469606Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5469749Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5469878Z E ^ 2025-05-07T20:32:17.5470230Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5470235Z 2025-05-07T20:32:17.5470744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5470749Z 2025-05-07T20:32:17.5470845Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5471067Z self=, 2025-05-07T20:32:17.5471179Z T=16384, 2025-05-07T20:32:17.5471250Z D=5120, 2025-05-07T20:32:17.5471329Z scale_ub=None, 2025-05-07T20:32:17.5471409Z contiguous=False, 2025-05-07T20:32:17.5471487Z compiled=False, 2025-05-07T20:32:17.5471557Z ) 2025-05-07T20:32:17.5471769Z self = 2025-05-07T20:32:17.5471940Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.5471950Z 2025-05-07T20:32:17.5472022Z @given( 2025-05-07T20:32:17.5472134Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5472229Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5472344Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5472456Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5472570Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5472638Z ) 2025-05-07T20:32:17.5472879Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5472972Z def test_silu_mul_quant( 2025-05-07T20:32:17.5473044Z self, 2025-05-07T20:32:17.5473115Z T: int, 2025-05-07T20:32:17.5473187Z D: int, 2025-05-07T20:32:17.5473280Z scale_ub: Optional[float], 2025-05-07T20:32:17.5473364Z contiguous: bool, 2025-05-07T20:32:17.5473444Z compiled: bool, 2025-05-07T20:32:17.5473522Z ) -> None: 2025-05-07T20:32:17.5473612Z torch.manual_seed(2025) 2025-05-07T20:32:17.5473683Z 2025-05-07T20:32:17.5473846Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5473917Z 2025-05-07T20:32:17.5474007Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5474123Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5475963Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5475974Z 2025-05-07T20:32:17.5476087Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:17.5476091Z 2025-05-07T20:32:17.5476190Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5476417Z self=, 2025-05-07T20:32:17.5476488Z T=4096, 2025-05-07T20:32:17.5476556Z D=7168, 2025-05-07T20:32:17.5476631Z scale_ub=1200.0, 2025-05-07T20:32:17.5476711Z contiguous=True, 2025-05-07T20:32:17.5476787Z compiled=True, 2025-05-07T20:32:17.5476854Z ) 2025-05-07T20:32:17.5477064Z self = 2025-05-07T20:32:17.5477233Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.5477237Z 2025-05-07T20:32:17.5477309Z @given( 2025-05-07T20:32:17.5477425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5477517Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5477679Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5477789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5477895Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5477967Z ) 2025-05-07T20:32:17.5478278Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5478368Z def test_silu_mul_quant( 2025-05-07T20:32:17.5478439Z self, 2025-05-07T20:32:17.5478511Z T: int, 2025-05-07T20:32:17.5478582Z D: int, 2025-05-07T20:32:17.5478678Z scale_ub: Optional[float], 2025-05-07T20:32:17.5478801Z contiguous: bool, 2025-05-07T20:32:17.5478879Z compiled: bool, 2025-05-07T20:32:17.5478955Z ) -> None: 2025-05-07T20:32:17.5479042Z torch.manual_seed(2025) 2025-05-07T20:32:17.5479112Z 2025-05-07T20:32:17.5479272Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5479341Z 2025-05-07T20:32:17.5479434Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5479555Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5481395Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5481407Z 2025-05-07T20:32:17.5481519Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:17.5481524Z 2025-05-07T20:32:17.5481620Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5481842Z self=, 2025-05-07T20:32:17.5481913Z T=16384, 2025-05-07T20:32:17.5481983Z D=7168, 2025-05-07T20:32:17.5482063Z scale_ub=None, 2025-05-07T20:32:17.5482142Z contiguous=False, 2025-05-07T20:32:17.5482221Z compiled=False, 2025-05-07T20:32:17.5482289Z ) 2025-05-07T20:32:17.5482500Z self = 2025-05-07T20:32:17.5482670Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.5482674Z 2025-05-07T20:32:17.5482746Z @given( 2025-05-07T20:32:17.5482858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5482956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5483064Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5483174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5483283Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5483359Z ) 2025-05-07T20:32:17.5483601Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5483689Z def test_silu_mul_quant( 2025-05-07T20:32:17.5483758Z self, 2025-05-07T20:32:17.5483828Z T: int, 2025-05-07T20:32:17.5483899Z D: int, 2025-05-07T20:32:17.5483995Z scale_ub: Optional[float], 2025-05-07T20:32:17.5484077Z contiguous: bool, 2025-05-07T20:32:17.5484155Z compiled: bool, 2025-05-07T20:32:17.5484232Z ) -> None: 2025-05-07T20:32:17.5484321Z torch.manual_seed(2025) 2025-05-07T20:32:17.5484390Z 2025-05-07T20:32:17.5484554Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5486340Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5486466Z 2025-05-07T20:32:17.5486583Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5486588Z 2025-05-07T20:32:17.5486683Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5486906Z self=, 2025-05-07T20:32:17.5486975Z T=2048, 2025-05-07T20:32:17.5487085Z D=7168, 2025-05-07T20:32:17.5487168Z scale_ub=1200.0, 2025-05-07T20:32:17.5487244Z contiguous=True, 2025-05-07T20:32:17.5487321Z compiled=True, 2025-05-07T20:32:17.5487390Z ) 2025-05-07T20:32:17.5487604Z self = 2025-05-07T20:32:17.5487769Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.5487779Z 2025-05-07T20:32:17.5487852Z @given( 2025-05-07T20:32:17.5487964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5488056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5488171Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5488281Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5488391Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5488460Z ) 2025-05-07T20:32:17.5491410Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5491529Z def test_silu_mul_quant( 2025-05-07T20:32:17.5491602Z self, 2025-05-07T20:32:17.5491675Z T: int, 2025-05-07T20:32:17.5491747Z D: int, 2025-05-07T20:32:17.5491844Z scale_ub: Optional[float], 2025-05-07T20:32:17.5491929Z contiguous: bool, 2025-05-07T20:32:17.5492015Z compiled: bool, 2025-05-07T20:32:17.5492094Z ) -> None: 2025-05-07T20:32:17.5492183Z torch.manual_seed(2025) 2025-05-07T20:32:17.5492254Z 2025-05-07T20:32:17.5492418Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5492487Z 2025-05-07T20:32:17.5492581Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5492701Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5494496Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5494506Z 2025-05-07T20:32:17.5494620Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:17.5494625Z 2025-05-07T20:32:17.5494722Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5494941Z self=, 2025-05-07T20:32:17.5495020Z T=2048, 2025-05-07T20:32:17.5495097Z D=7168, 2025-05-07T20:32:17.5495173Z scale_ub=None, 2025-05-07T20:32:17.5495252Z contiguous=True, 2025-05-07T20:32:17.5495331Z compiled=False, 2025-05-07T20:32:17.5495402Z ) 2025-05-07T20:32:17.5495613Z self = 2025-05-07T20:32:17.5495783Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.5495788Z 2025-05-07T20:32:17.5495860Z @given( 2025-05-07T20:32:17.5495972Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5496068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5496176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5496358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5496467Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5496536Z ) 2025-05-07T20:32:17.5496818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5496909Z def test_silu_mul_quant( 2025-05-07T20:32:17.5496982Z self, 2025-05-07T20:32:17.5497058Z T: int, 2025-05-07T20:32:17.5497128Z D: int, 2025-05-07T20:32:17.5497221Z scale_ub: Optional[float], 2025-05-07T20:32:17.5497347Z contiguous: bool, 2025-05-07T20:32:17.5497426Z compiled: bool, 2025-05-07T20:32:17.5497501Z ) -> None: 2025-05-07T20:32:17.5497594Z torch.manual_seed(2025) 2025-05-07T20:32:17.5497662Z 2025-05-07T20:32:17.5497829Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5497899Z 2025-05-07T20:32:17.5497984Z > x_sign = torch.sign(x) 2025-05-07T20:32:17.5499771Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5499860Z 2025-05-07T20:32:17.5499976Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:17.5499981Z 2025-05-07T20:32:17.5500082Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5500303Z self=, 2025-05-07T20:32:17.5500373Z T=1, 2025-05-07T20:32:17.5500454Z D=7168, 2025-05-07T20:32:17.5500535Z scale_ub=1200.0, 2025-05-07T20:32:17.5500613Z contiguous=True, 2025-05-07T20:32:17.5500692Z compiled=False, 2025-05-07T20:32:17.5500761Z ) 2025-05-07T20:32:17.5500974Z self = 2025-05-07T20:32:17.5501137Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.5501142Z 2025-05-07T20:32:17.5501214Z @given( 2025-05-07T20:32:17.5501330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5501423Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5501539Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5501655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5501763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5501834Z ) 2025-05-07T20:32:17.5502078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5502169Z def test_silu_mul_quant( 2025-05-07T20:32:17.5502244Z self, 2025-05-07T20:32:17.5502316Z T: int, 2025-05-07T20:32:17.5502393Z D: int, 2025-05-07T20:32:17.5502488Z scale_ub: Optional[float], 2025-05-07T20:32:17.5502571Z contiguous: bool, 2025-05-07T20:32:17.5502655Z compiled: bool, 2025-05-07T20:32:17.5502736Z ) -> None: 2025-05-07T20:32:17.5502827Z torch.manual_seed(2025) 2025-05-07T20:32:17.5502897Z 2025-05-07T20:32:17.5503061Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5503131Z 2025-05-07T20:32:17.5503224Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5503345Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5503430Z x = x_sign * x_clamp 2025-05-07T20:32:17.5503508Z x0 = x[:, :D] 2025-05-07T20:32:17.5503581Z x1 = x[:, D:] 2025-05-07T20:32:17.5503649Z 2025-05-07T20:32:17.5504003Z if contiguous: 2025-05-07T20:32:17.5504258Z x0 = x0.contiguous() 2025-05-07T20:32:17.5504364Z x1 = x1.contiguous() 2025-05-07T20:32:17.5504433Z 2025-05-07T20:32:17.5504520Z if scale_ub is not None: 2025-05-07T20:32:17.5504623Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5504838Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5504910Z ) 2025-05-07T20:32:17.5504980Z else: 2025-05-07T20:32:17.5505076Z scale_ub_tensor = None 2025-05-07T20:32:17.5505145Z 2025-05-07T20:32:17.5505270Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5505421Z op = silu_mul_quant 2025-05-07T20:32:17.5505502Z if compiled: 2025-05-07T20:32:17.5505602Z op = torch.compile(op) 2025-05-07T20:32:17.5505702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5505768Z 2025-05-07T20:32:17.5505858Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5505862Z 2025-05-07T20:32:17.5505964Z moe/activation_test.py:117: 2025-05-07T20:32:17.5506088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5506204Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5506307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5506839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5506933Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5507360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5507590Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5507923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5508014Z kernel = self.compile( 2025-05-07T20:32:17.5508392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5508566Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5508689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5508697Z 2025-05-07T20:32:17.5508900Z self = 2025-05-07T20:32:17.5509684Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5510251Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a1e8040>} 2025-05-07T20:32:17.5510998Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5511199Z context = 2025-05-07T20:32:17.5511203Z 2025-05-07T20:32:17.5511368Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5511660Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5511786Z module_map=module_map) 2025-05-07T20:32:17.5511950Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5512055Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5512131Z E ^ 2025-05-07T20:32:17.5512484Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5512489Z 2025-05-07T20:32:17.5512900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5512954Z 2025-05-07T20:32:17.5513054Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5513283Z self=, 2025-05-07T20:32:17.5513356Z T=128, 2025-05-07T20:32:17.5513466Z D=5120, 2025-05-07T20:32:17.5513548Z scale_ub=None, 2025-05-07T20:32:17.5513627Z contiguous=True, 2025-05-07T20:32:17.5513704Z compiled=False, 2025-05-07T20:32:17.5513774Z ) 2025-05-07T20:32:17.5513992Z self = 2025-05-07T20:32:17.5514201Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.5514206Z 2025-05-07T20:32:17.5514285Z @given( 2025-05-07T20:32:17.5514396Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5514494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5514604Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5514719Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5514831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5514901Z ) 2025-05-07T20:32:17.5515147Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5515242Z def test_silu_mul_quant( 2025-05-07T20:32:17.5515311Z self, 2025-05-07T20:32:17.5515381Z T: int, 2025-05-07T20:32:17.5515459Z D: int, 2025-05-07T20:32:17.5515554Z scale_ub: Optional[float], 2025-05-07T20:32:17.5515640Z contiguous: bool, 2025-05-07T20:32:17.5515842Z compiled: bool, 2025-05-07T20:32:17.5515917Z ) -> None: 2025-05-07T20:32:17.5516008Z torch.manual_seed(2025) 2025-05-07T20:32:17.5516074Z 2025-05-07T20:32:17.5516235Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5516311Z 2025-05-07T20:32:17.5516400Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5516522Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5516611Z x = x_sign * x_clamp 2025-05-07T20:32:17.5516686Z x0 = x[:, :D] 2025-05-07T20:32:17.5516760Z x1 = x[:, D:] 2025-05-07T20:32:17.5516832Z 2025-05-07T20:32:17.5516914Z if contiguous: 2025-05-07T20:32:17.5516999Z x0 = x0.contiguous() 2025-05-07T20:32:17.5517086Z x1 = x1.contiguous() 2025-05-07T20:32:17.5517151Z 2025-05-07T20:32:17.5517240Z if scale_ub is not None: 2025-05-07T20:32:17.5517343Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5517480Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5517556Z ) 2025-05-07T20:32:17.5517627Z else: 2025-05-07T20:32:17.5517716Z scale_ub_tensor = None 2025-05-07T20:32:17.5517789Z 2025-05-07T20:32:17.5517913Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5517996Z op = silu_mul_quant 2025-05-07T20:32:17.5518082Z if compiled: 2025-05-07T20:32:17.5518177Z op = torch.compile(op) 2025-05-07T20:32:17.5518277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5518348Z 2025-05-07T20:32:17.5518436Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5518443Z 2025-05-07T20:32:17.5518541Z moe/activation_test.py:117: 2025-05-07T20:32:17.5518664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5518758Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5518856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5519361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5519454Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5519813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5520079Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5520421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5520512Z kernel = self.compile( 2025-05-07T20:32:17.5520926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5521104Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5521225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5521271Z 2025-05-07T20:32:17.5521477Z self = 2025-05-07T20:32:17.5522258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5522772Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a1e89d0>} 2025-05-07T20:32:17.5523526Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5523720Z context = 2025-05-07T20:32:17.5523725Z 2025-05-07T20:32:17.5523929Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5524191Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5524295Z module_map=module_map) 2025-05-07T20:32:17.5524458Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5524556Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5524634Z E ^ 2025-05-07T20:32:17.5524990Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5524994Z 2025-05-07T20:32:17.5525406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5525411Z 2025-05-07T20:32:17.5525511Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5525730Z self=, 2025-05-07T20:32:17.5525800Z T=128, 2025-05-07T20:32:17.5525880Z D=7168, 2025-05-07T20:32:17.5525958Z scale_ub=None, 2025-05-07T20:32:17.5526042Z contiguous=True, 2025-05-07T20:32:17.5526121Z compiled=False, 2025-05-07T20:32:17.5526189Z ) 2025-05-07T20:32:17.5526402Z self = 2025-05-07T20:32:17.5526565Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.5526572Z 2025-05-07T20:32:17.5526643Z @given( 2025-05-07T20:32:17.5526760Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5526857Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5526968Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5527089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5527199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5527273Z ) 2025-05-07T20:32:17.5527520Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5527614Z def test_silu_mul_quant( 2025-05-07T20:32:17.5527686Z self, 2025-05-07T20:32:17.5527758Z T: int, 2025-05-07T20:32:17.5527828Z D: int, 2025-05-07T20:32:17.5527927Z scale_ub: Optional[float], 2025-05-07T20:32:17.5528010Z contiguous: bool, 2025-05-07T20:32:17.5528090Z compiled: bool, 2025-05-07T20:32:17.5528214Z ) -> None: 2025-05-07T20:32:17.5528305Z torch.manual_seed(2025) 2025-05-07T20:32:17.5528370Z 2025-05-07T20:32:17.5528540Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5528608Z 2025-05-07T20:32:17.5528743Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5528864Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5528950Z x = x_sign * x_clamp 2025-05-07T20:32:17.5529028Z x0 = x[:, :D] 2025-05-07T20:32:17.5529103Z x1 = x[:, D:] 2025-05-07T20:32:17.5529172Z 2025-05-07T20:32:17.5529256Z if contiguous: 2025-05-07T20:32:17.5529383Z x0 = x0.contiguous() 2025-05-07T20:32:17.5529472Z x1 = x1.contiguous() 2025-05-07T20:32:17.5529545Z 2025-05-07T20:32:17.5529632Z if scale_ub is not None: 2025-05-07T20:32:17.5529734Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5529896Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5529980Z ) 2025-05-07T20:32:17.5530067Z else: 2025-05-07T20:32:17.5530162Z scale_ub_tensor = None 2025-05-07T20:32:17.5530231Z 2025-05-07T20:32:17.5530359Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5530447Z op = silu_mul_quant 2025-05-07T20:32:17.5530525Z if compiled: 2025-05-07T20:32:17.5530622Z op = torch.compile(op) 2025-05-07T20:32:17.5530723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5530794Z 2025-05-07T20:32:17.5530881Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5530961Z 2025-05-07T20:32:17.5531057Z moe/activation_test.py:117: 2025-05-07T20:32:17.5531177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5531277Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5531372Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5531874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5531970Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5532321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5532549Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5532881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5532971Z kernel = self.compile( 2025-05-07T20:32:17.5533356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5533525Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5533649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5533654Z 2025-05-07T20:32:17.5533853Z self = 2025-05-07T20:32:17.5534637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5535141Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a301430>} 2025-05-07T20:32:17.5535891Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5536092Z context = 2025-05-07T20:32:17.5536097Z 2025-05-07T20:32:17.5536256Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5536563Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5536665Z module_map=module_map) 2025-05-07T20:32:17.5536821Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5536954Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5537024Z E ^ 2025-05-07T20:32:17.5537374Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5537379Z 2025-05-07T20:32:17.5537795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5537836Z 2025-05-07T20:32:17.5537935Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5538153Z self=, 2025-05-07T20:32:17.5538227Z T=2048, 2025-05-07T20:32:17.5538295Z D=7168, 2025-05-07T20:32:17.5538382Z scale_ub=1200.0, 2025-05-07T20:32:17.5538463Z contiguous=True, 2025-05-07T20:32:17.5538542Z compiled=False, 2025-05-07T20:32:17.5538612Z ) 2025-05-07T20:32:17.5538823Z self = 2025-05-07T20:32:17.5538995Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.5539003Z 2025-05-07T20:32:17.5539075Z @given( 2025-05-07T20:32:17.5539188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5539286Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5539443Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5539557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5539671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5539740Z ) 2025-05-07T20:32:17.5539984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5540075Z def test_silu_mul_quant( 2025-05-07T20:32:17.5540170Z self, 2025-05-07T20:32:17.5540252Z T: int, 2025-05-07T20:32:17.5540343Z D: int, 2025-05-07T20:32:17.5540451Z scale_ub: Optional[float], 2025-05-07T20:32:17.5540539Z contiguous: bool, 2025-05-07T20:32:17.5540623Z compiled: bool, 2025-05-07T20:32:17.5540697Z ) -> None: 2025-05-07T20:32:17.5540795Z torch.manual_seed(2025) 2025-05-07T20:32:17.5540863Z 2025-05-07T20:32:17.5541026Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5542816Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5542827Z 2025-05-07T20:32:17.5542940Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5542944Z 2025-05-07T20:32:17.5543046Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5543270Z self=, 2025-05-07T20:32:17.5543339Z T=1, 2025-05-07T20:32:17.5543415Z D=5120, 2025-05-07T20:32:17.5543493Z scale_ub=1200.0, 2025-05-07T20:32:17.5543575Z contiguous=True, 2025-05-07T20:32:17.5543659Z compiled=False, 2025-05-07T20:32:17.5543729Z ) 2025-05-07T20:32:17.5543944Z self = 2025-05-07T20:32:17.5544101Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.5544106Z 2025-05-07T20:32:17.5544175Z @given( 2025-05-07T20:32:17.5544292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5544433Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5544541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5544658Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5544805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5544880Z ) 2025-05-07T20:32:17.5545120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5548698Z def test_silu_mul_quant( 2025-05-07T20:32:17.5548786Z self, 2025-05-07T20:32:17.5548933Z T: int, 2025-05-07T20:32:17.5549011Z D: int, 2025-05-07T20:32:17.5549108Z scale_ub: Optional[float], 2025-05-07T20:32:17.5549196Z contiguous: bool, 2025-05-07T20:32:17.5549277Z compiled: bool, 2025-05-07T20:32:17.5549353Z ) -> None: 2025-05-07T20:32:17.5549450Z torch.manual_seed(2025) 2025-05-07T20:32:17.5549519Z 2025-05-07T20:32:17.5549693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5549763Z 2025-05-07T20:32:17.5549919Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5550040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5550131Z x = x_sign * x_clamp 2025-05-07T20:32:17.5550205Z x0 = x[:, :D] 2025-05-07T20:32:17.5550279Z x1 = x[:, D:] 2025-05-07T20:32:17.5550350Z 2025-05-07T20:32:17.5550430Z if contiguous: 2025-05-07T20:32:17.5550518Z x0 = x0.contiguous() 2025-05-07T20:32:17.5550606Z x1 = x1.contiguous() 2025-05-07T20:32:17.5550732Z 2025-05-07T20:32:17.5550826Z if scale_ub is not None: 2025-05-07T20:32:17.5550927Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5551060Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5551136Z ) 2025-05-07T20:32:17.5551209Z else: 2025-05-07T20:32:17.5551300Z scale_ub_tensor = None 2025-05-07T20:32:17.5551374Z 2025-05-07T20:32:17.5551500Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5551585Z op = silu_mul_quant 2025-05-07T20:32:17.5551668Z if compiled: 2025-05-07T20:32:17.5551770Z op = torch.compile(op) 2025-05-07T20:32:17.5551874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5551949Z 2025-05-07T20:32:17.5552036Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5552041Z 2025-05-07T20:32:17.5552141Z moe/activation_test.py:117: 2025-05-07T20:32:17.5552268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5552369Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5552468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5552970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5553062Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5553423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5553641Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5553985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5554075Z kernel = self.compile( 2025-05-07T20:32:17.5554450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5554630Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5554752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5554757Z 2025-05-07T20:32:17.5554962Z self = 2025-05-07T20:32:17.5555739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5556336Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a221160>} 2025-05-07T20:32:17.5557090Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5557320Z context = 2025-05-07T20:32:17.5557325Z 2025-05-07T20:32:17.5557488Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5557750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5557856Z module_map=module_map) 2025-05-07T20:32:17.5558018Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5558114Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5558188Z E ^ 2025-05-07T20:32:17.5558544Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5558548Z 2025-05-07T20:32:17.5558956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5558961Z 2025-05-07T20:32:17.5559110Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5559334Z self=, 2025-05-07T20:32:17.5559408Z T=2048, 2025-05-07T20:32:17.5559487Z D=5120, 2025-05-07T20:32:17.5559564Z scale_ub=None, 2025-05-07T20:32:17.5559645Z contiguous=True, 2025-05-07T20:32:17.5559745Z compiled=False, 2025-05-07T20:32:17.5559823Z ) 2025-05-07T20:32:17.5560061Z self = 2025-05-07T20:32:17.5560228Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.5560232Z 2025-05-07T20:32:17.5560305Z @given( 2025-05-07T20:32:17.5560428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5560524Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5560637Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5560753Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5560870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5560940Z ) 2025-05-07T20:32:17.5561181Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5561271Z def test_silu_mul_quant( 2025-05-07T20:32:17.5561348Z self, 2025-05-07T20:32:17.5561418Z T: int, 2025-05-07T20:32:17.5561489Z D: int, 2025-05-07T20:32:17.5561589Z scale_ub: Optional[float], 2025-05-07T20:32:17.5561675Z contiguous: bool, 2025-05-07T20:32:17.5561757Z compiled: bool, 2025-05-07T20:32:17.5561836Z ) -> None: 2025-05-07T20:32:17.5561927Z torch.manual_seed(2025) 2025-05-07T20:32:17.5561999Z 2025-05-07T20:32:17.5562167Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5562238Z 2025-05-07T20:32:17.5562328Z > x_sign = torch.sign(x) 2025-05-07T20:32:17.5564148Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5564202Z 2025-05-07T20:32:17.5564320Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:17.5564325Z 2025-05-07T20:32:17.5564423Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5564684Z self=, 2025-05-07T20:32:17.5564758Z T=16384, 2025-05-07T20:32:17.5564829Z D=5120, 2025-05-07T20:32:17.5564908Z scale_ub=None, 2025-05-07T20:32:17.5564994Z contiguous=True, 2025-05-07T20:32:17.5565074Z compiled=False, 2025-05-07T20:32:17.5565208Z ) 2025-05-07T20:32:17.5565420Z self = 2025-05-07T20:32:17.5565588Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.5565593Z 2025-05-07T20:32:17.5565668Z @given( 2025-05-07T20:32:17.5565781Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5565878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5565988Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5566098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5566207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5566281Z ) 2025-05-07T20:32:17.5566519Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5566609Z def test_silu_mul_quant( 2025-05-07T20:32:17.5566683Z self, 2025-05-07T20:32:17.5566755Z T: int, 2025-05-07T20:32:17.5566878Z D: int, 2025-05-07T20:32:17.5566972Z scale_ub: Optional[float], 2025-05-07T20:32:17.5567055Z contiguous: bool, 2025-05-07T20:32:17.5567139Z compiled: bool, 2025-05-07T20:32:17.5567213Z ) -> None: 2025-05-07T20:32:17.5567304Z torch.manual_seed(2025) 2025-05-07T20:32:17.5567375Z 2025-05-07T20:32:17.5567537Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5569323Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5569332Z 2025-05-07T20:32:17.5569446Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5569450Z 2025-05-07T20:32:17.5569551Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5569776Z self=, 2025-05-07T20:32:17.5569846Z T=4096, 2025-05-07T20:32:17.5569941Z D=5120, 2025-05-07T20:32:17.5570024Z scale_ub=None, 2025-05-07T20:32:17.5570119Z contiguous=True, 2025-05-07T20:32:17.5570210Z compiled=False, 2025-05-07T20:32:17.5570277Z ) 2025-05-07T20:32:17.5570485Z self = 2025-05-07T20:32:17.5570655Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.5570659Z 2025-05-07T20:32:17.5570732Z @given( 2025-05-07T20:32:17.5570848Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5570942Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5571059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5571175Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5571284Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5571352Z ) 2025-05-07T20:32:17.5571596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5571734Z def test_silu_mul_quant( 2025-05-07T20:32:17.5571809Z self, 2025-05-07T20:32:17.5571879Z T: int, 2025-05-07T20:32:17.5571949Z D: int, 2025-05-07T20:32:17.5572047Z scale_ub: Optional[float], 2025-05-07T20:32:17.5572133Z contiguous: bool, 2025-05-07T20:32:17.5572252Z compiled: bool, 2025-05-07T20:32:17.5572334Z ) -> None: 2025-05-07T20:32:17.5572424Z torch.manual_seed(2025) 2025-05-07T20:32:17.5572489Z 2025-05-07T20:32:17.5572655Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5574422Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5574469Z 2025-05-07T20:32:17.5574589Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5574596Z 2025-05-07T20:32:17.5574692Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5574914Z self=, 2025-05-07T20:32:17.5574988Z T=2048, 2025-05-07T20:32:17.5575056Z D=5120, 2025-05-07T20:32:17.5575137Z scale_ub=None, 2025-05-07T20:32:17.5575262Z contiguous=False, 2025-05-07T20:32:17.5575344Z compiled=False, 2025-05-07T20:32:17.5575417Z ) 2025-05-07T20:32:17.5575625Z self = 2025-05-07T20:32:17.5575790Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.5575795Z 2025-05-07T20:32:17.5575874Z @given( 2025-05-07T20:32:17.5575989Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5576083Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5576197Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5576309Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5576421Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5576491Z ) 2025-05-07T20:32:17.5576732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5576824Z def test_silu_mul_quant( 2025-05-07T20:32:17.5576900Z self, 2025-05-07T20:32:17.5576972Z T: int, 2025-05-07T20:32:17.5577045Z D: int, 2025-05-07T20:32:17.5577136Z scale_ub: Optional[float], 2025-05-07T20:32:17.5577222Z contiguous: bool, 2025-05-07T20:32:17.5577303Z compiled: bool, 2025-05-07T20:32:17.5577376Z ) -> None: 2025-05-07T20:32:17.5577463Z torch.manual_seed(2025) 2025-05-07T20:32:17.5577537Z 2025-05-07T20:32:17.5577697Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5579481Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5579491Z 2025-05-07T20:32:17.5579603Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5579608Z 2025-05-07T20:32:17.5579709Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5579930Z self=, 2025-05-07T20:32:17.5580044Z T=4096, 2025-05-07T20:32:17.5580120Z D=7168, 2025-05-07T20:32:17.5580201Z scale_ub=None, 2025-05-07T20:32:17.5580281Z contiguous=True, 2025-05-07T20:32:17.5580363Z compiled=True, 2025-05-07T20:32:17.5580435Z ) 2025-05-07T20:32:17.5580682Z self = 2025-05-07T20:32:17.5580852Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.5580857Z 2025-05-07T20:32:17.5580928Z @given( 2025-05-07T20:32:17.5581042Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5581179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5581288Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5581405Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5581512Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5581580Z ) 2025-05-07T20:32:17.5581827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5581919Z def test_silu_mul_quant( 2025-05-07T20:32:17.5581998Z self, 2025-05-07T20:32:17.5582070Z T: int, 2025-05-07T20:32:17.5582144Z D: int, 2025-05-07T20:32:17.5582243Z scale_ub: Optional[float], 2025-05-07T20:32:17.5582326Z contiguous: bool, 2025-05-07T20:32:17.5582405Z compiled: bool, 2025-05-07T20:32:17.5582481Z ) -> None: 2025-05-07T20:32:17.5582571Z torch.manual_seed(2025) 2025-05-07T20:32:17.5582644Z 2025-05-07T20:32:17.5582853Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5584633Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5584642Z 2025-05-07T20:32:17.5584761Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5584766Z 2025-05-07T20:32:17.5584864Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5585085Z self=, 2025-05-07T20:32:17.5585162Z T=2048, 2025-05-07T20:32:17.5585236Z D=5120, 2025-05-07T20:32:17.5585322Z scale_ub=1200.0, 2025-05-07T20:32:17.5585402Z contiguous=False, 2025-05-07T20:32:17.5585481Z compiled=False, 2025-05-07T20:32:17.5585556Z ) 2025-05-07T20:32:17.5585763Z self = 2025-05-07T20:32:17.5585931Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.5585937Z 2025-05-07T20:32:17.5586015Z @given( 2025-05-07T20:32:17.5586129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5586224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5586338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5586448Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5586561Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5586632Z ) 2025-05-07T20:32:17.5586869Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5586966Z def test_silu_mul_quant( 2025-05-07T20:32:17.5587040Z self, 2025-05-07T20:32:17.5587112Z T: int, 2025-05-07T20:32:17.5587187Z D: int, 2025-05-07T20:32:17.5587279Z scale_ub: Optional[float], 2025-05-07T20:32:17.5587362Z contiguous: bool, 2025-05-07T20:32:17.5587447Z compiled: bool, 2025-05-07T20:32:17.5587522Z ) -> None: 2025-05-07T20:32:17.5587662Z torch.manual_seed(2025) 2025-05-07T20:32:17.5587734Z 2025-05-07T20:32:17.5587895Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5589722Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5589775Z 2025-05-07T20:32:17.5589966Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5589972Z 2025-05-07T20:32:17.5590073Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5590295Z self=, 2025-05-07T20:32:17.5590367Z T=4096, 2025-05-07T20:32:17.5590439Z D=7168, 2025-05-07T20:32:17.5590518Z scale_ub=1200.0, 2025-05-07T20:32:17.5590595Z contiguous=True, 2025-05-07T20:32:17.5590681Z compiled=False, 2025-05-07T20:32:17.5590747Z ) 2025-05-07T20:32:17.5590957Z self = 2025-05-07T20:32:17.5591128Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.5591132Z 2025-05-07T20:32:17.5591205Z @given( 2025-05-07T20:32:17.5591371Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5591467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5591579Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5591698Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5591805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5591877Z ) 2025-05-07T20:32:17.5592120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5592212Z def test_silu_mul_quant( 2025-05-07T20:32:17.5592288Z self, 2025-05-07T20:32:17.5592360Z T: int, 2025-05-07T20:32:17.5592436Z D: int, 2025-05-07T20:32:17.5592533Z scale_ub: Optional[float], 2025-05-07T20:32:17.5592617Z contiguous: bool, 2025-05-07T20:32:17.5592696Z compiled: bool, 2025-05-07T20:32:17.5592774Z ) -> None: 2025-05-07T20:32:17.5592865Z torch.manual_seed(2025) 2025-05-07T20:32:17.5592939Z 2025-05-07T20:32:17.5593105Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5594886Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5594894Z 2025-05-07T20:32:17.5595010Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5595015Z 2025-05-07T20:32:17.5595114Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5595335Z self=, 2025-05-07T20:32:17.5595421Z T=16384, 2025-05-07T20:32:17.5595493Z D=7168, 2025-05-07T20:32:17.5595571Z scale_ub=None, 2025-05-07T20:32:17.5595652Z contiguous=False, 2025-05-07T20:32:17.5595729Z compiled=True, 2025-05-07T20:32:17.5595802Z ) 2025-05-07T20:32:17.5596017Z self = 2025-05-07T20:32:17.5596231Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.5596235Z 2025-05-07T20:32:17.5596316Z @given( 2025-05-07T20:32:17.5596426Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5596583Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5596695Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5596805Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5596914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5596984Z ) 2025-05-07T20:32:17.5597227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5597360Z def test_silu_mul_quant( 2025-05-07T20:32:17.5597433Z self, 2025-05-07T20:32:17.5597503Z T: int, 2025-05-07T20:32:17.5597578Z D: int, 2025-05-07T20:32:17.5597671Z scale_ub: Optional[float], 2025-05-07T20:32:17.5597754Z contiguous: bool, 2025-05-07T20:32:17.5597840Z compiled: bool, 2025-05-07T20:32:17.5597911Z ) -> None: 2025-05-07T20:32:17.5598001Z torch.manual_seed(2025) 2025-05-07T20:32:17.5598070Z 2025-05-07T20:32:17.5598230Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5600126Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5600137Z 2025-05-07T20:32:17.5600255Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5600263Z 2025-05-07T20:32:17.5600363Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5600582Z self=, 2025-05-07T20:32:17.5600656Z T=4096, 2025-05-07T20:32:17.5600730Z D=7168, 2025-05-07T20:32:17.5600804Z scale_ub=None, 2025-05-07T20:32:17.5600887Z contiguous=True, 2025-05-07T20:32:17.5600968Z compiled=False, 2025-05-07T20:32:17.5601039Z ) 2025-05-07T20:32:17.5601247Z self = 2025-05-07T20:32:17.5601416Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.5601426Z 2025-05-07T20:32:17.5601501Z @given( 2025-05-07T20:32:17.5601617Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5601709Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5601818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5601931Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5602039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5602112Z ) 2025-05-07T20:32:17.5602353Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5602441Z def test_silu_mul_quant( 2025-05-07T20:32:17.5602521Z self, 2025-05-07T20:32:17.5602592Z T: int, 2025-05-07T20:32:17.5602664Z D: int, 2025-05-07T20:32:17.5602765Z scale_ub: Optional[float], 2025-05-07T20:32:17.5602848Z contiguous: bool, 2025-05-07T20:32:17.5602928Z compiled: bool, 2025-05-07T20:32:17.5603005Z ) -> None: 2025-05-07T20:32:17.5603099Z torch.manual_seed(2025) 2025-05-07T20:32:17.5603167Z 2025-05-07T20:32:17.5603333Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5605488Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5605545Z 2025-05-07T20:32:17.5605678Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5605683Z 2025-05-07T20:32:17.5605789Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5606104Z self=, 2025-05-07T20:32:17.5606183Z T=16384, 2025-05-07T20:32:17.5606257Z D=7168, 2025-05-07T20:32:17.5606341Z scale_ub=None, 2025-05-07T20:32:17.5606426Z contiguous=True, 2025-05-07T20:32:17.5606510Z compiled=False, 2025-05-07T20:32:17.5606584Z ) 2025-05-07T20:32:17.5606830Z self = 2025-05-07T20:32:17.5607024Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:17.5607029Z 2025-05-07T20:32:17.5607106Z @given( 2025-05-07T20:32:17.5607232Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5607332Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5607455Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5607578Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5607698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5607838Z ) 2025-05-07T20:32:17.5608081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5608172Z def test_silu_mul_quant( 2025-05-07T20:32:17.5608246Z self, 2025-05-07T20:32:17.5608318Z T: int, 2025-05-07T20:32:17.5608396Z D: int, 2025-05-07T20:32:17.5608490Z scale_ub: Optional[float], 2025-05-07T20:32:17.5608579Z contiguous: bool, 2025-05-07T20:32:17.5608661Z compiled: bool, 2025-05-07T20:32:17.5608734Z ) -> None: 2025-05-07T20:32:17.5608824Z torch.manual_seed(2025) 2025-05-07T20:32:17.5608892Z 2025-05-07T20:32:17.5609056Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5610847Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5610855Z 2025-05-07T20:32:17.5610982Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5610987Z 2025-05-07T20:32:17.5611085Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5611305Z self=, 2025-05-07T20:32:17.5611385Z T=16384, 2025-05-07T20:32:17.5611464Z D=7168, 2025-05-07T20:32:17.5611540Z scale_ub=1200.0, 2025-05-07T20:32:17.5611623Z contiguous=True, 2025-05-07T20:32:17.5611704Z compiled=False, 2025-05-07T20:32:17.5611774Z ) 2025-05-07T20:32:17.5611990Z self = 2025-05-07T20:32:17.5612164Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.5612169Z 2025-05-07T20:32:17.5612249Z @given( 2025-05-07T20:32:17.5612358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5612450Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5612566Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5612724Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5612833Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5612906Z ) 2025-05-07T20:32:17.5613190Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5613285Z def test_silu_mul_quant( 2025-05-07T20:32:17.5613356Z self, 2025-05-07T20:32:17.5613429Z T: int, 2025-05-07T20:32:17.5613508Z D: int, 2025-05-07T20:32:17.5613602Z scale_ub: Optional[float], 2025-05-07T20:32:17.5613685Z contiguous: bool, 2025-05-07T20:32:17.5613812Z compiled: bool, 2025-05-07T20:32:17.5613885Z ) -> None: 2025-05-07T20:32:17.5613975Z torch.manual_seed(2025) 2025-05-07T20:32:17.5614048Z 2025-05-07T20:32:17.5614209Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5616018Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5616026Z 2025-05-07T20:32:17.5616139Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5616146Z 2025-05-07T20:32:17.5616288Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5616515Z self=, 2025-05-07T20:32:17.5616586Z T=128, 2025-05-07T20:32:17.5616657Z D=5120, 2025-05-07T20:32:17.5616731Z scale_ub=1200.0, 2025-05-07T20:32:17.5616811Z contiguous=False, 2025-05-07T20:32:17.5616895Z compiled=False, 2025-05-07T20:32:17.5616971Z ) 2025-05-07T20:32:17.5617185Z self = 2025-05-07T20:32:17.5617353Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.5617357Z 2025-05-07T20:32:17.5617432Z @given( 2025-05-07T20:32:17.5617544Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5617639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5617750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5617864Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5617977Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5618046Z ) 2025-05-07T20:32:17.5618287Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5618376Z def test_silu_mul_quant( 2025-05-07T20:32:17.5618445Z self, 2025-05-07T20:32:17.5618521Z T: int, 2025-05-07T20:32:17.5618594Z D: int, 2025-05-07T20:32:17.5618684Z scale_ub: Optional[float], 2025-05-07T20:32:17.5618770Z contiguous: bool, 2025-05-07T20:32:17.5618848Z compiled: bool, 2025-05-07T20:32:17.5618923Z ) -> None: 2025-05-07T20:32:17.5619020Z torch.manual_seed(2025) 2025-05-07T20:32:17.5619088Z 2025-05-07T20:32:17.5619252Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5619321Z 2025-05-07T20:32:17.5619408Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5619533Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5619621Z x = x_sign * x_clamp 2025-05-07T20:32:17.5619712Z x0 = x[:, :D] 2025-05-07T20:32:17.5619798Z x1 = x[:, D:] 2025-05-07T20:32:17.5619877Z 2025-05-07T20:32:17.5619966Z if contiguous: 2025-05-07T20:32:17.5620056Z x0 = x0.contiguous() 2025-05-07T20:32:17.5620141Z x1 = x1.contiguous() 2025-05-07T20:32:17.5620258Z 2025-05-07T20:32:17.5620345Z if scale_ub is not None: 2025-05-07T20:32:17.5620444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5620576Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5620646Z ) 2025-05-07T20:32:17.5620757Z else: 2025-05-07T20:32:17.5620854Z scale_ub_tensor = None 2025-05-07T20:32:17.5620921Z 2025-05-07T20:32:17.5621046Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5621134Z op = silu_mul_quant 2025-05-07T20:32:17.5621214Z if compiled: 2025-05-07T20:32:17.5621349Z op = torch.compile(op) 2025-05-07T20:32:17.5621456Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5621526Z 2025-05-07T20:32:17.5621613Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5621618Z 2025-05-07T20:32:17.5621715Z moe/activation_test.py:117: 2025-05-07T20:32:17.5621837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5621938Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5622034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5622539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5622634Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5622993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5623211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5623596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5623689Z kernel = self.compile( 2025-05-07T20:32:17.5624069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5624240Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5624364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5624368Z 2025-05-07T20:32:17.5624577Z self = 2025-05-07T20:32:17.5625358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5625871Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a106ca0>} 2025-05-07T20:32:17.5626621Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5626818Z context = 2025-05-07T20:32:17.5626822Z 2025-05-07T20:32:17.5626985Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5627250Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5627355Z module_map=module_map) 2025-05-07T20:32:17.5627515Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5627611Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5627688Z E ^ 2025-05-07T20:32:17.5628045Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5628050Z 2025-05-07T20:32:17.5628461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5628466Z 2025-05-07T20:32:17.5628562Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5628856Z self=, 2025-05-07T20:32:17.5628933Z T=2048, 2025-05-07T20:32:17.5629001Z D=7168, 2025-05-07T20:32:17.5629077Z scale_ub=None, 2025-05-07T20:32:17.5629195Z contiguous=False, 2025-05-07T20:32:17.5629277Z compiled=False, 2025-05-07T20:32:17.5629344Z ) 2025-05-07T20:32:17.5629558Z self = 2025-05-07T20:32:17.5629730Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:17.5629735Z 2025-05-07T20:32:17.5629930Z @given( 2025-05-07T20:32:17.5630055Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5630171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5630286Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5630397Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5630505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5630583Z ) 2025-05-07T20:32:17.5630821Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5630912Z def test_silu_mul_quant( 2025-05-07T20:32:17.5630984Z self, 2025-05-07T20:32:17.5631055Z T: int, 2025-05-07T20:32:17.5631129Z D: int, 2025-05-07T20:32:17.5631222Z scale_ub: Optional[float], 2025-05-07T20:32:17.5631306Z contiguous: bool, 2025-05-07T20:32:17.5631388Z compiled: bool, 2025-05-07T20:32:17.5631461Z ) -> None: 2025-05-07T20:32:17.5631605Z torch.manual_seed(2025) 2025-05-07T20:32:17.5631681Z 2025-05-07T20:32:17.5631845Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5633634Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5633643Z 2025-05-07T20:32:17.5633756Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5633760Z 2025-05-07T20:32:17.5633860Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5634078Z self=, 2025-05-07T20:32:17.5634153Z T=128, 2025-05-07T20:32:17.5634224Z D=7168, 2025-05-07T20:32:17.5634304Z scale_ub=1200.0, 2025-05-07T20:32:17.5634384Z contiguous=True, 2025-05-07T20:32:17.5634463Z compiled=True, 2025-05-07T20:32:17.5634533Z ) 2025-05-07T20:32:17.5634742Z self = 2025-05-07T20:32:17.5634909Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.5634914Z 2025-05-07T20:32:17.5634988Z @given( 2025-05-07T20:32:17.5635102Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5635200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5635309Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5635423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5635532Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5635602Z ) 2025-05-07T20:32:17.5635853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5635943Z def test_silu_mul_quant( 2025-05-07T20:32:17.5636016Z self, 2025-05-07T20:32:17.5636090Z T: int, 2025-05-07T20:32:17.5636164Z D: int, 2025-05-07T20:32:17.5636258Z scale_ub: Optional[float], 2025-05-07T20:32:17.5636342Z contiguous: bool, 2025-05-07T20:32:17.5636467Z compiled: bool, 2025-05-07T20:32:17.5636542Z ) -> None: 2025-05-07T20:32:17.5636631Z torch.manual_seed(2025) 2025-05-07T20:32:17.5636700Z 2025-05-07T20:32:17.5636908Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5636976Z 2025-05-07T20:32:17.5637064Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5637184Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5637268Z x = x_sign * x_clamp 2025-05-07T20:32:17.5637342Z x0 = x[:, :D] 2025-05-07T20:32:17.5637461Z x1 = x[:, D:] 2025-05-07T20:32:17.5637530Z 2025-05-07T20:32:17.5637609Z if contiguous: 2025-05-07T20:32:17.5637700Z x0 = x0.contiguous() 2025-05-07T20:32:17.5637784Z x1 = x1.contiguous() 2025-05-07T20:32:17.5637852Z 2025-05-07T20:32:17.5637939Z if scale_ub is not None: 2025-05-07T20:32:17.5638038Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5638175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5638247Z ) 2025-05-07T20:32:17.5638320Z else: 2025-05-07T20:32:17.5638411Z scale_ub_tensor = None 2025-05-07T20:32:17.5638476Z 2025-05-07T20:32:17.5638604Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5638695Z op = silu_mul_quant 2025-05-07T20:32:17.5638774Z if compiled: 2025-05-07T20:32:17.5638869Z op = torch.compile(op) 2025-05-07T20:32:17.5638974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5639083Z 2025-05-07T20:32:17.5639172Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5639179Z 2025-05-07T20:32:17.5639271Z moe/activation_test.py:117: 2025-05-07T20:32:17.5639392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5639490Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5639584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5639952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5640042Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5640534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5640630Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5640982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5641204Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5641542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5641632Z kernel = self.compile( 2025-05-07T20:32:17.5642005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5642179Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5642303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5642308Z 2025-05-07T20:32:17.5642517Z self = 2025-05-07T20:32:17.5643295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5643802Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7feb2a3a4280>} 2025-05-07T20:32:17.5644547Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5644783Z context = 2025-05-07T20:32:17.5644788Z 2025-05-07T20:32:17.5644951Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5645248Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5645352Z module_map=module_map) 2025-05-07T20:32:17.5645515Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5645608Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5645790Z E ^ 2025-05-07T20:32:17.5646141Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5646146Z 2025-05-07T20:32:17.5646554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5646561Z 2025-05-07T20:32:17.5646664Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5646881Z self=, 2025-05-07T20:32:17.5646957Z T=128, 2025-05-07T20:32:17.5647030Z D=7168, 2025-05-07T20:32:17.5647108Z scale_ub=1200.0, 2025-05-07T20:32:17.5647190Z contiguous=True, 2025-05-07T20:32:17.5647269Z compiled=False, 2025-05-07T20:32:17.5647336Z ) 2025-05-07T20:32:17.5647549Z self = 2025-05-07T20:32:17.5647750Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.5647758Z 2025-05-07T20:32:17.5647833Z @given( 2025-05-07T20:32:17.5647950Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5648046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5648158Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5648278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5648391Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5648463Z ) 2025-05-07T20:32:17.5648704Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5648796Z def test_silu_mul_quant( 2025-05-07T20:32:17.5648873Z self, 2025-05-07T20:32:17.5648946Z T: int, 2025-05-07T20:32:17.5649018Z D: int, 2025-05-07T20:32:17.5649116Z scale_ub: Optional[float], 2025-05-07T20:32:17.5649199Z contiguous: bool, 2025-05-07T20:32:17.5649281Z compiled: bool, 2025-05-07T20:32:17.5649359Z ) -> None: 2025-05-07T20:32:17.5649449Z torch.manual_seed(2025) 2025-05-07T20:32:17.5649515Z 2025-05-07T20:32:17.5649681Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5649763Z 2025-05-07T20:32:17.5649862Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5650007Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5651817Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5651828Z 2025-05-07T20:32:17.5651946Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:17.5651951Z 2025-05-07T20:32:17.5652047Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5652272Z self=, 2025-05-07T20:32:17.5652342Z T=128, 2025-05-07T20:32:17.5652413Z D=5120, 2025-05-07T20:32:17.5652539Z scale_ub=1200.0, 2025-05-07T20:32:17.5652615Z contiguous=True, 2025-05-07T20:32:17.5652692Z compiled=True, 2025-05-07T20:32:17.5652764Z ) 2025-05-07T20:32:17.5652973Z self = 2025-05-07T20:32:17.5653177Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.5653183Z 2025-05-07T20:32:17.5653254Z @given( 2025-05-07T20:32:17.5653368Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5653463Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5653575Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5653739Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5653854Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5653922Z ) 2025-05-07T20:32:17.5654164Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5654254Z def test_silu_mul_quant( 2025-05-07T20:32:17.5654327Z self, 2025-05-07T20:32:17.5654401Z T: int, 2025-05-07T20:32:17.5654471Z D: int, 2025-05-07T20:32:17.5654564Z scale_ub: Optional[float], 2025-05-07T20:32:17.5654655Z contiguous: bool, 2025-05-07T20:32:17.5654737Z compiled: bool, 2025-05-07T20:32:17.5654809Z ) -> None: 2025-05-07T20:32:17.5654904Z torch.manual_seed(2025) 2025-05-07T20:32:17.5654971Z 2025-05-07T20:32:17.5655135Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5655206Z 2025-05-07T20:32:17.5655292Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5655478Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5657280Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5657287Z 2025-05-07T20:32:17.5657402Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:17.5657407Z 2025-05-07T20:32:17.5657504Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5657726Z self=, 2025-05-07T20:32:17.5657806Z T=128, 2025-05-07T20:32:17.5657878Z D=7168, 2025-05-07T20:32:17.5657953Z scale_ub=None, 2025-05-07T20:32:17.5658035Z contiguous=True, 2025-05-07T20:32:17.5658112Z compiled=True, 2025-05-07T20:32:17.5658177Z ) 2025-05-07T20:32:17.5658389Z self = 2025-05-07T20:32:17.5658549Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.5658555Z 2025-05-07T20:32:17.5658631Z @given( 2025-05-07T20:32:17.5658741Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5658835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5658950Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5659060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5659168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5659240Z ) 2025-05-07T20:32:17.5659484Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5659574Z def test_silu_mul_quant( 2025-05-07T20:32:17.5659652Z self, 2025-05-07T20:32:17.5659725Z T: int, 2025-05-07T20:32:17.5659798Z D: int, 2025-05-07T20:32:17.5659893Z scale_ub: Optional[float], 2025-05-07T20:32:17.5659982Z contiguous: bool, 2025-05-07T20:32:17.5660084Z compiled: bool, 2025-05-07T20:32:17.5660218Z ) -> None: 2025-05-07T20:32:17.5660316Z torch.manual_seed(2025) 2025-05-07T20:32:17.5660387Z 2025-05-07T20:32:17.5660548Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5662386Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:17.5662433Z 2025-05-07T20:32:17.5662544Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:17.5662672Z =============================== warnings summary =============================== 2025-05-07T20:32:17.5662978Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:17.5663274Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:17.5663565Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:17.5664484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:17.5664713Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:17.5664718Z 2025-05-07T20:32:17.5664926Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:17.5665088Z ================= 1 failed, 1 deselected, 3 warnings in 24.11s ================= 2025-05-07T20:32:19.1384508Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:19.1999846Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:19.2000196Z 2025-05-07T20:32:21.2016976Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:23.3700032Z ============================= test session starts ============================== 2025-05-07T20:32:23.3700734Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:23.3701291Z cachedir: .pytest_cache 2025-05-07T20:32:23.3701896Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:23.3702646Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:23.3703058Z plugins: hypothesis-6.131.14 2025-05-07T20:32:24.9668048Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:25.1806184Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:25.1806961Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:25.1807381Z 2025-05-07T20:32:27.8484208Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.8485039Z self=, 2025-05-07T20:32:27.8485456Z T=1, 2025-05-07T20:32:27.8485645Z D=5120, 2025-05-07T20:32:27.8485846Z scale_ub=None, 2025-05-07T20:32:27.8486065Z contiguous=True, 2025-05-07T20:32:27.8486291Z compiled=True, 2025-05-07T20:32:27.8486790Z ) 2025-05-07T20:32:27.8487114Z self = 2025-05-07T20:32:27.8487596Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:27.8487866Z 2025-05-07T20:32:27.8487945Z @given( 2025-05-07T20:32:27.8488277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.8488601Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.8488906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.8489242Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.8489655Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.8489938Z ) 2025-05-07T20:32:27.8490297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.8490743Z def test_silu_mul_quant( 2025-05-07T20:32:27.8490980Z self, 2025-05-07T20:32:27.8491181Z T: int, 2025-05-07T20:32:27.8491381Z D: int, 2025-05-07T20:32:27.8491601Z scale_ub: Optional[float], 2025-05-07T20:32:27.8491882Z contiguous: bool, 2025-05-07T20:32:27.8492126Z compiled: bool, 2025-05-07T20:32:27.8492350Z ) -> None: 2025-05-07T20:32:27.8492568Z torch.manual_seed(2025) 2025-05-07T20:32:27.8492817Z 2025-05-07T20:32:27.8493084Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.8493430Z 2025-05-07T20:32:27.8493625Z x_sign = torch.sign(x) 2025-05-07T20:32:27.8493918Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.8494309Z x = x_sign * x_clamp 2025-05-07T20:32:27.8494559Z x0 = x[:, :D] 2025-05-07T20:32:27.8494777Z x1 = x[:, D:] 2025-05-07T20:32:27.8494983Z 2025-05-07T20:32:27.8495167Z if contiguous: 2025-05-07T20:32:27.8495399Z x0 = x0.contiguous() 2025-05-07T20:32:27.8495655Z x1 = x1.contiguous() 2025-05-07T20:32:27.8495894Z 2025-05-07T20:32:27.8496091Z if scale_ub is not None: 2025-05-07T20:32:27.8496359Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.8496696Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.8497004Z ) 2025-05-07T20:32:27.8497190Z else: 2025-05-07T20:32:27.8497404Z scale_ub_tensor = None 2025-05-07T20:32:27.8497653Z 2025-05-07T20:32:27.8497880Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.8498193Z op = silu_mul_quant 2025-05-07T20:32:27.8498443Z if compiled: 2025-05-07T20:32:27.8498700Z op = torch.compile(op) 2025-05-07T20:32:27.8498995Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.8499270Z 2025-05-07T20:32:27.8499460Z y_fp8, y_scale = fn() 2025-05-07T20:32:27.8499740Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:27.8500029Z 2025-05-07T20:32:27.8500267Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.8500598Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:27.8500891Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:27.8501208Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:27.8501565Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.8501880Z 2025-05-07T20:32:27.8502088Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:27.8502285Z 2025-05-07T20:32:27.8502393Z moe/activation_test.py:126: 2025-05-07T20:32:27.8502687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.8503029Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:27.8503360Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.8504463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:27.8505229Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:27.8505864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.8506545Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.8507283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:27.8508007Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:27.8508763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:27.8509575Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:27.8510389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:27.8511027Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:27.8511632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:27.8512142Z fn() 2025-05-07T20:32:27.8512650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:27.8513231Z self.fn.run( 2025-05-07T20:32:27.8513697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.8514225Z kernel = self.compile( 2025-05-07T20:32:27.8514827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.8515478Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.8515872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.8516107Z 2025-05-07T20:32:27.8516317Z self = 2025-05-07T20:32:27.8517409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.8518818Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fcc1d69d0>} 2025-05-07T20:32:27.8520169Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.8521189Z context = 2025-05-07T20:32:27.8521483Z 2025-05-07T20:32:27.8521650Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.8522205Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.8522702Z module_map=module_map) 2025-05-07T20:32:27.8523064Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.8523422Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:27.8523689Z E ^ 2025-05-07T20:32:27.8524150Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.8524604Z 2025-05-07T20:32:27.8525022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.8525545Z 2025-05-07T20:32:27.8525648Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.8526076Z self=, 2025-05-07T20:32:27.8526479Z T=2048, 2025-05-07T20:32:27.8526670Z D=5120, 2025-05-07T20:32:27.8526864Z scale_ub=1200.0, 2025-05-07T20:32:27.8527136Z contiguous=True, 2025-05-07T20:32:27.8527358Z compiled=False, 2025-05-07T20:32:27.8527565Z ) 2025-05-07T20:32:29.3632974Z self = 2025-05-07T20:32:29.3633855Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.3634260Z 2025-05-07T20:32:29.3634373Z @given( 2025-05-07T20:32:29.3634676Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.3635083Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.3635489Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.3635951Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.3636281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.3636567Z ) 2025-05-07T20:32:29.3636916Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.3637359Z def test_silu_mul_quant( 2025-05-07T20:32:29.3637609Z self, 2025-05-07T20:32:29.3637800Z T: int, 2025-05-07T20:32:29.3638001Z D: int, 2025-05-07T20:32:29.3638220Z scale_ub: Optional[float], 2025-05-07T20:32:29.3638488Z contiguous: bool, 2025-05-07T20:32:29.3638727Z compiled: bool, 2025-05-07T20:32:29.3638960Z ) -> None: 2025-05-07T20:32:29.3639177Z torch.manual_seed(2025) 2025-05-07T20:32:29.3639424Z 2025-05-07T20:32:29.3639701Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.3640049Z 2025-05-07T20:32:29.3640242Z x_sign = torch.sign(x) 2025-05-07T20:32:29.3640628Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.3640947Z x = x_sign * x_clamp 2025-05-07T20:32:29.3641186Z x0 = x[:, :D] 2025-05-07T20:32:29.3641406Z x1 = x[:, D:] 2025-05-07T20:32:29.3641619Z 2025-05-07T20:32:29.3641802Z if contiguous: 2025-05-07T20:32:29.3642041Z x0 = x0.contiguous() 2025-05-07T20:32:29.3642315Z x1 = x1.contiguous() 2025-05-07T20:32:29.3642558Z 2025-05-07T20:32:29.3642755Z if scale_ub is not None: 2025-05-07T20:32:29.3643069Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.3643429Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.3643744Z ) 2025-05-07T20:32:29.3643934Z else: 2025-05-07T20:32:29.3644146Z scale_ub_tensor = None 2025-05-07T20:32:29.3644399Z 2025-05-07T20:32:29.3644627Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.3644949Z op = silu_mul_quant 2025-05-07T20:32:29.3645210Z if compiled: 2025-05-07T20:32:29.3645466Z op = torch.compile(op) 2025-05-07T20:32:29.3645761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.3646041Z 2025-05-07T20:32:29.3646238Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.3646403Z 2025-05-07T20:32:29.3646503Z moe/activation_test.py:117: 2025-05-07T20:32:29.3646810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.3647146Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.3647421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.3648116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.3648813Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.3649358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.3650042Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.3650741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.3651269Z kernel = self.compile( 2025-05-07T20:32:29.3651812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.3652550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.3652942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.3653178Z 2025-05-07T20:32:29.3653461Z self = 2025-05-07T20:32:29.3654551Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.3655991Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fb83b05e0>} 2025-05-07T20:32:29.3657334Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.3658354Z context = 2025-05-07T20:32:29.3658649Z 2025-05-07T20:32:29.3658814Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.3659340Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.3659809Z module_map=module_map) 2025-05-07T20:32:29.3660173Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.3660576Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.3660837Z E ^ 2025-05-07T20:32:29.3661294Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.3661751Z 2025-05-07T20:32:29.3662163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.3662715Z 2025-05-07T20:32:29.3662838Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.3663252Z self=, 2025-05-07T20:32:29.3663649Z T=2048, 2025-05-07T20:32:29.3663838Z D=5120, 2025-05-07T20:32:29.3664035Z scale_ub=1200.0, 2025-05-07T20:32:29.3664256Z contiguous=True, 2025-05-07T20:32:29.3664484Z compiled=True, 2025-05-07T20:32:29.3664702Z ) 2025-05-07T20:32:29.3665019Z self = 2025-05-07T20:32:29.3665521Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.3665793Z 2025-05-07T20:32:29.3665879Z @given( 2025-05-07T20:32:29.3666122Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.3666429Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.3666737Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.3667072Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.3667402Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.3667692Z ) 2025-05-07T20:32:29.3668045Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.3668491Z def test_silu_mul_quant( 2025-05-07T20:32:29.3668737Z self, 2025-05-07T20:32:29.3668935Z T: int, 2025-05-07T20:32:29.3669133Z D: int, 2025-05-07T20:32:29.3669356Z scale_ub: Optional[float], 2025-05-07T20:32:29.3669635Z contiguous: bool, 2025-05-07T20:32:29.3669982Z compiled: bool, 2025-05-07T20:32:29.3670214Z ) -> None: 2025-05-07T20:32:29.3670430Z torch.manual_seed(2025) 2025-05-07T20:32:29.3670677Z 2025-05-07T20:32:29.3670942Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.3671283Z 2025-05-07T20:32:29.3671477Z x_sign = torch.sign(x) 2025-05-07T20:32:29.3671760Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.3672122Z x = x_sign * x_clamp 2025-05-07T20:32:29.3672364Z x0 = x[:, :D] 2025-05-07T20:32:29.3672574Z x1 = x[:, D:] 2025-05-07T20:32:29.3672789Z 2025-05-07T20:32:29.3673006Z if contiguous: 2025-05-07T20:32:29.3673294Z x0 = x0.contiguous() 2025-05-07T20:32:29.3673562Z x1 = x1.contiguous() 2025-05-07T20:32:29.3673806Z 2025-05-07T20:32:29.3673991Z if scale_ub is not None: 2025-05-07T20:32:29.3674271Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.3674622Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.3674972Z ) 2025-05-07T20:32:29.3675171Z else: 2025-05-07T20:32:29.3675386Z scale_ub_tensor = None 2025-05-07T20:32:29.3675632Z 2025-05-07T20:32:29.3675872Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.3676191Z op = silu_mul_quant 2025-05-07T20:32:29.3676451Z if compiled: 2025-05-07T20:32:29.3676697Z op = torch.compile(op) 2025-05-07T20:32:29.3676996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.3677276Z 2025-05-07T20:32:29.3677465Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.3677765Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.3678061Z 2025-05-07T20:32:29.3678298Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.3678635Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.3678931Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.3679288Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.3679654Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.3679968Z 2025-05-07T20:32:29.3680172Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.3680371Z 2025-05-07T20:32:29.3680471Z moe/activation_test.py:126: 2025-05-07T20:32:29.3680780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.3681128Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.3681460Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.3682270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.3683046Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.3683607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.3684295Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.3684980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.3685706Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.3686450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:29.3687203Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.3687938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.3688589Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.3689191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.3689719Z fn() 2025-05-07T20:32:29.3690224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.3690801Z self.fn.run( 2025-05-07T20:32:29.3691261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.3691792Z kernel = self.compile( 2025-05-07T20:32:29.3692507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.3693311Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.3693858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.3694153Z 2025-05-07T20:32:29.3694414Z self = 2025-05-07T20:32:29.3695581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.3696999Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fcac53550>} 2025-05-07T20:32:29.3698358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.3699389Z context = 2025-05-07T20:32:29.3699680Z 2025-05-07T20:32:29.3699851Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.3700376Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.3700838Z module_map=module_map) 2025-05-07T20:32:29.3701250Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.3701610Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.3701877Z E ^ 2025-05-07T20:32:29.3702406Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.3702969Z 2025-05-07T20:32:29.3703500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.3704466Z 2025-05-07T20:32:29.3704585Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.3705000Z self=, 2025-05-07T20:32:29.3705413Z T=16384, 2025-05-07T20:32:29.3705611Z D=7168, 2025-05-07T20:32:29.3705807Z scale_ub=1200.0, 2025-05-07T20:32:29.3706039Z contiguous=False, 2025-05-07T20:32:29.3706281Z compiled=False, 2025-05-07T20:32:29.3706492Z ) 2025-05-07T20:32:30.7046516Z self = 2025-05-07T20:32:30.7047064Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:30.7047449Z 2025-05-07T20:32:30.7047575Z @given( 2025-05-07T20:32:30.7047873Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.7048279Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.7048678Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.7049007Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.7049343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.7049637Z ) 2025-05-07T20:32:30.7049997Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.7050431Z def test_silu_mul_quant( 2025-05-07T20:32:30.7050676Z self, 2025-05-07T20:32:30.7050872Z T: int, 2025-05-07T20:32:30.7051096Z D: int, 2025-05-07T20:32:30.7051310Z scale_ub: Optional[float], 2025-05-07T20:32:30.7051584Z contiguous: bool, 2025-05-07T20:32:30.7051822Z compiled: bool, 2025-05-07T20:32:30.7052047Z ) -> None: 2025-05-07T20:32:30.7052258Z torch.manual_seed(2025) 2025-05-07T20:32:30.7052503Z 2025-05-07T20:32:30.7052791Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.7053163Z 2025-05-07T20:32:30.7053640Z x_sign = torch.sign(x) 2025-05-07T20:32:30.7053933Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.7054235Z x = x_sign * x_clamp 2025-05-07T20:32:30.7054479Z x0 = x[:, :D] 2025-05-07T20:32:30.7054697Z x1 = x[:, D:] 2025-05-07T20:32:30.7054992Z 2025-05-07T20:32:30.7055182Z if contiguous: 2025-05-07T20:32:30.7055415Z x0 = x0.contiguous() 2025-05-07T20:32:30.7055668Z x1 = x1.contiguous() 2025-05-07T20:32:30.7055912Z 2025-05-07T20:32:30.7056103Z if scale_ub is not None: 2025-05-07T20:32:30.7063390Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.7063775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.7064101Z ) 2025-05-07T20:32:30.7064300Z else: 2025-05-07T20:32:30.7064515Z scale_ub_tensor = None 2025-05-07T20:32:30.7064771Z 2025-05-07T20:32:30.7065006Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.7065335Z op = silu_mul_quant 2025-05-07T20:32:30.7065597Z if compiled: 2025-05-07T20:32:30.7065838Z op = torch.compile(op) 2025-05-07T20:32:30.7066122Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.7066394Z 2025-05-07T20:32:30.7066576Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.7066737Z 2025-05-07T20:32:30.7066832Z moe/activation_test.py:117: 2025-05-07T20:32:30.7067129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.7067468Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.7067860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.7068560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.7069266Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.7069807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.7070588Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.7071253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.7071788Z kernel = self.compile( 2025-05-07T20:32:30.7072324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.7072979Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.7073385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.7073614Z 2025-05-07T20:32:30.7073826Z self = 2025-05-07T20:32:30.7074908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.7076308Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fcabfdd30>} 2025-05-07T20:32:30.7077662Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.7078686Z context = 2025-05-07T20:32:30.7078978Z 2025-05-07T20:32:30.7079154Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.7079675Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.7080149Z module_map=module_map) 2025-05-07T20:32:30.7080521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.7080965Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.7081227Z E ^ 2025-05-07T20:32:30.7081699Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.7082196Z 2025-05-07T20:32:30.7082627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.7083190Z 2025-05-07T20:32:30.7083293Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.7083712Z self=, 2025-05-07T20:32:30.7084162Z T=1, 2025-05-07T20:32:30.7084341Z D=7168, 2025-05-07T20:32:30.7084535Z scale_ub=None, 2025-05-07T20:32:30.7084752Z contiguous=True, 2025-05-07T20:32:30.7084977Z compiled=True, 2025-05-07T20:32:30.7085191Z ) 2025-05-07T20:32:30.7085517Z self = 2025-05-07T20:32:30.7086005Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:30.7086263Z 2025-05-07T20:32:30.7086340Z @given( 2025-05-07T20:32:30.7086575Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.7086903Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.7087208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.7087551Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.7087886Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.7088169Z ) 2025-05-07T20:32:30.7088571Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.7089027Z def test_silu_mul_quant( 2025-05-07T20:32:30.7089273Z self, 2025-05-07T20:32:30.7089466Z T: int, 2025-05-07T20:32:30.7089666Z D: int, 2025-05-07T20:32:30.7089887Z scale_ub: Optional[float], 2025-05-07T20:32:30.7090156Z contiguous: bool, 2025-05-07T20:32:30.7090399Z compiled: bool, 2025-05-07T20:32:30.7090629Z ) -> None: 2025-05-07T20:32:30.7090849Z torch.manual_seed(2025) 2025-05-07T20:32:30.7091093Z 2025-05-07T20:32:30.7091366Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.7091705Z 2025-05-07T20:32:30.7091906Z x_sign = torch.sign(x) 2025-05-07T20:32:30.7092197Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.7092506Z x = x_sign * x_clamp 2025-05-07T20:32:30.7092741Z x0 = x[:, :D] 2025-05-07T20:32:30.7092987Z x1 = x[:, D:] 2025-05-07T20:32:30.7093231Z 2025-05-07T20:32:30.7093411Z if contiguous: 2025-05-07T20:32:30.7093647Z x0 = x0.contiguous() 2025-05-07T20:32:30.7093909Z x1 = x1.contiguous() 2025-05-07T20:32:30.7094143Z 2025-05-07T20:32:30.7094336Z if scale_ub is not None: 2025-05-07T20:32:30.7094616Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.7094949Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.7095260Z ) 2025-05-07T20:32:30.7095452Z else: 2025-05-07T20:32:30.7095658Z scale_ub_tensor = None 2025-05-07T20:32:30.7095909Z 2025-05-07T20:32:30.7096142Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.7096451Z op = silu_mul_quant 2025-05-07T20:32:30.7096702Z if compiled: 2025-05-07T20:32:30.7096951Z op = torch.compile(op) 2025-05-07T20:32:30.7097253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.7097526Z 2025-05-07T20:32:30.7097721Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.7098007Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.7098297Z 2025-05-07T20:32:30.7098531Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.7098868Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.7099155Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.7099519Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.7099877Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.7100184Z 2025-05-07T20:32:30.7100424Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.7100629Z 2025-05-07T20:32:30.7100729Z moe/activation_test.py:126: 2025-05-07T20:32:30.7101029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.7101358Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.7101691Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.7102517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.7103272Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.7104097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.7104791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.7105478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.7106195Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.7106949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:30.7107773Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.7108505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.7109139Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.7109740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.7110306Z fn() 2025-05-07T20:32:30.7110804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.7111383Z self.fn.run( 2025-05-07T20:32:30.7111854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.7112383Z kernel = self.compile( 2025-05-07T20:32:30.7112940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.7113618Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.7114021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.7114251Z 2025-05-07T20:32:30.7114458Z self = 2025-05-07T20:32:30.7115547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.7116930Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fcabfde50>} 2025-05-07T20:32:30.7118283Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.7119314Z context = 2025-05-07T20:32:30.7119600Z 2025-05-07T20:32:30.7119767Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.7120290Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.7120760Z module_map=module_map) 2025-05-07T20:32:30.7121208Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.7121563Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.7121833Z E ^ 2025-05-07T20:32:30.7122357Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.7122809Z 2025-05-07T20:32:30.7123273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.7123789Z 2025-05-07T20:32:30.7123892Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.7124368Z self=, 2025-05-07T20:32:30.7124768Z T=4096, 2025-05-07T20:32:30.7124954Z D=5120, 2025-05-07T20:32:30.7125145Z scale_ub=None, 2025-05-07T20:32:30.7125362Z contiguous=False, 2025-05-07T20:32:30.7125582Z compiled=False, 2025-05-07T20:32:30.7125791Z ) 2025-05-07T20:32:32.4756369Z self = 2025-05-07T20:32:32.4756927Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.4757208Z 2025-05-07T20:32:32.4757286Z @given( 2025-05-07T20:32:32.4757537Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.4757849Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.4758151Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.4758485Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.4759077Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.4759363Z ) 2025-05-07T20:32:32.4759711Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.4760149Z def test_silu_mul_quant( 2025-05-07T20:32:32.4760384Z self, 2025-05-07T20:32:32.4760577Z T: int, 2025-05-07T20:32:32.4760770Z D: int, 2025-05-07T20:32:32.4760994Z scale_ub: Optional[float], 2025-05-07T20:32:32.4761258Z contiguous: bool, 2025-05-07T20:32:32.4761492Z compiled: bool, 2025-05-07T20:32:32.4761720Z ) -> None: 2025-05-07T20:32:32.4761926Z torch.manual_seed(2025) 2025-05-07T20:32:32.4762166Z 2025-05-07T20:32:32.4762438Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.4762774Z 2025-05-07T20:32:32.4762964Z x_sign = torch.sign(x) 2025-05-07T20:32:32.4763267Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.4763605Z x = x_sign * x_clamp 2025-05-07T20:32:32.4763853Z x0 = x[:, :D] 2025-05-07T20:32:32.4764067Z x1 = x[:, D:] 2025-05-07T20:32:32.4764267Z 2025-05-07T20:32:32.4764448Z if contiguous: 2025-05-07T20:32:32.4764678Z x0 = x0.contiguous() 2025-05-07T20:32:32.4764928Z x1 = x1.contiguous() 2025-05-07T20:32:32.4765162Z 2025-05-07T20:32:32.4765349Z if scale_ub is not None: 2025-05-07T20:32:32.4765617Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.4765954Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.4766256Z ) 2025-05-07T20:32:32.4766461Z else: 2025-05-07T20:32:32.4766674Z scale_ub_tensor = None 2025-05-07T20:32:32.4766916Z 2025-05-07T20:32:32.4767148Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.4767461Z op = silu_mul_quant 2025-05-07T20:32:32.4767704Z if compiled: 2025-05-07T20:32:32.4767954Z op = torch.compile(op) 2025-05-07T20:32:32.4768253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4768520Z 2025-05-07T20:32:32.4768711Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.4768879Z 2025-05-07T20:32:32.4768981Z moe/activation_test.py:117: 2025-05-07T20:32:32.4769272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4769721Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.4769999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4770687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.4771451Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.4771990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.4772669Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.4773409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.4773981Z kernel = self.compile( 2025-05-07T20:32:32.4774519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.4775167Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.4775557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4775789Z 2025-05-07T20:32:32.4775995Z self = 2025-05-07T20:32:32.4777074Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.4778554Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fca83b5e0>} 2025-05-07T20:32:32.4779914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.4780927Z context = 2025-05-07T20:32:32.4781219Z 2025-05-07T20:32:32.4781380Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.4781898Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.4782368Z module_map=module_map) 2025-05-07T20:32:32.4782721Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.4783070Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.4783329Z E ^ 2025-05-07T20:32:32.4783785Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.4784242Z 2025-05-07T20:32:32.4784654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.4785168Z 2025-05-07T20:32:32.4785267Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.4785684Z self=, 2025-05-07T20:32:32.4786077Z T=4096, 2025-05-07T20:32:32.4786264Z D=7168, 2025-05-07T20:32:32.4786452Z scale_ub=None, 2025-05-07T20:32:32.4786658Z contiguous=False, 2025-05-07T20:32:32.4786886Z compiled=False, 2025-05-07T20:32:32.4787094Z ) 2025-05-07T20:32:32.4787405Z self = 2025-05-07T20:32:32.4787900Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.4788178Z 2025-05-07T20:32:32.4788251Z @given( 2025-05-07T20:32:32.4788493Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.4788791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.4789101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.4789432Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.4789753Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.4790209Z ) 2025-05-07T20:32:32.4790556Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.4790986Z def test_silu_mul_quant( 2025-05-07T20:32:32.4791233Z self, 2025-05-07T20:32:32.4791425Z T: int, 2025-05-07T20:32:32.4791661Z D: int, 2025-05-07T20:32:32.4791879Z scale_ub: Optional[float], 2025-05-07T20:32:32.4792157Z contiguous: bool, 2025-05-07T20:32:32.4792384Z compiled: bool, 2025-05-07T20:32:32.4792606Z ) -> None: 2025-05-07T20:32:32.4792821Z torch.manual_seed(2025) 2025-05-07T20:32:32.4793103Z 2025-05-07T20:32:32.4793396Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.4793763Z 2025-05-07T20:32:32.4793954Z x_sign = torch.sign(x) 2025-05-07T20:32:32.4794233Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.4794537Z x = x_sign * x_clamp 2025-05-07T20:32:32.4794777Z x0 = x[:, :D] 2025-05-07T20:32:32.4794985Z x1 = x[:, D:] 2025-05-07T20:32:32.4795184Z 2025-05-07T20:32:32.4795362Z if contiguous: 2025-05-07T20:32:32.4795582Z x0 = x0.contiguous() 2025-05-07T20:32:32.4795843Z x1 = x1.contiguous() 2025-05-07T20:32:32.4796080Z 2025-05-07T20:32:32.4796264Z if scale_ub is not None: 2025-05-07T20:32:32.4796537Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.4796867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.4797163Z ) 2025-05-07T20:32:32.4797352Z else: 2025-05-07T20:32:32.4797634Z scale_ub_tensor = None 2025-05-07T20:32:32.4797879Z 2025-05-07T20:32:32.4798100Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.4798411Z op = silu_mul_quant 2025-05-07T20:32:32.4798663Z if compiled: 2025-05-07T20:32:32.4798900Z op = torch.compile(op) 2025-05-07T20:32:32.4799197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4799472Z 2025-05-07T20:32:32.4799653Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.4799821Z 2025-05-07T20:32:32.4799918Z moe/activation_test.py:117: 2025-05-07T20:32:32.4800215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4800536Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.4800815Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4801504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.4802211Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.4802743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.4803425Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.4804399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.4804923Z kernel = self.compile( 2025-05-07T20:32:32.4805461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.4806113Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.4806509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4806734Z 2025-05-07T20:32:32.4806944Z self = 2025-05-07T20:32:32.4808038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.4809427Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fcabfd8b0>} 2025-05-07T20:32:32.4810857Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.4811942Z context = 2025-05-07T20:32:32.4812234Z 2025-05-07T20:32:32.4812398Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.4812927Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.4813452Z module_map=module_map) 2025-05-07T20:32:32.4813806Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.4814158Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.4814418Z E ^ 2025-05-07T20:32:32.4814875Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.4815326Z 2025-05-07T20:32:32.4815737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.4816253Z 2025-05-07T20:32:32.4816359Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.4816770Z self=, 2025-05-07T20:32:32.4817168Z T=128, 2025-05-07T20:32:32.4817348Z D=7168, 2025-05-07T20:32:32.4817535Z scale_ub=None, 2025-05-07T20:32:32.4817746Z contiguous=False, 2025-05-07T20:32:32.4818031Z compiled=True, 2025-05-07T20:32:32.4818227Z ) 2025-05-07T20:32:32.5580654Z self = 2025-05-07T20:32:32.5581698Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.5582228Z 2025-05-07T20:32:32.5582382Z @given( 2025-05-07T20:32:32.5582835Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.5583400Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.5583751Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.5584082Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.5584417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.5584702Z ) 2025-05-07T20:32:32.5585045Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.5585488Z def test_silu_mul_quant( 2025-05-07T20:32:32.5585731Z self, 2025-05-07T20:32:32.5585920Z T: int, 2025-05-07T20:32:32.5586129Z D: int, 2025-05-07T20:32:32.5586348Z scale_ub: Optional[float], 2025-05-07T20:32:32.5586613Z contiguous: bool, 2025-05-07T20:32:32.5586852Z compiled: bool, 2025-05-07T20:32:32.5587079Z ) -> None: 2025-05-07T20:32:32.5587289Z torch.manual_seed(2025) 2025-05-07T20:32:32.5587535Z 2025-05-07T20:32:32.5587807Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.5588149Z 2025-05-07T20:32:32.5588344Z x_sign = torch.sign(x) 2025-05-07T20:32:32.5588637Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.5588942Z x = x_sign * x_clamp 2025-05-07T20:32:32.5589188Z x0 = x[:, :D] 2025-05-07T20:32:32.5589404Z x1 = x[:, D:] 2025-05-07T20:32:32.5589612Z 2025-05-07T20:32:32.5589794Z if contiguous: 2025-05-07T20:32:32.5590104Z x0 = x0.contiguous() 2025-05-07T20:32:32.5590366Z x1 = x1.contiguous() 2025-05-07T20:32:32.5590605Z 2025-05-07T20:32:32.5590798Z if scale_ub is not None: 2025-05-07T20:32:32.5591073Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.5591403Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.5591711Z ) 2025-05-07T20:32:32.5591904Z else: 2025-05-07T20:32:32.5592109Z scale_ub_tensor = None 2025-05-07T20:32:32.5592543Z 2025-05-07T20:32:32.5592773Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.5593083Z op = silu_mul_quant 2025-05-07T20:32:32.5593333Z if compiled: 2025-05-07T20:32:32.5593579Z op = torch.compile(op) 2025-05-07T20:32:32.5593940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.5594215Z 2025-05-07T20:32:32.5594407Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.5594690Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.5594976Z 2025-05-07T20:32:32.5595217Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.5595625Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.5595911Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.5596227Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.5596587Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.5596893Z 2025-05-07T20:32:32.5597090Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.5597284Z 2025-05-07T20:32:32.5597391Z moe/activation_test.py:126: 2025-05-07T20:32:32.5597681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.5598018Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.5598348Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.5599140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.5599978Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.5600529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.5601211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.5601895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.5609574Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.5610378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:32.5611132Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.5611868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.5612512Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.5613111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.5613668Z fn() 2025-05-07T20:32:32.5614187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.5614764Z self.fn.run( 2025-05-07T20:32:32.5615222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.5615751Z kernel = self.compile( 2025-05-07T20:32:32.5616297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.5616939Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.5617339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.5617576Z 2025-05-07T20:32:32.5617785Z self = 2025-05-07T20:32:32.5618877Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.5620258Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fca2f73a0>} 2025-05-07T20:32:32.5621790Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.5622813Z context = 2025-05-07T20:32:32.5623100Z 2025-05-07T20:32:32.5623273Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.5623893Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.5624357Z module_map=module_map) 2025-05-07T20:32:32.5624726Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.5625083Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.5625340Z E ^ 2025-05-07T20:32:32.5625807Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.5626253Z 2025-05-07T20:32:32.5626678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.5627191Z 2025-05-07T20:32:32.5627301Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.5627709Z self=, 2025-05-07T20:32:32.5628109Z T=128, 2025-05-07T20:32:32.5628297Z D=7168, 2025-05-07T20:32:32.5628547Z scale_ub=None, 2025-05-07T20:32:32.5628768Z contiguous=False, 2025-05-07T20:32:32.5628994Z compiled=False, 2025-05-07T20:32:32.5629195Z ) 2025-05-07T20:32:32.9613122Z self = 2025-05-07T20:32:32.9613709Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.9614025Z 2025-05-07T20:32:32.9614104Z @given( 2025-05-07T20:32:32.9614338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.9614662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.9614973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.9615313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.9615642Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.9615931Z ) 2025-05-07T20:32:32.9616277Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.9616725Z def test_silu_mul_quant( 2025-05-07T20:32:32.9616981Z self, 2025-05-07T20:32:32.9617170Z T: int, 2025-05-07T20:32:32.9617362Z D: int, 2025-05-07T20:32:32.9617582Z scale_ub: Optional[float], 2025-05-07T20:32:32.9617849Z contiguous: bool, 2025-05-07T20:32:32.9618089Z compiled: bool, 2025-05-07T20:32:32.9618312Z ) -> None: 2025-05-07T20:32:32.9618534Z torch.manual_seed(2025) 2025-05-07T20:32:32.9618783Z 2025-05-07T20:32:32.9619048Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.9619388Z 2025-05-07T20:32:32.9619586Z x_sign = torch.sign(x) 2025-05-07T20:32:32.9619873Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.9620184Z x = x_sign * x_clamp 2025-05-07T20:32:32.9620426Z x0 = x[:, :D] 2025-05-07T20:32:32.9620635Z x1 = x[:, D:] 2025-05-07T20:32:32.9620848Z 2025-05-07T20:32:32.9621037Z if contiguous: 2025-05-07T20:32:32.9621261Z x0 = x0.contiguous() 2025-05-07T20:32:32.9621524Z x1 = x1.contiguous() 2025-05-07T20:32:32.9621772Z 2025-05-07T20:32:32.9621957Z if scale_ub is not None: 2025-05-07T20:32:32.9622232Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.9622576Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.9622883Z ) 2025-05-07T20:32:32.9623298Z else: 2025-05-07T20:32:32.9623510Z scale_ub_tensor = None 2025-05-07T20:32:32.9623761Z 2025-05-07T20:32:32.9623986Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.9624301Z op = silu_mul_quant 2025-05-07T20:32:32.9624626Z if compiled: 2025-05-07T20:32:32.9624870Z op = torch.compile(op) 2025-05-07T20:32:32.9625165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9625437Z 2025-05-07T20:32:32.9625618Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.9625788Z 2025-05-07T20:32:32.9625963Z moe/activation_test.py:117: 2025-05-07T20:32:32.9626255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9626578Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.9626860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9627553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.9628255Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.9628783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.9629466Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.9630214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.9630739Z kernel = self.compile( 2025-05-07T20:32:32.9631346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.9632001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.9632392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9632616Z 2025-05-07T20:32:32.9632822Z self = 2025-05-07T20:32:32.9633960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.9635360Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fca2dfd30>} 2025-05-07T20:32:32.9636704Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.9637726Z context = 2025-05-07T20:32:32.9638011Z 2025-05-07T20:32:32.9638173Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.9638692Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.9639165Z module_map=module_map) 2025-05-07T20:32:32.9639540Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.9639894Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.9640154Z E ^ 2025-05-07T20:32:32.9640622Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.9641069Z 2025-05-07T20:32:32.9641488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.9642005Z 2025-05-07T20:32:32.9642105Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.9642521Z self=, 2025-05-07T20:32:32.9642925Z T=4096, 2025-05-07T20:32:32.9643109Z D=5120, 2025-05-07T20:32:32.9643303Z scale_ub=1200.0, 2025-05-07T20:32:32.9643606Z contiguous=True, 2025-05-07T20:32:32.9643850Z compiled=False, 2025-05-07T20:32:32.9644062Z ) 2025-05-07T20:32:32.9644381Z self = 2025-05-07T20:32:32.9644919Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.9645200Z 2025-05-07T20:32:32.9645275Z @given( 2025-05-07T20:32:32.9645508Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.9645815Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.9646124Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.9646498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.9646827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.9647103Z ) 2025-05-07T20:32:32.9647448Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.9647887Z def test_silu_mul_quant( 2025-05-07T20:32:32.9648121Z self, 2025-05-07T20:32:32.9648319Z T: int, 2025-05-07T20:32:32.9648513Z D: int, 2025-05-07T20:32:32.9648725Z scale_ub: Optional[float], 2025-05-07T20:32:32.9648994Z contiguous: bool, 2025-05-07T20:32:32.9649230Z compiled: bool, 2025-05-07T20:32:32.9649448Z ) -> None: 2025-05-07T20:32:32.9649662Z torch.manual_seed(2025) 2025-05-07T20:32:32.9649902Z 2025-05-07T20:32:32.9650165Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.9650501Z 2025-05-07T20:32:32.9650696Z x_sign = torch.sign(x) 2025-05-07T20:32:32.9651027Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.9651338Z x = x_sign * x_clamp 2025-05-07T20:32:32.9651579Z x0 = x[:, :D] 2025-05-07T20:32:32.9651799Z x1 = x[:, D:] 2025-05-07T20:32:32.9651999Z 2025-05-07T20:32:32.9652186Z if contiguous: 2025-05-07T20:32:32.9652420Z x0 = x0.contiguous() 2025-05-07T20:32:32.9652679Z x1 = x1.contiguous() 2025-05-07T20:32:32.9652919Z 2025-05-07T20:32:32.9653107Z if scale_ub is not None: 2025-05-07T20:32:32.9653401Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.9653767Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.9654085Z ) 2025-05-07T20:32:32.9654267Z else: 2025-05-07T20:32:32.9654481Z scale_ub_tensor = None 2025-05-07T20:32:32.9654736Z 2025-05-07T20:32:32.9654967Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.9655282Z op = silu_mul_quant 2025-05-07T20:32:32.9655543Z if compiled: 2025-05-07T20:32:32.9655786Z op = torch.compile(op) 2025-05-07T20:32:32.9656089Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9656367Z 2025-05-07T20:32:32.9656561Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.9656728Z 2025-05-07T20:32:32.9656827Z moe/activation_test.py:117: 2025-05-07T20:32:32.9657130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9657465Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.9657746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9658441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.9659132Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.9659669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.9660349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.9661017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.9661543Z kernel = self.compile( 2025-05-07T20:32:32.9662078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.9662780Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.9663175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9663398Z 2025-05-07T20:32:32.9663651Z self = 2025-05-07T20:32:32.9664732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.9666159Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fca3121f0>} 2025-05-07T20:32:32.9667509Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.9668537Z context = 2025-05-07T20:32:32.9668826Z 2025-05-07T20:32:32.9668999Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.9669542Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.9670058Z module_map=module_map) 2025-05-07T20:32:32.9670424Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.9670782Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.9671088Z E ^ 2025-05-07T20:32:32.9671551Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.9672007Z 2025-05-07T20:32:32.9672424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.9672937Z 2025-05-07T20:32:32.9673049Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.9673455Z self=, 2025-05-07T20:32:32.9673858Z T=1, 2025-05-07T20:32:32.9674044Z D=5120, 2025-05-07T20:32:32.9674230Z scale_ub=None, 2025-05-07T20:32:32.9674445Z contiguous=True, 2025-05-07T20:32:32.9674678Z compiled=True, 2025-05-07T20:32:32.9674879Z ) 2025-05-07T20:32:33.6242060Z self = 2025-05-07T20:32:33.6242576Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:33.6242857Z 2025-05-07T20:32:33.6242960Z @given( 2025-05-07T20:32:33.6243284Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.6243770Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.6244167Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.6244604Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.6244975Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.6245263Z ) 2025-05-07T20:32:33.6245609Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.6246054Z def test_silu_mul_quant( 2025-05-07T20:32:33.6246312Z self, 2025-05-07T20:32:33.6246508Z T: int, 2025-05-07T20:32:33.6246711Z D: int, 2025-05-07T20:32:33.6246935Z scale_ub: Optional[float], 2025-05-07T20:32:33.6247203Z contiguous: bool, 2025-05-07T20:32:33.6247445Z compiled: bool, 2025-05-07T20:32:33.6247677Z ) -> None: 2025-05-07T20:32:33.6247898Z torch.manual_seed(2025) 2025-05-07T20:32:33.6248144Z 2025-05-07T20:32:33.6248422Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.6248762Z 2025-05-07T20:32:33.6248958Z x_sign = torch.sign(x) 2025-05-07T20:32:33.6249254Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.6249849Z x = x_sign * x_clamp 2025-05-07T20:32:33.6250095Z x0 = x[:, :D] 2025-05-07T20:32:33.6250315Z x1 = x[:, D:] 2025-05-07T20:32:33.6250528Z 2025-05-07T20:32:33.6250710Z if contiguous: 2025-05-07T20:32:33.6250947Z x0 = x0.contiguous() 2025-05-07T20:32:33.6251332Z x1 = x1.contiguous() 2025-05-07T20:32:33.6251574Z 2025-05-07T20:32:33.6251774Z if scale_ub is not None: 2025-05-07T20:32:33.6252052Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.6252390Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.6252781Z ) 2025-05-07T20:32:33.6252981Z else: 2025-05-07T20:32:33.6253189Z scale_ub_tensor = None 2025-05-07T20:32:33.6253444Z 2025-05-07T20:32:33.6253713Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.6254070Z op = silu_mul_quant 2025-05-07T20:32:33.6254320Z if compiled: 2025-05-07T20:32:33.6254579Z op = torch.compile(op) 2025-05-07T20:32:33.6254881Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.6255153Z 2025-05-07T20:32:33.6255350Z y_fp8, y_scale = fn() 2025-05-07T20:32:33.6255640Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:33.6255949Z 2025-05-07T20:32:33.6256182Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.6256519Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:33.6256815Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:33.6257207Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:33.6257577Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.6257892Z 2025-05-07T20:32:33.6258091Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:33.6258291Z 2025-05-07T20:32:33.6258394Z moe/activation_test.py:126: 2025-05-07T20:32:33.6258690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6259029Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:33.6259352Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.6260157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:33.6260921Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:33.6261462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.6262151Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.6262837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:33.6263557Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.6264302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:33.6265055Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.6265788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:33.6266425Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:33.6267019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:33.6267539Z fn() 2025-05-07T20:32:33.6268056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:33.6268631Z self.fn.run( 2025-05-07T20:32:33.6269098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.6269631Z kernel = self.compile( 2025-05-07T20:32:33.6270289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.6270988Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.6271432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6271661Z 2025-05-07T20:32:33.6271879Z self = 2025-05-07T20:32:33.6272969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.6274476Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fca3124c0>} 2025-05-07T20:32:33.6275827Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.6276858Z context = 2025-05-07T20:32:33.6277145Z 2025-05-07T20:32:33.6277323Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.6277841Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.6278313Z module_map=module_map) 2025-05-07T20:32:33.6278728Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.6279089Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:33.6279360Z E ^ 2025-05-07T20:32:33.6279831Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.6280283Z 2025-05-07T20:32:33.6280709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.6281229Z 2025-05-07T20:32:33.6281335Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.6281755Z self=, 2025-05-07T20:32:33.6282162Z T=2048, 2025-05-07T20:32:33.6282353Z D=5120, 2025-05-07T20:32:33.6282541Z scale_ub=None, 2025-05-07T20:32:33.6282760Z contiguous=True, 2025-05-07T20:32:33.6282986Z compiled=True, 2025-05-07T20:32:33.6283188Z ) 2025-05-07T20:32:34.2371772Z self = 2025-05-07T20:32:34.2372685Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:34.2373131Z 2025-05-07T20:32:34.2373254Z @given( 2025-05-07T20:32:34.2373627Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.2374160Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.2374653Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.2375205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.2375737Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.2376206Z ) 2025-05-07T20:32:34.2376739Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.2377368Z def test_silu_mul_quant( 2025-05-07T20:32:34.2377698Z self, 2025-05-07T20:32:34.2377946Z T: int, 2025-05-07T20:32:34.2378203Z D: int, 2025-05-07T20:32:34.2378493Z scale_ub: Optional[float], 2025-05-07T20:32:34.2378848Z contiguous: bool, 2025-05-07T20:32:34.2379172Z compiled: bool, 2025-05-07T20:32:34.2379474Z ) -> None: 2025-05-07T20:32:34.2379750Z torch.manual_seed(2025) 2025-05-07T20:32:34.2380075Z 2025-05-07T20:32:34.2380437Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.2380904Z 2025-05-07T20:32:34.2381158Z x_sign = torch.sign(x) 2025-05-07T20:32:34.2381859Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.2382276Z x = x_sign * x_clamp 2025-05-07T20:32:34.2382597Z x0 = x[:, :D] 2025-05-07T20:32:34.2382880Z x1 = x[:, D:] 2025-05-07T20:32:34.2383148Z 2025-05-07T20:32:34.2383573Z if contiguous: 2025-05-07T20:32:34.2383929Z x0 = x0.contiguous() 2025-05-07T20:32:34.2384294Z x1 = x1.contiguous() 2025-05-07T20:32:34.2384648Z 2025-05-07T20:32:34.2384920Z if scale_ub is not None: 2025-05-07T20:32:34.2385320Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.2385980Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.2386458Z ) 2025-05-07T20:32:34.2386738Z else: 2025-05-07T20:32:34.2387044Z scale_ub_tensor = None 2025-05-07T20:32:34.2387401Z 2025-05-07T20:32:34.2387715Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.2388152Z op = silu_mul_quant 2025-05-07T20:32:34.2388500Z if compiled: 2025-05-07T20:32:34.2388846Z op = torch.compile(op) 2025-05-07T20:32:34.2389249Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.2389631Z 2025-05-07T20:32:34.2390027Z y_fp8, y_scale = fn() 2025-05-07T20:32:34.2390439Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:34.2390880Z 2025-05-07T20:32:34.2391246Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.2391747Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:34.2392336Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:34.2392862Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:34.2393439Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:34.2393921Z 2025-05-07T20:32:34.2394226Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:34.2394518Z 2025-05-07T20:32:34.2394667Z moe/activation_test.py:126: 2025-05-07T20:32:34.2395115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.2395658Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:34.2396180Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:34.2397489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:34.2398780Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:34.2399667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.2400858Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.2402008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:34.2403270Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:34.2404978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:34.2415328Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:34.2416653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:34.2417789Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:34.2418824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:34.2419738Z fn() 2025-05-07T20:32:34.2420596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:34.2421617Z self.fn.run( 2025-05-07T20:32:34.2422378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.2423441Z kernel = self.compile( 2025-05-07T20:32:34.2424364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.2425502Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.2426178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.2426567Z 2025-05-07T20:32:34.2426906Z self = 2025-05-07T20:32:34.2428739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.2431492Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fca00cf70>} 2025-05-07T20:32:34.2433883Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.2435683Z context = 2025-05-07T20:32:34.2436193Z 2025-05-07T20:32:34.2436469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.2437369Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.2438270Z module_map=module_map) 2025-05-07T20:32:34.2438883Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.2439478Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:34.2439919Z E ^ 2025-05-07T20:32:34.2440703Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.2441487Z 2025-05-07T20:32:34.2442206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.2443104Z 2025-05-07T20:32:34.2443287Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.2443989Z self=, 2025-05-07T20:32:34.2444661Z T=128, 2025-05-07T20:32:34.2444962Z D=5120, 2025-05-07T20:32:34.2445267Z scale_ub=None, 2025-05-07T20:32:34.2445600Z contiguous=True, 2025-05-07T20:32:34.2445958Z compiled=True, 2025-05-07T20:32:34.2446292Z ) 2025-05-07T20:32:35.2143277Z self = 2025-05-07T20:32:35.2144158Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.2144599Z 2025-05-07T20:32:35.2144722Z @given( 2025-05-07T20:32:35.2145099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2145609Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2146110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2146653Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2147176Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2147649Z ) 2025-05-07T20:32:35.2148186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2148802Z def test_silu_mul_quant( 2025-05-07T20:32:35.2149128Z self, 2025-05-07T20:32:35.2149389Z T: int, 2025-05-07T20:32:35.2149650Z D: int, 2025-05-07T20:32:35.2150044Z scale_ub: Optional[float], 2025-05-07T20:32:35.2150408Z contiguous: bool, 2025-05-07T20:32:35.2150730Z compiled: bool, 2025-05-07T20:32:35.2151021Z ) -> None: 2025-05-07T20:32:35.2151306Z torch.manual_seed(2025) 2025-05-07T20:32:35.2151633Z 2025-05-07T20:32:35.2151990Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2152778Z 2025-05-07T20:32:35.2153038Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2153426Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2153846Z x = x_sign * x_clamp 2025-05-07T20:32:35.2154308Z x0 = x[:, :D] 2025-05-07T20:32:35.2154597Z x1 = x[:, D:] 2025-05-07T20:32:35.2154872Z 2025-05-07T20:32:35.2155109Z if contiguous: 2025-05-07T20:32:35.2155420Z x0 = x0.contiguous() 2025-05-07T20:32:35.2155768Z x1 = x1.contiguous() 2025-05-07T20:32:35.2156088Z 2025-05-07T20:32:35.2156454Z if scale_ub is not None: 2025-05-07T20:32:35.2156824Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2157275Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2157698Z ) 2025-05-07T20:32:35.2157956Z else: 2025-05-07T20:32:35.2158238Z scale_ub_tensor = None 2025-05-07T20:32:35.2158574Z 2025-05-07T20:32:35.2158894Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2159330Z op = silu_mul_quant 2025-05-07T20:32:35.2159662Z if compiled: 2025-05-07T20:32:35.2159992Z op = torch.compile(op) 2025-05-07T20:32:35.2160401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2160769Z 2025-05-07T20:32:35.2161065Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.2161516Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.2161926Z 2025-05-07T20:32:35.2162273Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2162894Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.2163350Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.2163839Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.2164397Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.2164864Z 2025-05-07T20:32:35.2165138Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.2165431Z 2025-05-07T20:32:35.2165566Z moe/activation_test.py:126: 2025-05-07T20:32:35.2165988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2166464Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.2166932Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.2168097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.2169302Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.2170134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2171253Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2172286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.2173431Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.2174675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.2175935Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.2177132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.2178221Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.2179235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.2180122Z fn() 2025-05-07T20:32:35.2180994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.2181996Z self.fn.run( 2025-05-07T20:32:35.2182886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2183799Z kernel = self.compile( 2025-05-07T20:32:35.2184835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2185984Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2186657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2187040Z 2025-05-07T20:32:35.2187389Z self = 2025-05-07T20:32:35.2189350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2191911Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc9e93b80>} 2025-05-07T20:32:35.2194342Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2196151Z context = 2025-05-07T20:32:35.2196647Z 2025-05-07T20:32:35.2196931Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2197887Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2198702Z module_map=module_map) 2025-05-07T20:32:35.2199306Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2199877Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.2200297Z E ^ 2025-05-07T20:32:35.2201068Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2201829Z 2025-05-07T20:32:35.2202531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2203415Z 2025-05-07T20:32:35.2203582Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2205202Z self=, 2025-05-07T20:32:35.2205863Z T=4096, 2025-05-07T20:32:35.2206144Z D=5120, 2025-05-07T20:32:35.2206440Z scale_ub=None, 2025-05-07T20:32:35.2206790Z contiguous=True, 2025-05-07T20:32:35.2207131Z compiled=True, 2025-05-07T20:32:35.2207455Z ) 2025-05-07T20:32:36.0542590Z self = 2025-05-07T20:32:36.0543243Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:36.0543616Z 2025-05-07T20:32:36.0543731Z @given( 2025-05-07T20:32:36.0544068Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.0544437Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.0544755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.0545099Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.0545425Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.0545720Z ) 2025-05-07T20:32:36.0546075Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.0546519Z def test_silu_mul_quant( 2025-05-07T20:32:36.0546771Z self, 2025-05-07T20:32:36.0546980Z T: int, 2025-05-07T20:32:36.0547179Z D: int, 2025-05-07T20:32:36.0547398Z scale_ub: Optional[float], 2025-05-07T20:32:36.0547671Z contiguous: bool, 2025-05-07T20:32:36.0547905Z compiled: bool, 2025-05-07T20:32:36.0548139Z ) -> None: 2025-05-07T20:32:36.0548366Z torch.manual_seed(2025) 2025-05-07T20:32:36.0548921Z 2025-05-07T20:32:36.0549200Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.0549548Z 2025-05-07T20:32:36.0549743Z x_sign = torch.sign(x) 2025-05-07T20:32:36.0550145Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.0550552Z x = x_sign * x_clamp 2025-05-07T20:32:36.0550803Z x0 = x[:, :D] 2025-05-07T20:32:36.0551023Z x1 = x[:, D:] 2025-05-07T20:32:36.0551231Z 2025-05-07T20:32:36.0551419Z if contiguous: 2025-05-07T20:32:36.0551645Z x0 = x0.contiguous() 2025-05-07T20:32:36.0551987Z x1 = x1.contiguous() 2025-05-07T20:32:36.0552231Z 2025-05-07T20:32:36.0552418Z if scale_ub is not None: 2025-05-07T20:32:36.0552732Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.0553139Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.0553438Z ) 2025-05-07T20:32:36.0553629Z else: 2025-05-07T20:32:36.0553844Z scale_ub_tensor = None 2025-05-07T20:32:36.0554087Z 2025-05-07T20:32:36.0554322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.0554658Z op = silu_mul_quant 2025-05-07T20:32:36.0554924Z if compiled: 2025-05-07T20:32:36.0555172Z op = torch.compile(op) 2025-05-07T20:32:36.0555471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.0555745Z 2025-05-07T20:32:36.0555926Z y_fp8, y_scale = fn() 2025-05-07T20:32:36.0556212Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:36.0556501Z 2025-05-07T20:32:36.0556842Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.0557172Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:36.0557465Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:36.0557783Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:36.0558133Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.0558446Z 2025-05-07T20:32:36.0558647Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:36.0558841Z 2025-05-07T20:32:36.0558948Z moe/activation_test.py:126: 2025-05-07T20:32:36.0559249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.0559583Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:36.0559909Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.0560699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:36.0561461Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:36.0562006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.0562680Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.0563366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:36.0564082Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.0564830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:36.0565568Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.0566316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:36.0566949Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:36.0567546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:36.0568062Z fn() 2025-05-07T20:32:36.0568555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:36.0569185Z self.fn.run( 2025-05-07T20:32:36.0569647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.0570167Z kernel = self.compile( 2025-05-07T20:32:36.0570739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.0571385Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.0571780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.0572046Z 2025-05-07T20:32:36.0572252Z self = 2025-05-07T20:32:36.0573340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.0574793Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc99e0c10>} 2025-05-07T20:32:36.0576142Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.0577153Z context = 2025-05-07T20:32:36.0577443Z 2025-05-07T20:32:36.0577647Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.0578177Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.0578642Z module_map=module_map) 2025-05-07T20:32:36.0579004Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.0579363Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:36.0579626Z E ^ 2025-05-07T20:32:36.0580082Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.0580536Z 2025-05-07T20:32:36.0580950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.0581463Z 2025-05-07T20:32:36.0581564Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.0581972Z self=, 2025-05-07T20:32:36.0582375Z T=16384, 2025-05-07T20:32:36.0582569Z D=5120, 2025-05-07T20:32:36.0582759Z scale_ub=None, 2025-05-07T20:32:36.0582964Z contiguous=True, 2025-05-07T20:32:36.0583187Z compiled=True, 2025-05-07T20:32:36.0583394Z ) 2025-05-07T20:32:36.1022224Z W0507 20:32:36.100620 87906 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:36.1023792Z W0507 20:32:36.100620 87906 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:36.1025204Z W0507 20:32:36.100620 87906 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:36.1026206Z W0507 20:32:36.100620 87906 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:36.1027328Z W0507 20:32:36.100620 87906 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:36.2242890Z self = 2025-05-07T20:32:36.2243645Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:36.2244200Z 2025-05-07T20:32:36.2244279Z @given( 2025-05-07T20:32:36.2244511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.2244825Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.2245211Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.2245541Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.2245868Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.2246142Z ) 2025-05-07T20:32:36.2246496Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.2247009Z def test_silu_mul_quant( 2025-05-07T20:32:36.2247245Z self, 2025-05-07T20:32:36.2247428Z T: int, 2025-05-07T20:32:36.2247620Z D: int, 2025-05-07T20:32:36.2247835Z scale_ub: Optional[float], 2025-05-07T20:32:36.2248096Z contiguous: bool, 2025-05-07T20:32:36.2248329Z compiled: bool, 2025-05-07T20:32:36.2248557Z ) -> None: 2025-05-07T20:32:36.2248765Z torch.manual_seed(2025) 2025-05-07T20:32:36.2249004Z 2025-05-07T20:32:36.2249270Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.2249604Z 2025-05-07T20:32:36.2249797Z x_sign = torch.sign(x) 2025-05-07T20:32:36.2250083Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.2250383Z x = x_sign * x_clamp 2025-05-07T20:32:36.2250620Z x0 = x[:, :D] 2025-05-07T20:32:36.2250832Z x1 = x[:, D:] 2025-05-07T20:32:36.2251032Z 2025-05-07T20:32:36.2251292Z if contiguous: 2025-05-07T20:32:36.2251523Z x0 = x0.contiguous() 2025-05-07T20:32:36.2251772Z x1 = x1.contiguous() 2025-05-07T20:32:36.2252005Z 2025-05-07T20:32:36.2252194Z if scale_ub is not None: 2025-05-07T20:32:36.2252456Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.2252791Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.2253101Z ) 2025-05-07T20:32:36.2253289Z else: 2025-05-07T20:32:36.2253488Z scale_ub_tensor = None 2025-05-07T20:32:36.2253738Z 2025-05-07T20:32:36.2253972Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.2254308Z op = silu_mul_quant 2025-05-07T20:32:36.2254587Z if compiled: 2025-05-07T20:32:36.2254835Z op = torch.compile(op) 2025-05-07T20:32:36.2255123Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.2255398Z 2025-05-07T20:32:36.2255592Z y_fp8, y_scale = fn() 2025-05-07T20:32:36.2255872Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:36.2256167Z 2025-05-07T20:32:36.2256410Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.2256740Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:36.2257032Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:36.2257342Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:36.2257699Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.2258000Z 2025-05-07T20:32:36.2258200Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:36.2258391Z 2025-05-07T20:32:36.2258498Z moe/activation_test.py:126: 2025-05-07T20:32:36.2258787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2259119Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:36.2259440Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.2260220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:36.2260974Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:36.2261513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.2262187Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.2262919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:36.2263673Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.2264424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:36.2265213Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.2266000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:36.2266633Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:36.2267228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:36.2267740Z fn() 2025-05-07T20:32:36.2268235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:36.2268809Z self.fn.run( 2025-05-07T20:32:36.2269270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.2269784Z kernel = self.compile( 2025-05-07T20:32:36.2270405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.2271048Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.2271509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2271737Z 2025-05-07T20:32:36.2271942Z self = 2025-05-07T20:32:36.2273023Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.2274417Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc99b0c10>} 2025-05-07T20:32:36.2275759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.2276775Z context = 2025-05-07T20:32:36.2277068Z 2025-05-07T20:32:36.2277231Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.2277752Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.2278217Z module_map=module_map) 2025-05-07T20:32:36.2278577Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.2278938Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:36.2279204Z E ^ 2025-05-07T20:32:36.2279663Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.2280121Z 2025-05-07T20:32:36.2280530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.2281047Z 2025-05-07T20:32:36.2281146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2281568Z self=, 2025-05-07T20:32:36.2281970Z T=1, 2025-05-07T20:32:36.2282153Z D=5120, 2025-05-07T20:32:36.2282348Z scale_ub=1200.0, 2025-05-07T20:32:36.2282561Z contiguous=True, 2025-05-07T20:32:36.2282785Z compiled=True, 2025-05-07T20:32:36.2282992Z ) 2025-05-07T20:32:36.3987982Z self = 2025-05-07T20:32:36.3989029Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:36.3989352Z 2025-05-07T20:32:36.3989436Z @given( 2025-05-07T20:32:36.3989678Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.3990185Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.3990493Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.3990825Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.3991159Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.3991520Z ) 2025-05-07T20:32:36.3991880Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.3992321Z def test_silu_mul_quant( 2025-05-07T20:32:36.3992562Z self, 2025-05-07T20:32:36.3992813Z T: int, 2025-05-07T20:32:36.3993058Z D: int, 2025-05-07T20:32:36.3993277Z scale_ub: Optional[float], 2025-05-07T20:32:36.3993547Z contiguous: bool, 2025-05-07T20:32:36.3993786Z compiled: bool, 2025-05-07T20:32:36.3994011Z ) -> None: 2025-05-07T20:32:36.3994225Z torch.manual_seed(2025) 2025-05-07T20:32:36.3994505Z 2025-05-07T20:32:36.3994793Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.3995130Z 2025-05-07T20:32:36.3995321Z x_sign = torch.sign(x) 2025-05-07T20:32:36.3995611Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.3995913Z x = x_sign * x_clamp 2025-05-07T20:32:36.3996152Z x0 = x[:, :D] 2025-05-07T20:32:36.3996454Z x1 = x[:, D:] 2025-05-07T20:32:36.3996660Z 2025-05-07T20:32:36.3996848Z if contiguous: 2025-05-07T20:32:36.3997081Z x0 = x0.contiguous() 2025-05-07T20:32:36.3997346Z x1 = x1.contiguous() 2025-05-07T20:32:36.4004983Z 2025-05-07T20:32:36.4005222Z if scale_ub is not None: 2025-05-07T20:32:36.4005517Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.4005875Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.4006183Z ) 2025-05-07T20:32:36.4006385Z else: 2025-05-07T20:32:36.4006608Z scale_ub_tensor = None 2025-05-07T20:32:36.4006863Z 2025-05-07T20:32:36.4007114Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.4007441Z op = silu_mul_quant 2025-05-07T20:32:36.4007693Z if compiled: 2025-05-07T20:32:36.4007953Z op = torch.compile(op) 2025-05-07T20:32:36.4008258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.4008546Z 2025-05-07T20:32:36.4008733Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.4008910Z 2025-05-07T20:32:36.4009010Z moe/activation_test.py:117: 2025-05-07T20:32:36.4009315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4009647Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.4009942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.4010514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:36.4011073Z return fn(*args, **kwargs) 2025-05-07T20:32:36.4011741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.4012435Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.4012979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.4013660Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.4014326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.4014859Z kernel = self.compile( 2025-05-07T20:32:36.4015410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.4016187Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.4016594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4016824Z 2025-05-07T20:32:36.4017109Z self = 2025-05-07T20:32:36.4018211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.4019679Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc9256670>} 2025-05-07T20:32:36.4021024Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.4022056Z context = 2025-05-07T20:32:36.4022342Z 2025-05-07T20:32:36.4022517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.4023038Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.4023508Z module_map=module_map) 2025-05-07T20:32:36.4023884Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.4024340Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.4024625Z E ^ 2025-05-07T20:32:36.4025092Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.4025546Z 2025-05-07T20:32:36.4025972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.4026488Z 2025-05-07T20:32:36.4026595Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.4027010Z self=, 2025-05-07T20:32:36.4027419Z T=1, 2025-05-07T20:32:36.4027612Z D=5120, 2025-05-07T20:32:36.4027810Z scale_ub=None, 2025-05-07T20:32:36.4028033Z contiguous=False, 2025-05-07T20:32:36.4028266Z compiled=True, 2025-05-07T20:32:36.4028473Z ) 2025-05-07T20:32:36.4832146Z self = 2025-05-07T20:32:36.4832867Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:36.4833178Z 2025-05-07T20:32:36.4833271Z @given( 2025-05-07T20:32:36.4833502Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.4833825Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.4834144Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.4834525Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.4834869Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.4835160Z ) 2025-05-07T20:32:36.4835509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.4835960Z def test_silu_mul_quant( 2025-05-07T20:32:36.4836205Z self, 2025-05-07T20:32:36.4836406Z T: int, 2025-05-07T20:32:36.4836609Z D: int, 2025-05-07T20:32:36.4836833Z scale_ub: Optional[float], 2025-05-07T20:32:36.4837111Z contiguous: bool, 2025-05-07T20:32:36.4837349Z compiled: bool, 2025-05-07T20:32:36.4837584Z ) -> None: 2025-05-07T20:32:36.4837804Z torch.manual_seed(2025) 2025-05-07T20:32:36.4838043Z 2025-05-07T20:32:36.4838318Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.4838662Z 2025-05-07T20:32:36.4838852Z x_sign = torch.sign(x) 2025-05-07T20:32:36.4839146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.4839652Z x = x_sign * x_clamp 2025-05-07T20:32:36.4839888Z x0 = x[:, :D] 2025-05-07T20:32:36.4840109Z x1 = x[:, D:] 2025-05-07T20:32:36.4840319Z 2025-05-07T20:32:36.4840501Z if contiguous: 2025-05-07T20:32:36.4840813Z x0 = x0.contiguous() 2025-05-07T20:32:36.4841077Z x1 = x1.contiguous() 2025-05-07T20:32:36.4841314Z 2025-05-07T20:32:36.4841516Z if scale_ub is not None: 2025-05-07T20:32:36.4841792Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.4842137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.4842518Z ) 2025-05-07T20:32:36.4842714Z else: 2025-05-07T20:32:36.4842926Z scale_ub_tensor = None 2025-05-07T20:32:36.4843175Z 2025-05-07T20:32:36.4843411Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.4843730Z op = silu_mul_quant 2025-05-07T20:32:36.4843978Z if compiled: 2025-05-07T20:32:36.4844237Z op = torch.compile(op) 2025-05-07T20:32:36.4844568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.4844869Z 2025-05-07T20:32:36.4845066Z y_fp8, y_scale = fn() 2025-05-07T20:32:36.4845362Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:36.4845649Z 2025-05-07T20:32:36.4845890Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.4846233Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:36.4846532Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:36.4846917Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:36.4847285Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.4847600Z 2025-05-07T20:32:36.4847800Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:36.4848002Z 2025-05-07T20:32:36.4848105Z moe/activation_test.py:126: 2025-05-07T20:32:36.4848407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4848742Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:36.4849073Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.4849875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:36.4850646Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:36.4851193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.4851893Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.4852585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:36.4853312Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.4854061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:36.4854828Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.4855566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:36.4856214Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:36.4856824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:36.4857346Z fn() 2025-05-07T20:32:36.4857862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:36.4858444Z self.fn.run( 2025-05-07T20:32:36.4858915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.4859449Z kernel = self.compile( 2025-05-07T20:32:36.4860066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.4860713Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.4861153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4861382Z 2025-05-07T20:32:36.4861595Z self = 2025-05-07T20:32:36.4862688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.4864115Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc92c0dc0>} 2025-05-07T20:32:36.4865518Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.4866552Z context = 2025-05-07T20:32:36.4866847Z 2025-05-07T20:32:36.4867023Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.4867544Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.4868016Z module_map=module_map) 2025-05-07T20:32:36.4868426Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.4868789Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:36.4869050Z E ^ 2025-05-07T20:32:36.4869516Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.4870078Z 2025-05-07T20:32:36.4870501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.4871018Z 2025-05-07T20:32:36.4871130Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.4871545Z self=, 2025-05-07T20:32:36.4871950Z T=1, 2025-05-07T20:32:36.4872135Z D=5120, 2025-05-07T20:32:36.4872329Z scale_ub=None, 2025-05-07T20:32:36.4872547Z contiguous=True, 2025-05-07T20:32:36.4872773Z compiled=False, 2025-05-07T20:32:36.4872981Z ) 2025-05-07T20:32:36.8406740Z self = 2025-05-07T20:32:36.8407385Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.8407755Z 2025-05-07T20:32:36.8407875Z @given( 2025-05-07T20:32:36.8408112Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.8408513Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.8408828Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.8409151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.8409481Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.8409768Z ) 2025-05-07T20:32:36.8410128Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.8410561Z def test_silu_mul_quant( 2025-05-07T20:32:36.8410801Z self, 2025-05-07T20:32:36.8410994Z T: int, 2025-05-07T20:32:36.8411184Z D: int, 2025-05-07T20:32:36.8411403Z scale_ub: Optional[float], 2025-05-07T20:32:36.8411675Z contiguous: bool, 2025-05-07T20:32:36.8411904Z compiled: bool, 2025-05-07T20:32:36.8412132Z ) -> None: 2025-05-07T20:32:36.8412345Z torch.manual_seed(2025) 2025-05-07T20:32:36.8412578Z 2025-05-07T20:32:36.8412848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.8413188Z 2025-05-07T20:32:36.8413650Z x_sign = torch.sign(x) 2025-05-07T20:32:36.8413936Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.8414258Z x = x_sign * x_clamp 2025-05-07T20:32:36.8414535Z x0 = x[:, :D] 2025-05-07T20:32:36.8414743Z x1 = x[:, D:] 2025-05-07T20:32:36.8415027Z 2025-05-07T20:32:36.8415212Z if contiguous: 2025-05-07T20:32:36.8415453Z x0 = x0.contiguous() 2025-05-07T20:32:36.8415710Z x1 = x1.contiguous() 2025-05-07T20:32:36.8415948Z 2025-05-07T20:32:36.8416140Z if scale_ub is not None: 2025-05-07T20:32:36.8416413Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.8416829Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.8417136Z ) 2025-05-07T20:32:36.8417323Z else: 2025-05-07T20:32:36.8417530Z scale_ub_tensor = None 2025-05-07T20:32:36.8417776Z 2025-05-07T20:32:36.8417997Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.8418314Z op = silu_mul_quant 2025-05-07T20:32:36.8418561Z if compiled: 2025-05-07T20:32:36.8418803Z op = torch.compile(op) 2025-05-07T20:32:36.8419099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.8419376Z 2025-05-07T20:32:36.8419575Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.8419739Z 2025-05-07T20:32:36.8419842Z moe/activation_test.py:117: 2025-05-07T20:32:36.8420142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.8420474Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.8420828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.8421526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.8422215Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.8422752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.8423429Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.8424085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.8424644Z kernel = self.compile( 2025-05-07T20:32:36.8425197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.8425845Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.8426238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.8426465Z 2025-05-07T20:32:36.8426697Z self = 2025-05-07T20:32:36.8427781Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.8429166Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc926edc0>} 2025-05-07T20:32:36.8430612Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.8431633Z context = 2025-05-07T20:32:36.8431919Z 2025-05-07T20:32:36.8432090Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.8432608Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.8433066Z module_map=module_map) 2025-05-07T20:32:36.8433431Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.8433841Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.8434090Z E ^ 2025-05-07T20:32:36.8434555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.8435004Z 2025-05-07T20:32:36.8435464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.8435976Z 2025-05-07T20:32:36.8436089Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.8436496Z self=, 2025-05-07T20:32:36.8436941Z T=128, 2025-05-07T20:32:36.8437129Z D=5120, 2025-05-07T20:32:36.8437312Z scale_ub=None, 2025-05-07T20:32:36.8437530Z contiguous=False, 2025-05-07T20:32:36.8437757Z compiled=True, 2025-05-07T20:32:36.8437960Z ) 2025-05-07T20:32:36.8438284Z self = 2025-05-07T20:32:36.8438780Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:36.8439048Z 2025-05-07T20:32:36.8439131Z @given( 2025-05-07T20:32:36.8439357Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.8439672Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.8439985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.8440311Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.8440646Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.8440931Z ) 2025-05-07T20:32:36.8441324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.8441767Z def test_silu_mul_quant( 2025-05-07T20:32:36.8442014Z self, 2025-05-07T20:32:36.8442204Z T: int, 2025-05-07T20:32:36.8442400Z D: int, 2025-05-07T20:32:36.8442617Z scale_ub: Optional[float], 2025-05-07T20:32:36.8442881Z contiguous: bool, 2025-05-07T20:32:36.8443125Z compiled: bool, 2025-05-07T20:32:36.8443346Z ) -> None: 2025-05-07T20:32:36.8443562Z torch.manual_seed(2025) 2025-05-07T20:32:36.8443798Z 2025-05-07T20:32:36.8444067Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.8444445Z 2025-05-07T20:32:36.8444649Z x_sign = torch.sign(x) 2025-05-07T20:32:36.8444940Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.8445249Z x = x_sign * x_clamp 2025-05-07T20:32:36.8445478Z x0 = x[:, :D] 2025-05-07T20:32:36.8445695Z x1 = x[:, D:] 2025-05-07T20:32:36.8445905Z 2025-05-07T20:32:36.8446081Z if contiguous: 2025-05-07T20:32:36.8446310Z x0 = x0.contiguous() 2025-05-07T20:32:36.8446570Z x1 = x1.contiguous() 2025-05-07T20:32:36.8446802Z 2025-05-07T20:32:36.8446991Z if scale_ub is not None: 2025-05-07T20:32:36.8447262Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.8447593Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.8447897Z ) 2025-05-07T20:32:36.8448092Z else: 2025-05-07T20:32:36.8448299Z scale_ub_tensor = None 2025-05-07T20:32:36.8448542Z 2025-05-07T20:32:36.8448774Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.8449091Z op = silu_mul_quant 2025-05-07T20:32:36.8449338Z if compiled: 2025-05-07T20:32:36.8449582Z op = torch.compile(op) 2025-05-07T20:32:36.8449885Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.8450156Z 2025-05-07T20:32:36.8450352Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.8450515Z 2025-05-07T20:32:36.8450620Z moe/activation_test.py:117: 2025-05-07T20:32:36.8450910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.8451243Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.8451523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.8452128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:36.8452679Z return fn(*args, **kwargs) 2025-05-07T20:32:36.8453387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.8454074Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.8454629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.8455334Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.8456097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.8456624Z kernel = self.compile( 2025-05-07T20:32:36.8457158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.8457809Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.8458202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.8458427Z 2025-05-07T20:32:36.8458638Z self = 2025-05-07T20:32:36.8459723Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.8461174Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8ee8040>} 2025-05-07T20:32:36.8462518Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.8463548Z context = 2025-05-07T20:32:36.8463834Z 2025-05-07T20:32:36.8463999Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.8464521Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.8464982Z module_map=module_map) 2025-05-07T20:32:36.8465344Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.8465691Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.8465958Z E ^ 2025-05-07T20:32:36.8466423Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.8466871Z 2025-05-07T20:32:36.8467285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.8467803Z 2025-05-07T20:32:36.8467905Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.8468318Z self=, 2025-05-07T20:32:36.8468718Z T=128, 2025-05-07T20:32:36.8468898Z D=7168, 2025-05-07T20:32:36.8469090Z scale_ub=1200.0, 2025-05-07T20:32:36.8469311Z contiguous=False, 2025-05-07T20:32:36.8469532Z compiled=False, 2025-05-07T20:32:36.8469736Z ) 2025-05-07T20:32:37.0011728Z self = 2025-05-07T20:32:37.0012247Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.0012556Z 2025-05-07T20:32:37.0012640Z @given( 2025-05-07T20:32:37.0012885Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0013316Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0013677Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0014005Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0014573Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0014903Z ) 2025-05-07T20:32:37.0015250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0015692Z def test_silu_mul_quant( 2025-05-07T20:32:37.0016025Z self, 2025-05-07T20:32:37.0016218Z T: int, 2025-05-07T20:32:37.0016418Z D: int, 2025-05-07T20:32:37.0016638Z scale_ub: Optional[float], 2025-05-07T20:32:37.0016912Z contiguous: bool, 2025-05-07T20:32:37.0017149Z compiled: bool, 2025-05-07T20:32:37.0017377Z ) -> None: 2025-05-07T20:32:37.0017679Z torch.manual_seed(2025) 2025-05-07T20:32:37.0017917Z 2025-05-07T20:32:37.0018186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0018531Z 2025-05-07T20:32:37.0018717Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0019005Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0019320Z x = x_sign * x_clamp 2025-05-07T20:32:37.0019552Z x0 = x[:, :D] 2025-05-07T20:32:37.0019767Z x1 = x[:, D:] 2025-05-07T20:32:37.0019977Z 2025-05-07T20:32:37.0020157Z if contiguous: 2025-05-07T20:32:37.0020390Z x0 = x0.contiguous() 2025-05-07T20:32:37.0020654Z x1 = x1.contiguous() 2025-05-07T20:32:37.0020889Z 2025-05-07T20:32:37.0021086Z if scale_ub is not None: 2025-05-07T20:32:37.0021361Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0021691Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0022010Z ) 2025-05-07T20:32:37.0022281Z else: 2025-05-07T20:32:37.0022497Z scale_ub_tensor = None 2025-05-07T20:32:37.0022741Z 2025-05-07T20:32:37.0022974Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0023293Z op = silu_mul_quant 2025-05-07T20:32:37.0023537Z if compiled: 2025-05-07T20:32:37.0023787Z op = torch.compile(op) 2025-05-07T20:32:37.0024087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0024358Z 2025-05-07T20:32:37.0024556Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0024721Z 2025-05-07T20:32:37.0024834Z moe/activation_test.py:117: 2025-05-07T20:32:37.0025140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0032339Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0032645Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0033361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0034071Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0034674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0035371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0036038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0036577Z kernel = self.compile( 2025-05-07T20:32:37.0037135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0037803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0038205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0038447Z 2025-05-07T20:32:37.0038661Z self = 2025-05-07T20:32:37.0039763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0041175Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8ee8ca0>} 2025-05-07T20:32:37.0042653Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0043699Z context = 2025-05-07T20:32:37.0043997Z 2025-05-07T20:32:37.0044166Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0044755Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0045271Z module_map=module_map) 2025-05-07T20:32:37.0045645Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0046012Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0046280Z E ^ 2025-05-07T20:32:37.0046747Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0047214Z 2025-05-07T20:32:37.0047639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0048164Z 2025-05-07T20:32:37.0048281Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0048706Z self=, 2025-05-07T20:32:37.0049109Z T=128, 2025-05-07T20:32:37.0049298Z D=5120, 2025-05-07T20:32:37.0049500Z scale_ub=None, 2025-05-07T20:32:37.0049761Z contiguous=False, 2025-05-07T20:32:37.0049996Z compiled=False, 2025-05-07T20:32:37.0050210Z ) 2025-05-07T20:32:37.0050529Z self = 2025-05-07T20:32:37.0051050Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.0051330Z 2025-05-07T20:32:37.0051416Z @given( 2025-05-07T20:32:37.0051654Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0051970Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0052285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0052626Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0052956Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0053249Z ) 2025-05-07T20:32:37.0053604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0054051Z def test_silu_mul_quant( 2025-05-07T20:32:37.0054295Z self, 2025-05-07T20:32:37.0054501Z T: int, 2025-05-07T20:32:37.0054701Z D: int, 2025-05-07T20:32:37.0054920Z scale_ub: Optional[float], 2025-05-07T20:32:37.0055207Z contiguous: bool, 2025-05-07T20:32:37.0055452Z compiled: bool, 2025-05-07T20:32:37.0055675Z ) -> None: 2025-05-07T20:32:37.0055895Z torch.manual_seed(2025) 2025-05-07T20:32:37.0056150Z 2025-05-07T20:32:37.0056423Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0056779Z 2025-05-07T20:32:37.0056981Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0057276Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0057593Z x = x_sign * x_clamp 2025-05-07T20:32:37.0057832Z x0 = x[:, :D] 2025-05-07T20:32:37.0058048Z x1 = x[:, D:] 2025-05-07T20:32:37.0058256Z 2025-05-07T20:32:37.0058436Z if contiguous: 2025-05-07T20:32:37.0058672Z x0 = x0.contiguous() 2025-05-07T20:32:37.0058933Z x1 = x1.contiguous() 2025-05-07T20:32:37.0059176Z 2025-05-07T20:32:37.0059370Z if scale_ub is not None: 2025-05-07T20:32:37.0059641Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0059979Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0060295Z ) 2025-05-07T20:32:37.0060483Z else: 2025-05-07T20:32:37.0060746Z scale_ub_tensor = None 2025-05-07T20:32:37.0061003Z 2025-05-07T20:32:37.0061237Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0061556Z op = silu_mul_quant 2025-05-07T20:32:37.0061812Z if compiled: 2025-05-07T20:32:37.0062129Z op = torch.compile(op) 2025-05-07T20:32:37.0062428Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0062709Z 2025-05-07T20:32:37.0062905Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0063070Z 2025-05-07T20:32:37.0063173Z moe/activation_test.py:117: 2025-05-07T20:32:37.0063519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0063860Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0064141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0064902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0065612Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0066155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0066844Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0067511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0068051Z kernel = self.compile( 2025-05-07T20:32:37.0068639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0069312Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0069712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0070002Z 2025-05-07T20:32:37.0070218Z self = 2025-05-07T20:32:37.0071324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0072738Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc89fd310>} 2025-05-07T20:32:37.0074119Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0075219Z context = 2025-05-07T20:32:37.0075510Z 2025-05-07T20:32:37.0075684Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0076209Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0076685Z module_map=module_map) 2025-05-07T20:32:37.0077055Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0077406Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0077670Z E ^ 2025-05-07T20:32:37.0078144Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0078601Z 2025-05-07T20:32:37.0079026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0079553Z 2025-05-07T20:32:37.0079659Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0080078Z self=, 2025-05-07T20:32:37.0080490Z T=128, 2025-05-07T20:32:37.0080677Z D=5120, 2025-05-07T20:32:37.0080876Z scale_ub=1200.0, 2025-05-07T20:32:37.0081107Z contiguous=True, 2025-05-07T20:32:37.0081387Z compiled=False, 2025-05-07T20:32:37.0081595Z ) 2025-05-07T20:32:37.2366929Z self = 2025-05-07T20:32:37.2367459Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.2368012Z 2025-05-07T20:32:37.2368095Z @given( 2025-05-07T20:32:37.2368328Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.2368638Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.2368937Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.2369305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.2369738Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.2370025Z ) 2025-05-07T20:32:37.2370377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.2370808Z def test_silu_mul_quant( 2025-05-07T20:32:37.2371051Z self, 2025-05-07T20:32:37.2371248Z T: int, 2025-05-07T20:32:37.2371446Z D: int, 2025-05-07T20:32:37.2371655Z scale_ub: Optional[float], 2025-05-07T20:32:37.2371925Z contiguous: bool, 2025-05-07T20:32:37.2372160Z compiled: bool, 2025-05-07T20:32:37.2372377Z ) -> None: 2025-05-07T20:32:37.2372596Z torch.manual_seed(2025) 2025-05-07T20:32:37.2372832Z 2025-05-07T20:32:37.2373095Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.2373436Z 2025-05-07T20:32:37.2373627Z x_sign = torch.sign(x) 2025-05-07T20:32:37.2373988Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.2374310Z x = x_sign * x_clamp 2025-05-07T20:32:37.2374577Z x0 = x[:, :D] 2025-05-07T20:32:37.2374809Z x1 = x[:, D:] 2025-05-07T20:32:37.2375020Z 2025-05-07T20:32:37.2375200Z if contiguous: 2025-05-07T20:32:37.2375427Z x0 = x0.contiguous() 2025-05-07T20:32:37.2375682Z x1 = x1.contiguous() 2025-05-07T20:32:37.2375927Z 2025-05-07T20:32:37.2376110Z if scale_ub is not None: 2025-05-07T20:32:37.2376383Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.2376720Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.2377022Z ) 2025-05-07T20:32:37.2377211Z else: 2025-05-07T20:32:37.2377427Z scale_ub_tensor = None 2025-05-07T20:32:37.2377674Z 2025-05-07T20:32:37.2377898Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.2378214Z op = silu_mul_quant 2025-05-07T20:32:37.2378463Z if compiled: 2025-05-07T20:32:37.2378708Z op = torch.compile(op) 2025-05-07T20:32:37.2379008Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.2379287Z 2025-05-07T20:32:37.2379478Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.2379653Z 2025-05-07T20:32:37.2379755Z moe/activation_test.py:117: 2025-05-07T20:32:37.2380058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.2380392Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.2380672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.2381368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.2382056Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.2382587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.2383266Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.2383930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.2384460Z kernel = self.compile( 2025-05-07T20:32:37.2385045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.2385779Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.2386179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.2386403Z 2025-05-07T20:32:37.2386653Z self = 2025-05-07T20:32:37.2387742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.2389184Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc89fdee0>} 2025-05-07T20:32:37.2390619Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.2391639Z context = 2025-05-07T20:32:37.2391924Z 2025-05-07T20:32:37.2392085Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.2392602Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.2393063Z module_map=module_map) 2025-05-07T20:32:37.2393426Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.2393770Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.2394027Z E ^ 2025-05-07T20:32:37.2394546Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.2394998Z 2025-05-07T20:32:37.2395412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.2395927Z 2025-05-07T20:32:37.2396032Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.2396442Z self=, 2025-05-07T20:32:37.2396858Z T=1, 2025-05-07T20:32:37.2397043Z D=7168, 2025-05-07T20:32:37.2397226Z scale_ub=1200.0, 2025-05-07T20:32:37.2397450Z contiguous=True, 2025-05-07T20:32:37.2397678Z compiled=True, 2025-05-07T20:32:37.2397881Z ) 2025-05-07T20:32:37.2398203Z self = 2025-05-07T20:32:37.2398685Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.2398945Z 2025-05-07T20:32:37.2399022Z @given( 2025-05-07T20:32:37.2399252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.2399567Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.2399880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.2400201Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.2400532Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.2400815Z ) 2025-05-07T20:32:37.2401163Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.2401600Z def test_silu_mul_quant( 2025-05-07T20:32:37.2401844Z self, 2025-05-07T20:32:37.2402032Z T: int, 2025-05-07T20:32:37.2402238Z D: int, 2025-05-07T20:32:37.2402460Z scale_ub: Optional[float], 2025-05-07T20:32:37.2402729Z contiguous: bool, 2025-05-07T20:32:37.2402966Z compiled: bool, 2025-05-07T20:32:37.2403185Z ) -> None: 2025-05-07T20:32:37.2403397Z torch.manual_seed(2025) 2025-05-07T20:32:37.2403648Z 2025-05-07T20:32:37.2404186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.2404557Z 2025-05-07T20:32:37.2404771Z x_sign = torch.sign(x) 2025-05-07T20:32:37.2405069Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.2405388Z x = x_sign * x_clamp 2025-05-07T20:32:37.2405704Z x0 = x[:, :D] 2025-05-07T20:32:37.2405926Z x1 = x[:, D:] 2025-05-07T20:32:37.2406135Z 2025-05-07T20:32:37.2406319Z if contiguous: 2025-05-07T20:32:37.2406555Z x0 = x0.contiguous() 2025-05-07T20:32:37.2406882Z x1 = x1.contiguous() 2025-05-07T20:32:37.2407123Z 2025-05-07T20:32:37.2407323Z if scale_ub is not None: 2025-05-07T20:32:37.2407603Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.2407935Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.2408246Z ) 2025-05-07T20:32:37.2408513Z else: 2025-05-07T20:32:37.2408726Z scale_ub_tensor = None 2025-05-07T20:32:37.2408980Z 2025-05-07T20:32:37.2409218Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.2409539Z op = silu_mul_quant 2025-05-07T20:32:37.2409792Z if compiled: 2025-05-07T20:32:37.2410045Z op = torch.compile(op) 2025-05-07T20:32:37.2410353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.2410624Z 2025-05-07T20:32:37.2410820Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.2410989Z 2025-05-07T20:32:37.2411096Z moe/activation_test.py:117: 2025-05-07T20:32:37.2411396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.2411735Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.2412027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.2412643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.2413213Z return fn(*args, **kwargs) 2025-05-07T20:32:37.2413887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.2414607Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.2415173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.2415863Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.2416534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.2417068Z kernel = self.compile( 2025-05-07T20:32:37.2417616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.2418279Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.2418692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.2418918Z 2025-05-07T20:32:37.2419127Z self = 2025-05-07T20:32:37.2420214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.2421598Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8fda940>} 2025-05-07T20:32:37.2422948Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.2423978Z context = 2025-05-07T20:32:37.2424271Z 2025-05-07T20:32:37.2424438Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.2424965Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.2425438Z module_map=module_map) 2025-05-07T20:32:37.2425801Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.2426209Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.2426470Z E ^ 2025-05-07T20:32:37.2426939Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.2427434Z 2025-05-07T20:32:37.2427852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.2428369Z 2025-05-07T20:32:37.2428472Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.2428887Z self=, 2025-05-07T20:32:37.2429336Z T=1, 2025-05-07T20:32:37.2429515Z D=7168, 2025-05-07T20:32:37.2429715Z scale_ub=1200.0, 2025-05-07T20:32:37.2429990Z contiguous=False, 2025-05-07T20:32:37.2430207Z compiled=True, 2025-05-07T20:32:37.2430405Z ) 2025-05-07T20:32:37.5727761Z self = 2025-05-07T20:32:37.5728323Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.5728589Z 2025-05-07T20:32:37.5728666Z @given( 2025-05-07T20:32:37.5728893Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.5729212Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.5729513Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.5729840Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.5730165Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.5730451Z ) 2025-05-07T20:32:37.5731062Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.5731503Z def test_silu_mul_quant( 2025-05-07T20:32:37.5731740Z self, 2025-05-07T20:32:37.5731924Z T: int, 2025-05-07T20:32:37.5732118Z D: int, 2025-05-07T20:32:37.5732339Z scale_ub: Optional[float], 2025-05-07T20:32:37.5732602Z contiguous: bool, 2025-05-07T20:32:37.5732841Z compiled: bool, 2025-05-07T20:32:37.5733064Z ) -> None: 2025-05-07T20:32:37.5733270Z torch.manual_seed(2025) 2025-05-07T20:32:37.5733510Z 2025-05-07T20:32:37.5733785Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.5734117Z 2025-05-07T20:32:37.5734311Z x_sign = torch.sign(x) 2025-05-07T20:32:37.5734647Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.5734950Z x = x_sign * x_clamp 2025-05-07T20:32:37.5735193Z x0 = x[:, :D] 2025-05-07T20:32:37.5735411Z x1 = x[:, D:] 2025-05-07T20:32:37.5735617Z 2025-05-07T20:32:37.5735792Z if contiguous: 2025-05-07T20:32:37.5736023Z x0 = x0.contiguous() 2025-05-07T20:32:37.5736277Z x1 = x1.contiguous() 2025-05-07T20:32:37.5736512Z 2025-05-07T20:32:37.5736706Z if scale_ub is not None: 2025-05-07T20:32:37.5736975Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.5737309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.5737615Z ) 2025-05-07T20:32:37.5737807Z else: 2025-05-07T20:32:37.5738009Z scale_ub_tensor = None 2025-05-07T20:32:37.5738258Z 2025-05-07T20:32:37.5738494Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.5738800Z op = silu_mul_quant 2025-05-07T20:32:37.5739043Z if compiled: 2025-05-07T20:32:37.5739285Z op = torch.compile(op) 2025-05-07T20:32:37.5739574Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5739850Z 2025-05-07T20:32:37.5740035Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.5740196Z 2025-05-07T20:32:37.5740300Z moe/activation_test.py:117: 2025-05-07T20:32:37.5740583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5740917Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.5741198Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.5741839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.5742395Z return fn(*args, **kwargs) 2025-05-07T20:32:37.5743156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.5743840Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.5744365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.5745173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.5745828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.5746344Z kernel = self.compile( 2025-05-07T20:32:37.5746879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.5747528Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.5747918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.5748142Z 2025-05-07T20:32:37.5748347Z self = 2025-05-07T20:32:37.5749426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.5750971Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8dee5e0>} 2025-05-07T20:32:37.5752310Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.5753326Z context = 2025-05-07T20:32:37.5753616Z 2025-05-07T20:32:37.5753781Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.5754304Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.5754820Z module_map=module_map) 2025-05-07T20:32:37.5755177Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.5755526Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.5755785Z E ^ 2025-05-07T20:32:37.5756238Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.5756692Z 2025-05-07T20:32:37.5757106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.5757622Z 2025-05-07T20:32:37.5757724Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.5758137Z self=, 2025-05-07T20:32:37.5758534Z T=1, 2025-05-07T20:32:37.5758715Z D=7168, 2025-05-07T20:32:37.5758909Z scale_ub=None, 2025-05-07T20:32:37.5759116Z contiguous=False, 2025-05-07T20:32:37.5759336Z compiled=True, 2025-05-07T20:32:37.5759542Z ) 2025-05-07T20:32:37.6907169Z self = 2025-05-07T20:32:37.6907724Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.6907994Z 2025-05-07T20:32:37.6908072Z @given( 2025-05-07T20:32:37.6908307Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.6908618Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.6908922Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.6909251Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.6909881Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.6910167Z ) 2025-05-07T20:32:37.6910513Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.6917856Z def test_silu_mul_quant( 2025-05-07T20:32:37.6918123Z self, 2025-05-07T20:32:37.6918329Z T: int, 2025-05-07T20:32:37.6918531Z D: int, 2025-05-07T20:32:37.6918747Z scale_ub: Optional[float], 2025-05-07T20:32:37.6919031Z contiguous: bool, 2025-05-07T20:32:37.6919274Z compiled: bool, 2025-05-07T20:32:37.6919588Z ) -> None: 2025-05-07T20:32:37.6919817Z torch.manual_seed(2025) 2025-05-07T20:32:37.6920066Z 2025-05-07T20:32:37.6920339Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.6920695Z 2025-05-07T20:32:37.6920894Z x_sign = torch.sign(x) 2025-05-07T20:32:37.6921187Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.6921511Z x = x_sign * x_clamp 2025-05-07T20:32:37.6921758Z x0 = x[:, :D] 2025-05-07T20:32:37.6921978Z x1 = x[:, D:] 2025-05-07T20:32:37.6922184Z 2025-05-07T20:32:37.6922373Z if contiguous: 2025-05-07T20:32:37.6922614Z x0 = x0.contiguous() 2025-05-07T20:32:37.6922873Z x1 = x1.contiguous() 2025-05-07T20:32:37.6923116Z 2025-05-07T20:32:37.6923315Z if scale_ub is not None: 2025-05-07T20:32:37.6923585Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.6923935Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.6924383Z ) 2025-05-07T20:32:37.6924582Z else: 2025-05-07T20:32:37.6924796Z scale_ub_tensor = None 2025-05-07T20:32:37.6925047Z 2025-05-07T20:32:37.6925276Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.6925595Z op = silu_mul_quant 2025-05-07T20:32:37.6925854Z if compiled: 2025-05-07T20:32:37.6926102Z op = torch.compile(op) 2025-05-07T20:32:37.6926409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.6926691Z 2025-05-07T20:32:37.6926893Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.6927182Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.6927480Z 2025-05-07T20:32:37.6927724Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.6928059Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.6928356Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.6928672Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.6929030Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.6929344Z 2025-05-07T20:32:37.6929553Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.6929752Z 2025-05-07T20:32:37.6929852Z moe/activation_test.py:126: 2025-05-07T20:32:37.6930158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.6930505Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.6930837Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.6931635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.6932407Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.6932961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.6933653Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.6934339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.6935123Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.6935886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.6936685Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.6937464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.6938111Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.6938722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.6939235Z fn() 2025-05-07T20:32:37.6939786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.6940366Z self.fn.run( 2025-05-07T20:32:37.6940838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.6941364Z kernel = self.compile( 2025-05-07T20:32:37.6941908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.6942560Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.6942961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.6943188Z 2025-05-07T20:32:37.6943395Z self = 2025-05-07T20:32:37.6944532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.6945978Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8c44160>} 2025-05-07T20:32:37.6947319Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.6948343Z context = 2025-05-07T20:32:37.6948635Z 2025-05-07T20:32:37.6948804Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.6949328Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.6949794Z module_map=module_map) 2025-05-07T20:32:37.6950231Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.6950588Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.6950851Z E ^ 2025-05-07T20:32:37.6951307Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.6951760Z 2025-05-07T20:32:37.6952172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.6952691Z 2025-05-07T20:32:37.6952793Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.6953208Z self=, 2025-05-07T20:32:37.6953612Z T=1, 2025-05-07T20:32:37.6953796Z D=5120, 2025-05-07T20:32:37.6953987Z scale_ub=1200.0, 2025-05-07T20:32:37.6954209Z contiguous=False, 2025-05-07T20:32:37.6954440Z compiled=True, 2025-05-07T20:32:37.6954651Z ) 2025-05-07T20:32:37.8946866Z self = 2025-05-07T20:32:37.8947535Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.8947813Z 2025-05-07T20:32:37.8947893Z @given( 2025-05-07T20:32:37.8948128Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.8948439Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.8948752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.8949304Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.8949645Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.8950051Z ) 2025-05-07T20:32:37.8950495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.8950937Z def test_silu_mul_quant( 2025-05-07T20:32:37.8951178Z self, 2025-05-07T20:32:37.8951372Z T: int, 2025-05-07T20:32:37.8951572Z D: int, 2025-05-07T20:32:37.8951784Z scale_ub: Optional[float], 2025-05-07T20:32:37.8952133Z contiguous: bool, 2025-05-07T20:32:37.8952381Z compiled: bool, 2025-05-07T20:32:37.8952603Z ) -> None: 2025-05-07T20:32:37.8952821Z torch.manual_seed(2025) 2025-05-07T20:32:37.8953066Z 2025-05-07T20:32:37.8953336Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.8953680Z 2025-05-07T20:32:37.8953876Z x_sign = torch.sign(x) 2025-05-07T20:32:37.8954168Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.8954479Z x = x_sign * x_clamp 2025-05-07T20:32:37.8954725Z x0 = x[:, :D] 2025-05-07T20:32:37.8954947Z x1 = x[:, D:] 2025-05-07T20:32:37.8955157Z 2025-05-07T20:32:37.8955352Z if contiguous: 2025-05-07T20:32:37.8955586Z x0 = x0.contiguous() 2025-05-07T20:32:37.8955846Z x1 = x1.contiguous() 2025-05-07T20:32:37.8956086Z 2025-05-07T20:32:37.8956280Z if scale_ub is not None: 2025-05-07T20:32:37.8956630Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.8956986Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.8957299Z ) 2025-05-07T20:32:37.8957492Z else: 2025-05-07T20:32:37.8957712Z scale_ub_tensor = None 2025-05-07T20:32:37.8957962Z 2025-05-07T20:32:37.8958194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.8958511Z op = silu_mul_quant 2025-05-07T20:32:37.8958765Z if compiled: 2025-05-07T20:32:37.8959011Z op = torch.compile(op) 2025-05-07T20:32:37.8959313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8959589Z 2025-05-07T20:32:37.8959787Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.8959951Z 2025-05-07T20:32:37.8960055Z moe/activation_test.py:117: 2025-05-07T20:32:37.8960360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8960693Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.8960976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8961537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.8962104Z return fn(*args, **kwargs) 2025-05-07T20:32:37.8962761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.8963453Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.8963998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.8964688Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.8965341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.8965871Z kernel = self.compile( 2025-05-07T20:32:37.8966417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.8967068Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.8967459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8967691Z 2025-05-07T20:32:37.8967901Z self = 2025-05-07T20:32:37.8969063Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.8970490Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8c44b80>} 2025-05-07T20:32:37.8971842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.8972913Z context = 2025-05-07T20:32:37.8973209Z 2025-05-07T20:32:37.8973376Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.8973901Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.8974369Z module_map=module_map) 2025-05-07T20:32:37.8974738Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.8975103Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.8975359Z E ^ 2025-05-07T20:32:37.8975833Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.8976289Z 2025-05-07T20:32:37.8976701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.8977256Z 2025-05-07T20:32:37.8977368Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.8977782Z self=, 2025-05-07T20:32:37.8978190Z T=1, 2025-05-07T20:32:37.8978381Z D=5120, 2025-05-07T20:32:37.8978572Z scale_ub=1200.0, 2025-05-07T20:32:37.8978803Z contiguous=False, 2025-05-07T20:32:37.8979039Z compiled=False, 2025-05-07T20:32:37.8979255Z ) 2025-05-07T20:32:37.8979573Z self = 2025-05-07T20:32:37.8980063Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.8980330Z 2025-05-07T20:32:37.8980412Z @given( 2025-05-07T20:32:37.8980638Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.8980950Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.8981262Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.8981592Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.8981924Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.8982209Z ) 2025-05-07T20:32:37.8982561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.8982999Z def test_silu_mul_quant( 2025-05-07T20:32:37.8983242Z self, 2025-05-07T20:32:37.8983443Z T: int, 2025-05-07T20:32:37.8983659Z D: int, 2025-05-07T20:32:37.8983874Z scale_ub: Optional[float], 2025-05-07T20:32:37.8984154Z contiguous: bool, 2025-05-07T20:32:37.8984398Z compiled: bool, 2025-05-07T20:32:37.8984620Z ) -> None: 2025-05-07T20:32:37.8984847Z torch.manual_seed(2025) 2025-05-07T20:32:37.8985092Z 2025-05-07T20:32:37.8985358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.8985706Z 2025-05-07T20:32:37.8985903Z x_sign = torch.sign(x) 2025-05-07T20:32:37.8986195Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.8986517Z x = x_sign * x_clamp 2025-05-07T20:32:37.8986764Z x0 = x[:, :D] 2025-05-07T20:32:37.8986978Z x1 = x[:, D:] 2025-05-07T20:32:37.8987197Z 2025-05-07T20:32:37.8987387Z if contiguous: 2025-05-07T20:32:37.8987627Z x0 = x0.contiguous() 2025-05-07T20:32:37.8987883Z x1 = x1.contiguous() 2025-05-07T20:32:37.8988181Z 2025-05-07T20:32:37.8988377Z if scale_ub is not None: 2025-05-07T20:32:37.8988647Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.8988981Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.8989296Z ) 2025-05-07T20:32:37.8989527Z else: 2025-05-07T20:32:37.8989740Z scale_ub_tensor = None 2025-05-07T20:32:37.8990087Z 2025-05-07T20:32:37.8990312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.8990628Z op = silu_mul_quant 2025-05-07T20:32:37.8990877Z if compiled: 2025-05-07T20:32:37.8991168Z op = torch.compile(op) 2025-05-07T20:32:37.8991465Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8991740Z 2025-05-07T20:32:37.8991925Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.8992104Z 2025-05-07T20:32:37.8992205Z moe/activation_test.py:117: 2025-05-07T20:32:37.8992501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8992865Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.8993235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8993931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.8994620Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.8995152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.8995899Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.8996571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.8997102Z kernel = self.compile( 2025-05-07T20:32:37.8997638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.8998297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.8998693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8998922Z 2025-05-07T20:32:37.8999137Z self = 2025-05-07T20:32:37.9000221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.9001610Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc9015550>} 2025-05-07T20:32:37.9002981Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.9004297Z context = 2025-05-07T20:32:37.9004586Z 2025-05-07T20:32:37.9004757Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.9005285Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.9005760Z module_map=module_map) 2025-05-07T20:32:37.9006127Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.9006479Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.9006744Z E ^ 2025-05-07T20:32:37.9007216Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.9007670Z 2025-05-07T20:32:37.9008085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.9008601Z 2025-05-07T20:32:37.9008787Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.9009201Z self=, 2025-05-07T20:32:37.9009602Z T=16384, 2025-05-07T20:32:37.9009794Z D=5120, 2025-05-07T20:32:37.9009987Z scale_ub=1200.0, 2025-05-07T20:32:37.9010278Z contiguous=False, 2025-05-07T20:32:37.9010501Z compiled=True, 2025-05-07T20:32:37.9010705Z ) 2025-05-07T20:32:38.0191525Z self = 2025-05-07T20:32:38.0192294Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:38.0192918Z 2025-05-07T20:32:38.0193009Z @given( 2025-05-07T20:32:38.0193240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.0193558Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.0193870Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.0194198Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.0194537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.0194853Z ) 2025-05-07T20:32:38.0195220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.0195659Z def test_silu_mul_quant( 2025-05-07T20:32:38.0195913Z self, 2025-05-07T20:32:38.0196106Z T: int, 2025-05-07T20:32:38.0196308Z D: int, 2025-05-07T20:32:38.0196529Z scale_ub: Optional[float], 2025-05-07T20:32:38.0196799Z contiguous: bool, 2025-05-07T20:32:38.0197041Z compiled: bool, 2025-05-07T20:32:38.0197270Z ) -> None: 2025-05-07T20:32:38.0197577Z torch.manual_seed(2025) 2025-05-07T20:32:38.0197821Z 2025-05-07T20:32:38.0198097Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.0198445Z 2025-05-07T20:32:38.0198634Z x_sign = torch.sign(x) 2025-05-07T20:32:38.0198926Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.0199241Z x = x_sign * x_clamp 2025-05-07T20:32:38.0199475Z x0 = x[:, :D] 2025-05-07T20:32:38.0199693Z x1 = x[:, D:] 2025-05-07T20:32:38.0199905Z 2025-05-07T20:32:38.0200089Z if contiguous: 2025-05-07T20:32:38.0200326Z x0 = x0.contiguous() 2025-05-07T20:32:38.0200588Z x1 = x1.contiguous() 2025-05-07T20:32:38.0200821Z 2025-05-07T20:32:38.0201012Z if scale_ub is not None: 2025-05-07T20:32:38.0201285Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.0201619Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.0201939Z ) 2025-05-07T20:32:38.0202150Z else: 2025-05-07T20:32:38.0202364Z scale_ub_tensor = None 2025-05-07T20:32:38.0202609Z 2025-05-07T20:32:38.0202849Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.0203171Z op = silu_mul_quant 2025-05-07T20:32:38.0203428Z if compiled: 2025-05-07T20:32:38.0203999Z op = torch.compile(op) 2025-05-07T20:32:38.0204325Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.0204601Z 2025-05-07T20:32:38.0204802Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.0204969Z 2025-05-07T20:32:38.0205080Z moe/activation_test.py:117: 2025-05-07T20:32:38.0205384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.0205727Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.0206017Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.0206595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.0207154Z return fn(*args, **kwargs) 2025-05-07T20:32:38.0207823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.0208516Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.0209048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.0209839Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.0210577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.0211109Z kernel = self.compile( 2025-05-07T20:32:38.0211643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.0212298Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.0212770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.0212999Z 2025-05-07T20:32:38.0213216Z self = 2025-05-07T20:32:38.0214294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.0215750Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8cee1f0>} 2025-05-07T20:32:38.0217099Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.0218203Z context = 2025-05-07T20:32:38.0218498Z 2025-05-07T20:32:38.0218664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.0219193Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.0219669Z module_map=module_map) 2025-05-07T20:32:38.0220039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.0220388Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.0220658Z E ^ 2025-05-07T20:32:38.0221126Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.0221577Z 2025-05-07T20:32:38.0221993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.0222512Z 2025-05-07T20:32:38.0222619Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.0223038Z self=, 2025-05-07T20:32:38.0223447Z T=2048, 2025-05-07T20:32:38.0223634Z D=7168, 2025-05-07T20:32:38.0223837Z scale_ub=1200.0, 2025-05-07T20:32:38.0224064Z contiguous=False, 2025-05-07T20:32:38.0224288Z compiled=True, 2025-05-07T20:32:38.0224496Z ) 2025-05-07T20:32:38.0224849Z self = 2025-05-07T20:32:38.0225370Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:38.0225649Z 2025-05-07T20:32:38.0225725Z @given( 2025-05-07T20:32:38.0225961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.0226286Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.0226598Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.0226941Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.0227281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.0227568Z ) 2025-05-07T20:32:38.0227925Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.0228373Z def test_silu_mul_quant( 2025-05-07T20:32:38.0228615Z self, 2025-05-07T20:32:38.0228815Z T: int, 2025-05-07T20:32:38.0229022Z D: int, 2025-05-07T20:32:38.0229243Z scale_ub: Optional[float], 2025-05-07T20:32:38.0229570Z contiguous: bool, 2025-05-07T20:32:38.0229900Z compiled: bool, 2025-05-07T20:32:38.0230134Z ) -> None: 2025-05-07T20:32:38.0230349Z torch.manual_seed(2025) 2025-05-07T20:32:38.0230593Z 2025-05-07T20:32:38.0230952Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.0231296Z 2025-05-07T20:32:38.0231495Z x_sign = torch.sign(x) 2025-05-07T20:32:38.0231790Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.0232090Z x = x_sign * x_clamp 2025-05-07T20:32:38.0232333Z x0 = x[:, :D] 2025-05-07T20:32:38.0232601Z x1 = x[:, D:] 2025-05-07T20:32:38.0232808Z 2025-05-07T20:32:38.0232997Z if contiguous: 2025-05-07T20:32:38.0233228Z x0 = x0.contiguous() 2025-05-07T20:32:38.0233484Z x1 = x1.contiguous() 2025-05-07T20:32:38.0240490Z 2025-05-07T20:32:38.0240703Z if scale_ub is not None: 2025-05-07T20:32:38.0240994Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.0241342Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.0241658Z ) 2025-05-07T20:32:38.0241858Z else: 2025-05-07T20:32:38.0242065Z scale_ub_tensor = None 2025-05-07T20:32:38.0242323Z 2025-05-07T20:32:38.0242565Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.0242883Z op = silu_mul_quant 2025-05-07T20:32:38.0243139Z if compiled: 2025-05-07T20:32:38.0243399Z op = torch.compile(op) 2025-05-07T20:32:38.0243703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.0244052Z 2025-05-07T20:32:38.0244255Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.0244423Z 2025-05-07T20:32:38.0244534Z moe/activation_test.py:117: 2025-05-07T20:32:38.0244834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.0245171Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.0245460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.0246023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.0246593Z return fn(*args, **kwargs) 2025-05-07T20:32:38.0247360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.0248126Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.0248672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.0249374Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.0250043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.0250575Z kernel = self.compile( 2025-05-07T20:32:38.0251128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.0251799Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.0252203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.0252433Z 2025-05-07T20:32:38.0252645Z self = 2025-05-07T20:32:38.0253748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.0255168Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8ceeee0>} 2025-05-07T20:32:38.0256541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.0257639Z context = 2025-05-07T20:32:38.0257930Z 2025-05-07T20:32:38.0258100Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.0258682Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.0259157Z module_map=module_map) 2025-05-07T20:32:38.0259522Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.0259924Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.0260188Z E ^ 2025-05-07T20:32:38.0260660Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.0261120Z 2025-05-07T20:32:38.0261543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.0262075Z 2025-05-07T20:32:38.2924925Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.2925595Z self=, 2025-05-07T20:32:38.2926156Z T=1, 2025-05-07T20:32:38.2926402Z D=5120, 2025-05-07T20:32:38.2926644Z scale_ub=None, 2025-05-07T20:32:38.2926870Z contiguous=False, 2025-05-07T20:32:38.2927093Z compiled=False, 2025-05-07T20:32:38.2927306Z ) 2025-05-07T20:32:38.2927630Z self = 2025-05-07T20:32:38.2928416Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:38.2928704Z 2025-05-07T20:32:38.2928783Z @given( 2025-05-07T20:32:38.2929023Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.2929349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.2929666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.2930003Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.2930352Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.2930647Z ) 2025-05-07T20:32:38.2930998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.2931451Z def test_silu_mul_quant( 2025-05-07T20:32:38.2931700Z self, 2025-05-07T20:32:38.2931902Z T: int, 2025-05-07T20:32:38.2932102Z D: int, 2025-05-07T20:32:38.2932328Z scale_ub: Optional[float], 2025-05-07T20:32:38.2932606Z contiguous: bool, 2025-05-07T20:32:38.2932843Z compiled: bool, 2025-05-07T20:32:38.2933081Z ) -> None: 2025-05-07T20:32:38.2933303Z torch.manual_seed(2025) 2025-05-07T20:32:38.2933540Z 2025-05-07T20:32:38.2933820Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.2934174Z 2025-05-07T20:32:38.2934371Z x_sign = torch.sign(x) 2025-05-07T20:32:38.2934664Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.2935013Z x = x_sign * x_clamp 2025-05-07T20:32:38.2935271Z x0 = x[:, :D] 2025-05-07T20:32:38.2935496Z x1 = x[:, D:] 2025-05-07T20:32:38.2935709Z 2025-05-07T20:32:38.2935891Z if contiguous: 2025-05-07T20:32:38.2936133Z x0 = x0.contiguous() 2025-05-07T20:32:38.2936392Z x1 = x1.contiguous() 2025-05-07T20:32:38.2936634Z 2025-05-07T20:32:38.2936828Z if scale_ub is not None: 2025-05-07T20:32:38.2937106Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.2937451Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.2937762Z ) 2025-05-07T20:32:38.2937967Z else: 2025-05-07T20:32:38.2938186Z scale_ub_tensor = None 2025-05-07T20:32:38.2938431Z 2025-05-07T20:32:38.2938664Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.2938985Z op = silu_mul_quant 2025-05-07T20:32:38.2939232Z if compiled: 2025-05-07T20:32:38.2939575Z op = torch.compile(op) 2025-05-07T20:32:38.2939886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2940162Z 2025-05-07T20:32:38.2940360Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.2940525Z 2025-05-07T20:32:38.2940717Z moe/activation_test.py:117: 2025-05-07T20:32:38.2941012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2941345Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.2941639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2942340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.2943116Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.2943663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.2944361Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.2945036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.2945580Z kernel = self.compile( 2025-05-07T20:32:38.2946135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.2946796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.2947192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2947427Z 2025-05-07T20:32:38.2947679Z self = 2025-05-07T20:32:38.2948776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.2950299Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8d595e0>} 2025-05-07T20:32:38.2951667Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.2952690Z context = 2025-05-07T20:32:38.2952987Z 2025-05-07T20:32:38.2953154Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.2953691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.2954159Z module_map=module_map) 2025-05-07T20:32:38.2954533Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.2954933Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.2955196Z E ^ 2025-05-07T20:32:38.2955656Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.2956120Z 2025-05-07T20:32:38.2956541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.2957056Z 2025-05-07T20:32:38.2957165Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.2957581Z self=, 2025-05-07T20:32:38.2957983Z T=4096, 2025-05-07T20:32:38.2958171Z D=7168, 2025-05-07T20:32:38.2958369Z scale_ub=1200.0, 2025-05-07T20:32:38.2958587Z contiguous=False, 2025-05-07T20:32:38.2958811Z compiled=False, 2025-05-07T20:32:38.2959015Z ) 2025-05-07T20:32:38.2959326Z self = 2025-05-07T20:32:38.2959822Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:38.2960154Z 2025-05-07T20:32:38.2960236Z @given( 2025-05-07T20:32:38.2960459Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.2960772Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.2961079Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.2961451Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.2961776Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.2962062Z ) 2025-05-07T20:32:38.2962409Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.2962891Z def test_silu_mul_quant( 2025-05-07T20:32:38.2963135Z self, 2025-05-07T20:32:38.2963326Z T: int, 2025-05-07T20:32:38.2963514Z D: int, 2025-05-07T20:32:38.2963736Z scale_ub: Optional[float], 2025-05-07T20:32:38.2964012Z contiguous: bool, 2025-05-07T20:32:38.2964242Z compiled: bool, 2025-05-07T20:32:38.2964481Z ) -> None: 2025-05-07T20:32:38.2964707Z torch.manual_seed(2025) 2025-05-07T20:32:38.2964973Z 2025-05-07T20:32:38.2965260Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.2965605Z 2025-05-07T20:32:38.2965799Z x_sign = torch.sign(x) 2025-05-07T20:32:38.2966085Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.2966402Z x = x_sign * x_clamp 2025-05-07T20:32:38.2966643Z x0 = x[:, :D] 2025-05-07T20:32:38.2966859Z x1 = x[:, D:] 2025-05-07T20:32:38.2967062Z 2025-05-07T20:32:38.2967249Z if contiguous: 2025-05-07T20:32:38.2967539Z x0 = x0.contiguous() 2025-05-07T20:32:38.2967790Z x1 = x1.contiguous() 2025-05-07T20:32:38.2968032Z 2025-05-07T20:32:38.2968225Z if scale_ub is not None: 2025-05-07T20:32:38.2968493Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.2968835Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.2969153Z ) 2025-05-07T20:32:38.2969337Z else: 2025-05-07T20:32:38.2969552Z scale_ub_tensor = None 2025-05-07T20:32:38.2969805Z 2025-05-07T20:32:38.2970031Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.2970350Z op = silu_mul_quant 2025-05-07T20:32:38.2970606Z if compiled: 2025-05-07T20:32:38.2970848Z op = torch.compile(op) 2025-05-07T20:32:38.2971144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2971418Z 2025-05-07T20:32:38.2971603Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.2971773Z 2025-05-07T20:32:38.2971876Z moe/activation_test.py:117: 2025-05-07T20:32:38.2972175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2972506Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.2972788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2973486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.2974189Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.2974748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.2975445Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.2976110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.2976650Z kernel = self.compile( 2025-05-07T20:32:38.2977199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.2977856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.2978256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2978498Z 2025-05-07T20:32:38.2978706Z self = 2025-05-07T20:32:38.2979896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.2981303Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc86f51f0>} 2025-05-07T20:32:38.2982672Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.2983756Z context = 2025-05-07T20:32:38.2984051Z 2025-05-07T20:32:38.2984217Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.2984777Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.2985276Z module_map=module_map) 2025-05-07T20:32:38.2985646Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.2986010Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.2986267Z E ^ 2025-05-07T20:32:38.2986738Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.2987202Z 2025-05-07T20:32:38.2987669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.2988191Z 2025-05-07T20:32:38.2988301Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.2988713Z self=, 2025-05-07T20:32:38.2989125Z T=16384, 2025-05-07T20:32:38.2989320Z D=7168, 2025-05-07T20:32:38.2989510Z scale_ub=None, 2025-05-07T20:32:38.2989729Z contiguous=True, 2025-05-07T20:32:38.2990009Z compiled=True, 2025-05-07T20:32:38.2990201Z ) 2025-05-07T20:32:38.5824150Z self = 2025-05-07T20:32:38.5824981Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:38.5825762Z 2025-05-07T20:32:38.5825973Z @given( 2025-05-07T20:32:38.5826418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.5827029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.5827626Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.5828283Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.5828936Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.5829494Z ) 2025-05-07T20:32:38.5830303Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.5831160Z def test_silu_mul_quant( 2025-05-07T20:32:38.5831640Z self, 2025-05-07T20:32:38.5832014Z T: int, 2025-05-07T20:32:38.5832379Z D: int, 2025-05-07T20:32:38.5832805Z scale_ub: Optional[float], 2025-05-07T20:32:38.5833332Z contiguous: bool, 2025-05-07T20:32:38.5833785Z compiled: bool, 2025-05-07T20:32:38.5834230Z ) -> None: 2025-05-07T20:32:38.5834580Z torch.manual_seed(2025) 2025-05-07T20:32:38.5834817Z 2025-05-07T20:32:38.5835088Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.5835433Z 2025-05-07T20:32:38.5835624Z x_sign = torch.sign(x) 2025-05-07T20:32:38.5835932Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.5836241Z x = x_sign * x_clamp 2025-05-07T20:32:38.5836476Z x0 = x[:, :D] 2025-05-07T20:32:38.5836699Z x1 = x[:, D:] 2025-05-07T20:32:38.5836914Z 2025-05-07T20:32:38.5837106Z if contiguous: 2025-05-07T20:32:38.5837336Z x0 = x0.contiguous() 2025-05-07T20:32:38.5837863Z x1 = x1.contiguous() 2025-05-07T20:32:38.5838108Z 2025-05-07T20:32:38.5838295Z if scale_ub is not None: 2025-05-07T20:32:38.5838577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.5839039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.5839349Z ) 2025-05-07T20:32:38.5839545Z else: 2025-05-07T20:32:38.5839763Z scale_ub_tensor = None 2025-05-07T20:32:38.5840010Z 2025-05-07T20:32:38.5840245Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.5840561Z op = silu_mul_quant 2025-05-07T20:32:38.5840886Z if compiled: 2025-05-07T20:32:38.5841134Z op = torch.compile(op) 2025-05-07T20:32:38.5841436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.5841704Z 2025-05-07T20:32:38.5841897Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.5842067Z 2025-05-07T20:32:38.5842188Z moe/activation_test.py:117: 2025-05-07T20:32:38.5842485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.5842818Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.5843092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.5843661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.5844222Z return fn(*args, **kwargs) 2025-05-07T20:32:38.5844879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.5845655Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.5846381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.5847066Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.5847721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.5848259Z kernel = self.compile( 2025-05-07T20:32:38.5848803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.5849454Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.5849857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.5850091Z 2025-05-07T20:32:38.5850299Z self = 2025-05-07T20:32:38.5851393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.5852804Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc86f5ee0>} 2025-05-07T20:32:38.5854151Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.5855182Z context = 2025-05-07T20:32:38.5855476Z 2025-05-07T20:32:38.5855642Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.5856167Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.5856637Z module_map=module_map) 2025-05-07T20:32:38.5857009Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.5857369Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.5857624Z E ^ 2025-05-07T20:32:38.5858098Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.5858614Z 2025-05-07T20:32:38.5859031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.5859542Z 2025-05-07T20:32:38.5859654Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.5860118Z self=, 2025-05-07T20:32:38.5860541Z T=4096, 2025-05-07T20:32:38.5860738Z D=5120, 2025-05-07T20:32:38.5860932Z scale_ub=None, 2025-05-07T20:32:38.5861156Z contiguous=False, 2025-05-07T20:32:38.5861390Z compiled=True, 2025-05-07T20:32:38.5861643Z ) 2025-05-07T20:32:38.5861968Z self = 2025-05-07T20:32:38.5862472Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:38.5862741Z 2025-05-07T20:32:38.5862823Z @given( 2025-05-07T20:32:38.5863052Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.5863373Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.5863690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.5864014Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.5864357Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.5864653Z ) 2025-05-07T20:32:38.5865001Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.5865444Z def test_silu_mul_quant( 2025-05-07T20:32:38.5865689Z self, 2025-05-07T20:32:38.5865885Z T: int, 2025-05-07T20:32:38.5866640Z D: int, 2025-05-07T20:32:38.5866865Z scale_ub: Optional[float], 2025-05-07T20:32:38.5867139Z contiguous: bool, 2025-05-07T20:32:38.5867381Z compiled: bool, 2025-05-07T20:32:38.5867606Z ) -> None: 2025-05-07T20:32:38.5867827Z torch.manual_seed(2025) 2025-05-07T20:32:38.5868064Z 2025-05-07T20:32:38.5868348Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.5868706Z 2025-05-07T20:32:38.5868897Z x_sign = torch.sign(x) 2025-05-07T20:32:38.5869189Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.5869506Z x = x_sign * x_clamp 2025-05-07T20:32:38.5869745Z x0 = x[:, :D] 2025-05-07T20:32:38.5870039Z x1 = x[:, D:] 2025-05-07T20:32:38.5870248Z 2025-05-07T20:32:38.5870428Z if contiguous: 2025-05-07T20:32:38.5870662Z x0 = x0.contiguous() 2025-05-07T20:32:38.5870922Z x1 = x1.contiguous() 2025-05-07T20:32:38.5871158Z 2025-05-07T20:32:38.5871358Z if scale_ub is not None: 2025-05-07T20:32:38.5871632Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.5871965Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.5872274Z ) 2025-05-07T20:32:38.5872470Z else: 2025-05-07T20:32:38.5872682Z scale_ub_tensor = None 2025-05-07T20:32:38.5872934Z 2025-05-07T20:32:38.5873172Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.5873488Z op = silu_mul_quant 2025-05-07T20:32:38.5873741Z if compiled: 2025-05-07T20:32:38.5873990Z op = torch.compile(op) 2025-05-07T20:32:38.5874291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.5874563Z 2025-05-07T20:32:38.5874756Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.5874920Z 2025-05-07T20:32:38.5875028Z moe/activation_test.py:117: 2025-05-07T20:32:38.5875318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.5875655Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.5875936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.5876494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.5877046Z return fn(*args, **kwargs) 2025-05-07T20:32:38.5877714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.5878455Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.5879021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.5879706Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.5880367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.5880899Z kernel = self.compile( 2025-05-07T20:32:38.5881478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.5882131Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.5882532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.5882758Z 2025-05-07T20:32:38.5882974Z self = 2025-05-07T20:32:38.5884072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.5885458Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8ebe940>} 2025-05-07T20:32:38.5886853Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.5887883Z context = 2025-05-07T20:32:38.5888169Z 2025-05-07T20:32:38.5888335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.5888870Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.5889341Z module_map=module_map) 2025-05-07T20:32:38.5889729Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.5890080Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.5890345Z E ^ 2025-05-07T20:32:38.5890808Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.5898626Z 2025-05-07T20:32:38.5899103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.5899631Z 2025-05-07T20:32:38.7848716Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.7849347Z self=, 2025-05-07T20:32:38.7850108Z T=4096, 2025-05-07T20:32:38.7850371Z D=5120, 2025-05-07T20:32:38.7850653Z scale_ub=1200.0, 2025-05-07T20:32:38.7850913Z contiguous=False, 2025-05-07T20:32:38.7851136Z compiled=False, 2025-05-07T20:32:38.7851336Z ) 2025-05-07T20:32:38.7851661Z self = 2025-05-07T20:32:38.7852165Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:38.7852443Z 2025-05-07T20:32:38.7852520Z @given( 2025-05-07T20:32:38.7852750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.7853064Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.7853375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.7853710Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.7854039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.7854335Z ) 2025-05-07T20:32:38.7854687Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.7855421Z def test_silu_mul_quant( 2025-05-07T20:32:38.7855663Z self, 2025-05-07T20:32:38.7855850Z T: int, 2025-05-07T20:32:38.7856055Z D: int, 2025-05-07T20:32:38.7856277Z scale_ub: Optional[float], 2025-05-07T20:32:38.7856551Z contiguous: bool, 2025-05-07T20:32:38.7856876Z compiled: bool, 2025-05-07T20:32:38.7857117Z ) -> None: 2025-05-07T20:32:38.7857328Z torch.manual_seed(2025) 2025-05-07T20:32:38.7857573Z 2025-05-07T20:32:38.7857851Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.7858189Z 2025-05-07T20:32:38.7858465Z x_sign = torch.sign(x) 2025-05-07T20:32:38.7858760Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.7859064Z x = x_sign * x_clamp 2025-05-07T20:32:38.7859307Z x0 = x[:, :D] 2025-05-07T20:32:38.7859527Z x1 = x[:, D:] 2025-05-07T20:32:38.7859743Z 2025-05-07T20:32:38.7859924Z if contiguous: 2025-05-07T20:32:38.7860160Z x0 = x0.contiguous() 2025-05-07T20:32:38.7860419Z x1 = x1.contiguous() 2025-05-07T20:32:38.7860650Z 2025-05-07T20:32:38.7860843Z if scale_ub is not None: 2025-05-07T20:32:38.7861124Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.7861460Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.7861767Z ) 2025-05-07T20:32:38.7861964Z else: 2025-05-07T20:32:38.7862173Z scale_ub_tensor = None 2025-05-07T20:32:38.7862433Z 2025-05-07T20:32:38.7862697Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.7863096Z op = silu_mul_quant 2025-05-07T20:32:38.7863350Z if compiled: 2025-05-07T20:32:38.7863605Z op = torch.compile(op) 2025-05-07T20:32:38.7863913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.7864187Z 2025-05-07T20:32:38.7864383Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.7864561Z 2025-05-07T20:32:38.7864668Z moe/activation_test.py:117: 2025-05-07T20:32:38.7864974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.7865357Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.7865642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.7866337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.7867027Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.7867563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.7868253Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.7868926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.7869455Z kernel = self.compile( 2025-05-07T20:32:38.7870094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.7870756Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.7871163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.7871392Z 2025-05-07T20:32:38.7871599Z self = 2025-05-07T20:32:38.7872701Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.7874102Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc892d3a0>} 2025-05-07T20:32:38.7875456Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.7876536Z context = 2025-05-07T20:32:38.7876825Z 2025-05-07T20:32:38.7877030Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.7877555Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.7878018Z module_map=module_map) 2025-05-07T20:32:38.7878379Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.7878776Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.7879035Z E ^ 2025-05-07T20:32:38.7879499Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.7879963Z 2025-05-07T20:32:38.7880382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.7880904Z 2025-05-07T20:32:38.7881006Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.7881425Z self=, 2025-05-07T20:32:38.7881825Z T=4096, 2025-05-07T20:32:38.7882015Z D=5120, 2025-05-07T20:32:38.7882207Z scale_ub=1200.0, 2025-05-07T20:32:38.7882423Z contiguous=False, 2025-05-07T20:32:38.7882650Z compiled=True, 2025-05-07T20:32:38.7882856Z ) 2025-05-07T20:32:38.7883172Z self = 2025-05-07T20:32:38.7883719Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:38.7884001Z 2025-05-07T20:32:38.7884073Z @given( 2025-05-07T20:32:38.7884305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.7884606Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.7884967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.7885310Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.7885632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.7885917Z ) 2025-05-07T20:32:38.7886267Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.7886705Z def test_silu_mul_quant( 2025-05-07T20:32:38.7886941Z self, 2025-05-07T20:32:38.7887131Z T: int, 2025-05-07T20:32:38.7887326Z D: int, 2025-05-07T20:32:38.7887533Z scale_ub: Optional[float], 2025-05-07T20:32:38.7887809Z contiguous: bool, 2025-05-07T20:32:38.7888056Z compiled: bool, 2025-05-07T20:32:38.7888273Z ) -> None: 2025-05-07T20:32:38.7888496Z torch.manual_seed(2025) 2025-05-07T20:32:38.7888748Z 2025-05-07T20:32:38.7889017Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.7889369Z 2025-05-07T20:32:38.7889560Z x_sign = torch.sign(x) 2025-05-07T20:32:38.7889848Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.7890167Z x = x_sign * x_clamp 2025-05-07T20:32:38.7890419Z x0 = x[:, :D] 2025-05-07T20:32:38.7890635Z x1 = x[:, D:] 2025-05-07T20:32:38.7890848Z 2025-05-07T20:32:38.7891041Z if contiguous: 2025-05-07T20:32:38.7891273Z x0 = x0.contiguous() 2025-05-07T20:32:38.7891535Z x1 = x1.contiguous() 2025-05-07T20:32:38.7891779Z 2025-05-07T20:32:38.7891965Z if scale_ub is not None: 2025-05-07T20:32:38.7892236Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.7892577Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.7892888Z ) 2025-05-07T20:32:38.7893072Z else: 2025-05-07T20:32:38.7893285Z scale_ub_tensor = None 2025-05-07T20:32:38.7893535Z 2025-05-07T20:32:38.7893757Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.7894071Z op = silu_mul_quant 2025-05-07T20:32:38.7894375Z if compiled: 2025-05-07T20:32:38.7894619Z op = torch.compile(op) 2025-05-07T20:32:38.7894920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.7895215Z 2025-05-07T20:32:38.7895462Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.7895633Z 2025-05-07T20:32:38.7895734Z moe/activation_test.py:117: 2025-05-07T20:32:38.7896025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.7896353Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.7896626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.7897220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:38.7897775Z return fn(*args, **kwargs) 2025-05-07T20:32:38.7898424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.7899112Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.7899640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.7900313Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.7900966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.7901492Z kernel = self.compile( 2025-05-07T20:32:38.7902024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.7902709Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.7903100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.7903331Z 2025-05-07T20:32:38.7903535Z self = 2025-05-07T20:32:38.7904891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.7906275Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc892d280>} 2025-05-07T20:32:38.7907624Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.7908662Z context = 2025-05-07T20:32:38.7908952Z 2025-05-07T20:32:38.7909118Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.7909647Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.7910159Z module_map=module_map) 2025-05-07T20:32:38.7910523Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.7910877Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.7911129Z E ^ 2025-05-07T20:32:38.7911595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.7912051Z 2025-05-07T20:32:38.7912469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.7912984Z 2025-05-07T20:32:39.0676130Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.0676688Z self=, 2025-05-07T20:32:39.0677165Z T=2048, 2025-05-07T20:32:39.0677353Z D=7168, 2025-05-07T20:32:39.0677545Z scale_ub=1200.0, 2025-05-07T20:32:39.0677767Z contiguous=False, 2025-05-07T20:32:39.0677993Z compiled=False, 2025-05-07T20:32:39.0678441Z ) 2025-05-07T20:32:39.0678751Z self = 2025-05-07T20:32:39.0679247Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:39.0679545Z 2025-05-07T20:32:39.0679713Z @given( 2025-05-07T20:32:39.0679946Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.0680262Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.0680565Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.0680900Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.0681316Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.0681597Z ) 2025-05-07T20:32:39.0681950Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.0682391Z def test_silu_mul_quant( 2025-05-07T20:32:39.0682636Z self, 2025-05-07T20:32:39.0682827Z T: int, 2025-05-07T20:32:39.0683031Z D: int, 2025-05-07T20:32:39.0683250Z scale_ub: Optional[float], 2025-05-07T20:32:39.0683517Z contiguous: bool, 2025-05-07T20:32:39.0683759Z compiled: bool, 2025-05-07T20:32:39.0683989Z ) -> None: 2025-05-07T20:32:39.0684205Z torch.manual_seed(2025) 2025-05-07T20:32:39.0684446Z 2025-05-07T20:32:39.0684719Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.0685084Z 2025-05-07T20:32:39.0685310Z x_sign = torch.sign(x) 2025-05-07T20:32:39.0685601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.0685989Z x = x_sign * x_clamp 2025-05-07T20:32:39.0686234Z x0 = x[:, :D] 2025-05-07T20:32:39.0686452Z x1 = x[:, D:] 2025-05-07T20:32:39.0686653Z 2025-05-07T20:32:39.0686840Z if contiguous: 2025-05-07T20:32:39.0687074Z x0 = x0.contiguous() 2025-05-07T20:32:39.0687328Z x1 = x1.contiguous() 2025-05-07T20:32:39.0687572Z 2025-05-07T20:32:39.0687763Z if scale_ub is not None: 2025-05-07T20:32:39.0688032Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.0688376Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.0688683Z ) 2025-05-07T20:32:39.0688879Z else: 2025-05-07T20:32:39.0689085Z scale_ub_tensor = None 2025-05-07T20:32:39.0689336Z 2025-05-07T20:32:39.0689572Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.0689886Z op = silu_mul_quant 2025-05-07T20:32:39.0690139Z if compiled: 2025-05-07T20:32:39.0690395Z op = torch.compile(op) 2025-05-07T20:32:39.0690690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.0690968Z 2025-05-07T20:32:39.0691166Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.0691330Z 2025-05-07T20:32:39.0691432Z moe/activation_test.py:117: 2025-05-07T20:32:39.0691736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0692070Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.0692358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.0693055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.0693765Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.0694306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.0694981Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.0695650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.0696184Z kernel = self.compile( 2025-05-07T20:32:39.0696725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.0697369Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.0697858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0698084Z 2025-05-07T20:32:39.0698298Z self = 2025-05-07T20:32:39.0699434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.0701079Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8a0e670>} 2025-05-07T20:32:39.0702489Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.0703518Z context = 2025-05-07T20:32:39.0704086Z 2025-05-07T20:32:39.0704255Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.0704780Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.0705292Z module_map=module_map) 2025-05-07T20:32:39.0705667Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.0706027Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.0706278Z E ^ 2025-05-07T20:32:39.0706819Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.0707380Z 2025-05-07T20:32:39.0707883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.0708509Z 2025-05-07T20:32:39.0708627Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.0709128Z self=, 2025-05-07T20:32:39.0709588Z T=1, 2025-05-07T20:32:39.0709780Z D=7168, 2025-05-07T20:32:39.0710087Z scale_ub=None, 2025-05-07T20:32:39.0710296Z contiguous=True, 2025-05-07T20:32:39.0710522Z compiled=False, 2025-05-07T20:32:39.0710727Z ) 2025-05-07T20:32:39.0711039Z self = 2025-05-07T20:32:39.0711524Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:39.0711787Z 2025-05-07T20:32:39.0711869Z @given( 2025-05-07T20:32:39.0712107Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.0712414Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.0712721Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.0713053Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.0713377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.0713666Z ) 2025-05-07T20:32:39.0714015Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.0714447Z def test_silu_mul_quant( 2025-05-07T20:32:39.0714691Z self, 2025-05-07T20:32:39.0714887Z T: int, 2025-05-07T20:32:39.0715083Z D: int, 2025-05-07T20:32:39.0715345Z scale_ub: Optional[float], 2025-05-07T20:32:39.0715627Z contiguous: bool, 2025-05-07T20:32:39.0715872Z compiled: bool, 2025-05-07T20:32:39.0716093Z ) -> None: 2025-05-07T20:32:39.0716310Z torch.manual_seed(2025) 2025-05-07T20:32:39.0716561Z 2025-05-07T20:32:39.0716832Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.0717175Z 2025-05-07T20:32:39.0717371Z x_sign = torch.sign(x) 2025-05-07T20:32:39.0717661Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.0717979Z x = x_sign * x_clamp 2025-05-07T20:32:39.0718298Z x0 = x[:, :D] 2025-05-07T20:32:39.0718511Z x1 = x[:, D:] 2025-05-07T20:32:39.0718722Z 2025-05-07T20:32:39.0718912Z if contiguous: 2025-05-07T20:32:39.0719138Z x0 = x0.contiguous() 2025-05-07T20:32:39.0719398Z x1 = x1.contiguous() 2025-05-07T20:32:39.0719711Z 2025-05-07T20:32:39.0719901Z if scale_ub is not None: 2025-05-07T20:32:39.0720183Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.0720518Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.0720830Z ) 2025-05-07T20:32:39.0721082Z else: 2025-05-07T20:32:39.0721299Z scale_ub_tensor = None 2025-05-07T20:32:39.0721556Z 2025-05-07T20:32:39.0721784Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.0722100Z op = silu_mul_quant 2025-05-07T20:32:39.0722352Z if compiled: 2025-05-07T20:32:39.0722595Z op = torch.compile(op) 2025-05-07T20:32:39.0722897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.0723176Z 2025-05-07T20:32:39.0723367Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.0723542Z 2025-05-07T20:32:39.0723640Z moe/activation_test.py:117: 2025-05-07T20:32:39.0723941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0724267Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.0724556Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.0725300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.0726000Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.0726534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.0727211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.0727872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.0728406Z kernel = self.compile( 2025-05-07T20:32:39.0728937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.0729590Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.0729989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0730217Z 2025-05-07T20:32:39.0730428Z self = 2025-05-07T20:32:39.0731520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.0732904Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8962280>} 2025-05-07T20:32:39.0734252Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.0735327Z context = 2025-05-07T20:32:39.0735618Z 2025-05-07T20:32:39.0735784Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.0736313Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.0736795Z module_map=module_map) 2025-05-07T20:32:39.0737175Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.0737526Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.0737794Z E ^ 2025-05-07T20:32:39.0738274Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.0738774Z 2025-05-07T20:32:39.0739186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.0739702Z 2025-05-07T20:32:39.0739850Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.0740272Z self=, 2025-05-07T20:32:39.0740679Z T=16384, 2025-05-07T20:32:39.0740867Z D=7168, 2025-05-07T20:32:39.0741070Z scale_ub=1200.0, 2025-05-07T20:32:39.0741296Z contiguous=False, 2025-05-07T20:32:39.0741561Z compiled=True, 2025-05-07T20:32:39.0741770Z ) 2025-05-07T20:32:39.2661769Z self = 2025-05-07T20:32:39.2663082Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:39.2663646Z 2025-05-07T20:32:39.2663811Z @given( 2025-05-07T20:32:39.2664300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.2664923Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.2665290Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.2665638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.2665976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.2666266Z ) 2025-05-07T20:32:39.2666618Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.2667057Z def test_silu_mul_quant( 2025-05-07T20:32:39.2667303Z self, 2025-05-07T20:32:39.2667745Z T: int, 2025-05-07T20:32:39.2667946Z D: int, 2025-05-07T20:32:39.2668176Z scale_ub: Optional[float], 2025-05-07T20:32:39.2668454Z contiguous: bool, 2025-05-07T20:32:39.2668691Z compiled: bool, 2025-05-07T20:32:39.2668920Z ) -> None: 2025-05-07T20:32:39.2669143Z torch.manual_seed(2025) 2025-05-07T20:32:39.2669383Z 2025-05-07T20:32:39.2669670Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.2670120Z 2025-05-07T20:32:39.2670313Z x_sign = torch.sign(x) 2025-05-07T20:32:39.2670608Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.2670926Z x = x_sign * x_clamp 2025-05-07T20:32:39.2671167Z x0 = x[:, :D] 2025-05-07T20:32:39.2671390Z x1 = x[:, D:] 2025-05-07T20:32:39.2671608Z 2025-05-07T20:32:39.2671792Z if contiguous: 2025-05-07T20:32:39.2672045Z x0 = x0.contiguous() 2025-05-07T20:32:39.2672315Z x1 = x1.contiguous() 2025-05-07T20:32:39.2672569Z 2025-05-07T20:32:39.2672766Z if scale_ub is not None: 2025-05-07T20:32:39.2681110Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.2681506Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.2681828Z ) 2025-05-07T20:32:39.2682019Z else: 2025-05-07T20:32:39.2682242Z scale_ub_tensor = None 2025-05-07T20:32:39.2682512Z 2025-05-07T20:32:39.2682746Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.2683069Z op = silu_mul_quant 2025-05-07T20:32:39.2683329Z if compiled: 2025-05-07T20:32:39.2683593Z op = torch.compile(op) 2025-05-07T20:32:39.2683897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2684184Z 2025-05-07T20:32:39.2684392Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.2684560Z 2025-05-07T20:32:39.2684665Z moe/activation_test.py:117: 2025-05-07T20:32:39.2685012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2685365Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.2685654Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2686232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.2686790Z return fn(*args, **kwargs) 2025-05-07T20:32:39.2687602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.2688297Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.2688923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.2689618Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.2690284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.2690914Z kernel = self.compile( 2025-05-07T20:32:39.2691468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.2692132Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.2692538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2692782Z 2025-05-07T20:32:39.2692993Z self = 2025-05-07T20:32:39.2694091Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.2695543Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8962ee0>} 2025-05-07T20:32:39.2696938Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.2697974Z context = 2025-05-07T20:32:39.2698272Z 2025-05-07T20:32:39.2698445Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.2698979Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.2699450Z module_map=module_map) 2025-05-07T20:32:39.2699830Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.2700203Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.2700474Z E ^ 2025-05-07T20:32:39.2700945Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.2701414Z 2025-05-07T20:32:39.2701837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.2702351Z 2025-05-07T20:32:39.2702469Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.2702896Z self=, 2025-05-07T20:32:39.2703311Z T=1, 2025-05-07T20:32:39.2703505Z D=7168, 2025-05-07T20:32:39.2704057Z scale_ub=None, 2025-05-07T20:32:39.2704315Z contiguous=False, 2025-05-07T20:32:39.2704551Z compiled=False, 2025-05-07T20:32:39.2704770Z ) 2025-05-07T20:32:39.2705134Z self = 2025-05-07T20:32:39.2705640Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:39.2705903Z 2025-05-07T20:32:39.2705989Z @given( 2025-05-07T20:32:39.2706219Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.2706540Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.2706857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.2707195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.2707523Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.2707817Z ) 2025-05-07T20:32:39.2708171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.2708712Z def test_silu_mul_quant( 2025-05-07T20:32:39.2708960Z self, 2025-05-07T20:32:39.2709162Z T: int, 2025-05-07T20:32:39.2709356Z D: int, 2025-05-07T20:32:39.2709576Z scale_ub: Optional[float], 2025-05-07T20:32:39.2709996Z contiguous: bool, 2025-05-07T20:32:39.2710244Z compiled: bool, 2025-05-07T20:32:39.2710465Z ) -> None: 2025-05-07T20:32:39.2710685Z torch.manual_seed(2025) 2025-05-07T20:32:39.2710931Z 2025-05-07T20:32:39.2711200Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.2711616Z 2025-05-07T20:32:39.2711814Z x_sign = torch.sign(x) 2025-05-07T20:32:39.2712106Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.2712423Z x = x_sign * x_clamp 2025-05-07T20:32:39.2712666Z x0 = x[:, :D] 2025-05-07T20:32:39.2712880Z x1 = x[:, D:] 2025-05-07T20:32:39.2713093Z 2025-05-07T20:32:39.2713286Z if contiguous: 2025-05-07T20:32:39.2713512Z x0 = x0.contiguous() 2025-05-07T20:32:39.2713779Z x1 = x1.contiguous() 2025-05-07T20:32:39.2714021Z 2025-05-07T20:32:39.2714209Z if scale_ub is not None: 2025-05-07T20:32:39.2714491Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.2714850Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.2715188Z ) 2025-05-07T20:32:39.2715376Z else: 2025-05-07T20:32:39.2715586Z scale_ub_tensor = None 2025-05-07T20:32:39.2715840Z 2025-05-07T20:32:39.2716138Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.2716450Z op = silu_mul_quant 2025-05-07T20:32:39.2716697Z if compiled: 2025-05-07T20:32:39.2716933Z op = torch.compile(op) 2025-05-07T20:32:39.2717224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2717493Z 2025-05-07T20:32:39.2717681Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.2717851Z 2025-05-07T20:32:39.2717947Z moe/activation_test.py:117: 2025-05-07T20:32:39.2718239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2718561Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.2718840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.2719528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.2720222Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.2720758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.2721437Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.2722093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.2722635Z kernel = self.compile( 2025-05-07T20:32:39.2723169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.2723823Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.2724216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.2724439Z 2025-05-07T20:32:39.2724644Z self = 2025-05-07T20:32:39.2725783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.2727166Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc887b670>} 2025-05-07T20:32:39.2728512Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.2729591Z context = 2025-05-07T20:32:39.2729880Z 2025-05-07T20:32:39.2730090Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.2730620Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.2731093Z module_map=module_map) 2025-05-07T20:32:39.2731515Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.2731874Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.2732138Z E ^ 2025-05-07T20:32:39.2732611Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.2733064Z 2025-05-07T20:32:39.2733479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.2734002Z 2025-05-07T20:32:39.2734110Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.2734529Z self=, 2025-05-07T20:32:39.2734939Z T=2048, 2025-05-07T20:32:39.2735129Z D=7168, 2025-05-07T20:32:39.2735316Z scale_ub=None, 2025-05-07T20:32:39.2735532Z contiguous=False, 2025-05-07T20:32:39.2735751Z compiled=True, 2025-05-07T20:32:39.2735956Z ) 2025-05-07T20:32:39.5599255Z self = 2025-05-07T20:32:39.5599821Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:39.5600092Z 2025-05-07T20:32:39.5600166Z @given( 2025-05-07T20:32:39.5600396Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.5600707Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.5601010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.5601341Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.5601667Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.5601941Z ) 2025-05-07T20:32:39.5602293Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.5602731Z def test_silu_mul_quant( 2025-05-07T20:32:39.5602970Z self, 2025-05-07T20:32:39.5603155Z T: int, 2025-05-07T20:32:39.5603349Z D: int, 2025-05-07T20:32:39.5603566Z scale_ub: Optional[float], 2025-05-07T20:32:39.5604085Z contiguous: bool, 2025-05-07T20:32:39.5604320Z compiled: bool, 2025-05-07T20:32:39.5604543Z ) -> None: 2025-05-07T20:32:39.5604755Z torch.manual_seed(2025) 2025-05-07T20:32:39.5605000Z 2025-05-07T20:32:39.5605311Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.5605655Z 2025-05-07T20:32:39.5605858Z x_sign = torch.sign(x) 2025-05-07T20:32:39.5606151Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.5606451Z x = x_sign * x_clamp 2025-05-07T20:32:39.5606694Z x0 = x[:, :D] 2025-05-07T20:32:39.5606907Z x1 = x[:, D:] 2025-05-07T20:32:39.5607110Z 2025-05-07T20:32:39.5607294Z if contiguous: 2025-05-07T20:32:39.5607533Z x0 = x0.contiguous() 2025-05-07T20:32:39.5607783Z x1 = x1.contiguous() 2025-05-07T20:32:39.5608027Z 2025-05-07T20:32:39.5608219Z if scale_ub is not None: 2025-05-07T20:32:39.5608490Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.5608823Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.5609138Z ) 2025-05-07T20:32:39.5609332Z else: 2025-05-07T20:32:39.5609545Z scale_ub_tensor = None 2025-05-07T20:32:39.5609796Z 2025-05-07T20:32:39.5610031Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.5610427Z op = silu_mul_quant 2025-05-07T20:32:39.5610684Z if compiled: 2025-05-07T20:32:39.5610927Z op = torch.compile(op) 2025-05-07T20:32:39.5611216Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.5611483Z 2025-05-07T20:32:39.5611751Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.5611914Z 2025-05-07T20:32:39.5612012Z moe/activation_test.py:117: 2025-05-07T20:32:39.5612312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.5612644Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.5613045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.5613618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.5614181Z return fn(*args, **kwargs) 2025-05-07T20:32:39.5614845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.5615538Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.5616077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.5616767Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.5617419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.5617947Z kernel = self.compile( 2025-05-07T20:32:39.5618545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.5619199Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.5619595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.5619827Z 2025-05-07T20:32:39.5620033Z self = 2025-05-07T20:32:39.5621123Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.5622524Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc845c550>} 2025-05-07T20:32:39.5623866Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.5624900Z context = 2025-05-07T20:32:39.5625235Z 2025-05-07T20:32:39.5625406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.5625929Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.5626394Z module_map=module_map) 2025-05-07T20:32:39.5626764Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.5627109Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.5627368Z E ^ 2025-05-07T20:32:39.5627825Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.5628283Z 2025-05-07T20:32:39.5628706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.5629221Z 2025-05-07T20:32:39.5629325Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.5629742Z self=, 2025-05-07T20:32:39.5630193Z T=4096, 2025-05-07T20:32:39.5630379Z D=7168, 2025-05-07T20:32:39.5630567Z scale_ub=None, 2025-05-07T20:32:39.5630773Z contiguous=False, 2025-05-07T20:32:39.5631067Z compiled=True, 2025-05-07T20:32:39.5631274Z ) 2025-05-07T20:32:39.5631588Z self = 2025-05-07T20:32:39.5632080Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:39.5632394Z 2025-05-07T20:32:39.5632477Z @given( 2025-05-07T20:32:39.5632705Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.5633016Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.5633319Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.5633699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.5634026Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.5634313Z ) 2025-05-07T20:32:39.5634666Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.5635104Z def test_silu_mul_quant( 2025-05-07T20:32:39.5635376Z self, 2025-05-07T20:32:39.5635594Z T: int, 2025-05-07T20:32:39.5635780Z D: int, 2025-05-07T20:32:39.5635995Z scale_ub: Optional[float], 2025-05-07T20:32:39.5636269Z contiguous: bool, 2025-05-07T20:32:39.5636504Z compiled: bool, 2025-05-07T20:32:39.5636738Z ) -> None: 2025-05-07T20:32:39.5636957Z torch.manual_seed(2025) 2025-05-07T20:32:39.5637194Z 2025-05-07T20:32:39.5637464Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.5637808Z 2025-05-07T20:32:39.5638000Z x_sign = torch.sign(x) 2025-05-07T20:32:39.5638342Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.5638664Z x = x_sign * x_clamp 2025-05-07T20:32:39.5638904Z x0 = x[:, :D] 2025-05-07T20:32:39.5639113Z x1 = x[:, D:] 2025-05-07T20:32:39.5639321Z 2025-05-07T20:32:39.5639506Z if contiguous: 2025-05-07T20:32:39.5639728Z x0 = x0.contiguous() 2025-05-07T20:32:39.5639990Z x1 = x1.contiguous() 2025-05-07T20:32:39.5640244Z 2025-05-07T20:32:39.5640431Z if scale_ub is not None: 2025-05-07T20:32:39.5640716Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.5641063Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.5641368Z ) 2025-05-07T20:32:39.5641563Z else: 2025-05-07T20:32:39.5641776Z scale_ub_tensor = None 2025-05-07T20:32:39.5642016Z 2025-05-07T20:32:39.5642245Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.5642557Z op = silu_mul_quant 2025-05-07T20:32:39.5642803Z if compiled: 2025-05-07T20:32:39.5643050Z op = torch.compile(op) 2025-05-07T20:32:39.5643344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.5643623Z 2025-05-07T20:32:39.5643803Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.5643972Z 2025-05-07T20:32:39.5644073Z moe/activation_test.py:117: 2025-05-07T20:32:39.5644372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.5644696Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.5644977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.5645548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.5646099Z return fn(*args, **kwargs) 2025-05-07T20:32:39.5646758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.5647444Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.5647981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.5648652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.5649313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.5649891Z kernel = self.compile( 2025-05-07T20:32:39.5650421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.5651070Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.5651504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.5651730Z 2025-05-07T20:32:39.5651940Z self = 2025-05-07T20:32:39.5653027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.5654467Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8550160>} 2025-05-07T20:32:39.5655878Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.5656911Z context = 2025-05-07T20:32:39.5657196Z 2025-05-07T20:32:39.5657371Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.5657897Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.5658411Z module_map=module_map) 2025-05-07T20:32:39.5658787Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.5659133Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.5659393Z E ^ 2025-05-07T20:32:39.5659856Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.5660313Z 2025-05-07T20:32:39.5660742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.5661258Z 2025-05-07T20:32:39.7731328Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.7732185Z self=, 2025-05-07T20:32:39.7733237Z T=16384, 2025-05-07T20:32:39.7733717Z D=5120, 2025-05-07T20:32:39.7734148Z scale_ub=1200.0, 2025-05-07T20:32:39.7734578Z contiguous=False, 2025-05-07T20:32:39.7735021Z compiled=False, 2025-05-07T20:32:39.7735354Z ) 2025-05-07T20:32:39.7735733Z self = 2025-05-07T20:32:39.7736232Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:39.7736525Z 2025-05-07T20:32:39.7736604Z @given( 2025-05-07T20:32:39.7736840Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.7737151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.7737460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.7737789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.7738110Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.7738398Z ) 2025-05-07T20:32:39.7738748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.7739180Z def test_silu_mul_quant( 2025-05-07T20:32:39.7739427Z self, 2025-05-07T20:32:39.7739624Z T: int, 2025-05-07T20:32:39.7739823Z D: int, 2025-05-07T20:32:39.7740043Z scale_ub: Optional[float], 2025-05-07T20:32:39.7740320Z contiguous: bool, 2025-05-07T20:32:39.7740557Z compiled: bool, 2025-05-07T20:32:39.7740782Z ) -> None: 2025-05-07T20:32:39.7741002Z torch.manual_seed(2025) 2025-05-07T20:32:39.7741244Z 2025-05-07T20:32:39.7741511Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.7742103Z 2025-05-07T20:32:39.7742295Z x_sign = torch.sign(x) 2025-05-07T20:32:39.7742583Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.7742894Z x = x_sign * x_clamp 2025-05-07T20:32:39.7743139Z x0 = x[:, :D] 2025-05-07T20:32:39.7743433Z x1 = x[:, D:] 2025-05-07T20:32:39.7743649Z 2025-05-07T20:32:39.7743841Z if contiguous: 2025-05-07T20:32:39.7744069Z x0 = x0.contiguous() 2025-05-07T20:32:39.7744336Z x1 = x1.contiguous() 2025-05-07T20:32:39.7744585Z 2025-05-07T20:32:39.7744777Z if scale_ub is not None: 2025-05-07T20:32:39.7745130Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.7745472Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.7745792Z ) 2025-05-07T20:32:39.7745985Z else: 2025-05-07T20:32:39.7746202Z scale_ub_tensor = None 2025-05-07T20:32:39.7746457Z 2025-05-07T20:32:39.7746690Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.7747009Z op = silu_mul_quant 2025-05-07T20:32:39.7747265Z if compiled: 2025-05-07T20:32:39.7747507Z op = torch.compile(op) 2025-05-07T20:32:39.7747808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7748085Z 2025-05-07T20:32:39.7748274Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.7748444Z 2025-05-07T20:32:39.7748542Z moe/activation_test.py:117: 2025-05-07T20:32:39.7748835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7749243Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.7749525Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7750325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.7751020Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.7751551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.7752238Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.7752899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.7753433Z kernel = self.compile( 2025-05-07T20:32:39.7753968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.7754625Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.7755035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7755265Z 2025-05-07T20:32:39.7755475Z self = 2025-05-07T20:32:39.7756568Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.7757977Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8550940>} 2025-05-07T20:32:39.7759336Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.7760364Z context = 2025-05-07T20:32:39.7760653Z 2025-05-07T20:32:39.7760817Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.7761340Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.7761806Z module_map=module_map) 2025-05-07T20:32:39.7762227Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.7762579Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.7762841Z E ^ 2025-05-07T20:32:39.7763383Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.7763841Z 2025-05-07T20:32:39.7764259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.7764785Z 2025-05-07T20:32:39.7764891Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.7772941Z self=, 2025-05-07T20:32:39.7773454Z T=16384, 2025-05-07T20:32:39.7773661Z D=5120, 2025-05-07T20:32:39.7773856Z scale_ub=1200.0, 2025-05-07T20:32:39.7774092Z contiguous=True, 2025-05-07T20:32:39.7774321Z compiled=True, 2025-05-07T20:32:39.7774527Z ) 2025-05-07T20:32:39.7774864Z self = 2025-05-07T20:32:39.7775385Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:39.7775671Z 2025-05-07T20:32:39.7775758Z @given( 2025-05-07T20:32:39.7775995Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.7776323Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.7776636Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.7776970Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.7777306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.7777691Z ) 2025-05-07T20:32:39.7778044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.7778498Z def test_silu_mul_quant( 2025-05-07T20:32:39.7778752Z self, 2025-05-07T20:32:39.7778946Z T: int, 2025-05-07T20:32:39.7779151Z D: int, 2025-05-07T20:32:39.7779377Z scale_ub: Optional[float], 2025-05-07T20:32:39.7779658Z contiguous: bool, 2025-05-07T20:32:39.7779895Z compiled: bool, 2025-05-07T20:32:39.7780124Z ) -> None: 2025-05-07T20:32:39.7780348Z torch.manual_seed(2025) 2025-05-07T20:32:39.7780588Z 2025-05-07T20:32:39.7780869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.7781221Z 2025-05-07T20:32:39.7781411Z x_sign = torch.sign(x) 2025-05-07T20:32:39.7781706Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.7782022Z x = x_sign * x_clamp 2025-05-07T20:32:39.7782260Z x0 = x[:, :D] 2025-05-07T20:32:39.7782489Z x1 = x[:, D:] 2025-05-07T20:32:39.7782704Z 2025-05-07T20:32:39.7782885Z if contiguous: 2025-05-07T20:32:39.7783120Z x0 = x0.contiguous() 2025-05-07T20:32:39.7783388Z x1 = x1.contiguous() 2025-05-07T20:32:39.7783624Z 2025-05-07T20:32:39.7783819Z if scale_ub is not None: 2025-05-07T20:32:39.7784099Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.7784435Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.7784753Z ) 2025-05-07T20:32:39.7784953Z else: 2025-05-07T20:32:39.7785164Z scale_ub_tensor = None 2025-05-07T20:32:39.7785414Z 2025-05-07T20:32:39.7785648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.7785972Z op = silu_mul_quant 2025-05-07T20:32:39.7786222Z if compiled: 2025-05-07T20:32:39.7786474Z op = torch.compile(op) 2025-05-07T20:32:39.7786781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7787055Z 2025-05-07T20:32:39.7787251Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.7787421Z 2025-05-07T20:32:39.7787531Z moe/activation_test.py:117: 2025-05-07T20:32:39.7787824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7788164Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.7788505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7789087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.7789650Z return fn(*args, **kwargs) 2025-05-07T20:32:39.7790434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.7791136Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.7791677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.7792413Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.7793084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.7793623Z kernel = self.compile( 2025-05-07T20:32:39.7794164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.7794837Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.7795242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7795472Z 2025-05-07T20:32:39.7795691Z self = 2025-05-07T20:32:39.7796836Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.7798261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8351550>} 2025-05-07T20:32:39.7799647Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.7800704Z context = 2025-05-07T20:32:39.7800998Z 2025-05-07T20:32:39.7801184Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.7801717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.7802196Z module_map=module_map) 2025-05-07T20:32:39.7802577Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.7802942Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.7803210Z E ^ 2025-05-07T20:32:39.7803689Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.7804455Z 2025-05-07T20:32:39.7804910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.7805462Z 2025-05-07T20:32:40.0030331Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.0030796Z self=, 2025-05-07T20:32:40.0031196Z T=16384, 2025-05-07T20:32:40.0031407Z D=5120, 2025-05-07T20:32:40.0031598Z scale_ub=None, 2025-05-07T20:32:40.0031805Z contiguous=False, 2025-05-07T20:32:40.0032025Z compiled=True, 2025-05-07T20:32:40.0032228Z ) 2025-05-07T20:32:40.0032539Z self = 2025-05-07T20:32:40.0033055Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.0033339Z 2025-05-07T20:32:40.0033425Z @given( 2025-05-07T20:32:40.0033647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.0033963Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.0034270Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.0034833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.0035163Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.0035450Z ) 2025-05-07T20:32:40.0035803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.0036324Z def test_silu_mul_quant( 2025-05-07T20:32:40.0036568Z self, 2025-05-07T20:32:40.0036762Z T: int, 2025-05-07T20:32:40.0036956Z D: int, 2025-05-07T20:32:40.0037174Z scale_ub: Optional[float], 2025-05-07T20:32:40.0037445Z contiguous: bool, 2025-05-07T20:32:40.0037678Z compiled: bool, 2025-05-07T20:32:40.0037986Z ) -> None: 2025-05-07T20:32:40.0038201Z torch.manual_seed(2025) 2025-05-07T20:32:40.0038434Z 2025-05-07T20:32:40.0038701Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.0039043Z 2025-05-07T20:32:40.0039232Z x_sign = torch.sign(x) 2025-05-07T20:32:40.0039512Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.0039821Z x = x_sign * x_clamp 2025-05-07T20:32:40.0040060Z x0 = x[:, :D] 2025-05-07T20:32:40.0040267Z x1 = x[:, D:] 2025-05-07T20:32:40.0040468Z 2025-05-07T20:32:40.0040645Z if contiguous: 2025-05-07T20:32:40.0040872Z x0 = x0.contiguous() 2025-05-07T20:32:40.0041127Z x1 = x1.contiguous() 2025-05-07T20:32:40.0041368Z 2025-05-07T20:32:40.0041550Z if scale_ub is not None: 2025-05-07T20:32:40.0041817Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.0042237Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.0042544Z ) 2025-05-07T20:32:40.0042729Z else: 2025-05-07T20:32:40.0042933Z scale_ub_tensor = None 2025-05-07T20:32:40.0043176Z 2025-05-07T20:32:40.0043401Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.0043716Z op = silu_mul_quant 2025-05-07T20:32:40.0043962Z if compiled: 2025-05-07T20:32:40.0044211Z op = torch.compile(op) 2025-05-07T20:32:40.0044504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.0044775Z 2025-05-07T20:32:40.0044959Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.0045151Z 2025-05-07T20:32:40.0045268Z moe/activation_test.py:117: 2025-05-07T20:32:40.0045571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.0045894Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.0046177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.0046747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.0047296Z return fn(*args, **kwargs) 2025-05-07T20:32:40.0047962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.0048649Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.0049185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.0049856Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.0050520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.0051054Z kernel = self.compile( 2025-05-07T20:32:40.0051600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.0052251Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.0052654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.0052881Z 2025-05-07T20:32:40.0053102Z self = 2025-05-07T20:32:40.0054191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.0055734Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc85361f0>} 2025-05-07T20:32:40.0057093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.0058170Z context = 2025-05-07T20:32:40.0058456Z 2025-05-07T20:32:40.0058625Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.0059140Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.0059604Z module_map=module_map) 2025-05-07T20:32:40.0059971Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.0060324Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.0060579Z E ^ 2025-05-07T20:32:40.0061049Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.0061498Z 2025-05-07T20:32:40.0061919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.0062428Z 2025-05-07T20:32:40.0062602Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.0063016Z self=, 2025-05-07T20:32:40.0063416Z T=2048, 2025-05-07T20:32:40.0063605Z D=5120, 2025-05-07T20:32:40.0063784Z scale_ub=None, 2025-05-07T20:32:40.0063994Z contiguous=False, 2025-05-07T20:32:40.0064219Z compiled=True, 2025-05-07T20:32:40.0064417Z ) 2025-05-07T20:32:40.1276302Z self = 2025-05-07T20:32:40.1276964Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.1277356Z 2025-05-07T20:32:40.1277458Z @given( 2025-05-07T20:32:40.1277783Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.1278213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.1278560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.1278892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.1279247Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.1279544Z ) 2025-05-07T20:32:40.1279897Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.1280345Z def test_silu_mul_quant( 2025-05-07T20:32:40.1280587Z self, 2025-05-07T20:32:40.1280792Z T: int, 2025-05-07T20:32:40.1280996Z D: int, 2025-05-07T20:32:40.1281217Z scale_ub: Optional[float], 2025-05-07T20:32:40.1281501Z contiguous: bool, 2025-05-07T20:32:40.1281743Z compiled: bool, 2025-05-07T20:32:40.1281969Z ) -> None: 2025-05-07T20:32:40.1282192Z torch.manual_seed(2025) 2025-05-07T20:32:40.1282445Z 2025-05-07T20:32:40.1282722Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.1283066Z 2025-05-07T20:32:40.1283262Z x_sign = torch.sign(x) 2025-05-07T20:32:40.1283553Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.1283898Z x = x_sign * x_clamp 2025-05-07T20:32:40.1284141Z x0 = x[:, :D] 2025-05-07T20:32:40.1284364Z x1 = x[:, D:] 2025-05-07T20:32:40.1284570Z 2025-05-07T20:32:40.1284749Z if contiguous: 2025-05-07T20:32:40.1284983Z x0 = x0.contiguous() 2025-05-07T20:32:40.1285243Z x1 = x1.contiguous() 2025-05-07T20:32:40.1285482Z 2025-05-07T20:32:40.1285680Z if scale_ub is not None: 2025-05-07T20:32:40.1286200Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.1286534Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.1286848Z ) 2025-05-07T20:32:40.1287042Z else: 2025-05-07T20:32:40.1287364Z scale_ub_tensor = None 2025-05-07T20:32:40.1287621Z 2025-05-07T20:32:40.1287855Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.1288174Z op = silu_mul_quant 2025-05-07T20:32:40.1288420Z if compiled: 2025-05-07T20:32:40.1288670Z op = torch.compile(op) 2025-05-07T20:32:40.1289044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1289317Z 2025-05-07T20:32:40.1289511Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.1289676Z 2025-05-07T20:32:40.1289784Z moe/activation_test.py:117: 2025-05-07T20:32:40.1290077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1290413Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.1290695Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1291252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.1291825Z return fn(*args, **kwargs) 2025-05-07T20:32:40.1292493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.1293177Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.1293785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.1294479Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.1295148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.1295695Z kernel = self.compile( 2025-05-07T20:32:40.1296234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.1296897Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.1297304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1297533Z 2025-05-07T20:32:40.1297747Z self = 2025-05-07T20:32:40.1298852Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.1300256Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8536f70>} 2025-05-07T20:32:40.1301623Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.1302658Z context = 2025-05-07T20:32:40.1302952Z 2025-05-07T20:32:40.1303123Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.1303653Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.1304455Z module_map=module_map) 2025-05-07T20:32:40.1304832Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.1305194Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.1305463Z E ^ 2025-05-07T20:32:40.1305937Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.1306388Z 2025-05-07T20:32:40.1306802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.1307406Z 2025-05-07T20:32:40.1307507Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.1307933Z self=, 2025-05-07T20:32:40.1308400Z T=2048, 2025-05-07T20:32:40.1308586Z D=5120, 2025-05-07T20:32:40.1308782Z scale_ub=1200.0, 2025-05-07T20:32:40.1309006Z contiguous=False, 2025-05-07T20:32:40.1309227Z compiled=True, 2025-05-07T20:32:40.1309432Z ) 2025-05-07T20:32:40.1309750Z self = 2025-05-07T20:32:40.1310404Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.1310686Z 2025-05-07T20:32:40.1310762Z @given( 2025-05-07T20:32:40.1310996Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.1311310Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.1311624Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.1311961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.1312291Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.1312575Z ) 2025-05-07T20:32:40.1312928Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.1313373Z def test_silu_mul_quant( 2025-05-07T20:32:40.1313610Z self, 2025-05-07T20:32:40.1313807Z T: int, 2025-05-07T20:32:40.1314009Z D: int, 2025-05-07T20:32:40.1314227Z scale_ub: Optional[float], 2025-05-07T20:32:40.1314500Z contiguous: bool, 2025-05-07T20:32:40.1314812Z compiled: bool, 2025-05-07T20:32:40.1315032Z ) -> None: 2025-05-07T20:32:40.1315253Z torch.manual_seed(2025) 2025-05-07T20:32:40.1315505Z 2025-05-07T20:32:40.1315777Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.1316124Z 2025-05-07T20:32:40.1316324Z x_sign = torch.sign(x) 2025-05-07T20:32:40.1316617Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.1316931Z x = x_sign * x_clamp 2025-05-07T20:32:40.1317179Z x0 = x[:, :D] 2025-05-07T20:32:40.1317406Z x1 = x[:, D:] 2025-05-07T20:32:40.1317612Z 2025-05-07T20:32:40.1317811Z if contiguous: 2025-05-07T20:32:40.1318046Z x0 = x0.contiguous() 2025-05-07T20:32:40.1318305Z x1 = x1.contiguous() 2025-05-07T20:32:40.1318560Z 2025-05-07T20:32:40.1318760Z if scale_ub is not None: 2025-05-07T20:32:40.1319028Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.1319373Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.1319684Z ) 2025-05-07T20:32:40.1319871Z else: 2025-05-07T20:32:40.1320088Z scale_ub_tensor = None 2025-05-07T20:32:40.1320349Z 2025-05-07T20:32:40.1320582Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.1320903Z op = silu_mul_quant 2025-05-07T20:32:40.1321162Z if compiled: 2025-05-07T20:32:40.1321407Z op = torch.compile(op) 2025-05-07T20:32:40.1321716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1322000Z 2025-05-07T20:32:40.1322202Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.1322370Z 2025-05-07T20:32:40.1322468Z moe/activation_test.py:117: 2025-05-07T20:32:40.1322767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1323098Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.1323377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.1323934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.1324491Z return fn(*args, **kwargs) 2025-05-07T20:32:40.1325149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.1325893Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.1326428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.1327151Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.1327808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.1328335Z kernel = self.compile( 2025-05-07T20:32:40.1328876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.1329570Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.1329961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.1330199Z 2025-05-07T20:32:40.1330407Z self = 2025-05-07T20:32:40.1331501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.1332899Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc81f7940>} 2025-05-07T20:32:40.1334292Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.1335334Z context = 2025-05-07T20:32:40.1335632Z 2025-05-07T20:32:40.1335799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.1336327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.1336798Z module_map=module_map) 2025-05-07T20:32:40.1337169Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.1337532Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.1337788Z E ^ 2025-05-07T20:32:40.1338263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.1338731Z 2025-05-07T20:32:40.1339159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.1339678Z 2025-05-07T20:32:40.5313201Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.5313696Z self=, 2025-05-07T20:32:40.5314112Z T=4096, 2025-05-07T20:32:40.5314298Z D=5120, 2025-05-07T20:32:40.5314485Z scale_ub=1200.0, 2025-05-07T20:32:40.5314709Z contiguous=True, 2025-05-07T20:32:40.5314935Z compiled=True, 2025-05-07T20:32:40.5315151Z ) 2025-05-07T20:32:40.5315509Z self = 2025-05-07T20:32:40.5316003Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:40.5316275Z 2025-05-07T20:32:40.5316363Z @given( 2025-05-07T20:32:40.5316583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.5316893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.5317198Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.5317528Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.5317860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.5318145Z ) 2025-05-07T20:32:40.5318485Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.5318924Z def test_silu_mul_quant( 2025-05-07T20:32:40.5319162Z self, 2025-05-07T20:32:40.5319353Z T: int, 2025-05-07T20:32:40.5319783Z D: int, 2025-05-07T20:32:40.5320001Z scale_ub: Optional[float], 2025-05-07T20:32:40.5320270Z contiguous: bool, 2025-05-07T20:32:40.5320502Z compiled: bool, 2025-05-07T20:32:40.5320725Z ) -> None: 2025-05-07T20:32:40.5321021Z torch.manual_seed(2025) 2025-05-07T20:32:40.5321253Z 2025-05-07T20:32:40.5321520Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.5321859Z 2025-05-07T20:32:40.5322045Z x_sign = torch.sign(x) 2025-05-07T20:32:40.5322333Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.5322715Z x = x_sign * x_clamp 2025-05-07T20:32:40.5330152Z x0 = x[:, :D] 2025-05-07T20:32:40.5330422Z x1 = x[:, D:] 2025-05-07T20:32:40.5330644Z 2025-05-07T20:32:40.5330846Z if contiguous: 2025-05-07T20:32:40.5331081Z x0 = x0.contiguous() 2025-05-07T20:32:40.5331371Z x1 = x1.contiguous() 2025-05-07T20:32:40.5331628Z 2025-05-07T20:32:40.5331827Z if scale_ub is not None: 2025-05-07T20:32:40.5332102Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.5332450Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.5332773Z ) 2025-05-07T20:32:40.5332970Z else: 2025-05-07T20:32:40.5333188Z scale_ub_tensor = None 2025-05-07T20:32:40.5333446Z 2025-05-07T20:32:40.5333679Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.5334014Z op = silu_mul_quant 2025-05-07T20:32:40.5334274Z if compiled: 2025-05-07T20:32:40.5334654Z op = torch.compile(op) 2025-05-07T20:32:40.5334958Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.5335247Z 2025-05-07T20:32:40.5335443Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.5335611Z 2025-05-07T20:32:40.5335714Z moe/activation_test.py:117: 2025-05-07T20:32:40.5336023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.5336365Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.5336646Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.5337222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.5337797Z return fn(*args, **kwargs) 2025-05-07T20:32:40.5338470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.5339168Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.5339723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.5340415Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.5341086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.5341629Z kernel = self.compile( 2025-05-07T20:32:40.5342181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.5342847Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.5343251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.5343494Z 2025-05-07T20:32:40.5343704Z self = 2025-05-07T20:32:40.5344813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.5346294Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc82b4790>} 2025-05-07T20:32:40.5347664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.5348759Z context = 2025-05-07T20:32:40.5349100Z 2025-05-07T20:32:40.5349269Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.5349801Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.5350394Z module_map=module_map) 2025-05-07T20:32:40.5350814Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.5351175Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.5351437Z E ^ 2025-05-07T20:32:40.5351906Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.5352378Z 2025-05-07T20:32:40.5352806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.5353327Z 2025-05-07T20:32:40.5353438Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.5353858Z self=, 2025-05-07T20:32:40.5354285Z T=128, 2025-05-07T20:32:40.5354478Z D=5120, 2025-05-07T20:32:40.5354676Z scale_ub=1200.0, 2025-05-07T20:32:40.5354900Z contiguous=False, 2025-05-07T20:32:40.5355132Z compiled=True, 2025-05-07T20:32:40.5355342Z ) 2025-05-07T20:32:40.6681398Z self = 2025-05-07T20:32:40.6682110Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.6682387Z 2025-05-07T20:32:40.6682466Z @given( 2025-05-07T20:32:40.6682702Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6683018Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6683341Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6683667Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6683997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6684284Z ) 2025-05-07T20:32:40.6684631Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6685074Z def test_silu_mul_quant( 2025-05-07T20:32:40.6685322Z self, 2025-05-07T20:32:40.6685545Z T: int, 2025-05-07T20:32:40.6685776Z D: int, 2025-05-07T20:32:40.6686002Z scale_ub: Optional[float], 2025-05-07T20:32:40.6686277Z contiguous: bool, 2025-05-07T20:32:40.6686525Z compiled: bool, 2025-05-07T20:32:40.6686754Z ) -> None: 2025-05-07T20:32:40.6686966Z torch.manual_seed(2025) 2025-05-07T20:32:40.6687209Z 2025-05-07T20:32:40.6687482Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6687821Z 2025-05-07T20:32:40.6688015Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6688303Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6688611Z x = x_sign * x_clamp 2025-05-07T20:32:40.6688844Z x0 = x[:, :D] 2025-05-07T20:32:40.6689061Z x1 = x[:, D:] 2025-05-07T20:32:40.6689268Z 2025-05-07T20:32:40.6689447Z if contiguous: 2025-05-07T20:32:40.6689678Z x0 = x0.contiguous() 2025-05-07T20:32:40.6689943Z x1 = x1.contiguous() 2025-05-07T20:32:40.6690176Z 2025-05-07T20:32:40.6690368Z if scale_ub is not None: 2025-05-07T20:32:40.6690649Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6690978Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6691291Z ) 2025-05-07T20:32:40.6691486Z else: 2025-05-07T20:32:40.6691689Z scale_ub_tensor = None 2025-05-07T20:32:40.6691943Z 2025-05-07T20:32:40.6692176Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6692578Z op = silu_mul_quant 2025-05-07T20:32:40.6692827Z if compiled: 2025-05-07T20:32:40.6693076Z op = torch.compile(op) 2025-05-07T20:32:40.6693375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6693727Z 2025-05-07T20:32:40.6693919Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.6694083Z 2025-05-07T20:32:40.6694188Z moe/activation_test.py:117: 2025-05-07T20:32:40.6694480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6694820Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.6695179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6695734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.6696307Z return fn(*args, **kwargs) 2025-05-07T20:32:40.6696979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.6697684Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.6698225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6698923Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6699592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6700124Z kernel = self.compile( 2025-05-07T20:32:40.6700722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6701385Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6701787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6702019Z 2025-05-07T20:32:40.6702225Z self = 2025-05-07T20:32:40.6703330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6705119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc81370d0>} 2025-05-07T20:32:40.6706827Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6707994Z context = 2025-05-07T20:32:40.6708284Z 2025-05-07T20:32:40.6708455Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6708987Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6709470Z module_map=module_map) 2025-05-07T20:32:40.6709886Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6710247Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.6710518Z E ^ 2025-05-07T20:32:40.6710987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6711439Z 2025-05-07T20:32:40.6711860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.6712380Z 2025-05-07T20:32:40.6712484Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.6712898Z self=, 2025-05-07T20:32:40.6713313Z T=16384, 2025-05-07T20:32:40.6713499Z D=7168, 2025-05-07T20:32:40.6713700Z scale_ub=1200.0, 2025-05-07T20:32:40.6714003Z contiguous=True, 2025-05-07T20:32:40.6714223Z compiled=True, 2025-05-07T20:32:40.6714433Z ) 2025-05-07T20:32:40.6714753Z self = 2025-05-07T20:32:40.6715432Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:40.6715785Z 2025-05-07T20:32:40.6715881Z @given( 2025-05-07T20:32:40.6716171Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6716553Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6716934Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6717365Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6717699Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6717982Z ) 2025-05-07T20:32:40.6718332Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6718776Z def test_silu_mul_quant( 2025-05-07T20:32:40.6719022Z self, 2025-05-07T20:32:40.6719221Z T: int, 2025-05-07T20:32:40.6719423Z D: int, 2025-05-07T20:32:40.6719638Z scale_ub: Optional[float], 2025-05-07T20:32:40.6719909Z contiguous: bool, 2025-05-07T20:32:40.6720158Z compiled: bool, 2025-05-07T20:32:40.6720382Z ) -> None: 2025-05-07T20:32:40.6720609Z torch.manual_seed(2025) 2025-05-07T20:32:40.6720852Z 2025-05-07T20:32:40.6721123Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6721477Z 2025-05-07T20:32:40.6721676Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6722033Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6722352Z x = x_sign * x_clamp 2025-05-07T20:32:40.6722605Z x0 = x[:, :D] 2025-05-07T20:32:40.6722830Z x1 = x[:, D:] 2025-05-07T20:32:40.6723041Z 2025-05-07T20:32:40.6723233Z if contiguous: 2025-05-07T20:32:40.6723476Z x0 = x0.contiguous() 2025-05-07T20:32:40.6723740Z x1 = x1.contiguous() 2025-05-07T20:32:40.6723984Z 2025-05-07T20:32:40.6724180Z if scale_ub is not None: 2025-05-07T20:32:40.6724455Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6724801Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6725124Z ) 2025-05-07T20:32:40.6725314Z else: 2025-05-07T20:32:40.6725534Z scale_ub_tensor = None 2025-05-07T20:32:40.6725789Z 2025-05-07T20:32:40.6726016Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6726336Z op = silu_mul_quant 2025-05-07T20:32:40.6726600Z if compiled: 2025-05-07T20:32:40.6726847Z op = torch.compile(op) 2025-05-07T20:32:40.6727151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6727431Z 2025-05-07T20:32:40.6727619Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.6727794Z 2025-05-07T20:32:40.6727895Z moe/activation_test.py:117: 2025-05-07T20:32:40.6728197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6728537Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.6728827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6729389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.6729952Z return fn(*args, **kwargs) 2025-05-07T20:32:40.6730616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.6731314Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.6731847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6732530Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6733192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6733775Z kernel = self.compile( 2025-05-07T20:32:40.6734316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6735045Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6735494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6735722Z 2025-05-07T20:32:40.6735932Z self = 2025-05-07T20:32:40.6737026Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6738461Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8137d30>} 2025-05-07T20:32:40.6739823Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6740861Z context = 2025-05-07T20:32:40.6741153Z 2025-05-07T20:32:40.6741318Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6741856Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6742376Z module_map=module_map) 2025-05-07T20:32:40.6742738Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6743096Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.6743353Z E ^ 2025-05-07T20:32:40.6743822Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6744278Z 2025-05-07T20:32:40.6744694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.6745212Z 2025-05-07T20:32:40.9495372Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.9496054Z self=, 2025-05-07T20:32:40.9496611Z T=16384, 2025-05-07T20:32:40.9496834Z D=5120, 2025-05-07T20:32:40.9497029Z scale_ub=1200.0, 2025-05-07T20:32:40.9497254Z contiguous=True, 2025-05-07T20:32:40.9497473Z compiled=False, 2025-05-07T20:32:40.9497692Z ) 2025-05-07T20:32:40.9498011Z self = 2025-05-07T20:32:40.9498502Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:40.9498786Z 2025-05-07T20:32:40.9498871Z @given( 2025-05-07T20:32:40.9499100Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.9499415Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.9499722Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.9500051Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.9500386Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.9500673Z ) 2025-05-07T20:32:40.9501020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.9501464Z def test_silu_mul_quant( 2025-05-07T20:32:40.9501708Z self, 2025-05-07T20:32:40.9501899Z T: int, 2025-05-07T20:32:40.9502100Z D: int, 2025-05-07T20:32:40.9502314Z scale_ub: Optional[float], 2025-05-07T20:32:40.9502587Z contiguous: bool, 2025-05-07T20:32:40.9502823Z compiled: bool, 2025-05-07T20:32:40.9503047Z ) -> None: 2025-05-07T20:32:40.9503262Z torch.manual_seed(2025) 2025-05-07T20:32:40.9503510Z 2025-05-07T20:32:40.9504169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.9504737Z 2025-05-07T20:32:40.9504930Z x_sign = torch.sign(x) 2025-05-07T20:32:40.9505216Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.9505552Z x = x_sign * x_clamp 2025-05-07T20:32:40.9505895Z x0 = x[:, :D] 2025-05-07T20:32:40.9506114Z x1 = x[:, D:] 2025-05-07T20:32:40.9506312Z 2025-05-07T20:32:40.9506498Z if contiguous: 2025-05-07T20:32:40.9506733Z x0 = x0.contiguous() 2025-05-07T20:32:40.9506988Z x1 = x1.contiguous() 2025-05-07T20:32:40.9507228Z 2025-05-07T20:32:40.9507522Z if scale_ub is not None: 2025-05-07T20:32:40.9507789Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.9508129Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.9508438Z ) 2025-05-07T20:32:40.9508618Z else: 2025-05-07T20:32:40.9508830Z scale_ub_tensor = None 2025-05-07T20:32:40.9509080Z 2025-05-07T20:32:40.9509309Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.9509625Z op = silu_mul_quant 2025-05-07T20:32:40.9509952Z if compiled: 2025-05-07T20:32:40.9510193Z op = torch.compile(op) 2025-05-07T20:32:40.9510496Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9510772Z 2025-05-07T20:32:40.9510965Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.9511127Z 2025-05-07T20:32:40.9511226Z moe/activation_test.py:117: 2025-05-07T20:32:40.9511521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9511930Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.9512211Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9513078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.9513776Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.9514310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.9514989Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.9515653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.9516179Z kernel = self.compile( 2025-05-07T20:32:40.9516712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.9517361Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.9517758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9517982Z 2025-05-07T20:32:40.9518199Z self = 2025-05-07T20:32:40.9519272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.9520668Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc831d700>} 2025-05-07T20:32:40.9522010Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.9523032Z context = 2025-05-07T20:32:40.9523319Z 2025-05-07T20:32:40.9523491Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.9524010Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.9524485Z module_map=module_map) 2025-05-07T20:32:40.9524909Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.9525282Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.9525565Z E ^ 2025-05-07T20:32:40.9526446Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.9526897Z 2025-05-07T20:32:40.9527320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.9527829Z 2025-05-07T20:32:40.9527931Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.9528393Z self=, 2025-05-07T20:32:40.9528796Z T=1, 2025-05-07T20:32:40.9528977Z D=7168, 2025-05-07T20:32:40.9529173Z scale_ub=1200.0, 2025-05-07T20:32:40.9529400Z contiguous=False, 2025-05-07T20:32:40.9529622Z compiled=False, 2025-05-07T20:32:40.9529827Z ) 2025-05-07T20:32:40.9530147Z self = 2025-05-07T20:32:40.9530630Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:40.9530902Z 2025-05-07T20:32:40.9530974Z @given( 2025-05-07T20:32:40.9531208Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.9531516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.9531816Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.9532152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.9532480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.9532814Z ) 2025-05-07T20:32:40.9533161Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.9533599Z def test_silu_mul_quant( 2025-05-07T20:32:40.9533831Z self, 2025-05-07T20:32:40.9534022Z T: int, 2025-05-07T20:32:40.9534220Z D: int, 2025-05-07T20:32:40.9534432Z scale_ub: Optional[float], 2025-05-07T20:32:40.9534707Z contiguous: bool, 2025-05-07T20:32:40.9534943Z compiled: bool, 2025-05-07T20:32:40.9535168Z ) -> None: 2025-05-07T20:32:40.9535377Z torch.manual_seed(2025) 2025-05-07T20:32:40.9535646Z 2025-05-07T20:32:40.9535950Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.9536283Z 2025-05-07T20:32:40.9536475Z x_sign = torch.sign(x) 2025-05-07T20:32:40.9536765Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.9537072Z x = x_sign * x_clamp 2025-05-07T20:32:40.9537318Z x0 = x[:, :D] 2025-05-07T20:32:40.9537535Z x1 = x[:, D:] 2025-05-07T20:32:40.9537733Z 2025-05-07T20:32:40.9537917Z if contiguous: 2025-05-07T20:32:40.9538146Z x0 = x0.contiguous() 2025-05-07T20:32:40.9538400Z x1 = x1.contiguous() 2025-05-07T20:32:40.9538642Z 2025-05-07T20:32:40.9538835Z if scale_ub is not None: 2025-05-07T20:32:40.9539109Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.9539448Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.9539762Z ) 2025-05-07T20:32:40.9539957Z else: 2025-05-07T20:32:40.9540167Z scale_ub_tensor = None 2025-05-07T20:32:40.9540418Z 2025-05-07T20:32:40.9540648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.9540957Z op = silu_mul_quant 2025-05-07T20:32:40.9541210Z if compiled: 2025-05-07T20:32:40.9541460Z op = torch.compile(op) 2025-05-07T20:32:40.9541755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9542035Z 2025-05-07T20:32:40.9542227Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.9542389Z 2025-05-07T20:32:40.9542488Z moe/activation_test.py:117: 2025-05-07T20:32:40.9542784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9543115Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.9543460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9544140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.9544868Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.9545412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.9546091Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.9546748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.9547319Z kernel = self.compile( 2025-05-07T20:32:40.9547857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.9548500Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.9548898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9549128Z 2025-05-07T20:32:40.9549336Z self = 2025-05-07T20:32:40.9550480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.9551899Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc82540d0>} 2025-05-07T20:32:40.9553259Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.9554287Z context = 2025-05-07T20:32:40.9554577Z 2025-05-07T20:32:40.9554754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.9555279Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.9555741Z module_map=module_map) 2025-05-07T20:32:40.9563328Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.9563736Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.9564011Z E ^ 2025-05-07T20:32:40.9564493Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.9564967Z 2025-05-07T20:32:40.9565404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.9565932Z 2025-05-07T20:32:40.9566039Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.9566468Z self=, 2025-05-07T20:32:40.9566879Z T=4096, 2025-05-07T20:32:40.9567075Z D=7168, 2025-05-07T20:32:40.9567277Z scale_ub=1200.0, 2025-05-07T20:32:40.9567505Z contiguous=False, 2025-05-07T20:32:40.9567738Z compiled=True, 2025-05-07T20:32:40.9567956Z ) 2025-05-07T20:32:41.0731313Z self = 2025-05-07T20:32:41.0732093Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.0732469Z 2025-05-07T20:32:41.0732573Z @given( 2025-05-07T20:32:41.0732909Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.0733333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.0733737Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.0734160Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.0734543Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.0734834Z ) 2025-05-07T20:32:41.0735431Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.0735886Z def test_silu_mul_quant( 2025-05-07T20:32:41.0736135Z self, 2025-05-07T20:32:41.0736329Z T: int, 2025-05-07T20:32:41.0736532Z D: int, 2025-05-07T20:32:41.0736847Z scale_ub: Optional[float], 2025-05-07T20:32:41.0737119Z contiguous: bool, 2025-05-07T20:32:41.0737368Z compiled: bool, 2025-05-07T20:32:41.0737601Z ) -> None: 2025-05-07T20:32:41.0737819Z torch.manual_seed(2025) 2025-05-07T20:32:41.0738078Z 2025-05-07T20:32:41.0738440Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.0738790Z 2025-05-07T20:32:41.0738980Z x_sign = torch.sign(x) 2025-05-07T20:32:41.0739277Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.0739592Z x = x_sign * x_clamp 2025-05-07T20:32:41.0739827Z x0 = x[:, :D] 2025-05-07T20:32:41.0740052Z x1 = x[:, D:] 2025-05-07T20:32:41.0740270Z 2025-05-07T20:32:41.0740454Z if contiguous: 2025-05-07T20:32:41.0740696Z x0 = x0.contiguous() 2025-05-07T20:32:41.0740964Z x1 = x1.contiguous() 2025-05-07T20:32:41.0741206Z 2025-05-07T20:32:41.0741412Z if scale_ub is not None: 2025-05-07T20:32:41.0741698Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.0742039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.0742352Z ) 2025-05-07T20:32:41.0742554Z else: 2025-05-07T20:32:41.0742848Z scale_ub_tensor = None 2025-05-07T20:32:41.0743115Z 2025-05-07T20:32:41.0743356Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.0743681Z op = silu_mul_quant 2025-05-07T20:32:41.0743931Z if compiled: 2025-05-07T20:32:41.0744188Z op = torch.compile(op) 2025-05-07T20:32:41.0744488Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0744759Z 2025-05-07T20:32:41.0744958Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.0745126Z 2025-05-07T20:32:41.0745240Z moe/activation_test.py:117: 2025-05-07T20:32:41.0745550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0745885Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.0746187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0746758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.0747316Z return fn(*args, **kwargs) 2025-05-07T20:32:41.0747996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.0748694Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.0749235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.0750002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.0750663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.0751195Z kernel = self.compile( 2025-05-07T20:32:41.0751733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.0752391Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.0752789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0753025Z 2025-05-07T20:32:41.0753242Z self = 2025-05-07T20:32:41.0754324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.0755783Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8254dc0>} 2025-05-07T20:32:41.0757176Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.0758215Z context = 2025-05-07T20:32:41.0758501Z 2025-05-07T20:32:41.0758678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.0759241Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.0759714Z module_map=module_map) 2025-05-07T20:32:41.0760086Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.0760439Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.0760709Z E ^ 2025-05-07T20:32:41.0761180Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.0761632Z 2025-05-07T20:32:41.0762063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.0762577Z 2025-05-07T20:32:41.0762685Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.0763111Z self=, 2025-05-07T20:32:41.0763520Z T=128, 2025-05-07T20:32:41.0763748Z D=7168, 2025-05-07T20:32:41.0763949Z scale_ub=1200.0, 2025-05-07T20:32:41.0764182Z contiguous=False, 2025-05-07T20:32:41.0764406Z compiled=True, 2025-05-07T20:32:41.0764619Z ) 2025-05-07T20:32:41.0764947Z self = 2025-05-07T20:32:41.0765442Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.0765720Z 2025-05-07T20:32:41.0765797Z @given( 2025-05-07T20:32:41.0766034Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.0766351Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.0766662Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.0767004Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.0767334Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.0767618Z ) 2025-05-07T20:32:41.0767977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.0768424Z def test_silu_mul_quant( 2025-05-07T20:32:41.0768662Z self, 2025-05-07T20:32:41.0768866Z T: int, 2025-05-07T20:32:41.0769072Z D: int, 2025-05-07T20:32:41.0769308Z scale_ub: Optional[float], 2025-05-07T20:32:41.0769584Z contiguous: bool, 2025-05-07T20:32:41.0769833Z compiled: bool, 2025-05-07T20:32:41.0770072Z ) -> None: 2025-05-07T20:32:41.0770291Z torch.manual_seed(2025) 2025-05-07T20:32:41.0770544Z 2025-05-07T20:32:41.0770824Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.0771168Z 2025-05-07T20:32:41.0771379Z x_sign = torch.sign(x) 2025-05-07T20:32:41.0771680Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.0771993Z x = x_sign * x_clamp 2025-05-07T20:32:41.0772245Z x0 = x[:, :D] 2025-05-07T20:32:41.0772475Z x1 = x[:, D:] 2025-05-07T20:32:41.0772690Z 2025-05-07T20:32:41.0772892Z if contiguous: 2025-05-07T20:32:41.0773132Z x0 = x0.contiguous() 2025-05-07T20:32:41.0773398Z x1 = x1.contiguous() 2025-05-07T20:32:41.0773646Z 2025-05-07T20:32:41.0773847Z if scale_ub is not None: 2025-05-07T20:32:41.0774120Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.0774464Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.0774828Z ) 2025-05-07T20:32:41.0775021Z else: 2025-05-07T20:32:41.0775231Z scale_ub_tensor = None 2025-05-07T20:32:41.0775489Z 2025-05-07T20:32:41.0775722Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.0776070Z op = silu_mul_quant 2025-05-07T20:32:41.0776323Z if compiled: 2025-05-07T20:32:41.0776575Z op = torch.compile(op) 2025-05-07T20:32:41.0776874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0777151Z 2025-05-07T20:32:41.0777352Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.0777564Z 2025-05-07T20:32:41.0777666Z moe/activation_test.py:117: 2025-05-07T20:32:41.0777967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0778299Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.0778585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0779136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.0779695Z return fn(*args, **kwargs) 2025-05-07T20:32:41.0780357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.0781037Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.0781571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.0782253Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.0782959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.0783487Z kernel = self.compile( 2025-05-07T20:32:41.0784023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.0784680Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.0785077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0785315Z 2025-05-07T20:32:41.0785523Z self = 2025-05-07T20:32:41.0786612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.0788009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc806a940>} 2025-05-07T20:32:41.0789363Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.0790452Z context = 2025-05-07T20:32:41.0790744Z 2025-05-07T20:32:41.0790909Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.0791440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.0791909Z module_map=module_map) 2025-05-07T20:32:41.0792272Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.0792629Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.0792893Z E ^ 2025-05-07T20:32:41.0793359Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.0793818Z 2025-05-07T20:32:41.0794235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.0794754Z 2025-05-07T20:32:41.2452635Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2453567Z self=, 2025-05-07T20:32:41.2454088Z T=2048, 2025-05-07T20:32:41.2454283Z D=7168, 2025-05-07T20:32:41.2454476Z scale_ub=None, 2025-05-07T20:32:41.2454703Z contiguous=True, 2025-05-07T20:32:41.2455030Z compiled=True, 2025-05-07T20:32:41.2455240Z ) 2025-05-07T20:32:41.2455561Z self = 2025-05-07T20:32:41.2456058Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.2456326Z 2025-05-07T20:32:41.2456524Z @given( 2025-05-07T20:32:41.2456762Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2457078Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2457390Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2457721Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2458054Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2458359Z ) 2025-05-07T20:32:41.2458709Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2459149Z def test_silu_mul_quant( 2025-05-07T20:32:41.2459401Z self, 2025-05-07T20:32:41.2459607Z T: int, 2025-05-07T20:32:41.2459811Z D: int, 2025-05-07T20:32:41.2460040Z scale_ub: Optional[float], 2025-05-07T20:32:41.2460325Z contiguous: bool, 2025-05-07T20:32:41.2460564Z compiled: bool, 2025-05-07T20:32:41.2460795Z ) -> None: 2025-05-07T20:32:41.2461026Z torch.manual_seed(2025) 2025-05-07T20:32:41.2461360Z 2025-05-07T20:32:41.2461634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2461985Z 2025-05-07T20:32:41.2462185Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2462482Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2462796Z x = x_sign * x_clamp 2025-05-07T20:32:41.2463042Z x0 = x[:, :D] 2025-05-07T20:32:41.2463259Z x1 = x[:, D:] 2025-05-07T20:32:41.2463472Z 2025-05-07T20:32:41.2463664Z if contiguous: 2025-05-07T20:32:41.2463896Z x0 = x0.contiguous() 2025-05-07T20:32:41.2464168Z x1 = x1.contiguous() 2025-05-07T20:32:41.2464418Z 2025-05-07T20:32:41.2464608Z if scale_ub is not None: 2025-05-07T20:32:41.2464892Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.2465232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.2465589Z ) 2025-05-07T20:32:41.2465798Z else: 2025-05-07T20:32:41.2466018Z scale_ub_tensor = None 2025-05-07T20:32:41.2466271Z 2025-05-07T20:32:41.2466506Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2466825Z op = silu_mul_quant 2025-05-07T20:32:41.2467080Z if compiled: 2025-05-07T20:32:41.2467327Z op = torch.compile(op) 2025-05-07T20:32:41.2467631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2467911Z 2025-05-07T20:32:41.2468102Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.2468271Z 2025-05-07T20:32:41.2468378Z moe/activation_test.py:117: 2025-05-07T20:32:41.2468690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2469018Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.2469309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2469973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.2470537Z return fn(*args, **kwargs) 2025-05-07T20:32:41.2471195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.2471884Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.2472421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2473154Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2473813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2474388Z kernel = self.compile( 2025-05-07T20:32:41.2474932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2475608Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2476035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2476301Z 2025-05-07T20:32:41.2476515Z self = 2025-05-07T20:32:41.2477605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2478991Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc7fc9550>} 2025-05-07T20:32:41.2480335Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2481358Z context = 2025-05-07T20:32:41.2481650Z 2025-05-07T20:32:41.2481884Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2482407Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2482873Z module_map=module_map) 2025-05-07T20:32:41.2483243Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2483605Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.2483855Z E ^ 2025-05-07T20:32:41.2484321Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2484770Z 2025-05-07T20:32:41.2485205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2485713Z 2025-05-07T20:32:41.2485826Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2486236Z self=, 2025-05-07T20:32:41.2486650Z T=16384, 2025-05-07T20:32:41.2486845Z D=5120, 2025-05-07T20:32:41.2487032Z scale_ub=None, 2025-05-07T20:32:41.2487244Z contiguous=False, 2025-05-07T20:32:41.2487465Z compiled=False, 2025-05-07T20:32:41.2487657Z ) 2025-05-07T20:32:41.2487974Z self = 2025-05-07T20:32:41.2488478Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.2488753Z 2025-05-07T20:32:41.2488828Z @given( 2025-05-07T20:32:41.2489058Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2489368Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2489672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2489992Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2490320Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2490606Z ) 2025-05-07T20:32:41.2490948Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2491390Z def test_silu_mul_quant( 2025-05-07T20:32:41.2491633Z self, 2025-05-07T20:32:41.2491818Z T: int, 2025-05-07T20:32:41.2492018Z D: int, 2025-05-07T20:32:41.2492237Z scale_ub: Optional[float], 2025-05-07T20:32:41.2492500Z contiguous: bool, 2025-05-07T20:32:41.2492791Z compiled: bool, 2025-05-07T20:32:41.2493011Z ) -> None: 2025-05-07T20:32:41.2493219Z torch.manual_seed(2025) 2025-05-07T20:32:41.2493462Z 2025-05-07T20:32:41.2493732Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2494106Z 2025-05-07T20:32:41.2494298Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2494584Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2496650Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.2498574Z 2025-05-07T20:32:41.2498696Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.2498906Z 2025-05-07T20:32:41.2499015Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2499421Z self=, 2025-05-07T20:32:41.2499830Z T=4096, 2025-05-07T20:32:41.2500022Z D=7168, 2025-05-07T20:32:41.2500211Z scale_ub=1200.0, 2025-05-07T20:32:41.2500434Z contiguous=True, 2025-05-07T20:32:41.2500656Z compiled=True, 2025-05-07T20:32:41.2500851Z ) 2025-05-07T20:32:41.2501214Z self = 2025-05-07T20:32:41.2501703Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.2501969Z 2025-05-07T20:32:41.2502052Z @given( 2025-05-07T20:32:41.2502270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2502574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2502887Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2503207Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2503535Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2504181Z ) 2025-05-07T20:32:41.2504525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2504961Z def test_silu_mul_quant( 2025-05-07T20:32:41.2505198Z self, 2025-05-07T20:32:41.2505385Z T: int, 2025-05-07T20:32:41.2505571Z D: int, 2025-05-07T20:32:41.2505792Z scale_ub: Optional[float], 2025-05-07T20:32:41.2506055Z contiguous: bool, 2025-05-07T20:32:41.2506284Z compiled: bool, 2025-05-07T20:32:41.2506501Z ) -> None: 2025-05-07T20:32:41.2506709Z torch.manual_seed(2025) 2025-05-07T20:32:41.2506941Z 2025-05-07T20:32:41.2507212Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2507552Z 2025-05-07T20:32:41.2507737Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2508022Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2510083Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.2511950Z 2025-05-07T20:32:41.2512072Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.2512279Z 2025-05-07T20:32:41.2512387Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2512789Z self=, 2025-05-07T20:32:41.2513281Z T=16384, 2025-05-07T20:32:41.2513477Z D=7168, 2025-05-07T20:32:41.2513661Z scale_ub=None, 2025-05-07T20:32:41.2513872Z contiguous=False, 2025-05-07T20:32:41.2514098Z compiled=False, 2025-05-07T20:32:41.2514350Z ) 2025-05-07T20:32:41.3559572Z self = 2025-05-07T20:32:41.3560252Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.3560538Z 2025-05-07T20:32:41.3560616Z @given( 2025-05-07T20:32:41.3561124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3561436Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3561742Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3562073Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3562401Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3562691Z ) 2025-05-07T20:32:41.3563039Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3563474Z def test_silu_mul_quant( 2025-05-07T20:32:41.3563718Z self, 2025-05-07T20:32:41.3563914Z T: int, 2025-05-07T20:32:41.3564109Z D: int, 2025-05-07T20:32:41.3564328Z scale_ub: Optional[float], 2025-05-07T20:32:41.3564603Z contiguous: bool, 2025-05-07T20:32:41.3564834Z compiled: bool, 2025-05-07T20:32:41.3565061Z ) -> None: 2025-05-07T20:32:41.3565275Z torch.manual_seed(2025) 2025-05-07T20:32:41.3565529Z 2025-05-07T20:32:41.3565922Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3568010Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.3569935Z 2025-05-07T20:32:41.3570058Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.3570273Z 2025-05-07T20:32:41.3570380Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3570782Z self=, 2025-05-07T20:32:41.3571192Z T=2048, 2025-05-07T20:32:41.3571382Z D=7168, 2025-05-07T20:32:41.3571572Z scale_ub=1200.0, 2025-05-07T20:32:41.3571789Z contiguous=True, 2025-05-07T20:32:41.3572014Z compiled=True, 2025-05-07T20:32:41.3572219Z ) 2025-05-07T20:32:41.3572530Z self = 2025-05-07T20:32:41.3573018Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.3573287Z 2025-05-07T20:32:41.3573375Z @given( 2025-05-07T20:32:41.3573596Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3573903Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3574213Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3574533Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3574864Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3575151Z ) 2025-05-07T20:32:41.3575509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3575945Z def test_silu_mul_quant( 2025-05-07T20:32:41.3576187Z self, 2025-05-07T20:32:41.3576382Z T: int, 2025-05-07T20:32:41.3576575Z D: int, 2025-05-07T20:32:41.3576797Z scale_ub: Optional[float], 2025-05-07T20:32:41.3584704Z contiguous: bool, 2025-05-07T20:32:41.3584970Z compiled: bool, 2025-05-07T20:32:41.3585326Z ) -> None: 2025-05-07T20:32:41.3585590Z torch.manual_seed(2025) 2025-05-07T20:32:41.3585850Z 2025-05-07T20:32:41.3586135Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3586481Z 2025-05-07T20:32:41.3586757Z x_sign = torch.sign(x) 2025-05-07T20:32:41.3587058Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.3589101Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.3591179Z 2025-05-07T20:32:41.3591303Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:41.3591527Z 2025-05-07T20:32:41.3591630Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3592051Z self=, 2025-05-07T20:32:41.3592453Z T=2048, 2025-05-07T20:32:41.3592629Z D=7168, 2025-05-07T20:32:41.3592818Z scale_ub=None, 2025-05-07T20:32:41.3593039Z contiguous=True, 2025-05-07T20:32:41.3593265Z compiled=False, 2025-05-07T20:32:41.3593482Z ) 2025-05-07T20:32:41.3593855Z self = 2025-05-07T20:32:41.3594356Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.3594640Z 2025-05-07T20:32:41.3594719Z @given( 2025-05-07T20:32:41.3594953Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3595262Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3595584Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3595969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3596303Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3596589Z ) 2025-05-07T20:32:41.3596946Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3597391Z def test_silu_mul_quant( 2025-05-07T20:32:41.3597631Z self, 2025-05-07T20:32:41.3597827Z T: int, 2025-05-07T20:32:41.3598033Z D: int, 2025-05-07T20:32:41.3598247Z scale_ub: Optional[float], 2025-05-07T20:32:41.3598527Z contiguous: bool, 2025-05-07T20:32:41.3598771Z compiled: bool, 2025-05-07T20:32:41.3598991Z ) -> None: 2025-05-07T20:32:41.3599209Z torch.manual_seed(2025) 2025-05-07T20:32:41.3599453Z 2025-05-07T20:32:41.3599719Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3600066Z 2025-05-07T20:32:41.3600265Z > x_sign = torch.sign(x) 2025-05-07T20:32:41.3602250Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.3604529Z 2025-05-07T20:32:41.3604657Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:41.3604870Z 2025-05-07T20:32:41.3604972Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3605396Z self=, 2025-05-07T20:32:41.3605798Z T=1, 2025-05-07T20:32:41.3606069Z D=7168, 2025-05-07T20:32:41.3606271Z scale_ub=1200.0, 2025-05-07T20:32:41.3606502Z contiguous=True, 2025-05-07T20:32:41.3606721Z compiled=False, 2025-05-07T20:32:41.3606930Z ) 2025-05-07T20:32:41.6888388Z self = 2025-05-07T20:32:41.6889137Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.6889440Z 2025-05-07T20:32:41.6889523Z @given( 2025-05-07T20:32:41.6889761Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6890074Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6890499Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6890838Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6891168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6891465Z ) 2025-05-07T20:32:41.6891824Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6892273Z def test_silu_mul_quant( 2025-05-07T20:32:41.6892526Z self, 2025-05-07T20:32:41.6892730Z T: int, 2025-05-07T20:32:41.6892928Z D: int, 2025-05-07T20:32:41.6893158Z scale_ub: Optional[float], 2025-05-07T20:32:41.6893444Z contiguous: bool, 2025-05-07T20:32:41.6893687Z compiled: bool, 2025-05-07T20:32:41.6893915Z ) -> None: 2025-05-07T20:32:41.6894137Z torch.manual_seed(2025) 2025-05-07T20:32:41.6894387Z 2025-05-07T20:32:41.6894658Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6895007Z 2025-05-07T20:32:41.6895301Z x_sign = torch.sign(x) 2025-05-07T20:32:41.6895595Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.6895910Z x = x_sign * x_clamp 2025-05-07T20:32:41.6896155Z x0 = x[:, :D] 2025-05-07T20:32:41.6896371Z x1 = x[:, D:] 2025-05-07T20:32:41.6896581Z 2025-05-07T20:32:41.6896771Z if contiguous: 2025-05-07T20:32:41.6897005Z x0 = x0.contiguous() 2025-05-07T20:32:41.6897268Z x1 = x1.contiguous() 2025-05-07T20:32:41.6897510Z 2025-05-07T20:32:41.6897698Z if scale_ub is not None: 2025-05-07T20:32:41.6897978Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.6898325Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.6898637Z ) 2025-05-07T20:32:41.6898830Z else: 2025-05-07T20:32:41.6899047Z scale_ub_tensor = None 2025-05-07T20:32:41.6899303Z 2025-05-07T20:32:41.6899539Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.6899867Z op = silu_mul_quant 2025-05-07T20:32:41.6900131Z if compiled: 2025-05-07T20:32:41.6900376Z op = torch.compile(op) 2025-05-07T20:32:41.6900679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6900959Z 2025-05-07T20:32:41.6901148Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.6901324Z 2025-05-07T20:32:41.6901428Z moe/activation_test.py:117: 2025-05-07T20:32:41.6901731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6902064Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.6902356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6903053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.6904019Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.6904573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.6905270Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.6905942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.6906480Z kernel = self.compile( 2025-05-07T20:32:41.6907021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.6907772Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.6908239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6908470Z 2025-05-07T20:32:41.6908679Z self = 2025-05-07T20:32:41.6909780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.6911349Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc818f040>} 2025-05-07T20:32:41.6912705Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.6913741Z context = 2025-05-07T20:32:41.6914034Z 2025-05-07T20:32:41.6914209Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.6914749Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.6915222Z module_map=module_map) 2025-05-07T20:32:41.6915652Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.6916017Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.6916288Z E ^ 2025-05-07T20:32:41.6916762Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.6917215Z 2025-05-07T20:32:41.6917632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.6918161Z 2025-05-07T20:32:41.6918270Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6918694Z self=, 2025-05-07T20:32:41.6919110Z T=128, 2025-05-07T20:32:41.6919295Z D=5120, 2025-05-07T20:32:41.6919489Z scale_ub=None, 2025-05-07T20:32:41.6919709Z contiguous=True, 2025-05-07T20:32:41.6919930Z compiled=False, 2025-05-07T20:32:41.6920144Z ) 2025-05-07T20:32:41.6920467Z self = 2025-05-07T20:32:41.6920961Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.6921237Z 2025-05-07T20:32:41.6921315Z @given( 2025-05-07T20:32:41.6921549Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.6921860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.6922173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.6922513Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.6922846Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.6923130Z ) 2025-05-07T20:32:41.6923491Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.6923933Z def test_silu_mul_quant( 2025-05-07T20:32:41.6924174Z self, 2025-05-07T20:32:41.6924372Z T: int, 2025-05-07T20:32:41.6924575Z D: int, 2025-05-07T20:32:41.6924795Z scale_ub: Optional[float], 2025-05-07T20:32:41.6925073Z contiguous: bool, 2025-05-07T20:32:41.6925324Z compiled: bool, 2025-05-07T20:32:41.6925547Z ) -> None: 2025-05-07T20:32:41.6925768Z torch.manual_seed(2025) 2025-05-07T20:32:41.6926017Z 2025-05-07T20:32:41.6926288Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.6926803Z 2025-05-07T20:32:41.6927005Z x_sign = torch.sign(x) 2025-05-07T20:32:41.6927350Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.6927665Z x = x_sign * x_clamp 2025-05-07T20:32:41.6927920Z x0 = x[:, :D] 2025-05-07T20:32:41.6928145Z x1 = x[:, D:] 2025-05-07T20:32:41.6928351Z 2025-05-07T20:32:41.6928617Z if contiguous: 2025-05-07T20:32:41.6928857Z x0 = x0.contiguous() 2025-05-07T20:32:41.6929118Z x1 = x1.contiguous() 2025-05-07T20:32:41.6929366Z 2025-05-07T20:32:41.6929564Z if scale_ub is not None: 2025-05-07T20:32:41.6929838Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.6930221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.6930538Z ) 2025-05-07T20:32:41.6930737Z else: 2025-05-07T20:32:41.6930958Z scale_ub_tensor = None 2025-05-07T20:32:41.6931219Z 2025-05-07T20:32:41.6931453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.6931774Z op = silu_mul_quant 2025-05-07T20:32:41.6932034Z if compiled: 2025-05-07T20:32:41.6932279Z op = torch.compile(op) 2025-05-07T20:32:41.6932581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6932883Z 2025-05-07T20:32:41.6933083Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.6933257Z 2025-05-07T20:32:41.6933360Z moe/activation_test.py:117: 2025-05-07T20:32:41.6933666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6933997Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.6934285Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.6935035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.6935734Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.6936278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.6936969Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.6937644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.6938176Z kernel = self.compile( 2025-05-07T20:32:41.6938721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.6939392Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.6939801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.6940033Z 2025-05-07T20:32:41.6940241Z self = 2025-05-07T20:32:41.6941331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.6942719Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc818fa60>} 2025-05-07T20:32:41.6944085Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.6945115Z context = 2025-05-07T20:32:41.6945414Z 2025-05-07T20:32:41.6945588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.6946119Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.6946592Z module_map=module_map) 2025-05-07T20:32:41.6946958Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.6947369Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.6947633Z E ^ 2025-05-07T20:32:41.6948096Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.6948556Z 2025-05-07T20:32:41.6949019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.6949542Z 2025-05-07T20:32:41.6949645Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.6950114Z self=, 2025-05-07T20:32:41.6950567Z T=128, 2025-05-07T20:32:41.6950760Z D=7168, 2025-05-07T20:32:41.6950959Z scale_ub=None, 2025-05-07T20:32:41.6951177Z contiguous=True, 2025-05-07T20:32:41.6951408Z compiled=False, 2025-05-07T20:32:41.6951613Z ) 2025-05-07T20:32:41.7849064Z self = 2025-05-07T20:32:41.7849836Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.7850223Z 2025-05-07T20:32:41.7850331Z @given( 2025-05-07T20:32:41.7850589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7850902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7851214Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7851550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7851881Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7852162Z ) 2025-05-07T20:32:41.7852732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7853185Z def test_silu_mul_quant( 2025-05-07T20:32:41.7853431Z self, 2025-05-07T20:32:41.7853623Z T: int, 2025-05-07T20:32:41.7853820Z D: int, 2025-05-07T20:32:41.7854049Z scale_ub: Optional[float], 2025-05-07T20:32:41.7854318Z contiguous: bool, 2025-05-07T20:32:41.7854564Z compiled: bool, 2025-05-07T20:32:41.7854794Z ) -> None: 2025-05-07T20:32:41.7855000Z torch.manual_seed(2025) 2025-05-07T20:32:41.7855239Z 2025-05-07T20:32:41.7855537Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7855896Z 2025-05-07T20:32:41.7856092Z x_sign = torch.sign(x) 2025-05-07T20:32:41.7856382Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.7856685Z x = x_sign * x_clamp 2025-05-07T20:32:41.7856923Z x0 = x[:, :D] 2025-05-07T20:32:41.7857138Z x1 = x[:, D:] 2025-05-07T20:32:41.7857334Z 2025-05-07T20:32:41.7857521Z if contiguous: 2025-05-07T20:32:41.7857753Z x0 = x0.contiguous() 2025-05-07T20:32:41.7858005Z x1 = x1.contiguous() 2025-05-07T20:32:41.7858245Z 2025-05-07T20:32:41.7858430Z if scale_ub is not None: 2025-05-07T20:32:41.7858700Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.7859031Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.7859341Z ) 2025-05-07T20:32:41.7859528Z else: 2025-05-07T20:32:41.7859735Z scale_ub_tensor = None 2025-05-07T20:32:41.7859981Z 2025-05-07T20:32:41.7860206Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.7860512Z op = silu_mul_quant 2025-05-07T20:32:41.7860757Z if compiled: 2025-05-07T20:32:41.7861001Z op = torch.compile(op) 2025-05-07T20:32:41.7861289Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7861561Z 2025-05-07T20:32:41.7861750Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.7861914Z 2025-05-07T20:32:41.7862014Z moe/activation_test.py:117: 2025-05-07T20:32:41.7862304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7862628Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.7862907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7863594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.7864371Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.7864995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.7865679Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.7866380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.7866909Z kernel = self.compile( 2025-05-07T20:32:41.7867519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.7868176Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.7868569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7868797Z 2025-05-07T20:32:41.7869016Z self = 2025-05-07T20:32:41.7870174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.7871554Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc7d85790>} 2025-05-07T20:32:41.7872948Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.7873974Z context = 2025-05-07T20:32:41.7874265Z 2025-05-07T20:32:41.7874435Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.7874962Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.7875431Z module_map=module_map) 2025-05-07T20:32:41.7875849Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.7876200Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.7876448Z E ^ 2025-05-07T20:32:41.7876908Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.7877361Z 2025-05-07T20:32:41.7877788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.7878303Z 2025-05-07T20:32:41.7878414Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.7878821Z self=, 2025-05-07T20:32:41.7879228Z T=2048, 2025-05-07T20:32:41.7879417Z D=7168, 2025-05-07T20:32:41.7879596Z scale_ub=1200.0, 2025-05-07T20:32:41.7879818Z contiguous=True, 2025-05-07T20:32:41.7880040Z compiled=False, 2025-05-07T20:32:41.7880242Z ) 2025-05-07T20:32:41.7880557Z self = 2025-05-07T20:32:41.7881056Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.7881331Z 2025-05-07T20:32:41.7881405Z @given( 2025-05-07T20:32:41.7881633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7881946Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7882251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7882572Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7882901Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7883181Z ) 2025-05-07T20:32:41.7883525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7884019Z def test_silu_mul_quant( 2025-05-07T20:32:41.7884254Z self, 2025-05-07T20:32:41.7884438Z T: int, 2025-05-07T20:32:41.7884632Z D: int, 2025-05-07T20:32:41.7884845Z scale_ub: Optional[float], 2025-05-07T20:32:41.7885111Z contiguous: bool, 2025-05-07T20:32:41.7885392Z compiled: bool, 2025-05-07T20:32:41.7885615Z ) -> None: 2025-05-07T20:32:41.7885825Z torch.manual_seed(2025) 2025-05-07T20:32:41.7886069Z 2025-05-07T20:32:41.7886335Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7888398Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.7890320Z 2025-05-07T20:32:41.7890440Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.7890652Z 2025-05-07T20:32:41.7890762Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.7891172Z self=, 2025-05-07T20:32:41.7891570Z T=1, 2025-05-07T20:32:41.7891746Z D=5120, 2025-05-07T20:32:41.7891933Z scale_ub=1200.0, 2025-05-07T20:32:41.7892198Z contiguous=True, 2025-05-07T20:32:41.7892426Z compiled=False, 2025-05-07T20:32:41.7892617Z ) 2025-05-07T20:32:41.8375922Z self = 2025-05-07T20:32:41.8376691Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.8377057Z 2025-05-07T20:32:41.8377168Z @given( 2025-05-07T20:32:41.8377420Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8377737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8378047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8378382Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8378715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8378999Z ) 2025-05-07T20:32:41.8379355Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8379796Z def test_silu_mul_quant( 2025-05-07T20:32:41.8380030Z self, 2025-05-07T20:32:41.8380234Z T: int, 2025-05-07T20:32:41.8380431Z D: int, 2025-05-07T20:32:41.8380642Z scale_ub: Optional[float], 2025-05-07T20:32:41.8380915Z contiguous: bool, 2025-05-07T20:32:41.8381153Z compiled: bool, 2025-05-07T20:32:41.8381372Z ) -> None: 2025-05-07T20:32:41.8381587Z torch.manual_seed(2025) 2025-05-07T20:32:41.8381833Z 2025-05-07T20:32:41.8382110Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8382443Z 2025-05-07T20:32:41.8382638Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8382931Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8383237Z x = x_sign * x_clamp 2025-05-07T20:32:41.8383479Z x0 = x[:, :D] 2025-05-07T20:32:41.8383696Z x1 = x[:, D:] 2025-05-07T20:32:41.8383904Z 2025-05-07T20:32:41.8384090Z if contiguous: 2025-05-07T20:32:41.8384324Z x0 = x0.contiguous() 2025-05-07T20:32:41.8384585Z x1 = x1.contiguous() 2025-05-07T20:32:41.8384830Z 2025-05-07T20:32:41.8385024Z if scale_ub is not None: 2025-05-07T20:32:41.8385298Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.8385689Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.8386004Z ) 2025-05-07T20:32:41.8386191Z else: 2025-05-07T20:32:41.8386628Z scale_ub_tensor = None 2025-05-07T20:32:41.8386898Z 2025-05-07T20:32:41.8395419Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8395763Z op = silu_mul_quant 2025-05-07T20:32:41.8396069Z if compiled: 2025-05-07T20:32:41.8396474Z op = torch.compile(op) 2025-05-07T20:32:41.8396783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8397057Z 2025-05-07T20:32:41.8397258Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.8397426Z 2025-05-07T20:32:41.8397536Z moe/activation_test.py:117: 2025-05-07T20:32:41.8397919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8398253Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.8398545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8399247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.8399951Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.8400498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.8401194Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.8401865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.8402398Z kernel = self.compile( 2025-05-07T20:32:41.8402954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.8403996Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8404404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8404639Z 2025-05-07T20:32:41.8404847Z self = 2025-05-07T20:32:41.8405990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.8407393Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc7ea3040>} 2025-05-07T20:32:41.8408756Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.8409784Z context = 2025-05-07T20:32:41.8410087Z 2025-05-07T20:32:41.8410254Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.8410786Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8411259Z module_map=module_map) 2025-05-07T20:32:41.8411621Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8411984Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.8412247Z E ^ 2025-05-07T20:32:41.8412708Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8413176Z 2025-05-07T20:32:41.8413594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.8414120Z 2025-05-07T20:32:41.8414229Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8414656Z self=, 2025-05-07T20:32:41.8415064Z T=2048, 2025-05-07T20:32:41.8415266Z D=5120, 2025-05-07T20:32:41.8415485Z scale_ub=None, 2025-05-07T20:32:41.8415738Z contiguous=True, 2025-05-07T20:32:41.8416046Z compiled=False, 2025-05-07T20:32:41.8416267Z ) 2025-05-07T20:32:41.8416586Z self = 2025-05-07T20:32:41.8417098Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.8417371Z 2025-05-07T20:32:41.8417525Z @given( 2025-05-07T20:32:41.8417764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8418088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8418416Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8418744Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8419134Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8419438Z ) 2025-05-07T20:32:41.8419792Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8420231Z def test_silu_mul_quant( 2025-05-07T20:32:41.8420486Z self, 2025-05-07T20:32:41.8420691Z T: int, 2025-05-07T20:32:41.8420892Z D: int, 2025-05-07T20:32:41.8421123Z scale_ub: Optional[float], 2025-05-07T20:32:41.8421394Z contiguous: bool, 2025-05-07T20:32:41.8421643Z compiled: bool, 2025-05-07T20:32:41.8421867Z ) -> None: 2025-05-07T20:32:41.8422087Z torch.manual_seed(2025) 2025-05-07T20:32:41.8422321Z 2025-05-07T20:32:41.8422597Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8422953Z 2025-05-07T20:32:41.8423145Z > x_sign = torch.sign(x) 2025-05-07T20:32:41.8425189Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.8427124Z 2025-05-07T20:32:41.8427245Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:41.8427463Z 2025-05-07T20:32:41.8427581Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8428007Z self=, 2025-05-07T20:32:41.8428420Z T=16384, 2025-05-07T20:32:41.8428617Z D=5120, 2025-05-07T20:32:41.8428811Z scale_ub=None, 2025-05-07T20:32:41.8429020Z contiguous=True, 2025-05-07T20:32:41.8429259Z compiled=False, 2025-05-07T20:32:41.8429472Z ) 2025-05-07T20:32:41.8429787Z self = 2025-05-07T20:32:41.8430351Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.8430626Z 2025-05-07T20:32:41.8430708Z @given( 2025-05-07T20:32:41.8430931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8431250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8431554Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8431878Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8432209Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8432494Z ) 2025-05-07T20:32:41.8432845Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8433280Z def test_silu_mul_quant( 2025-05-07T20:32:41.8433525Z self, 2025-05-07T20:32:41.8433727Z T: int, 2025-05-07T20:32:41.8433924Z D: int, 2025-05-07T20:32:41.8434140Z scale_ub: Optional[float], 2025-05-07T20:32:41.8434421Z contiguous: bool, 2025-05-07T20:32:41.8434658Z compiled: bool, 2025-05-07T20:32:41.8434885Z ) -> None: 2025-05-07T20:32:41.8435101Z torch.manual_seed(2025) 2025-05-07T20:32:41.8435348Z 2025-05-07T20:32:41.8435671Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8437755Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.8439662Z 2025-05-07T20:32:41.8439780Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.8439992Z 2025-05-07T20:32:41.8440103Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8440511Z self=, 2025-05-07T20:32:41.8440918Z T=4096, 2025-05-07T20:32:41.8441120Z D=5120, 2025-05-07T20:32:41.8441309Z scale_ub=None, 2025-05-07T20:32:41.8441529Z contiguous=True, 2025-05-07T20:32:41.8441754Z compiled=False, 2025-05-07T20:32:41.8441958Z ) 2025-05-07T20:32:41.9458776Z self = 2025-05-07T20:32:41.9459549Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.9459934Z 2025-05-07T20:32:41.9460036Z @given( 2025-05-07T20:32:41.9460281Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9460745Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9461077Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9461423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9461762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9462056Z ) 2025-05-07T20:32:41.9462418Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9462877Z def test_silu_mul_quant( 2025-05-07T20:32:41.9463124Z self, 2025-05-07T20:32:41.9463328Z T: int, 2025-05-07T20:32:41.9463535Z D: int, 2025-05-07T20:32:41.9463756Z scale_ub: Optional[float], 2025-05-07T20:32:41.9464042Z contiguous: bool, 2025-05-07T20:32:41.9464291Z compiled: bool, 2025-05-07T20:32:41.9464523Z ) -> None: 2025-05-07T20:32:41.9464750Z torch.manual_seed(2025) 2025-05-07T20:32:41.9465001Z 2025-05-07T20:32:41.9465282Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9467430Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9469331Z 2025-05-07T20:32:41.9469457Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.9469686Z 2025-05-07T20:32:41.9469793Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9470312Z self=, 2025-05-07T20:32:41.9470714Z T=2048, 2025-05-07T20:32:41.9470911Z D=5120, 2025-05-07T20:32:41.9471107Z scale_ub=None, 2025-05-07T20:32:41.9471328Z contiguous=False, 2025-05-07T20:32:41.9471563Z compiled=False, 2025-05-07T20:32:41.9471777Z ) 2025-05-07T20:32:41.9472098Z self = 2025-05-07T20:32:41.9472597Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.9472879Z 2025-05-07T20:32:41.9473039Z @given( 2025-05-07T20:32:41.9473275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9473587Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9473906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9474314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9474649Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9474949Z ) 2025-05-07T20:32:41.9475307Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9475772Z def test_silu_mul_quant( 2025-05-07T20:32:41.9476124Z self, 2025-05-07T20:32:41.9476325Z T: int, 2025-05-07T20:32:41.9476530Z D: int, 2025-05-07T20:32:41.9476749Z scale_ub: Optional[float], 2025-05-07T20:32:41.9477031Z contiguous: bool, 2025-05-07T20:32:41.9477280Z compiled: bool, 2025-05-07T20:32:41.9477505Z ) -> None: 2025-05-07T20:32:41.9477731Z torch.manual_seed(2025) 2025-05-07T20:32:41.9477991Z 2025-05-07T20:32:41.9478262Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9480416Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9482305Z 2025-05-07T20:32:41.9482428Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.9482648Z 2025-05-07T20:32:41.9482755Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9483185Z self=, 2025-05-07T20:32:41.9483591Z T=4096, 2025-05-07T20:32:41.9483792Z D=7168, 2025-05-07T20:32:41.9484000Z scale_ub=None, 2025-05-07T20:32:41.9484216Z contiguous=True, 2025-05-07T20:32:41.9484447Z compiled=True, 2025-05-07T20:32:41.9484659Z ) 2025-05-07T20:32:41.9484984Z self = 2025-05-07T20:32:41.9485479Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.9485747Z 2025-05-07T20:32:41.9485834Z @given( 2025-05-07T20:32:41.9486088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9486410Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9486728Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9487060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9487406Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9487703Z ) 2025-05-07T20:32:41.9488058Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9488507Z def test_silu_mul_quant( 2025-05-07T20:32:41.9488760Z self, 2025-05-07T20:32:41.9488963Z T: int, 2025-05-07T20:32:41.9489174Z D: int, 2025-05-07T20:32:41.9489411Z scale_ub: Optional[float], 2025-05-07T20:32:41.9489684Z contiguous: bool, 2025-05-07T20:32:41.9489935Z compiled: bool, 2025-05-07T20:32:41.9490179Z ) -> None: 2025-05-07T20:32:41.9490401Z torch.manual_seed(2025) 2025-05-07T20:32:41.9490657Z 2025-05-07T20:32:41.9490943Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9493020Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9495007Z 2025-05-07T20:32:41.9495136Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.9495351Z 2025-05-07T20:32:41.9495456Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9495881Z self=, 2025-05-07T20:32:41.9496288Z T=2048, 2025-05-07T20:32:41.9496524Z D=5120, 2025-05-07T20:32:41.9496722Z scale_ub=1200.0, 2025-05-07T20:32:41.9496954Z contiguous=False, 2025-05-07T20:32:41.9497184Z compiled=False, 2025-05-07T20:32:41.9497397Z ) 2025-05-07T20:32:41.9497719Z self = 2025-05-07T20:32:41.9498218Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.9498503Z 2025-05-07T20:32:41.9498585Z @given( 2025-05-07T20:32:41.9498823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9499144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9499458Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9499803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9500145Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9500434Z ) 2025-05-07T20:32:41.9500789Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9501290Z def test_silu_mul_quant( 2025-05-07T20:32:41.9501551Z self, 2025-05-07T20:32:41.9501746Z T: int, 2025-05-07T20:32:41.9501950Z D: int, 2025-05-07T20:32:41.9502180Z scale_ub: Optional[float], 2025-05-07T20:32:41.9502458Z contiguous: bool, 2025-05-07T20:32:41.9502711Z compiled: bool, 2025-05-07T20:32:41.9502947Z ) -> None: 2025-05-07T20:32:41.9503169Z torch.manual_seed(2025) 2025-05-07T20:32:41.9503429Z 2025-05-07T20:32:41.9503971Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9506115Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9507975Z 2025-05-07T20:32:41.9508098Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.9508324Z 2025-05-07T20:32:41.9508431Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9508862Z self=, 2025-05-07T20:32:41.9509282Z T=4096, 2025-05-07T20:32:41.9509473Z D=7168, 2025-05-07T20:32:41.9509675Z scale_ub=1200.0, 2025-05-07T20:32:41.9509971Z contiguous=True, 2025-05-07T20:32:41.9510198Z compiled=False, 2025-05-07T20:32:41.9510412Z ) 2025-05-07T20:32:41.9510734Z self = 2025-05-07T20:32:41.9511224Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.9511510Z 2025-05-07T20:32:41.9511595Z @given( 2025-05-07T20:32:41.9511834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9512150Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9512472Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9512815Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9513155Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9513520Z ) 2025-05-07T20:32:41.9513875Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9514331Z def test_silu_mul_quant( 2025-05-07T20:32:41.9514577Z self, 2025-05-07T20:32:41.9514849Z T: int, 2025-05-07T20:32:41.9515063Z D: int, 2025-05-07T20:32:41.9515283Z scale_ub: Optional[float], 2025-05-07T20:32:41.9515564Z contiguous: bool, 2025-05-07T20:32:41.9515811Z compiled: bool, 2025-05-07T20:32:41.9516037Z ) -> None: 2025-05-07T20:32:41.9516265Z torch.manual_seed(2025) 2025-05-07T20:32:41.9516580Z 2025-05-07T20:32:41.9516856Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9518936Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:41.9520834Z 2025-05-07T20:32:41.9520956Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:41.9521175Z 2025-05-07T20:32:41.9521283Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9521761Z self=, 2025-05-07T20:32:41.9522170Z T=16384, 2025-05-07T20:32:41.9522375Z D=7168, 2025-05-07T20:32:41.9522578Z scale_ub=None, 2025-05-07T20:32:41.9522794Z contiguous=False, 2025-05-07T20:32:41.9523034Z compiled=True, 2025-05-07T20:32:41.9523247Z ) 2025-05-07T20:32:42.0820331Z self = 2025-05-07T20:32:42.0821104Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.0821490Z 2025-05-07T20:32:42.0821583Z @given( 2025-05-07T20:32:42.0821819Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0822150Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0822464Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0822796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0823138Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0823432Z ) 2025-05-07T20:32:42.0823793Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0824246Z def test_silu_mul_quant( 2025-05-07T20:32:42.0824497Z self, 2025-05-07T20:32:42.0824695Z T: int, 2025-05-07T20:32:42.0824903Z D: int, 2025-05-07T20:32:42.0825133Z scale_ub: Optional[float], 2025-05-07T20:32:42.0825410Z contiguous: bool, 2025-05-07T20:32:42.0825711Z compiled: bool, 2025-05-07T20:32:42.0825948Z ) -> None: 2025-05-07T20:32:42.0826173Z torch.manual_seed(2025) 2025-05-07T20:32:42.0826419Z 2025-05-07T20:32:42.0826702Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0828784Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.0830757Z 2025-05-07T20:32:42.0830885Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.0831323Z 2025-05-07T20:32:42.0831430Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.0831852Z self=, 2025-05-07T20:32:42.0832265Z T=4096, 2025-05-07T20:32:42.0832463Z D=7168, 2025-05-07T20:32:42.0832729Z scale_ub=None, 2025-05-07T20:32:42.0832955Z contiguous=True, 2025-05-07T20:32:42.0833187Z compiled=False, 2025-05-07T20:32:42.0833401Z ) 2025-05-07T20:32:42.0833721Z self = 2025-05-07T20:32:42.0834220Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.0834572Z 2025-05-07T20:32:42.0834653Z @given( 2025-05-07T20:32:42.0834889Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0835209Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0835522Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0835857Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0836205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0836501Z ) 2025-05-07T20:32:42.0836850Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0837300Z def test_silu_mul_quant( 2025-05-07T20:32:42.0837552Z self, 2025-05-07T20:32:42.0837751Z T: int, 2025-05-07T20:32:42.0837957Z D: int, 2025-05-07T20:32:42.0838188Z scale_ub: Optional[float], 2025-05-07T20:32:42.0838463Z contiguous: bool, 2025-05-07T20:32:42.0838710Z compiled: bool, 2025-05-07T20:32:42.0839066Z ) -> None: 2025-05-07T20:32:42.0839286Z torch.manual_seed(2025) 2025-05-07T20:32:42.0839542Z 2025-05-07T20:32:42.0839828Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0841914Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.0843805Z 2025-05-07T20:32:42.0843935Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.0844152Z 2025-05-07T20:32:42.0844258Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.0844689Z self=, 2025-05-07T20:32:42.0845106Z T=16384, 2025-05-07T20:32:42.0845305Z D=7168, 2025-05-07T20:32:42.0845512Z scale_ub=None, 2025-05-07T20:32:42.0845735Z contiguous=True, 2025-05-07T20:32:42.0845966Z compiled=False, 2025-05-07T20:32:42.0846183Z ) 2025-05-07T20:32:42.0846507Z self = 2025-05-07T20:32:42.0846999Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.0847282Z 2025-05-07T20:32:42.0847364Z @given( 2025-05-07T20:32:42.0847603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0847921Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0848229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0848563Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0848910Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0849199Z ) 2025-05-07T20:32:42.0849555Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0850005Z def test_silu_mul_quant( 2025-05-07T20:32:42.0850255Z self, 2025-05-07T20:32:42.0850460Z T: int, 2025-05-07T20:32:42.0850668Z D: int, 2025-05-07T20:32:42.0850946Z scale_ub: Optional[float], 2025-05-07T20:32:42.0851224Z contiguous: bool, 2025-05-07T20:32:42.0851474Z compiled: bool, 2025-05-07T20:32:42.0851704Z ) -> None: 2025-05-07T20:32:42.0851921Z torch.manual_seed(2025) 2025-05-07T20:32:42.0852174Z 2025-05-07T20:32:42.0852495Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0854561Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.0856534Z 2025-05-07T20:32:42.0856658Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.0856881Z 2025-05-07T20:32:42.0856987Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.0857409Z self=, 2025-05-07T20:32:42.0857820Z T=16384, 2025-05-07T20:32:42.0858015Z D=7168, 2025-05-07T20:32:42.0858214Z scale_ub=1200.0, 2025-05-07T20:32:42.0858450Z contiguous=True, 2025-05-07T20:32:42.0858673Z compiled=False, 2025-05-07T20:32:42.0858883Z ) 2025-05-07T20:32:42.0859207Z self = 2025-05-07T20:32:42.0859757Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.0860042Z 2025-05-07T20:32:42.0860121Z @given( 2025-05-07T20:32:42.0860362Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0860691Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0861010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0861356Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0861701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0861990Z ) 2025-05-07T20:32:42.0870210Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0870672Z def test_silu_mul_quant( 2025-05-07T20:32:42.0870918Z self, 2025-05-07T20:32:42.0871126Z T: int, 2025-05-07T20:32:42.0871331Z D: int, 2025-05-07T20:32:42.0871549Z scale_ub: Optional[float], 2025-05-07T20:32:42.0871832Z contiguous: bool, 2025-05-07T20:32:42.0872080Z compiled: bool, 2025-05-07T20:32:42.0872305Z ) -> None: 2025-05-07T20:32:42.0872526Z torch.manual_seed(2025) 2025-05-07T20:32:42.0872776Z 2025-05-07T20:32:42.0873049Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0875151Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.0877058Z 2025-05-07T20:32:42.0877179Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.0877414Z 2025-05-07T20:32:42.0877518Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.0877937Z self=, 2025-05-07T20:32:42.0878338Z T=128, 2025-05-07T20:32:42.0878532Z D=5120, 2025-05-07T20:32:42.0878731Z scale_ub=1200.0, 2025-05-07T20:32:42.0878954Z contiguous=False, 2025-05-07T20:32:42.0879273Z compiled=False, 2025-05-07T20:32:42.0879482Z ) 2025-05-07T20:32:42.2500042Z self = 2025-05-07T20:32:42.2500783Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.2501432Z 2025-05-07T20:32:42.2501520Z @given( 2025-05-07T20:32:42.2501766Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2502081Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2502390Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2502737Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2503172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2503454Z ) 2025-05-07T20:32:42.2504082Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2504534Z def test_silu_mul_quant( 2025-05-07T20:32:42.2504773Z self, 2025-05-07T20:32:42.2504979Z T: int, 2025-05-07T20:32:42.2505183Z D: int, 2025-05-07T20:32:42.2505398Z scale_ub: Optional[float], 2025-05-07T20:32:42.2505676Z contiguous: bool, 2025-05-07T20:32:42.2505922Z compiled: bool, 2025-05-07T20:32:42.2506146Z ) -> None: 2025-05-07T20:32:42.2506371Z torch.manual_seed(2025) 2025-05-07T20:32:42.2506620Z 2025-05-07T20:32:42.2506888Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2507237Z 2025-05-07T20:32:42.2507436Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2507821Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2508133Z x = x_sign * x_clamp 2025-05-07T20:32:42.2508381Z x0 = x[:, :D] 2025-05-07T20:32:42.2508601Z x1 = x[:, D:] 2025-05-07T20:32:42.2508804Z 2025-05-07T20:32:42.2508994Z if contiguous: 2025-05-07T20:32:42.2509228Z x0 = x0.contiguous() 2025-05-07T20:32:42.2509485Z x1 = x1.contiguous() 2025-05-07T20:32:42.2509728Z 2025-05-07T20:32:42.2510010Z if scale_ub is not None: 2025-05-07T20:32:42.2510282Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.2510611Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.2510919Z ) 2025-05-07T20:32:42.2511115Z else: 2025-05-07T20:32:42.2511319Z scale_ub_tensor = None 2025-05-07T20:32:42.2511568Z 2025-05-07T20:32:42.2511799Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.2512108Z op = silu_mul_quant 2025-05-07T20:32:42.2512349Z if compiled: 2025-05-07T20:32:42.2512602Z op = torch.compile(op) 2025-05-07T20:32:42.2512900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2513169Z 2025-05-07T20:32:42.2513365Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.2513528Z 2025-05-07T20:32:42.2513638Z moe/activation_test.py:117: 2025-05-07T20:32:42.2513924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2514430Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.2514711Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.2515402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.2516092Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.2516625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.2517311Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.2517964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.2518490Z kernel = self.compile( 2025-05-07T20:32:42.2519033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.2519769Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.2520153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.2520385Z 2025-05-07T20:32:42.2520646Z self = 2025-05-07T20:32:42.2521729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.2523194Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc7bc5ca0>} 2025-05-07T20:32:42.2524530Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.2525553Z context = 2025-05-07T20:32:42.2525848Z 2025-05-07T20:32:42.2526016Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.2526542Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.2526999Z module_map=module_map) 2025-05-07T20:32:42.2527369Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.2527718Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.2527971Z E ^ 2025-05-07T20:32:42.2528478Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.2528937Z 2025-05-07T20:32:42.2529351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.2529864Z 2025-05-07T20:32:42.2529981Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2530386Z self=, 2025-05-07T20:32:42.2530783Z T=2048, 2025-05-07T20:32:42.2530974Z D=7168, 2025-05-07T20:32:42.2531158Z scale_ub=None, 2025-05-07T20:32:42.2531373Z contiguous=False, 2025-05-07T20:32:42.2531597Z compiled=False, 2025-05-07T20:32:42.2531808Z ) 2025-05-07T20:32:42.2532118Z self = 2025-05-07T20:32:42.2532608Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.2532884Z 2025-05-07T20:32:42.2532966Z @given( 2025-05-07T20:32:42.2533188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2533501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2533810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2534133Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2534463Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2534747Z ) 2025-05-07T20:32:42.2535096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2535534Z def test_silu_mul_quant( 2025-05-07T20:32:42.2535780Z self, 2025-05-07T20:32:42.2535982Z T: int, 2025-05-07T20:32:42.2536172Z D: int, 2025-05-07T20:32:42.2536388Z scale_ub: Optional[float], 2025-05-07T20:32:42.2536659Z contiguous: bool, 2025-05-07T20:32:42.2536889Z compiled: bool, 2025-05-07T20:32:42.2537109Z ) -> None: 2025-05-07T20:32:42.2537327Z torch.manual_seed(2025) 2025-05-07T20:32:42.2537563Z 2025-05-07T20:32:42.2537830Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2539930Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.2541824Z 2025-05-07T20:32:42.2541947Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.2542156Z 2025-05-07T20:32:42.2542263Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.2542669Z self=, 2025-05-07T20:32:42.2543107Z T=128, 2025-05-07T20:32:42.2543288Z D=7168, 2025-05-07T20:32:42.2543470Z scale_ub=1200.0, 2025-05-07T20:32:42.2543687Z contiguous=True, 2025-05-07T20:32:42.2543904Z compiled=True, 2025-05-07T20:32:42.2544099Z ) 2025-05-07T20:32:42.2990476Z self = 2025-05-07T20:32:42.2991250Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.2991625Z 2025-05-07T20:32:42.2991727Z @given( 2025-05-07T20:32:42.2992033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.2992382Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.2992689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.2993017Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.2993341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.2993624Z ) 2025-05-07T20:32:42.2994143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.2994586Z def test_silu_mul_quant( 2025-05-07T20:32:42.2994822Z self, 2025-05-07T20:32:42.2995020Z T: int, 2025-05-07T20:32:42.2995216Z D: int, 2025-05-07T20:32:42.2995432Z scale_ub: Optional[float], 2025-05-07T20:32:42.2995761Z contiguous: bool, 2025-05-07T20:32:42.2996003Z compiled: bool, 2025-05-07T20:32:42.2996227Z ) -> None: 2025-05-07T20:32:42.2996448Z torch.manual_seed(2025) 2025-05-07T20:32:42.2996691Z 2025-05-07T20:32:42.2996961Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.2997306Z 2025-05-07T20:32:42.2997508Z x_sign = torch.sign(x) 2025-05-07T20:32:42.2997803Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.2998107Z x = x_sign * x_clamp 2025-05-07T20:32:42.2998354Z x0 = x[:, :D] 2025-05-07T20:32:42.2998579Z x1 = x[:, D:] 2025-05-07T20:32:42.2998778Z 2025-05-07T20:32:42.2998964Z if contiguous: 2025-05-07T20:32:42.2999194Z x0 = x0.contiguous() 2025-05-07T20:32:42.2999447Z x1 = x1.contiguous() 2025-05-07T20:32:42.2999687Z 2025-05-07T20:32:42.2999876Z if scale_ub is not None: 2025-05-07T20:32:42.3000142Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.3000477Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.3000784Z ) 2025-05-07T20:32:42.3000969Z else: 2025-05-07T20:32:42.3001176Z scale_ub_tensor = None 2025-05-07T20:32:42.3001425Z 2025-05-07T20:32:42.3001649Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.3001964Z op = silu_mul_quant 2025-05-07T20:32:42.3002211Z if compiled: 2025-05-07T20:32:42.3002456Z op = torch.compile(op) 2025-05-07T20:32:42.3002756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.3003031Z 2025-05-07T20:32:42.3003220Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.3003384Z 2025-05-07T20:32:42.3003508Z moe/activation_test.py:117: 2025-05-07T20:32:42.3004078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.3004412Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.3004811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.3005366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.3005978Z return fn(*args, **kwargs) 2025-05-07T20:32:42.3006714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.3007404Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.3007933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.3008721Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.3009384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.3009909Z kernel = self.compile( 2025-05-07T20:32:42.3010452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.3011108Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.3011506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.3011732Z 2025-05-07T20:32:42.3011942Z self = 2025-05-07T20:32:42.3013091Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.3014492Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc7b390d0>} 2025-05-07T20:32:42.3015842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.3016867Z context = 2025-05-07T20:32:42.3017150Z 2025-05-07T20:32:42.3017317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.3017837Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.3018304Z module_map=module_map) 2025-05-07T20:32:42.3018664Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.3019021Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.3019278Z E ^ 2025-05-07T20:32:42.3019740Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.3020188Z 2025-05-07T20:32:42.3020602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.3021118Z 2025-05-07T20:32:42.3021219Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3021627Z self=, 2025-05-07T20:32:42.3022029Z T=128, 2025-05-07T20:32:42.3022208Z D=7168, 2025-05-07T20:32:42.3022401Z scale_ub=1200.0, 2025-05-07T20:32:42.3022626Z contiguous=True, 2025-05-07T20:32:42.3022839Z compiled=False, 2025-05-07T20:32:42.3023047Z ) 2025-05-07T20:32:42.3023360Z self = 2025-05-07T20:32:42.3023841Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.3024118Z 2025-05-07T20:32:42.3024194Z @given( 2025-05-07T20:32:42.3024419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3024721Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3025026Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3025406Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3025782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3026062Z ) 2025-05-07T20:32:42.3026407Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3026886Z def test_silu_mul_quant( 2025-05-07T20:32:42.3027120Z self, 2025-05-07T20:32:42.3027311Z T: int, 2025-05-07T20:32:42.3027507Z D: int, 2025-05-07T20:32:42.3027716Z scale_ub: Optional[float], 2025-05-07T20:32:42.3027985Z contiguous: bool, 2025-05-07T20:32:42.3028217Z compiled: bool, 2025-05-07T20:32:42.3028479Z ) -> None: 2025-05-07T20:32:42.3028692Z torch.manual_seed(2025) 2025-05-07T20:32:42.3028928Z 2025-05-07T20:32:42.3029190Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3029530Z 2025-05-07T20:32:42.3029717Z x_sign = torch.sign(x) 2025-05-07T20:32:42.3030092Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.3032095Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3034003Z 2025-05-07T20:32:42.3034121Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.3034337Z 2025-05-07T20:32:42.3034436Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3034849Z self=, 2025-05-07T20:32:42.3035247Z T=128, 2025-05-07T20:32:42.3035437Z D=5120, 2025-05-07T20:32:42.3035627Z scale_ub=1200.0, 2025-05-07T20:32:42.3035866Z contiguous=True, 2025-05-07T20:32:42.3036111Z compiled=True, 2025-05-07T20:32:42.3036312Z ) 2025-05-07T20:32:42.3036615Z self = 2025-05-07T20:32:42.3037099Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.3037369Z 2025-05-07T20:32:42.3037442Z @given( 2025-05-07T20:32:42.3037662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.3037964Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.3038269Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.3038596Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.3038913Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.3039194Z ) 2025-05-07T20:32:42.3039544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.3039975Z def test_silu_mul_quant( 2025-05-07T20:32:42.3040237Z self, 2025-05-07T20:32:42.3040429Z T: int, 2025-05-07T20:32:42.3040625Z D: int, 2025-05-07T20:32:42.3040831Z scale_ub: Optional[float], 2025-05-07T20:32:42.3041100Z contiguous: bool, 2025-05-07T20:32:42.3041337Z compiled: bool, 2025-05-07T20:32:42.3041550Z ) -> None: 2025-05-07T20:32:42.3041761Z torch.manual_seed(2025) 2025-05-07T20:32:42.3041998Z 2025-05-07T20:32:42.3042259Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.3042596Z 2025-05-07T20:32:42.3042792Z x_sign = torch.sign(x) 2025-05-07T20:32:42.3043072Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.3045109Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.3047014Z 2025-05-07T20:32:42.3047129Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.3047344Z 2025-05-07T20:32:42.3047443Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.3047863Z self=, 2025-05-07T20:32:42.3048311Z T=128, 2025-05-07T20:32:42.3048495Z D=7168, 2025-05-07T20:32:42.3048685Z scale_ub=None, 2025-05-07T20:32:42.3048890Z contiguous=True, 2025-05-07T20:32:42.3049115Z compiled=True, 2025-05-07T20:32:42.3049315Z ) 2025-05-07T20:32:42.5174942Z self = 2025-05-07T20:32:42.5175709Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.5176074Z 2025-05-07T20:32:42.5176183Z @given( 2025-05-07T20:32:42.5176493Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5176903Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5177301Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5177728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5178092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5178381Z ) 2025-05-07T20:32:42.5178965Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5179426Z def test_silu_mul_quant( 2025-05-07T20:32:42.5179664Z self, 2025-05-07T20:32:42.5179862Z T: int, 2025-05-07T20:32:42.5180061Z D: int, 2025-05-07T20:32:42.5180275Z scale_ub: Optional[float], 2025-05-07T20:32:42.5180545Z contiguous: bool, 2025-05-07T20:32:42.5180789Z compiled: bool, 2025-05-07T20:32:42.5181011Z ) -> None: 2025-05-07T20:32:42.5181231Z torch.manual_seed(2025) 2025-05-07T20:32:42.5181478Z 2025-05-07T20:32:42.5181743Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5183826Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.5185773Z 2025-05-07T20:32:42.5185902Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.5186123Z 2025-05-07T20:32:42.5197123Z FAILED 2025-05-07T20:32:42.5197288Z 2025-05-07T20:32:42.5197638Z =================================== FAILURES =================================== 2025-05-07T20:32:42.5198240Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:42.5198848Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:42.5199689Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:32:42.5200425Z | yield 2025-05-07T20:32:42.5201001Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run 2025-05-07T20:32:42.5201716Z | self._callTestMethod(testMethod) 2025-05-07T20:32:42.5202497Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod 2025-05-07T20:32:42.5203219Z | method() 2025-05-07T20:32:42.5204302Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:42.5205493Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5206485Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:42.5207340Z | raise the_error_hypothesis_found 2025-05-07T20:32:42.5208003Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:42.5208667Z +-+---------------- 1 ---------------- 2025-05-07T20:32:42.5209163Z | Traceback (most recent call last): 2025-05-07T20:32:42.5210124Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:42.5211192Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5214471Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.5230669Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:42.5231393Z | self=, 2025-05-07T20:32:42.5231942Z | T=2048, 2025-05-07T20:32:42.5232260Z | D=5120, # or any other generated value 2025-05-07T20:32:42.5232728Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:42.5233201Z | contiguous=True, # or any other generated value 2025-05-07T20:32:42.5233704Z | compiled=False, # or any other generated value 2025-05-07T20:32:42.5234127Z | ) 2025-05-07T20:32:42.5234370Z | 2025-05-07T20:32:42.5235076Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:42.5235917Z +---------------- 2 ---------------- 2025-05-07T20:32:42.5236316Z | Traceback (most recent call last): 2025-05-07T20:32:42.5237297Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:42.5238360Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5241211Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.5243386Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:42.5243839Z | self=, 2025-05-07T20:32:42.5244246Z | T=128, 2025-05-07T20:32:42.5244451Z | D=7168, 2025-05-07T20:32:42.5244668Z | scale_ub=None, 2025-05-07T20:32:42.5244909Z | contiguous=True, 2025-05-07T20:32:42.5245145Z | compiled=True, 2025-05-07T20:32:42.5245368Z | ) 2025-05-07T20:32:42.5245551Z | 2025-05-07T20:32:42.5246121Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:42.5246790Z +---------------- 3 ---------------- 2025-05-07T20:32:42.5247084Z | Traceback (most recent call last): 2025-05-07T20:32:42.5247861Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:42.5248643Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5250705Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.5252739Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:42.5253190Z | self=, 2025-05-07T20:32:42.5253592Z | T=128, 2025-05-07T20:32:42.5253802Z | D=5120, 2025-05-07T20:32:42.5254018Z | scale_ub=1200.0, 2025-05-07T20:32:42.5254256Z | contiguous=True, 2025-05-07T20:32:42.5254500Z | compiled=True, 2025-05-07T20:32:42.5254729Z | ) 2025-05-07T20:32:42.5254904Z | 2025-05-07T20:32:42.5255472Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:42.5256126Z +---------------- 4 ---------------- 2025-05-07T20:32:42.5256430Z | Traceback (most recent call last): 2025-05-07T20:32:42.5257134Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:42.5257855Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.5258513Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:42.5259213Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5260042Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:42.5260843Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.5261463Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:42.5262198Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5263145Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:42.5264211Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5265280Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:42.5266387Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5267469Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:42.5268441Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.5269346Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:42.5270240Z | fn() 2025-05-07T20:32:42.5271012Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:42.5271954Z | self.fn.run( 2025-05-07T20:32:42.5272727Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:42.5273517Z | kernel = self.compile( 2025-05-07T20:32:42.5274343Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:42.5275321Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5276382Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:42.5277461Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5278173Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5278659Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.5279011Z | ^ 2025-05-07T20:32:42.5279649Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5280437Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:42.5280997Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:42.5281700Z | self=, 2025-05-07T20:32:42.5282298Z | T=1, # or any other generated value 2025-05-07T20:32:42.5282785Z | D=5120, # or any other generated value 2025-05-07T20:32:42.5283237Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:42.5283726Z | contiguous=True, # or any other generated value 2025-05-07T20:32:42.5284219Z | compiled=True, # or any other generated value 2025-05-07T20:32:42.5284621Z | ) 2025-05-07T20:32:42.5284859Z | 2025-05-07T20:32:42.5285586Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:42.5286444Z +------------------------------------ 2025-05-07T20:32:42.5286930Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:42.5287450Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5288016Z self=, 2025-05-07T20:32:42.5288554Z T=1, 2025-05-07T20:32:42.5288799Z D=5120, 2025-05-07T20:32:42.5289055Z scale_ub=None, 2025-05-07T20:32:42.5289333Z contiguous=True, 2025-05-07T20:32:42.5289634Z compiled=True, 2025-05-07T20:32:42.5289917Z ) 2025-05-07T20:32:42.5290352Z self = 2025-05-07T20:32:42.5291012Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.5291380Z 2025-05-07T20:32:42.5291485Z @given( 2025-05-07T20:32:42.5291800Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5292220Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5292642Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5293101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5293550Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5293957Z ) 2025-05-07T20:32:42.5294435Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5295035Z def test_silu_mul_quant( 2025-05-07T20:32:42.5295362Z self, 2025-05-07T20:32:42.5295621Z T: int, 2025-05-07T20:32:42.5295905Z D: int, 2025-05-07T20:32:42.5296230Z scale_ub: Optional[float], 2025-05-07T20:32:42.5296596Z contiguous: bool, 2025-05-07T20:32:42.5296922Z compiled: bool, 2025-05-07T20:32:42.5297218Z ) -> None: 2025-05-07T20:32:42.5297572Z torch.manual_seed(2025) 2025-05-07T20:32:42.5297903Z 2025-05-07T20:32:42.5298262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5298739Z 2025-05-07T20:32:42.5299003Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5299433Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5299857Z x = x_sign * x_clamp 2025-05-07T20:32:42.5300179Z x0 = x[:, :D] 2025-05-07T20:32:42.5300462Z x1 = x[:, D:] 2025-05-07T20:32:42.5300744Z 2025-05-07T20:32:42.5300995Z if contiguous: 2025-05-07T20:32:42.5301297Z x0 = x0.contiguous() 2025-05-07T20:32:42.5301708Z x1 = x1.contiguous() 2025-05-07T20:32:42.5302044Z 2025-05-07T20:32:42.5302309Z if scale_ub is not None: 2025-05-07T20:32:42.5302681Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5303144Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5303563Z ) 2025-05-07T20:32:42.5304118Z else: 2025-05-07T20:32:42.5304402Z scale_ub_tensor = None 2025-05-07T20:32:42.5304743Z 2025-05-07T20:32:42.5305050Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5305485Z op = silu_mul_quant 2025-05-07T20:32:42.5305865Z if compiled: 2025-05-07T20:32:42.5306189Z op = torch.compile(op) 2025-05-07T20:32:42.5306592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5306979Z 2025-05-07T20:32:42.5307239Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.5307768Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.5308164Z 2025-05-07T20:32:42.5308478Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5308932Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.5309337Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.5309769Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.5310363Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5310794Z 2025-05-07T20:32:42.5311056Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.5311313Z 2025-05-07T20:32:42.5311437Z moe/activation_test.py:126: 2025-05-07T20:32:42.5311823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5312269Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.5312703Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5313773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.5314810Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.5315571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5316569Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5317520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.5318517Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5319576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.5320608Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5321623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.5322490Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.5323311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.5323994Z fn() 2025-05-07T20:32:42.5324674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.5325570Z self.fn.run( 2025-05-07T20:32:42.5326238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5327039Z kernel = self.compile( 2025-05-07T20:32:42.5327784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5328664Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5329216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5329630Z 2025-05-07T20:32:42.5329913Z self = 2025-05-07T20:32:42.5331441Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5333406Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fcc1d69d0>} 2025-05-07T20:32:42.5335296Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5336755Z context = 2025-05-07T20:32:42.5337162Z 2025-05-07T20:32:42.5337444Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5338172Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5338792Z module_map=module_map) 2025-05-07T20:32:42.5339257Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5339719Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.5340062Z E ^ 2025-05-07T20:32:42.5340674Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5341272Z 2025-05-07T20:32:42.5341820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5342503Z 2025-05-07T20:32:42.5342635Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5343172Z self=, 2025-05-07T20:32:42.5343700Z T=2048, 2025-05-07T20:32:42.5343939Z D=5120, 2025-05-07T20:32:42.5344191Z scale_ub=1200.0, 2025-05-07T20:32:42.5344488Z contiguous=True, 2025-05-07T20:32:42.5344766Z compiled=False, 2025-05-07T20:32:42.5345029Z ) 2025-05-07T20:32:42.5345436Z self = 2025-05-07T20:32:42.5346122Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.5346487Z 2025-05-07T20:32:42.5346587Z @given( 2025-05-07T20:32:42.5346884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5347307Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5347722Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5348158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5348586Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5348970Z ) 2025-05-07T20:32:42.5349436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5350127Z def test_silu_mul_quant( 2025-05-07T20:32:42.5350450Z self, 2025-05-07T20:32:42.5350718Z T: int, 2025-05-07T20:32:42.5350977Z D: int, 2025-05-07T20:32:42.5351266Z scale_ub: Optional[float], 2025-05-07T20:32:42.5351638Z contiguous: bool, 2025-05-07T20:32:42.5352072Z compiled: bool, 2025-05-07T20:32:42.5352373Z ) -> None: 2025-05-07T20:32:42.5352671Z torch.manual_seed(2025) 2025-05-07T20:32:42.5353004Z 2025-05-07T20:32:42.5353369Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5353899Z 2025-05-07T20:32:42.5354167Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5354564Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5354985Z x = x_sign * x_clamp 2025-05-07T20:32:42.5355313Z x0 = x[:, :D] 2025-05-07T20:32:42.5355588Z x1 = x[:, D:] 2025-05-07T20:32:42.5355946Z 2025-05-07T20:32:42.5356204Z if contiguous: 2025-05-07T20:32:42.5356502Z x0 = x0.contiguous() 2025-05-07T20:32:42.5356829Z x1 = x1.contiguous() 2025-05-07T20:32:42.5357141Z 2025-05-07T20:32:42.5357388Z if scale_ub is not None: 2025-05-07T20:32:42.5357737Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5358174Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5358574Z ) 2025-05-07T20:32:42.5358811Z else: 2025-05-07T20:32:42.5359080Z scale_ub_tensor = None 2025-05-07T20:32:42.5359410Z 2025-05-07T20:32:42.5359701Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5360109Z op = silu_mul_quant 2025-05-07T20:32:42.5360435Z if compiled: 2025-05-07T20:32:42.5360742Z op = torch.compile(op) 2025-05-07T20:32:42.5361134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5361488Z 2025-05-07T20:32:42.5361787Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5362012Z 2025-05-07T20:32:42.5362138Z moe/activation_test.py:117: 2025-05-07T20:32:42.5362526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5362959Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5363319Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5364257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5365219Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5365992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5366934Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5367826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5368553Z kernel = self.compile( 2025-05-07T20:32:42.5369290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5370194Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5370744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5371065Z 2025-05-07T20:32:42.5371356Z self = 2025-05-07T20:32:42.5372860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5374785Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fb83b05e0>} 2025-05-07T20:32:42.5376619Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5378016Z context = 2025-05-07T20:32:42.5378392Z 2025-05-07T20:32:42.5378616Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5379386Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5380031Z module_map=module_map) 2025-05-07T20:32:42.5380575Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5381050Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5381401Z E ^ 2025-05-07T20:32:42.5382028Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5382721Z 2025-05-07T20:32:42.5390038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5390883Z 2025-05-07T20:32:42.5391044Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5391627Z self=, 2025-05-07T20:32:42.5392186Z T=2048, 2025-05-07T20:32:42.5392443Z D=5120, 2025-05-07T20:32:42.5392689Z scale_ub=1200.0, 2025-05-07T20:32:42.5392967Z contiguous=True, 2025-05-07T20:32:42.5393256Z compiled=True, 2025-05-07T20:32:42.5393512Z ) 2025-05-07T20:32:42.5393930Z self = 2025-05-07T20:32:42.5394607Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.5394984Z 2025-05-07T20:32:42.5395088Z @given( 2025-05-07T20:32:42.5395391Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5396041Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5396460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5396907Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5397349Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5397731Z ) 2025-05-07T20:32:42.5398205Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5398821Z def test_silu_mul_quant( 2025-05-07T20:32:42.5399145Z self, 2025-05-07T20:32:42.5399411Z T: int, 2025-05-07T20:32:42.5399680Z D: int, 2025-05-07T20:32:42.5399969Z scale_ub: Optional[float], 2025-05-07T20:32:42.5400351Z contiguous: bool, 2025-05-07T20:32:42.5400678Z compiled: bool, 2025-05-07T20:32:42.5400982Z ) -> None: 2025-05-07T20:32:42.5401281Z torch.manual_seed(2025) 2025-05-07T20:32:42.5401547Z 2025-05-07T20:32:42.5401820Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5402164Z 2025-05-07T20:32:42.5402352Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5402644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5402946Z x = x_sign * x_clamp 2025-05-07T20:32:42.5403186Z x0 = x[:, :D] 2025-05-07T20:32:42.5403402Z x1 = x[:, D:] 2025-05-07T20:32:42.5403604Z 2025-05-07T20:32:42.5404107Z if contiguous: 2025-05-07T20:32:42.5404348Z x0 = x0.contiguous() 2025-05-07T20:32:42.5404601Z x1 = x1.contiguous() 2025-05-07T20:32:42.5404841Z 2025-05-07T20:32:42.5405029Z if scale_ub is not None: 2025-05-07T20:32:42.5405295Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5405635Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5405945Z ) 2025-05-07T20:32:42.5406131Z else: 2025-05-07T20:32:42.5406340Z scale_ub_tensor = None 2025-05-07T20:32:42.5406590Z 2025-05-07T20:32:42.5406821Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5407129Z op = silu_mul_quant 2025-05-07T20:32:42.5407378Z if compiled: 2025-05-07T20:32:42.5407624Z op = torch.compile(op) 2025-05-07T20:32:42.5407915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5408189Z 2025-05-07T20:32:42.5408376Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.5408834Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.5409127Z 2025-05-07T20:32:42.5409360Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5409684Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.5410067Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.5410382Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.5410739Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5411056Z 2025-05-07T20:32:42.5411253Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.5411529Z 2025-05-07T20:32:42.5411633Z moe/activation_test.py:126: 2025-05-07T20:32:42.5411925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5412259Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.5412587Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5413371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.5414127Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.5414677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5415365Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5416092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.5416881Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5417628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.5418369Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5419092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.5419725Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.5420322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.5420833Z fn() 2025-05-07T20:32:42.5421332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.5421907Z self.fn.run( 2025-05-07T20:32:42.5422371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5422894Z kernel = self.compile( 2025-05-07T20:32:42.5423427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5424075Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5424468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5424700Z 2025-05-07T20:32:42.5424905Z self = 2025-05-07T20:32:42.5425997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5427402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fcac53550>} 2025-05-07T20:32:42.5428753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5429773Z context = 2025-05-07T20:32:42.5430201Z 2025-05-07T20:32:42.5430364Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5430884Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5442542Z module_map=module_map) 2025-05-07T20:32:42.5442923Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5443283Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.5443549Z E ^ 2025-05-07T20:32:42.5444016Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5444516Z 2025-05-07T20:32:42.5444937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5445451Z 2025-05-07T20:32:42.5445551Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5445957Z self=, 2025-05-07T20:32:42.5446372Z T=16384, 2025-05-07T20:32:42.5446584Z D=7168, 2025-05-07T20:32:42.5446784Z scale_ub=1200.0, 2025-05-07T20:32:42.5446999Z contiguous=False, 2025-05-07T20:32:42.5447215Z compiled=False, 2025-05-07T20:32:42.5447410Z ) 2025-05-07T20:32:42.5447714Z self = 2025-05-07T20:32:42.5448203Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.5448482Z 2025-05-07T20:32:42.5448553Z @given( 2025-05-07T20:32:42.5448826Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5449131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5449420Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5449736Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5450051Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5450331Z ) 2025-05-07T20:32:42.5450678Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5451117Z def test_silu_mul_quant( 2025-05-07T20:32:42.5451363Z self, 2025-05-07T20:32:42.5451552Z T: int, 2025-05-07T20:32:42.5451754Z D: int, 2025-05-07T20:32:42.5451970Z scale_ub: Optional[float], 2025-05-07T20:32:42.5452235Z contiguous: bool, 2025-05-07T20:32:42.5452484Z compiled: bool, 2025-05-07T20:32:42.5452714Z ) -> None: 2025-05-07T20:32:42.5452925Z torch.manual_seed(2025) 2025-05-07T20:32:42.5453171Z 2025-05-07T20:32:42.5453448Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5453779Z 2025-05-07T20:32:42.5453973Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5454269Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5454572Z x = x_sign * x_clamp 2025-05-07T20:32:42.5454815Z x0 = x[:, :D] 2025-05-07T20:32:42.5455037Z x1 = x[:, D:] 2025-05-07T20:32:42.5455247Z 2025-05-07T20:32:42.5455433Z if contiguous: 2025-05-07T20:32:42.5455667Z x0 = x0.contiguous() 2025-05-07T20:32:42.5455927Z x1 = x1.contiguous() 2025-05-07T20:32:42.5456167Z 2025-05-07T20:32:42.5456389Z if scale_ub is not None: 2025-05-07T20:32:42.5456676Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5457012Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5457324Z ) 2025-05-07T20:32:42.5457510Z else: 2025-05-07T20:32:42.5457727Z scale_ub_tensor = None 2025-05-07T20:32:42.5457973Z 2025-05-07T20:32:42.5458201Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5458521Z op = silu_mul_quant 2025-05-07T20:32:42.5458771Z if compiled: 2025-05-07T20:32:42.5459018Z op = torch.compile(op) 2025-05-07T20:32:42.5459317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5459637Z 2025-05-07T20:32:42.5459831Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5459994Z 2025-05-07T20:32:42.5460092Z moe/activation_test.py:117: 2025-05-07T20:32:42.5460431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5460762Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5461039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5461732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5462462Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5463004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5463676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5464335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5464870Z kernel = self.compile( 2025-05-07T20:32:42.5465399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5466102Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5466498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5466723Z 2025-05-07T20:32:42.5466935Z self = 2025-05-07T20:32:42.5468060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5469454Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fcabfdd30>} 2025-05-07T20:32:42.5470872Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5471899Z context = 2025-05-07T20:32:42.5472184Z 2025-05-07T20:32:42.5472356Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5472876Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5473346Z module_map=module_map) 2025-05-07T20:32:42.5473711Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5474055Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5474311Z E ^ 2025-05-07T20:32:42.5474768Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5475221Z 2025-05-07T20:32:42.5475642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5476175Z 2025-05-07T20:32:42.5476285Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5476713Z self=, 2025-05-07T20:32:42.5477112Z T=1, 2025-05-07T20:32:42.5477288Z D=7168, 2025-05-07T20:32:42.5477475Z scale_ub=None, 2025-05-07T20:32:42.5477685Z contiguous=True, 2025-05-07T20:32:42.5477898Z compiled=True, 2025-05-07T20:32:42.5478104Z ) 2025-05-07T20:32:42.5478421Z self = 2025-05-07T20:32:42.5478900Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.5479160Z 2025-05-07T20:32:42.5479235Z @given( 2025-05-07T20:32:42.5479465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5479828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5480132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5480458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5480873Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5481154Z ) 2025-05-07T20:32:42.5481508Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5481953Z def test_silu_mul_quant( 2025-05-07T20:32:42.5482188Z self, 2025-05-07T20:32:42.5482380Z T: int, 2025-05-07T20:32:42.5482619Z D: int, 2025-05-07T20:32:42.5482836Z scale_ub: Optional[float], 2025-05-07T20:32:42.5483099Z contiguous: bool, 2025-05-07T20:32:42.5483336Z compiled: bool, 2025-05-07T20:32:42.5483555Z ) -> None: 2025-05-07T20:32:42.5483760Z torch.manual_seed(2025) 2025-05-07T20:32:42.5483997Z 2025-05-07T20:32:42.5484266Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5484600Z 2025-05-07T20:32:42.5484790Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5485076Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5485375Z x = x_sign * x_clamp 2025-05-07T20:32:42.5485619Z x0 = x[:, :D] 2025-05-07T20:32:42.5485837Z x1 = x[:, D:] 2025-05-07T20:32:42.5486034Z 2025-05-07T20:32:42.5486215Z if contiguous: 2025-05-07T20:32:42.5486446Z x0 = x0.contiguous() 2025-05-07T20:32:42.5486692Z x1 = x1.contiguous() 2025-05-07T20:32:42.5486934Z 2025-05-07T20:32:42.5487171Z if scale_ub is not None: 2025-05-07T20:32:42.5487439Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5487772Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5488081Z ) 2025-05-07T20:32:42.5488269Z else: 2025-05-07T20:32:42.5488467Z scale_ub_tensor = None 2025-05-07T20:32:42.5488718Z 2025-05-07T20:32:42.5488946Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5489254Z op = silu_mul_quant 2025-05-07T20:32:42.5489501Z if compiled: 2025-05-07T20:32:42.5489745Z op = torch.compile(op) 2025-05-07T20:32:42.5490037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5490305Z 2025-05-07T20:32:42.5490495Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.5490770Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.5491058Z 2025-05-07T20:32:42.5491292Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5491622Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.5491919Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.5492232Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.5492587Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5492887Z 2025-05-07T20:32:42.5493087Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.5493279Z 2025-05-07T20:32:42.5493385Z moe/activation_test.py:126: 2025-05-07T20:32:42.5493674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5494010Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.5494343Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5495123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.5495939Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.5496490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5497171Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5497849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.5498627Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5499380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.5500172Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5500896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.5501539Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.5502196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.5502713Z fn() 2025-05-07T20:32:42.5503206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.5504121Z self.fn.run( 2025-05-07T20:32:42.5504635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5505155Z kernel = self.compile( 2025-05-07T20:32:42.5505698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5506401Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5506796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5507021Z 2025-05-07T20:32:42.5507322Z self = 2025-05-07T20:32:42.5508412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5509787Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fcabfde50>} 2025-05-07T20:32:42.5511193Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5512223Z context = 2025-05-07T20:32:42.5512510Z 2025-05-07T20:32:42.5512684Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5513204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5513675Z module_map=module_map) 2025-05-07T20:32:42.5514041Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5514390Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.5514659Z E ^ 2025-05-07T20:32:42.5515121Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5515571Z 2025-05-07T20:32:42.5515996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5516504Z 2025-05-07T20:32:42.5516606Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5517016Z self=, 2025-05-07T20:32:42.5517421Z T=4096, 2025-05-07T20:32:42.5517599Z D=5120, 2025-05-07T20:32:42.5517787Z scale_ub=None, 2025-05-07T20:32:42.5518009Z contiguous=False, 2025-05-07T20:32:42.5518229Z compiled=False, 2025-05-07T20:32:42.5518428Z ) 2025-05-07T20:32:42.5518742Z self = 2025-05-07T20:32:42.5519231Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.5519500Z 2025-05-07T20:32:42.5519669Z @given( 2025-05-07T20:32:42.5519901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5520210Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5520511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5520904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5521239Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5521520Z ) 2025-05-07T20:32:42.5521865Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5522304Z def test_silu_mul_quant( 2025-05-07T20:32:42.5522614Z self, 2025-05-07T20:32:42.5522798Z T: int, 2025-05-07T20:32:42.5522999Z D: int, 2025-05-07T20:32:42.5523218Z scale_ub: Optional[float], 2025-05-07T20:32:42.5523485Z contiguous: bool, 2025-05-07T20:32:42.5523722Z compiled: bool, 2025-05-07T20:32:42.5523946Z ) -> None: 2025-05-07T20:32:42.5524153Z torch.manual_seed(2025) 2025-05-07T20:32:42.5524399Z 2025-05-07T20:32:42.5524673Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5525004Z 2025-05-07T20:32:42.5525193Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5525487Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5525787Z x = x_sign * x_clamp 2025-05-07T20:32:42.5526039Z x0 = x[:, :D] 2025-05-07T20:32:42.5526284Z x1 = x[:, D:] 2025-05-07T20:32:42.5526483Z 2025-05-07T20:32:42.5526670Z if contiguous: 2025-05-07T20:32:42.5526898Z x0 = x0.contiguous() 2025-05-07T20:32:42.5527208Z x1 = x1.contiguous() 2025-05-07T20:32:42.5527452Z 2025-05-07T20:32:42.5527652Z if scale_ub is not None: 2025-05-07T20:32:42.5527921Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5528259Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5528570Z ) 2025-05-07T20:32:42.5528774Z else: 2025-05-07T20:32:42.5528983Z scale_ub_tensor = None 2025-05-07T20:32:42.5529234Z 2025-05-07T20:32:42.5529464Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5529773Z op = silu_mul_quant 2025-05-07T20:32:42.5530029Z if compiled: 2025-05-07T20:32:42.5530279Z op = torch.compile(op) 2025-05-07T20:32:42.5530574Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5530847Z 2025-05-07T20:32:42.5531038Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5531201Z 2025-05-07T20:32:42.5531300Z moe/activation_test.py:117: 2025-05-07T20:32:42.5531608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5531937Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5532218Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5532900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5533591Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5534125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5534800Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5535461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5535993Z kernel = self.compile( 2025-05-07T20:32:42.5536575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5537226Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5537623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5537847Z 2025-05-07T20:32:42.5538058Z self = 2025-05-07T20:32:42.5539200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5540620Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fca83b5e0>} 2025-05-07T20:32:42.5541977Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5543051Z context = 2025-05-07T20:32:42.5543339Z 2025-05-07T20:32:42.5543512Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5544030Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5544505Z module_map=module_map) 2025-05-07T20:32:42.5544881Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5545244Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5545501Z E ^ 2025-05-07T20:32:42.5546023Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5546474Z 2025-05-07T20:32:42.5546899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5547450Z 2025-05-07T20:32:42.5547560Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5547971Z self=, 2025-05-07T20:32:42.5548369Z T=4096, 2025-05-07T20:32:42.5548556Z D=7168, 2025-05-07T20:32:42.5548746Z scale_ub=None, 2025-05-07T20:32:42.5548957Z contiguous=False, 2025-05-07T20:32:42.5549184Z compiled=False, 2025-05-07T20:32:42.5549375Z ) 2025-05-07T20:32:42.5549687Z self = 2025-05-07T20:32:42.5550223Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.5550497Z 2025-05-07T20:32:42.5550571Z @given( 2025-05-07T20:32:42.5550800Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5551107Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5551413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5551737Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5552059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5552337Z ) 2025-05-07T20:32:42.5552673Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5553106Z def test_silu_mul_quant( 2025-05-07T20:32:42.5553346Z self, 2025-05-07T20:32:42.5553530Z T: int, 2025-05-07T20:32:42.5553725Z D: int, 2025-05-07T20:32:42.5553938Z scale_ub: Optional[float], 2025-05-07T20:32:42.5554195Z contiguous: bool, 2025-05-07T20:32:42.5554428Z compiled: bool, 2025-05-07T20:32:42.5554644Z ) -> None: 2025-05-07T20:32:42.5554853Z torch.manual_seed(2025) 2025-05-07T20:32:42.5555088Z 2025-05-07T20:32:42.5555351Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5555681Z 2025-05-07T20:32:42.5555870Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5556156Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5556468Z x = x_sign * x_clamp 2025-05-07T20:32:42.5556695Z x0 = x[:, :D] 2025-05-07T20:32:42.5556906Z x1 = x[:, D:] 2025-05-07T20:32:42.5557102Z 2025-05-07T20:32:42.5557272Z if contiguous: 2025-05-07T20:32:42.5557495Z x0 = x0.contiguous() 2025-05-07T20:32:42.5557745Z x1 = x1.contiguous() 2025-05-07T20:32:42.5558048Z 2025-05-07T20:32:42.5558231Z if scale_ub is not None: 2025-05-07T20:32:42.5558497Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5558820Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5559161Z ) 2025-05-07T20:32:42.5559347Z else: 2025-05-07T20:32:42.5559544Z scale_ub_tensor = None 2025-05-07T20:32:42.5559792Z 2025-05-07T20:32:42.5560017Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5560318Z op = silu_mul_quant 2025-05-07T20:32:42.5560614Z if compiled: 2025-05-07T20:32:42.5560855Z op = torch.compile(op) 2025-05-07T20:32:42.5561148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5561412Z 2025-05-07T20:32:42.5561596Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5561758Z 2025-05-07T20:32:42.5561862Z moe/activation_test.py:117: 2025-05-07T20:32:42.5562152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5562479Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5562752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5563437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5564120Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5564647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5565366Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5566021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5566546Z kernel = self.compile( 2025-05-07T20:32:42.5567079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5567727Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5568121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5568348Z 2025-05-07T20:32:42.5568557Z self = 2025-05-07T20:32:42.5569637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5571026Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fcabfd8b0>} 2025-05-07T20:32:42.5572372Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5573400Z context = 2025-05-07T20:32:42.5573690Z 2025-05-07T20:32:42.5573855Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5574378Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5574840Z module_map=module_map) 2025-05-07T20:32:42.5575200Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5575550Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5575807Z E ^ 2025-05-07T20:32:42.5576316Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5576768Z 2025-05-07T20:32:42.5577181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5577690Z 2025-05-07T20:32:42.5577842Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5578245Z self=, 2025-05-07T20:32:42.5578640Z T=128, 2025-05-07T20:32:42.5578817Z D=7168, 2025-05-07T20:32:42.5578998Z scale_ub=None, 2025-05-07T20:32:42.5579250Z contiguous=False, 2025-05-07T20:32:42.5579469Z compiled=True, 2025-05-07T20:32:42.5579659Z ) 2025-05-07T20:32:42.5579968Z self = 2025-05-07T20:32:42.5580455Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.5580763Z 2025-05-07T20:32:42.5580841Z @given( 2025-05-07T20:32:42.5581057Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5581364Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5581667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5581984Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5582312Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5582593Z ) 2025-05-07T20:32:42.5582932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5583369Z def test_silu_mul_quant( 2025-05-07T20:32:42.5583613Z self, 2025-05-07T20:32:42.5583808Z T: int, 2025-05-07T20:32:42.5583994Z D: int, 2025-05-07T20:32:42.5584211Z scale_ub: Optional[float], 2025-05-07T20:32:42.5584482Z contiguous: bool, 2025-05-07T20:32:42.5584710Z compiled: bool, 2025-05-07T20:32:42.5585061Z ) -> None: 2025-05-07T20:32:42.5585446Z torch.manual_seed(2025) 2025-05-07T20:32:42.5585894Z 2025-05-07T20:32:42.5586299Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5586711Z 2025-05-07T20:32:42.5587035Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5587447Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5594209Z x = x_sign * x_clamp 2025-05-07T20:32:42.5594479Z x0 = x[:, :D] 2025-05-07T20:32:42.5594690Z x1 = x[:, D:] 2025-05-07T20:32:42.5594896Z 2025-05-07T20:32:42.5595069Z if contiguous: 2025-05-07T20:32:42.5595291Z x0 = x0.contiguous() 2025-05-07T20:32:42.5595545Z x1 = x1.contiguous() 2025-05-07T20:32:42.5595775Z 2025-05-07T20:32:42.5595959Z if scale_ub is not None: 2025-05-07T20:32:42.5596224Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5596552Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5596862Z ) 2025-05-07T20:32:42.5597044Z else: 2025-05-07T20:32:42.5597238Z scale_ub_tensor = None 2025-05-07T20:32:42.5597479Z 2025-05-07T20:32:42.5597702Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5598009Z op = silu_mul_quant 2025-05-07T20:32:42.5598245Z if compiled: 2025-05-07T20:32:42.5598485Z op = torch.compile(op) 2025-05-07T20:32:42.5598770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5599031Z 2025-05-07T20:32:42.5599208Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.5599488Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.5599769Z 2025-05-07T20:32:42.5600002Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5600333Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.5600612Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.5600917Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.5601279Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5601584Z 2025-05-07T20:32:42.5601774Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.5601973Z 2025-05-07T20:32:42.5602068Z moe/activation_test.py:126: 2025-05-07T20:32:42.5602363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5602764Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.5603090Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5604251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.5605012Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.5605555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5606282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5607029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.5607738Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5608488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.5609231Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5609953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.5610582Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.5611180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.5611690Z fn() 2025-05-07T20:32:42.5612256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.5612834Z self.fn.run( 2025-05-07T20:32:42.5613286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5613812Z kernel = self.compile( 2025-05-07T20:32:42.5614343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5614986Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5615383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5615609Z 2025-05-07T20:32:42.5615833Z self = 2025-05-07T20:32:42.5616949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5618322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fca2f73a0>} 2025-05-07T20:32:42.5619666Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5620695Z context = 2025-05-07T20:32:42.5620981Z 2025-05-07T20:32:42.5621151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5621674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5622134Z module_map=module_map) 2025-05-07T20:32:42.5622495Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5622848Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.5623099Z E ^ 2025-05-07T20:32:42.5623550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5623998Z 2025-05-07T20:32:42.5624413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5624992Z 2025-05-07T20:32:42.5625094Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5625496Z self=, 2025-05-07T20:32:42.5625953Z T=128, 2025-05-07T20:32:42.5626160Z D=7168, 2025-05-07T20:32:42.5626337Z scale_ub=None, 2025-05-07T20:32:42.5626548Z contiguous=False, 2025-05-07T20:32:42.5626768Z compiled=False, 2025-05-07T20:32:42.5626959Z ) 2025-05-07T20:32:42.5627267Z self = 2025-05-07T20:32:42.5627795Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.5628060Z 2025-05-07T20:32:42.5628137Z @given( 2025-05-07T20:32:42.5628351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5628656Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5628953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5629277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5629599Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5629936Z ) 2025-05-07T20:32:42.5630277Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5630709Z def test_silu_mul_quant( 2025-05-07T20:32:42.5630940Z self, 2025-05-07T20:32:42.5631119Z T: int, 2025-05-07T20:32:42.5631306Z D: int, 2025-05-07T20:32:42.5631518Z scale_ub: Optional[float], 2025-05-07T20:32:42.5631772Z contiguous: bool, 2025-05-07T20:32:42.5632056Z compiled: bool, 2025-05-07T20:32:42.5632271Z ) -> None: 2025-05-07T20:32:42.5632475Z torch.manual_seed(2025) 2025-05-07T20:32:42.5632705Z 2025-05-07T20:32:42.5632968Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5633301Z 2025-05-07T20:32:42.5633479Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5633762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5634064Z x = x_sign * x_clamp 2025-05-07T20:32:42.5634293Z x0 = x[:, :D] 2025-05-07T20:32:42.5634499Z x1 = x[:, D:] 2025-05-07T20:32:42.5634698Z 2025-05-07T20:32:42.5634870Z if contiguous: 2025-05-07T20:32:42.5635088Z x0 = x0.contiguous() 2025-05-07T20:32:42.5635338Z x1 = x1.contiguous() 2025-05-07T20:32:42.5635568Z 2025-05-07T20:32:42.5635773Z if scale_ub is not None: 2025-05-07T20:32:42.5636058Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5636387Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5636686Z ) 2025-05-07T20:32:42.5636868Z else: 2025-05-07T20:32:42.5637063Z scale_ub_tensor = None 2025-05-07T20:32:42.5637306Z 2025-05-07T20:32:42.5637526Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5637833Z op = silu_mul_quant 2025-05-07T20:32:42.5638073Z if compiled: 2025-05-07T20:32:42.5638310Z op = torch.compile(op) 2025-05-07T20:32:42.5638599Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5638864Z 2025-05-07T20:32:42.5639051Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5639212Z 2025-05-07T20:32:42.5639311Z moe/activation_test.py:117: 2025-05-07T20:32:42.5639591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5639915Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5640185Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5640872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5641555Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5642087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5642822Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5643468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5643988Z kernel = self.compile( 2025-05-07T20:32:42.5644563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5645207Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5645595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5645904Z 2025-05-07T20:32:42.5646106Z self = 2025-05-07T20:32:42.5647185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5648564Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fca2dfd30>} 2025-05-07T20:32:42.5649910Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5650931Z context = 2025-05-07T20:32:42.5651224Z 2025-05-07T20:32:42.5651459Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5651980Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5652432Z module_map=module_map) 2025-05-07T20:32:42.5652790Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5653136Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5653385Z E ^ 2025-05-07T20:32:42.5653837Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5654288Z 2025-05-07T20:32:42.5654702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5655209Z 2025-05-07T20:32:42.5655315Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5655714Z self=, 2025-05-07T20:32:42.5656152Z T=4096, 2025-05-07T20:32:42.5656335Z D=5120, 2025-05-07T20:32:42.5656512Z scale_ub=1200.0, 2025-05-07T20:32:42.5656719Z contiguous=True, 2025-05-07T20:32:42.5656936Z compiled=False, 2025-05-07T20:32:42.5657134Z ) 2025-05-07T20:32:42.5657437Z self = 2025-05-07T20:32:42.5657922Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.5658193Z 2025-05-07T20:32:42.5658270Z @given( 2025-05-07T20:32:42.5658485Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5658790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5659090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5659403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5659723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5660002Z ) 2025-05-07T20:32:42.5660343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5660771Z def test_silu_mul_quant( 2025-05-07T20:32:42.5661006Z self, 2025-05-07T20:32:42.5661187Z T: int, 2025-05-07T20:32:42.5661370Z D: int, 2025-05-07T20:32:42.5661576Z scale_ub: Optional[float], 2025-05-07T20:32:42.5661834Z contiguous: bool, 2025-05-07T20:32:42.5662057Z compiled: bool, 2025-05-07T20:32:42.5662320Z ) -> None: 2025-05-07T20:32:42.5662522Z torch.manual_seed(2025) 2025-05-07T20:32:42.5662757Z 2025-05-07T20:32:42.5663018Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5663347Z 2025-05-07T20:32:42.5663568Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5663847Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5664146Z x = x_sign * x_clamp 2025-05-07T20:32:42.5664380Z x0 = x[:, :D] 2025-05-07T20:32:42.5664581Z x1 = x[:, D:] 2025-05-07T20:32:42.5664775Z 2025-05-07T20:32:42.5664992Z if contiguous: 2025-05-07T20:32:42.5665211Z x0 = x0.contiguous() 2025-05-07T20:32:42.5665454Z x1 = x1.contiguous() 2025-05-07T20:32:42.5665684Z 2025-05-07T20:32:42.5665862Z if scale_ub is not None: 2025-05-07T20:32:42.5666118Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5666440Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5666742Z ) 2025-05-07T20:32:42.5666915Z else: 2025-05-07T20:32:42.5667114Z scale_ub_tensor = None 2025-05-07T20:32:42.5667354Z 2025-05-07T20:32:42.5667570Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5667877Z op = silu_mul_quant 2025-05-07T20:32:42.5668113Z if compiled: 2025-05-07T20:32:42.5668343Z op = torch.compile(op) 2025-05-07T20:32:42.5668629Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5668890Z 2025-05-07T20:32:42.5669110Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5669284Z 2025-05-07T20:32:42.5669378Z moe/activation_test.py:117: 2025-05-07T20:32:42.5669659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5670021Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5670287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5670963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5671645Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5672171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5672838Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5673488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5674005Z kernel = self.compile( 2025-05-07T20:32:42.5674384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5674554Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5674676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5674681Z 2025-05-07T20:32:42.5674884Z self = 2025-05-07T20:32:42.5675665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5676220Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fca3121f0>} 2025-05-07T20:32:42.5676969Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5677157Z context = 2025-05-07T20:32:42.5677162Z 2025-05-07T20:32:42.5677322Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5677633Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5677734Z module_map=module_map) 2025-05-07T20:32:42.5677931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5678031Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5678101Z E ^ 2025-05-07T20:32:42.5678468Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5678474Z 2025-05-07T20:32:42.5678925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5678930Z 2025-05-07T20:32:42.5679026Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5679247Z self=, 2025-05-07T20:32:42.5679317Z T=1, 2025-05-07T20:32:42.5679390Z D=5120, 2025-05-07T20:32:42.5679473Z scale_ub=None, 2025-05-07T20:32:42.5679552Z contiguous=True, 2025-05-07T20:32:42.5679632Z compiled=True, 2025-05-07T20:32:42.5679699Z ) 2025-05-07T20:32:42.5679917Z self = 2025-05-07T20:32:42.5680082Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.5680087Z 2025-05-07T20:32:42.5680161Z @given( 2025-05-07T20:32:42.5680275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5680372Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5680526Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5680644Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5680751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5680819Z ) 2025-05-07T20:32:42.5681064Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5681157Z def test_silu_mul_quant( 2025-05-07T20:32:42.5681229Z self, 2025-05-07T20:32:42.5681307Z T: int, 2025-05-07T20:32:42.5681379Z D: int, 2025-05-07T20:32:42.5681474Z scale_ub: Optional[float], 2025-05-07T20:32:42.5681561Z contiguous: bool, 2025-05-07T20:32:42.5681644Z compiled: bool, 2025-05-07T20:32:42.5681714Z ) -> None: 2025-05-07T20:32:42.5681807Z torch.manual_seed(2025) 2025-05-07T20:32:42.5681874Z 2025-05-07T20:32:42.5682036Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5682107Z 2025-05-07T20:32:42.5682199Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5682323Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5682407Z x = x_sign * x_clamp 2025-05-07T20:32:42.5682480Z x0 = x[:, :D] 2025-05-07T20:32:42.5682555Z x1 = x[:, D:] 2025-05-07T20:32:42.5682624Z 2025-05-07T20:32:42.5682703Z if contiguous: 2025-05-07T20:32:42.5682797Z x0 = x0.contiguous() 2025-05-07T20:32:42.5682881Z x1 = x1.contiguous() 2025-05-07T20:32:42.5682948Z 2025-05-07T20:32:42.5683038Z if scale_ub is not None: 2025-05-07T20:32:42.5683137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5683273Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5683346Z ) 2025-05-07T20:32:42.5683419Z else: 2025-05-07T20:32:42.5683509Z scale_ub_tensor = None 2025-05-07T20:32:42.5683575Z 2025-05-07T20:32:42.5683698Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5683789Z op = silu_mul_quant 2025-05-07T20:32:42.5683868Z if compiled: 2025-05-07T20:32:42.5683963Z op = torch.compile(op) 2025-05-07T20:32:42.5684065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5684131Z 2025-05-07T20:32:42.5684215Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.5684335Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.5684450Z 2025-05-07T20:32:42.5684582Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5684681Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.5684815Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.5684935Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.5685073Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5685141Z 2025-05-07T20:32:42.5685237Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.5685242Z 2025-05-07T20:32:42.5685379Z moe/activation_test.py:126: 2025-05-07T20:32:42.5685503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5685606Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.5685735Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5686318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.5686430Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.5686812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5687034Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5687394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.5687688Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5688090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.5688339Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5688708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.5688872Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.5689211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.5689286Z fn() 2025-05-07T20:32:42.5689679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.5689759Z self.fn.run( 2025-05-07T20:32:42.5690091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5690181Z kernel = self.compile( 2025-05-07T20:32:42.5690560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5690730Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5690850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5690856Z 2025-05-07T20:32:42.5691062Z self = 2025-05-07T20:32:42.5691840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5692348Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fca3124c0>} 2025-05-07T20:32:42.5693088Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5693278Z context = 2025-05-07T20:32:42.5693326Z 2025-05-07T20:32:42.5693488Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5693748Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5693892Z module_map=module_map) 2025-05-07T20:32:42.5694050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5694149Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.5694226Z E ^ 2025-05-07T20:32:42.5694579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5694649Z 2025-05-07T20:32:42.5695062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5695067Z 2025-05-07T20:32:42.5695163Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5695380Z self=, 2025-05-07T20:32:42.5695458Z T=2048, 2025-05-07T20:32:42.5695525Z D=5120, 2025-05-07T20:32:42.5695603Z scale_ub=None, 2025-05-07T20:32:42.5695686Z contiguous=True, 2025-05-07T20:32:42.5695764Z compiled=True, 2025-05-07T20:32:42.5695835Z ) 2025-05-07T20:32:42.5696049Z self = 2025-05-07T20:32:42.5696215Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.5696220Z 2025-05-07T20:32:42.5696291Z @given( 2025-05-07T20:32:42.5696405Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5696543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5696657Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5696769Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5696879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5696948Z ) 2025-05-07T20:32:42.5697190Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5697284Z def test_silu_mul_quant( 2025-05-07T20:32:42.5697352Z self, 2025-05-07T20:32:42.5697421Z T: int, 2025-05-07T20:32:42.5697497Z D: int, 2025-05-07T20:32:42.5697596Z scale_ub: Optional[float], 2025-05-07T20:32:42.5697680Z contiguous: bool, 2025-05-07T20:32:42.5697762Z compiled: bool, 2025-05-07T20:32:42.5697835Z ) -> None: 2025-05-07T20:32:42.5697924Z torch.manual_seed(2025) 2025-05-07T20:32:42.5697994Z 2025-05-07T20:32:42.5698160Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5698235Z 2025-05-07T20:32:42.5698325Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5698443Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5698530Z x = x_sign * x_clamp 2025-05-07T20:32:42.5698606Z x0 = x[:, :D] 2025-05-07T20:32:42.5698682Z x1 = x[:, D:] 2025-05-07T20:32:42.5698758Z 2025-05-07T20:32:42.5698836Z if contiguous: 2025-05-07T20:32:42.5698920Z x0 = x0.contiguous() 2025-05-07T20:32:42.5699007Z x1 = x1.contiguous() 2025-05-07T20:32:42.5699080Z 2025-05-07T20:32:42.5699166Z if scale_ub is not None: 2025-05-07T20:32:42.5699277Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5699409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5699482Z ) 2025-05-07T20:32:42.5699558Z else: 2025-05-07T20:32:42.5699645Z scale_ub_tensor = None 2025-05-07T20:32:42.5699718Z 2025-05-07T20:32:42.5699841Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5699926Z op = silu_mul_quant 2025-05-07T20:32:42.5700007Z if compiled: 2025-05-07T20:32:42.5700106Z op = torch.compile(op) 2025-05-07T20:32:42.5700208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5700334Z 2025-05-07T20:32:42.5700424Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.5700539Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.5700606Z 2025-05-07T20:32:42.5700737Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5700871Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.5700970Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.5701088Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.5701225Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5701292Z 2025-05-07T20:32:42.5701430Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.5701434Z 2025-05-07T20:32:42.5701531Z moe/activation_test.py:126: 2025-05-07T20:32:42.5701653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5701750Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.5701879Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5702438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.5702536Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.5702895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5703114Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5703522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.5703982Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5704416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.5704665Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5705035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.5705203Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.5705543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.5705619Z fn() 2025-05-07T20:32:42.5706060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.5706145Z self.fn.run( 2025-05-07T20:32:42.5706484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5706572Z kernel = self.compile( 2025-05-07T20:32:42.5706942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5707116Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5707238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5707242Z 2025-05-07T20:32:42.5707445Z self = 2025-05-07T20:32:42.5708228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5708735Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fca00cf70>} 2025-05-07T20:32:42.5709482Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5709755Z context = 2025-05-07T20:32:42.5709760Z 2025-05-07T20:32:42.5709968Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5710292Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5710402Z module_map=module_map) 2025-05-07T20:32:42.5710563Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5710662Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.5710736Z E ^ 2025-05-07T20:32:42.5711159Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5711164Z 2025-05-07T20:32:42.5711575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5711580Z 2025-05-07T20:32:42.5711678Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5711900Z self=, 2025-05-07T20:32:42.5711974Z T=128, 2025-05-07T20:32:42.5712048Z D=5120, 2025-05-07T20:32:42.5712125Z scale_ub=None, 2025-05-07T20:32:42.5712213Z contiguous=True, 2025-05-07T20:32:42.5712294Z compiled=True, 2025-05-07T20:32:42.5712362Z ) 2025-05-07T20:32:42.5712587Z self = 2025-05-07T20:32:42.5712751Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.5712755Z 2025-05-07T20:32:42.5712888Z @given( 2025-05-07T20:32:42.5713005Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5713100Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5713211Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5713325Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5713434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5713509Z ) 2025-05-07T20:32:42.5713750Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5713838Z def test_silu_mul_quant( 2025-05-07T20:32:42.5713910Z self, 2025-05-07T20:32:42.5713984Z T: int, 2025-05-07T20:32:42.5714055Z D: int, 2025-05-07T20:32:42.5714151Z scale_ub: Optional[float], 2025-05-07T20:32:42.5714236Z contiguous: bool, 2025-05-07T20:32:42.5714317Z compiled: bool, 2025-05-07T20:32:42.5714393Z ) -> None: 2025-05-07T20:32:42.5714484Z torch.manual_seed(2025) 2025-05-07T20:32:42.5714555Z 2025-05-07T20:32:42.5714724Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5714793Z 2025-05-07T20:32:42.5714879Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5715000Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5715082Z x = x_sign * x_clamp 2025-05-07T20:32:42.5715160Z x0 = x[:, :D] 2025-05-07T20:32:42.5715231Z x1 = x[:, D:] 2025-05-07T20:32:42.5715297Z 2025-05-07T20:32:42.5715380Z if contiguous: 2025-05-07T20:32:42.5715466Z x0 = x0.contiguous() 2025-05-07T20:32:42.5715553Z x1 = x1.contiguous() 2025-05-07T20:32:42.5715623Z 2025-05-07T20:32:42.5715708Z if scale_ub is not None: 2025-05-07T20:32:42.5715809Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5715966Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5716043Z ) 2025-05-07T20:32:42.5716140Z else: 2025-05-07T20:32:42.5716234Z scale_ub_tensor = None 2025-05-07T20:32:42.5716301Z 2025-05-07T20:32:42.5716426Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5716512Z op = silu_mul_quant 2025-05-07T20:32:42.5716590Z if compiled: 2025-05-07T20:32:42.5716688Z op = torch.compile(op) 2025-05-07T20:32:42.5716837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5716905Z 2025-05-07T20:32:42.5716994Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.5717109Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.5717179Z 2025-05-07T20:32:42.5717350Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5717448Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.5717540Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.5717661Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.5717836Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5717906Z 2025-05-07T20:32:42.5718001Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.5718005Z 2025-05-07T20:32:42.5718095Z moe/activation_test.py:126: 2025-05-07T20:32:42.5718220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5718321Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.5718452Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5719014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.5719109Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.5719469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5719683Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5720081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.5720337Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5720727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.5720981Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5721349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.5721513Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.5721849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.5721918Z fn() 2025-05-07T20:32:42.5722312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.5722394Z self.fn.run( 2025-05-07T20:32:42.5722724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5722815Z kernel = self.compile( 2025-05-07T20:32:42.5723187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5723361Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5723485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5723489Z 2025-05-07T20:32:42.5723696Z self = 2025-05-07T20:32:42.5724481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5724990Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc9e93b80>} 2025-05-07T20:32:42.5725727Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5725961Z context = 2025-05-07T20:32:42.5725965Z 2025-05-07T20:32:42.5726190Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5726455Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5726558Z module_map=module_map) 2025-05-07T20:32:42.5726717Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5726862Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.5726936Z E ^ 2025-05-07T20:32:42.5727286Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5727294Z 2025-05-07T20:32:42.5727701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5727708Z 2025-05-07T20:32:42.5727808Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5728033Z self=, 2025-05-07T20:32:42.5728103Z T=4096, 2025-05-07T20:32:42.5728178Z D=5120, 2025-05-07T20:32:42.5728255Z scale_ub=None, 2025-05-07T20:32:42.5728334Z contiguous=True, 2025-05-07T20:32:42.5728416Z compiled=True, 2025-05-07T20:32:42.5728484Z ) 2025-05-07T20:32:42.5728700Z self = 2025-05-07T20:32:42.5728909Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.5728914Z 2025-05-07T20:32:42.5728988Z @given( 2025-05-07T20:32:42.5729102Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5729199Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5729307Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5729425Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5729626Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5729809Z ) 2025-05-07T20:32:42.5736028Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5736165Z def test_silu_mul_quant( 2025-05-07T20:32:42.5736244Z self, 2025-05-07T20:32:42.5736338Z T: int, 2025-05-07T20:32:42.5736408Z D: int, 2025-05-07T20:32:42.5736502Z scale_ub: Optional[float], 2025-05-07T20:32:42.5736586Z contiguous: bool, 2025-05-07T20:32:42.5736672Z compiled: bool, 2025-05-07T20:32:42.5736749Z ) -> None: 2025-05-07T20:32:42.5736837Z torch.manual_seed(2025) 2025-05-07T20:32:42.5736905Z 2025-05-07T20:32:42.5737076Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5737145Z 2025-05-07T20:32:42.5737232Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5737359Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5737445Z x = x_sign * x_clamp 2025-05-07T20:32:42.5737520Z x0 = x[:, :D] 2025-05-07T20:32:42.5737597Z x1 = x[:, D:] 2025-05-07T20:32:42.5737664Z 2025-05-07T20:32:42.5737746Z if contiguous: 2025-05-07T20:32:42.5737835Z x0 = x0.contiguous() 2025-05-07T20:32:42.5737920Z x1 = x1.contiguous() 2025-05-07T20:32:42.5737990Z 2025-05-07T20:32:42.5738075Z if scale_ub is not None: 2025-05-07T20:32:42.5738172Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5738309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5738381Z ) 2025-05-07T20:32:42.5738454Z else: 2025-05-07T20:32:42.5738545Z scale_ub_tensor = None 2025-05-07T20:32:42.5738612Z 2025-05-07T20:32:42.5738737Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5738825Z op = silu_mul_quant 2025-05-07T20:32:42.5738970Z if compiled: 2025-05-07T20:32:42.5739069Z op = torch.compile(op) 2025-05-07T20:32:42.5739168Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5739236Z 2025-05-07T20:32:42.5739323Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.5739484Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.5739552Z 2025-05-07T20:32:42.5739688Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5739783Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.5739875Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.5740042Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.5740178Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5740249Z 2025-05-07T20:32:42.5740348Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.5740353Z 2025-05-07T20:32:42.5740450Z moe/activation_test.py:126: 2025-05-07T20:32:42.5740583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5740683Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.5740813Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5741376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.5741472Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.5741829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5742093Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5742456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.5742710Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5743103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.5743350Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5743722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.5743885Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.5744225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.5744306Z fn() 2025-05-07T20:32:42.5744700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.5744781Z self.fn.run( 2025-05-07T20:32:42.5745109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5745199Z kernel = self.compile( 2025-05-07T20:32:42.5745577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5745749Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5745878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5745883Z 2025-05-07T20:32:42.5746085Z self = 2025-05-07T20:32:42.5746864Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5747376Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc99e0c10>} 2025-05-07T20:32:42.5748113Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5748352Z context = 2025-05-07T20:32:42.5748395Z 2025-05-07T20:32:42.5748555Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5748816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5748917Z module_map=module_map) 2025-05-07T20:32:42.5749118Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5749216Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.5749288Z E ^ 2025-05-07T20:32:42.5749637Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5749642Z 2025-05-07T20:32:42.5750135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5750140Z 2025-05-07T20:32:42.5750237Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5750460Z self=, 2025-05-07T20:32:42.5750533Z T=16384, 2025-05-07T20:32:42.5750601Z D=5120, 2025-05-07T20:32:42.5750679Z scale_ub=None, 2025-05-07T20:32:42.5750759Z contiguous=True, 2025-05-07T20:32:42.5750837Z compiled=True, 2025-05-07T20:32:42.5750911Z ) 2025-05-07T20:32:42.5751194Z self = 2025-05-07T20:32:42.5751366Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.5751371Z 2025-05-07T20:32:42.5751447Z @given( 2025-05-07T20:32:42.5751560Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5751656Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5751773Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5751883Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5751993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5752064Z ) 2025-05-07T20:32:42.5752305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5752397Z def test_silu_mul_quant( 2025-05-07T20:32:42.5752469Z self, 2025-05-07T20:32:42.5752540Z T: int, 2025-05-07T20:32:42.5752614Z D: int, 2025-05-07T20:32:42.5752711Z scale_ub: Optional[float], 2025-05-07T20:32:42.5752798Z contiguous: bool, 2025-05-07T20:32:42.5752883Z compiled: bool, 2025-05-07T20:32:42.5752957Z ) -> None: 2025-05-07T20:32:42.5753046Z torch.manual_seed(2025) 2025-05-07T20:32:42.5753115Z 2025-05-07T20:32:42.5753278Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5753350Z 2025-05-07T20:32:42.5753444Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5753565Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5753651Z x = x_sign * x_clamp 2025-05-07T20:32:42.5753725Z x0 = x[:, :D] 2025-05-07T20:32:42.5753800Z x1 = x[:, D:] 2025-05-07T20:32:42.5753873Z 2025-05-07T20:32:42.5753950Z if contiguous: 2025-05-07T20:32:42.5754037Z x0 = x0.contiguous() 2025-05-07T20:32:42.5754124Z x1 = x1.contiguous() 2025-05-07T20:32:42.5754190Z 2025-05-07T20:32:42.5754274Z if scale_ub is not None: 2025-05-07T20:32:42.5754381Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5754512Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5754585Z ) 2025-05-07T20:32:42.5754659Z else: 2025-05-07T20:32:42.5754750Z scale_ub_tensor = None 2025-05-07T20:32:42.5754821Z 2025-05-07T20:32:42.5754949Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5755083Z op = silu_mul_quant 2025-05-07T20:32:42.5755165Z if compiled: 2025-05-07T20:32:42.5755260Z op = torch.compile(op) 2025-05-07T20:32:42.5755360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5755469Z 2025-05-07T20:32:42.5755557Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.5755676Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.5755749Z 2025-05-07T20:32:42.5755880Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5755978Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.5756118Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.5756237Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.5756375Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5756443Z 2025-05-07T20:32:42.5756537Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.5756544Z 2025-05-07T20:32:42.5756640Z moe/activation_test.py:126: 2025-05-07T20:32:42.5756764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5756863Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.5756998Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5757552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.5757648Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.5758044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5758266Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5758626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.5758878Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5759274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.5759525Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5759893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.5760058Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.5760402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.5760471Z fn() 2025-05-07T20:32:42.5760870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.5760950Z self.fn.run( 2025-05-07T20:32:42.5761282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5761376Z kernel = self.compile( 2025-05-07T20:32:42.5761750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5761930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5762050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5762054Z 2025-05-07T20:32:42.5762257Z self = 2025-05-07T20:32:42.5763042Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5763547Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc99b0c10>} 2025-05-07T20:32:42.5764370Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5764557Z context = 2025-05-07T20:32:42.5764562Z 2025-05-07T20:32:42.5764726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5764984Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5765125Z module_map=module_map) 2025-05-07T20:32:42.5765286Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5765380Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.5765450Z E ^ 2025-05-07T20:32:42.5765803Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5765810Z 2025-05-07T20:32:42.5766242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5766247Z 2025-05-07T20:32:42.5766362Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5766588Z self=, 2025-05-07T20:32:42.5766660Z T=1, 2025-05-07T20:32:42.5766733Z D=5120, 2025-05-07T20:32:42.5766810Z scale_ub=1200.0, 2025-05-07T20:32:42.5766889Z contiguous=True, 2025-05-07T20:32:42.5767013Z compiled=True, 2025-05-07T20:32:42.5767084Z ) 2025-05-07T20:32:42.5767306Z self = 2025-05-07T20:32:42.5767466Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.5767471Z 2025-05-07T20:32:42.5767541Z @given( 2025-05-07T20:32:42.5767660Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5767758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5767868Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5767982Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5768098Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5768165Z ) 2025-05-07T20:32:42.5768409Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5768499Z def test_silu_mul_quant( 2025-05-07T20:32:42.5768569Z self, 2025-05-07T20:32:42.5768641Z T: int, 2025-05-07T20:32:42.5768714Z D: int, 2025-05-07T20:32:42.5768811Z scale_ub: Optional[float], 2025-05-07T20:32:42.5768897Z contiguous: bool, 2025-05-07T20:32:42.5768976Z compiled: bool, 2025-05-07T20:32:42.5769053Z ) -> None: 2025-05-07T20:32:42.5769141Z torch.manual_seed(2025) 2025-05-07T20:32:42.5769208Z 2025-05-07T20:32:42.5769375Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5769444Z 2025-05-07T20:32:42.5769529Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5769651Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5769738Z x = x_sign * x_clamp 2025-05-07T20:32:42.5769815Z x0 = x[:, :D] 2025-05-07T20:32:42.5769887Z x1 = x[:, D:] 2025-05-07T20:32:42.5769954Z 2025-05-07T20:32:42.5770033Z if contiguous: 2025-05-07T20:32:42.5770120Z x0 = x0.contiguous() 2025-05-07T20:32:42.5770205Z x1 = x1.contiguous() 2025-05-07T20:32:42.5770282Z 2025-05-07T20:32:42.5770367Z if scale_ub is not None: 2025-05-07T20:32:42.5770464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5770596Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5770668Z ) 2025-05-07T20:32:42.5770736Z else: 2025-05-07T20:32:42.5770825Z scale_ub_tensor = None 2025-05-07T20:32:42.5770943Z 2025-05-07T20:32:42.5771069Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5771152Z op = silu_mul_quant 2025-05-07T20:32:42.5771229Z if compiled: 2025-05-07T20:32:42.5771367Z op = torch.compile(op) 2025-05-07T20:32:42.5771472Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5771535Z 2025-05-07T20:32:42.5771625Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5771629Z 2025-05-07T20:32:42.5771721Z moe/activation_test.py:117: 2025-05-07T20:32:42.5771848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5771984Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5772079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5772447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.5772533Z return fn(*args, **kwargs) 2025-05-07T20:32:42.5773024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5773124Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5773475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5773693Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5774028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5774161Z kernel = self.compile( 2025-05-07T20:32:42.5774542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5774710Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5774829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5774837Z 2025-05-07T20:32:42.5775040Z self = 2025-05-07T20:32:42.5775819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5776325Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc9256670>} 2025-05-07T20:32:42.5777066Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5777253Z context = 2025-05-07T20:32:42.5777261Z 2025-05-07T20:32:42.5777418Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5777676Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5777783Z module_map=module_map) 2025-05-07T20:32:42.5777943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5778035Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5778108Z E ^ 2025-05-07T20:32:42.5778458Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5778466Z 2025-05-07T20:32:42.5778878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5778882Z 2025-05-07T20:32:42.5778979Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5779194Z self=, 2025-05-07T20:32:42.5779268Z T=1, 2025-05-07T20:32:42.5779385Z D=5120, 2025-05-07T20:32:42.5779462Z scale_ub=None, 2025-05-07T20:32:42.5779546Z contiguous=False, 2025-05-07T20:32:42.5779623Z compiled=True, 2025-05-07T20:32:42.5779692Z ) 2025-05-07T20:32:42.5779946Z self = 2025-05-07T20:32:42.5780107Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.5780111Z 2025-05-07T20:32:42.5780186Z @given( 2025-05-07T20:32:42.5780299Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5780394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5780573Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5780686Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5780794Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5780868Z ) 2025-05-07T20:32:42.5781108Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5781202Z def test_silu_mul_quant( 2025-05-07T20:32:42.5781273Z self, 2025-05-07T20:32:42.5781346Z T: int, 2025-05-07T20:32:42.5781421Z D: int, 2025-05-07T20:32:42.5781517Z scale_ub: Optional[float], 2025-05-07T20:32:42.5781602Z contiguous: bool, 2025-05-07T20:32:42.5781687Z compiled: bool, 2025-05-07T20:32:42.5781762Z ) -> None: 2025-05-07T20:32:42.5781852Z torch.manual_seed(2025) 2025-05-07T20:32:42.5781922Z 2025-05-07T20:32:42.5782084Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5782157Z 2025-05-07T20:32:42.5782288Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5782408Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5782492Z x = x_sign * x_clamp 2025-05-07T20:32:42.5782567Z x0 = x[:, :D] 2025-05-07T20:32:42.5782641Z x1 = x[:, D:] 2025-05-07T20:32:42.5782712Z 2025-05-07T20:32:42.5782788Z if contiguous: 2025-05-07T20:32:42.5782879Z x0 = x0.contiguous() 2025-05-07T20:32:42.5782964Z x1 = x1.contiguous() 2025-05-07T20:32:42.5783031Z 2025-05-07T20:32:42.5783118Z if scale_ub is not None: 2025-05-07T20:32:42.5783223Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5783354Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5783423Z ) 2025-05-07T20:32:42.5783496Z else: 2025-05-07T20:32:42.5783585Z scale_ub_tensor = None 2025-05-07T20:32:42.5783652Z 2025-05-07T20:32:42.5783781Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5783873Z op = silu_mul_quant 2025-05-07T20:32:42.5783958Z if compiled: 2025-05-07T20:32:42.5784052Z op = torch.compile(op) 2025-05-07T20:32:42.5784152Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5784222Z 2025-05-07T20:32:42.5784310Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.5784429Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.5784499Z 2025-05-07T20:32:42.5784629Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5784726Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.5784826Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.5784948Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.5785080Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5785153Z 2025-05-07T20:32:42.5785249Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.5785258Z 2025-05-07T20:32:42.5785355Z moe/activation_test.py:126: 2025-05-07T20:32:42.5785477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5785575Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.5785705Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5786258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.5786401Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.5786794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5787011Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5787371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.5787661Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5788053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.5788304Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5788673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.5788837Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.5789173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.5789255Z fn() 2025-05-07T20:32:42.5789648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.5789724Z self.fn.run( 2025-05-07T20:32:42.5790148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5790242Z kernel = self.compile( 2025-05-07T20:32:42.5790615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5790784Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5790910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5790914Z 2025-05-07T20:32:42.5791124Z self = 2025-05-07T20:32:42.5791907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5792416Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc92c0dc0>} 2025-05-07T20:32:42.5793159Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5793344Z context = 2025-05-07T20:32:42.5793354Z 2025-05-07T20:32:42.5793515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5793773Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5793881Z module_map=module_map) 2025-05-07T20:32:42.5794039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5794133Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.5794205Z E ^ 2025-05-07T20:32:42.5794559Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5794566Z 2025-05-07T20:32:42.5794975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5794979Z 2025-05-07T20:32:42.5795076Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5795293Z self=, 2025-05-07T20:32:42.5795414Z T=1, 2025-05-07T20:32:42.5795486Z D=5120, 2025-05-07T20:32:42.5795563Z scale_ub=None, 2025-05-07T20:32:42.5795646Z contiguous=True, 2025-05-07T20:32:42.5795725Z compiled=False, 2025-05-07T20:32:42.5795833Z ) 2025-05-07T20:32:42.5796078Z self = 2025-05-07T20:32:42.5796261Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.5796265Z 2025-05-07T20:32:42.5796342Z @given( 2025-05-07T20:32:42.5796459Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5796591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5796703Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5796814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5796921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5796993Z ) 2025-05-07T20:32:42.5797237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5797325Z def test_silu_mul_quant( 2025-05-07T20:32:42.5797399Z self, 2025-05-07T20:32:42.5797469Z T: int, 2025-05-07T20:32:42.5797539Z D: int, 2025-05-07T20:32:42.5797636Z scale_ub: Optional[float], 2025-05-07T20:32:42.5797719Z contiguous: bool, 2025-05-07T20:32:42.5797803Z compiled: bool, 2025-05-07T20:32:42.5797876Z ) -> None: 2025-05-07T20:32:42.5797963Z torch.manual_seed(2025) 2025-05-07T20:32:42.5798034Z 2025-05-07T20:32:42.5798240Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5798311Z 2025-05-07T20:32:42.5798401Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5798519Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5798607Z x = x_sign * x_clamp 2025-05-07T20:32:42.5798686Z x0 = x[:, :D] 2025-05-07T20:32:42.5798764Z x1 = x[:, D:] 2025-05-07T20:32:42.5798833Z 2025-05-07T20:32:42.5798911Z if contiguous: 2025-05-07T20:32:42.5798997Z x0 = x0.contiguous() 2025-05-07T20:32:42.5799082Z x1 = x1.contiguous() 2025-05-07T20:32:42.5799150Z 2025-05-07T20:32:42.5799241Z if scale_ub is not None: 2025-05-07T20:32:42.5799343Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5799471Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5799544Z ) 2025-05-07T20:32:42.5799618Z else: 2025-05-07T20:32:42.5799706Z scale_ub_tensor = None 2025-05-07T20:32:42.5799775Z 2025-05-07T20:32:42.5799904Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5799990Z op = silu_mul_quant 2025-05-07T20:32:42.5800068Z if compiled: 2025-05-07T20:32:42.5800165Z op = torch.compile(op) 2025-05-07T20:32:42.5800265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5800336Z 2025-05-07T20:32:42.5800420Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5800425Z 2025-05-07T20:32:42.5800516Z moe/activation_test.py:117: 2025-05-07T20:32:42.5800639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5800738Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5800832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5801332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5801427Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5801786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5802003Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5802336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5802477Z kernel = self.compile( 2025-05-07T20:32:42.5802852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5803061Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5803188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5803192Z 2025-05-07T20:32:42.5803393Z self = 2025-05-07T20:32:42.5804459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5805061Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc926edc0>} 2025-05-07T20:32:42.5805809Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5806000Z context = 2025-05-07T20:32:42.5806006Z 2025-05-07T20:32:42.5806168Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5806429Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5806595Z module_map=module_map) 2025-05-07T20:32:42.5806758Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5806851Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5806923Z E ^ 2025-05-07T20:32:42.5807279Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5807286Z 2025-05-07T20:32:42.5807695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5807700Z 2025-05-07T20:32:42.5807798Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5808021Z self=, 2025-05-07T20:32:42.5808091Z T=128, 2025-05-07T20:32:42.5808167Z D=5120, 2025-05-07T20:32:42.5808243Z scale_ub=None, 2025-05-07T20:32:42.5808325Z contiguous=False, 2025-05-07T20:32:42.5808404Z compiled=True, 2025-05-07T20:32:42.5808476Z ) 2025-05-07T20:32:42.5808692Z self = 2025-05-07T20:32:42.5808864Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.5808869Z 2025-05-07T20:32:42.5808944Z @given( 2025-05-07T20:32:42.5809058Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5809156Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5809265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5809382Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5809490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5809562Z ) 2025-05-07T20:32:42.5809804Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5809893Z def test_silu_mul_quant( 2025-05-07T20:32:42.5809964Z self, 2025-05-07T20:32:42.5810037Z T: int, 2025-05-07T20:32:42.5810109Z D: int, 2025-05-07T20:32:42.5810208Z scale_ub: Optional[float], 2025-05-07T20:32:42.5810297Z contiguous: bool, 2025-05-07T20:32:42.5810377Z compiled: bool, 2025-05-07T20:32:42.5810450Z ) -> None: 2025-05-07T20:32:42.5810543Z torch.manual_seed(2025) 2025-05-07T20:32:42.5810611Z 2025-05-07T20:32:42.5810777Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5810914Z 2025-05-07T20:32:42.5811000Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5811121Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5811204Z x = x_sign * x_clamp 2025-05-07T20:32:42.5811360Z x0 = x[:, :D] 2025-05-07T20:32:42.5811438Z x1 = x[:, D:] 2025-05-07T20:32:42.5811504Z 2025-05-07T20:32:42.5811581Z if contiguous: 2025-05-07T20:32:42.5811671Z x0 = x0.contiguous() 2025-05-07T20:32:42.5811753Z x1 = x1.contiguous() 2025-05-07T20:32:42.5811817Z 2025-05-07T20:32:42.5811950Z if scale_ub is not None: 2025-05-07T20:32:42.5812049Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5812183Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5812256Z ) 2025-05-07T20:32:42.5812329Z else: 2025-05-07T20:32:42.5812419Z scale_ub_tensor = None 2025-05-07T20:32:42.5812487Z 2025-05-07T20:32:42.5812613Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5812701Z op = silu_mul_quant 2025-05-07T20:32:42.5812780Z if compiled: 2025-05-07T20:32:42.5812872Z op = torch.compile(op) 2025-05-07T20:32:42.5812981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5813048Z 2025-05-07T20:32:42.5813134Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5813138Z 2025-05-07T20:32:42.5813235Z moe/activation_test.py:117: 2025-05-07T20:32:42.5813357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5813501Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5813599Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5813962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.5814055Z return fn(*args, **kwargs) 2025-05-07T20:32:42.5814545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5814640Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5814998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5815218Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5815553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5815643Z kernel = self.compile( 2025-05-07T20:32:42.5816021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5816191Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5816313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5816318Z 2025-05-07T20:32:42.5816524Z self = 2025-05-07T20:32:42.5817301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5817807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8ee8040>} 2025-05-07T20:32:42.5818549Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5818737Z context = 2025-05-07T20:32:42.5818742Z 2025-05-07T20:32:42.5818905Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5819205Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5819308Z module_map=module_map) 2025-05-07T20:32:42.5819467Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5819597Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5819677Z E ^ 2025-05-07T20:32:42.5820025Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5820030Z 2025-05-07T20:32:42.5820441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5820484Z 2025-05-07T20:32:42.5820583Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5820798Z self=, 2025-05-07T20:32:42.5820870Z T=128, 2025-05-07T20:32:42.5820946Z D=7168, 2025-05-07T20:32:42.5821027Z scale_ub=1200.0, 2025-05-07T20:32:42.5821109Z contiguous=False, 2025-05-07T20:32:42.5821187Z compiled=False, 2025-05-07T20:32:42.5821254Z ) 2025-05-07T20:32:42.5821468Z self = 2025-05-07T20:32:42.5821640Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.5821645Z 2025-05-07T20:32:42.5821714Z @given( 2025-05-07T20:32:42.5821831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5821924Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5822077Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5822195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5822304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5822376Z ) 2025-05-07T20:32:42.5822616Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5822704Z def test_silu_mul_quant( 2025-05-07T20:32:42.5822783Z self, 2025-05-07T20:32:42.5822854Z T: int, 2025-05-07T20:32:42.5822923Z D: int, 2025-05-07T20:32:42.5823021Z scale_ub: Optional[float], 2025-05-07T20:32:42.5823103Z contiguous: bool, 2025-05-07T20:32:42.5823185Z compiled: bool, 2025-05-07T20:32:42.5823260Z ) -> None: 2025-05-07T20:32:42.5823350Z torch.manual_seed(2025) 2025-05-07T20:32:42.5823418Z 2025-05-07T20:32:42.5823583Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5823650Z 2025-05-07T20:32:42.5823745Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5823863Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5823946Z x = x_sign * x_clamp 2025-05-07T20:32:42.5824021Z x0 = x[:, :D] 2025-05-07T20:32:42.5824095Z x1 = x[:, D:] 2025-05-07T20:32:42.5824162Z 2025-05-07T20:32:42.5824243Z if contiguous: 2025-05-07T20:32:42.5824328Z x0 = x0.contiguous() 2025-05-07T20:32:42.5824414Z x1 = x1.contiguous() 2025-05-07T20:32:42.5824486Z 2025-05-07T20:32:42.5824571Z if scale_ub is not None: 2025-05-07T20:32:42.5824670Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5824803Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5824876Z ) 2025-05-07T20:32:42.5824952Z else: 2025-05-07T20:32:42.5825043Z scale_ub_tensor = None 2025-05-07T20:32:42.5825112Z 2025-05-07T20:32:42.5825239Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5825327Z op = silu_mul_quant 2025-05-07T20:32:42.5825409Z if compiled: 2025-05-07T20:32:42.5825506Z op = torch.compile(op) 2025-05-07T20:32:42.5825607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5825671Z 2025-05-07T20:32:42.5825759Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5825763Z 2025-05-07T20:32:42.5825905Z moe/activation_test.py:117: 2025-05-07T20:32:42.5826027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5826132Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5826228Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5826773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5826867Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5827218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5827485Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5827817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5827907Z kernel = self.compile( 2025-05-07T20:32:42.5828282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5828454Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5828579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5828586Z 2025-05-07T20:32:42.5828788Z self = 2025-05-07T20:32:42.5829599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5830191Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8ee8ca0>} 2025-05-07T20:32:42.5830927Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5831118Z context = 2025-05-07T20:32:42.5831122Z 2025-05-07T20:32:42.5831285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5831544Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5831646Z module_map=module_map) 2025-05-07T20:32:42.5831803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5831906Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5831979Z E ^ 2025-05-07T20:32:42.5832337Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5832342Z 2025-05-07T20:32:42.5832750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5832757Z 2025-05-07T20:32:42.5832853Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5833076Z self=, 2025-05-07T20:32:42.5833147Z T=128, 2025-05-07T20:32:42.5833219Z D=5120, 2025-05-07T20:32:42.5833299Z scale_ub=None, 2025-05-07T20:32:42.5833379Z contiguous=False, 2025-05-07T20:32:42.5833460Z compiled=False, 2025-05-07T20:32:42.5833533Z ) 2025-05-07T20:32:42.5833751Z self = 2025-05-07T20:32:42.5833922Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.5833927Z 2025-05-07T20:32:42.5834001Z @given( 2025-05-07T20:32:42.5834115Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5834213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5834323Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5834483Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5834596Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5834666Z ) 2025-05-07T20:32:42.5834907Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5835040Z def test_silu_mul_quant( 2025-05-07T20:32:42.5835114Z self, 2025-05-07T20:32:42.5835190Z T: int, 2025-05-07T20:32:42.5835261Z D: int, 2025-05-07T20:32:42.5835355Z scale_ub: Optional[float], 2025-05-07T20:32:42.5835445Z contiguous: bool, 2025-05-07T20:32:42.5835566Z compiled: bool, 2025-05-07T20:32:42.5835638Z ) -> None: 2025-05-07T20:32:42.5835731Z torch.manual_seed(2025) 2025-05-07T20:32:42.5835799Z 2025-05-07T20:32:42.5835961Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5836031Z 2025-05-07T20:32:42.5836117Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5836247Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5836351Z x = x_sign * x_clamp 2025-05-07T20:32:42.5836435Z x0 = x[:, :D] 2025-05-07T20:32:42.5836523Z x1 = x[:, D:] 2025-05-07T20:32:42.5836593Z 2025-05-07T20:32:42.5836670Z if contiguous: 2025-05-07T20:32:42.5836762Z x0 = x0.contiguous() 2025-05-07T20:32:42.5836846Z x1 = x1.contiguous() 2025-05-07T20:32:42.5836911Z 2025-05-07T20:32:42.5837000Z if scale_ub is not None: 2025-05-07T20:32:42.5837101Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5837278Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5837355Z ) 2025-05-07T20:32:42.5837427Z else: 2025-05-07T20:32:42.5837516Z scale_ub_tensor = None 2025-05-07T20:32:42.5837583Z 2025-05-07T20:32:42.5837706Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5837791Z op = silu_mul_quant 2025-05-07T20:32:42.5837877Z if compiled: 2025-05-07T20:32:42.5837971Z op = torch.compile(op) 2025-05-07T20:32:42.5838080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5838148Z 2025-05-07T20:32:42.5838232Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5838240Z 2025-05-07T20:32:42.5838335Z moe/activation_test.py:117: 2025-05-07T20:32:42.5838461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5838555Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5838653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5839154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5839247Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5839604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5839821Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5840158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5840247Z kernel = self.compile( 2025-05-07T20:32:42.5840623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5840796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5840917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5840924Z 2025-05-07T20:32:42.5841130Z self = 2025-05-07T20:32:42.5841901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5842476Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc89fd310>} 2025-05-07T20:32:42.5843252Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5843439Z context = 2025-05-07T20:32:42.5843444Z 2025-05-07T20:32:42.5843610Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5843909Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5844010Z module_map=module_map) 2025-05-07T20:32:42.5844170Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5844263Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5844340Z E ^ 2025-05-07T20:32:42.5844697Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5844702Z 2025-05-07T20:32:42.5845112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5845117Z 2025-05-07T20:32:42.5845216Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5845431Z self=, 2025-05-07T20:32:42.5845508Z T=128, 2025-05-07T20:32:42.5845620Z D=5120, 2025-05-07T20:32:42.5845700Z scale_ub=1200.0, 2025-05-07T20:32:42.5845781Z contiguous=True, 2025-05-07T20:32:42.5845859Z compiled=False, 2025-05-07T20:32:42.5845926Z ) 2025-05-07T20:32:42.5846139Z self = 2025-05-07T20:32:42.5846303Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.5846312Z 2025-05-07T20:32:42.5846382Z @given( 2025-05-07T20:32:42.5846501Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5846597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5846711Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5846822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5846930Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5847000Z ) 2025-05-07T20:32:42.5847239Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5847335Z def test_silu_mul_quant( 2025-05-07T20:32:42.5847413Z self, 2025-05-07T20:32:42.5847483Z T: int, 2025-05-07T20:32:42.5847553Z D: int, 2025-05-07T20:32:42.5847650Z scale_ub: Optional[float], 2025-05-07T20:32:42.5847733Z contiguous: bool, 2025-05-07T20:32:42.5847814Z compiled: bool, 2025-05-07T20:32:42.5847893Z ) -> None: 2025-05-07T20:32:42.5847984Z torch.manual_seed(2025) 2025-05-07T20:32:42.5848057Z 2025-05-07T20:32:42.5848221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5848290Z 2025-05-07T20:32:42.5848385Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5848506Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5848588Z x = x_sign * x_clamp 2025-05-07T20:32:42.5848666Z x0 = x[:, :D] 2025-05-07T20:32:42.5848741Z x1 = x[:, D:] 2025-05-07T20:32:42.5848807Z 2025-05-07T20:32:42.5848887Z if contiguous: 2025-05-07T20:32:42.5848978Z x0 = x0.contiguous() 2025-05-07T20:32:42.5849063Z x1 = x1.contiguous() 2025-05-07T20:32:42.5849132Z 2025-05-07T20:32:42.5849218Z if scale_ub is not None: 2025-05-07T20:32:42.5849317Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5849451Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5849571Z ) 2025-05-07T20:32:42.5849644Z else: 2025-05-07T20:32:42.5849733Z scale_ub_tensor = None 2025-05-07T20:32:42.5849801Z 2025-05-07T20:32:42.5849926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5850049Z op = silu_mul_quant 2025-05-07T20:32:42.5850129Z if compiled: 2025-05-07T20:32:42.5850226Z op = torch.compile(op) 2025-05-07T20:32:42.5850326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5850390Z 2025-05-07T20:32:42.5850478Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5850524Z 2025-05-07T20:32:42.5850614Z moe/activation_test.py:117: 2025-05-07T20:32:42.5850740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5850833Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5850929Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5851428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5851556Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5851942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5857865Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5858230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5858322Z kernel = self.compile( 2025-05-07T20:32:42.5858783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5858958Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5859085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5859091Z 2025-05-07T20:32:42.5859293Z self = 2025-05-07T20:32:42.5860073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5860584Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc89fdee0>} 2025-05-07T20:32:42.5861328Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5861520Z context = 2025-05-07T20:32:42.5861525Z 2025-05-07T20:32:42.5861685Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5861948Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5862056Z module_map=module_map) 2025-05-07T20:32:42.5862214Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5862313Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5862386Z E ^ 2025-05-07T20:32:42.5862736Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5862741Z 2025-05-07T20:32:42.5863156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5863163Z 2025-05-07T20:32:42.5863264Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5863485Z self=, 2025-05-07T20:32:42.5863556Z T=1, 2025-05-07T20:32:42.5863627Z D=7168, 2025-05-07T20:32:42.5863755Z scale_ub=1200.0, 2025-05-07T20:32:42.5863835Z contiguous=True, 2025-05-07T20:32:42.5863913Z compiled=True, 2025-05-07T20:32:42.5863982Z ) 2025-05-07T20:32:42.5864195Z self = 2025-05-07T20:32:42.5864396Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.5864402Z 2025-05-07T20:32:42.5864478Z @given( 2025-05-07T20:32:42.5864592Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5864690Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5864804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5864956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5865070Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5865139Z ) 2025-05-07T20:32:42.5865378Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5865470Z def test_silu_mul_quant( 2025-05-07T20:32:42.5865547Z self, 2025-05-07T20:32:42.5865619Z T: int, 2025-05-07T20:32:42.5865693Z D: int, 2025-05-07T20:32:42.5865787Z scale_ub: Optional[float], 2025-05-07T20:32:42.5865873Z contiguous: bool, 2025-05-07T20:32:42.5865963Z compiled: bool, 2025-05-07T20:32:42.5866039Z ) -> None: 2025-05-07T20:32:42.5866132Z torch.manual_seed(2025) 2025-05-07T20:32:42.5866200Z 2025-05-07T20:32:42.5866376Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5866457Z 2025-05-07T20:32:42.5866555Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5866735Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5866822Z x = x_sign * x_clamp 2025-05-07T20:32:42.5866899Z x0 = x[:, :D] 2025-05-07T20:32:42.5866974Z x1 = x[:, D:] 2025-05-07T20:32:42.5867049Z 2025-05-07T20:32:42.5867126Z if contiguous: 2025-05-07T20:32:42.5867213Z x0 = x0.contiguous() 2025-05-07T20:32:42.5867304Z x1 = x1.contiguous() 2025-05-07T20:32:42.5867371Z 2025-05-07T20:32:42.5867455Z if scale_ub is not None: 2025-05-07T20:32:42.5867557Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5867693Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5867769Z ) 2025-05-07T20:32:42.5867840Z else: 2025-05-07T20:32:42.5867930Z scale_ub_tensor = None 2025-05-07T20:32:42.5868001Z 2025-05-07T20:32:42.5868126Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5868217Z op = silu_mul_quant 2025-05-07T20:32:42.5868303Z if compiled: 2025-05-07T20:32:42.5868401Z op = torch.compile(op) 2025-05-07T20:32:42.5868501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5868572Z 2025-05-07T20:32:42.5868657Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5868662Z 2025-05-07T20:32:42.5868756Z moe/activation_test.py:117: 2025-05-07T20:32:42.5868883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5868977Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5869074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5869440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.5869527Z return fn(*args, **kwargs) 2025-05-07T20:32:42.5870077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5870175Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5870528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5870749Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5871080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5871245Z kernel = self.compile( 2025-05-07T20:32:42.5871619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5871828Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5871955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5871959Z 2025-05-07T20:32:42.5872164Z self = 2025-05-07T20:32:42.5872981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5873489Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8fda940>} 2025-05-07T20:32:42.5874229Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5874420Z context = 2025-05-07T20:32:42.5874425Z 2025-05-07T20:32:42.5874585Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5874885Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5874994Z module_map=module_map) 2025-05-07T20:32:42.5875153Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5875248Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5875320Z E ^ 2025-05-07T20:32:42.5875670Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5875678Z 2025-05-07T20:32:42.5876092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5876097Z 2025-05-07T20:32:42.5876197Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5876418Z self=, 2025-05-07T20:32:42.5876491Z T=1, 2025-05-07T20:32:42.5876564Z D=7168, 2025-05-07T20:32:42.5876646Z scale_ub=1200.0, 2025-05-07T20:32:42.5876732Z contiguous=False, 2025-05-07T20:32:42.5876814Z compiled=True, 2025-05-07T20:32:42.5876887Z ) 2025-05-07T20:32:42.5877104Z self = 2025-05-07T20:32:42.5877268Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.5877273Z 2025-05-07T20:32:42.5877351Z @given( 2025-05-07T20:32:42.5877466Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5877566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5877674Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5877786Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5877902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5877972Z ) 2025-05-07T20:32:42.5878214Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5878307Z def test_silu_mul_quant( 2025-05-07T20:32:42.5878379Z self, 2025-05-07T20:32:42.5878451Z T: int, 2025-05-07T20:32:42.5878527Z D: int, 2025-05-07T20:32:42.5878620Z scale_ub: Optional[float], 2025-05-07T20:32:42.5878707Z contiguous: bool, 2025-05-07T20:32:42.5878786Z compiled: bool, 2025-05-07T20:32:42.5878858Z ) -> None: 2025-05-07T20:32:42.5878950Z torch.manual_seed(2025) 2025-05-07T20:32:42.5879018Z 2025-05-07T20:32:42.5879185Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5879301Z 2025-05-07T20:32:42.5879388Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5879505Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5879591Z x = x_sign * x_clamp 2025-05-07T20:32:42.5879706Z x0 = x[:, :D] 2025-05-07T20:32:42.5879782Z x1 = x[:, D:] 2025-05-07T20:32:42.5879855Z 2025-05-07T20:32:42.5879931Z if contiguous: 2025-05-07T20:32:42.5880020Z x0 = x0.contiguous() 2025-05-07T20:32:42.5880104Z x1 = x1.contiguous() 2025-05-07T20:32:42.5880212Z 2025-05-07T20:32:42.5880304Z if scale_ub is not None: 2025-05-07T20:32:42.5880406Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5880537Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5880612Z ) 2025-05-07T20:32:42.5880682Z else: 2025-05-07T20:32:42.5880770Z scale_ub_tensor = None 2025-05-07T20:32:42.5880843Z 2025-05-07T20:32:42.5880970Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5881057Z op = silu_mul_quant 2025-05-07T20:32:42.5881137Z if compiled: 2025-05-07T20:32:42.5881231Z op = torch.compile(op) 2025-05-07T20:32:42.5881337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5881404Z 2025-05-07T20:32:42.5881490Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5881494Z 2025-05-07T20:32:42.5881589Z moe/activation_test.py:117: 2025-05-07T20:32:42.5881753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5881852Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5881949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5882311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.5882397Z return fn(*args, **kwargs) 2025-05-07T20:32:42.5882897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5882990Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5883348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5883570Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5883904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5883999Z kernel = self.compile( 2025-05-07T20:32:42.5884374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5884545Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5884669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5884676Z 2025-05-07T20:32:42.5884878Z self = 2025-05-07T20:32:42.5885663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5886171Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8dee5e0>} 2025-05-07T20:32:42.5886970Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5887160Z context = 2025-05-07T20:32:42.5887165Z 2025-05-07T20:32:42.5887324Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5887631Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5887734Z module_map=module_map) 2025-05-07T20:32:42.5887933Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5888029Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5888098Z E ^ 2025-05-07T20:32:42.5888449Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5888453Z 2025-05-07T20:32:42.5888905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5888910Z 2025-05-07T20:32:42.5889011Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5889231Z self=, 2025-05-07T20:32:42.5889301Z T=1, 2025-05-07T20:32:42.5889378Z D=7168, 2025-05-07T20:32:42.5889455Z scale_ub=None, 2025-05-07T20:32:42.5889533Z contiguous=False, 2025-05-07T20:32:42.5889615Z compiled=True, 2025-05-07T20:32:42.5889680Z ) 2025-05-07T20:32:42.5889895Z self = 2025-05-07T20:32:42.5890058Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.5890063Z 2025-05-07T20:32:42.5890134Z @given( 2025-05-07T20:32:42.5890252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5890347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5890502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5890618Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5890728Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5890794Z ) 2025-05-07T20:32:42.5891036Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5891127Z def test_silu_mul_quant( 2025-05-07T20:32:42.5891199Z self, 2025-05-07T20:32:42.5891273Z T: int, 2025-05-07T20:32:42.5891343Z D: int, 2025-05-07T20:32:42.5891436Z scale_ub: Optional[float], 2025-05-07T20:32:42.5891524Z contiguous: bool, 2025-05-07T20:32:42.5891606Z compiled: bool, 2025-05-07T20:32:42.5891680Z ) -> None: 2025-05-07T20:32:42.5891770Z torch.manual_seed(2025) 2025-05-07T20:32:42.5891837Z 2025-05-07T20:32:42.5892006Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5892077Z 2025-05-07T20:32:42.5892166Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5892286Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5892369Z x = x_sign * x_clamp 2025-05-07T20:32:42.5892444Z x0 = x[:, :D] 2025-05-07T20:32:42.5892520Z x1 = x[:, D:] 2025-05-07T20:32:42.5892588Z 2025-05-07T20:32:42.5892666Z if contiguous: 2025-05-07T20:32:42.5892757Z x0 = x0.contiguous() 2025-05-07T20:32:42.5892842Z x1 = x1.contiguous() 2025-05-07T20:32:42.5892912Z 2025-05-07T20:32:42.5892998Z if scale_ub is not None: 2025-05-07T20:32:42.5893097Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5893232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5893304Z ) 2025-05-07T20:32:42.5893377Z else: 2025-05-07T20:32:42.5893469Z scale_ub_tensor = None 2025-05-07T20:32:42.5893537Z 2025-05-07T20:32:42.5893664Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5893756Z op = silu_mul_quant 2025-05-07T20:32:42.5893837Z if compiled: 2025-05-07T20:32:42.5893932Z op = torch.compile(op) 2025-05-07T20:32:42.5894037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5894102Z 2025-05-07T20:32:42.5894192Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.5894356Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.5894420Z 2025-05-07T20:32:42.5894554Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5894651Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.5894785Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.5894906Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.5895040Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5895110Z 2025-05-07T20:32:42.5895208Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.5895251Z 2025-05-07T20:32:42.5895347Z moe/activation_test.py:126: 2025-05-07T20:32:42.5895473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5895572Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.5895700Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.5896268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.5896367Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.5896751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5896997Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5897358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.5897674Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5898068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:42.5898317Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.5898688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.5898854Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.5899197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.5899266Z fn() 2025-05-07T20:32:42.5899657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.5899738Z self.fn.run( 2025-05-07T20:32:42.5900071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5900161Z kernel = self.compile( 2025-05-07T20:32:42.5900535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5900705Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5900832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5900836Z 2025-05-07T20:32:42.5901038Z self = 2025-05-07T20:32:42.5901816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5902331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8c44160>} 2025-05-07T20:32:42.5903074Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5903261Z context = 2025-05-07T20:32:42.5903307Z 2025-05-07T20:32:42.5903468Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5903940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5904201Z module_map=module_map) 2025-05-07T20:32:42.5904366Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5904466Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.5904540Z E ^ 2025-05-07T20:32:42.5904891Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5904959Z 2025-05-07T20:32:42.5905375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5905379Z 2025-05-07T20:32:42.5905477Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5905698Z self=, 2025-05-07T20:32:42.5905776Z T=1, 2025-05-07T20:32:42.5905847Z D=5120, 2025-05-07T20:32:42.5905927Z scale_ub=1200.0, 2025-05-07T20:32:42.5906009Z contiguous=False, 2025-05-07T20:32:42.5906090Z compiled=True, 2025-05-07T20:32:42.5906162Z ) 2025-05-07T20:32:42.5906377Z self = 2025-05-07T20:32:42.5906538Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.5906543Z 2025-05-07T20:32:42.5906616Z @given( 2025-05-07T20:32:42.5906789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5906891Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5907000Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5907111Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5907223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5907293Z ) 2025-05-07T20:32:42.5907536Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5907630Z def test_silu_mul_quant( 2025-05-07T20:32:42.5907702Z self, 2025-05-07T20:32:42.5907775Z T: int, 2025-05-07T20:32:42.5907850Z D: int, 2025-05-07T20:32:42.5907947Z scale_ub: Optional[float], 2025-05-07T20:32:42.5908030Z contiguous: bool, 2025-05-07T20:32:42.5908113Z compiled: bool, 2025-05-07T20:32:42.5908188Z ) -> None: 2025-05-07T20:32:42.5908278Z torch.manual_seed(2025) 2025-05-07T20:32:42.5908343Z 2025-05-07T20:32:42.5908510Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5908580Z 2025-05-07T20:32:42.5908671Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5908791Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5908878Z x = x_sign * x_clamp 2025-05-07T20:32:42.5908951Z x0 = x[:, :D] 2025-05-07T20:32:42.5909024Z x1 = x[:, D:] 2025-05-07T20:32:42.5909096Z 2025-05-07T20:32:42.5909172Z if contiguous: 2025-05-07T20:32:42.5909257Z x0 = x0.contiguous() 2025-05-07T20:32:42.5909348Z x1 = x1.contiguous() 2025-05-07T20:32:42.5909415Z 2025-05-07T20:32:42.5909508Z if scale_ub is not None: 2025-05-07T20:32:42.5909606Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5909735Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5909877Z ) 2025-05-07T20:32:42.5909950Z else: 2025-05-07T20:32:42.5910039Z scale_ub_tensor = None 2025-05-07T20:32:42.5910115Z 2025-05-07T20:32:42.5910240Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5910324Z op = silu_mul_quant 2025-05-07T20:32:42.5910408Z if compiled: 2025-05-07T20:32:42.5910500Z op = torch.compile(op) 2025-05-07T20:32:42.5910600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5910738Z 2025-05-07T20:32:42.5910824Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5910828Z 2025-05-07T20:32:42.5910923Z moe/activation_test.py:117: 2025-05-07T20:32:42.5911046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5911180Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5911279Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5911641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.5911739Z return fn(*args, **kwargs) 2025-05-07T20:32:42.5912269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5912367Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5912719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5912938Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5913275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5913364Z kernel = self.compile( 2025-05-07T20:32:42.5913738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5913912Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5914034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5914041Z 2025-05-07T20:32:42.5914284Z self = 2025-05-07T20:32:42.5915059Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5915569Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8c44b80>} 2025-05-07T20:32:42.5916314Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5916499Z context = 2025-05-07T20:32:42.5916504Z 2025-05-07T20:32:42.5916668Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5916927Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5917029Z module_map=module_map) 2025-05-07T20:32:42.5917189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5917284Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5917357Z E ^ 2025-05-07T20:32:42.5917714Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5917718Z 2025-05-07T20:32:42.5918128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5918133Z 2025-05-07T20:32:42.5918235Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5918452Z self=, 2025-05-07T20:32:42.5918524Z T=1, 2025-05-07T20:32:42.5918598Z D=5120, 2025-05-07T20:32:42.5918677Z scale_ub=1200.0, 2025-05-07T20:32:42.5918761Z contiguous=False, 2025-05-07T20:32:42.5918839Z compiled=False, 2025-05-07T20:32:42.5918905Z ) 2025-05-07T20:32:42.5919122Z self = 2025-05-07T20:32:42.5919284Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.5919330Z 2025-05-07T20:32:42.5919405Z @given( 2025-05-07T20:32:42.5919520Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5919612Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5919768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5919882Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5919991Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5920064Z ) 2025-05-07T20:32:42.5920308Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5920441Z def test_silu_mul_quant( 2025-05-07T20:32:42.5920511Z self, 2025-05-07T20:32:42.5920582Z T: int, 2025-05-07T20:32:42.5920650Z D: int, 2025-05-07T20:32:42.5920748Z scale_ub: Optional[float], 2025-05-07T20:32:42.5920830Z contiguous: bool, 2025-05-07T20:32:42.5920911Z compiled: bool, 2025-05-07T20:32:42.5920986Z ) -> None: 2025-05-07T20:32:42.5921074Z torch.manual_seed(2025) 2025-05-07T20:32:42.5921145Z 2025-05-07T20:32:42.5921307Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5921374Z 2025-05-07T20:32:42.5921469Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5921586Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5921668Z x = x_sign * x_clamp 2025-05-07T20:32:42.5921746Z x0 = x[:, :D] 2025-05-07T20:32:42.5921820Z x1 = x[:, D:] 2025-05-07T20:32:42.5921886Z 2025-05-07T20:32:42.5921965Z if contiguous: 2025-05-07T20:32:42.5922095Z x0 = x0.contiguous() 2025-05-07T20:32:42.5922182Z x1 = x1.contiguous() 2025-05-07T20:32:42.5922250Z 2025-05-07T20:32:42.5922334Z if scale_ub is not None: 2025-05-07T20:32:42.5922436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5922569Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5922643Z ) 2025-05-07T20:32:42.5922719Z else: 2025-05-07T20:32:42.5922809Z scale_ub_tensor = None 2025-05-07T20:32:42.5922877Z 2025-05-07T20:32:42.5923004Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5923091Z op = silu_mul_quant 2025-05-07T20:32:42.5923169Z if compiled: 2025-05-07T20:32:42.5923267Z op = torch.compile(op) 2025-05-07T20:32:42.5923366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5923430Z 2025-05-07T20:32:42.5923519Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5923529Z 2025-05-07T20:32:42.5923621Z moe/activation_test.py:117: 2025-05-07T20:32:42.5923747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5923843Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5923936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5924435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5924529Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5924879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5925104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5925435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5925531Z kernel = self.compile( 2025-05-07T20:32:42.5925911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5926080Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5926202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5926207Z 2025-05-07T20:32:42.5926407Z self = 2025-05-07T20:32:42.5927282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5927783Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc9015550>} 2025-05-07T20:32:42.5928522Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5928776Z context = 2025-05-07T20:32:42.5928780Z 2025-05-07T20:32:42.5928941Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5929202Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5929303Z module_map=module_map) 2025-05-07T20:32:42.5929459Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5929560Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5929633Z E ^ 2025-05-07T20:32:42.5929981Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5929990Z 2025-05-07T20:32:42.5930438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5930446Z 2025-05-07T20:32:42.5930544Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5930764Z self=, 2025-05-07T20:32:42.5930839Z T=16384, 2025-05-07T20:32:42.5930911Z D=5120, 2025-05-07T20:32:42.5930997Z scale_ub=1200.0, 2025-05-07T20:32:42.5931075Z contiguous=False, 2025-05-07T20:32:42.5931151Z compiled=True, 2025-05-07T20:32:42.5931225Z ) 2025-05-07T20:32:42.5931439Z self = 2025-05-07T20:32:42.5931616Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.5931621Z 2025-05-07T20:32:42.5931692Z @given( 2025-05-07T20:32:42.5931805Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5931899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5932011Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5932128Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5932239Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5932309Z ) 2025-05-07T20:32:42.5932552Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5932638Z def test_silu_mul_quant( 2025-05-07T20:32:42.5932711Z self, 2025-05-07T20:32:42.5932785Z T: int, 2025-05-07T20:32:42.5932856Z D: int, 2025-05-07T20:32:42.5932950Z scale_ub: Optional[float], 2025-05-07T20:32:42.5933038Z contiguous: bool, 2025-05-07T20:32:42.5933121Z compiled: bool, 2025-05-07T20:32:42.5933193Z ) -> None: 2025-05-07T20:32:42.5933286Z torch.manual_seed(2025) 2025-05-07T20:32:42.5933354Z 2025-05-07T20:32:42.5933516Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5933589Z 2025-05-07T20:32:42.5933679Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5933800Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5933886Z x = x_sign * x_clamp 2025-05-07T20:32:42.5933959Z x0 = x[:, :D] 2025-05-07T20:32:42.5934034Z x1 = x[:, D:] 2025-05-07T20:32:42.5934099Z 2025-05-07T20:32:42.5934175Z if contiguous: 2025-05-07T20:32:42.5934264Z x0 = x0.contiguous() 2025-05-07T20:32:42.5934396Z x1 = x1.contiguous() 2025-05-07T20:32:42.5934461Z 2025-05-07T20:32:42.5934553Z if scale_ub is not None: 2025-05-07T20:32:42.5934653Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5934820Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5934896Z ) 2025-05-07T20:32:42.5934966Z else: 2025-05-07T20:32:42.5935055Z scale_ub_tensor = None 2025-05-07T20:32:42.5935126Z 2025-05-07T20:32:42.5935249Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5935379Z op = silu_mul_quant 2025-05-07T20:32:42.5935458Z if compiled: 2025-05-07T20:32:42.5935553Z op = torch.compile(op) 2025-05-07T20:32:42.5935655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5935720Z 2025-05-07T20:32:42.5935805Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5935809Z 2025-05-07T20:32:42.5935908Z moe/activation_test.py:117: 2025-05-07T20:32:42.5936031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5936125Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5936223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5936586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.5936678Z return fn(*args, **kwargs) 2025-05-07T20:32:42.5937163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5937849Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5938212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5938429Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5938760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5938855Z kernel = self.compile( 2025-05-07T20:32:42.5939227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5939401Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5939522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5939527Z 2025-05-07T20:32:42.5939728Z self = 2025-05-07T20:32:42.5940510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5941014Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8cee1f0>} 2025-05-07T20:32:42.5941761Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5941947Z context = 2025-05-07T20:32:42.5941951Z 2025-05-07T20:32:42.5942113Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5942375Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5942481Z module_map=module_map) 2025-05-07T20:32:42.5942645Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5942737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5942806Z E ^ 2025-05-07T20:32:42.5943157Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5943207Z 2025-05-07T20:32:42.5943616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5943621Z 2025-05-07T20:32:42.5943761Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5943979Z self=, 2025-05-07T20:32:42.5944049Z T=2048, 2025-05-07T20:32:42.5944122Z D=7168, 2025-05-07T20:32:42.5944199Z scale_ub=1200.0, 2025-05-07T20:32:42.5944279Z contiguous=False, 2025-05-07T20:32:42.5944401Z compiled=True, 2025-05-07T20:32:42.5944468Z ) 2025-05-07T20:32:42.5944679Z self = 2025-05-07T20:32:42.5944849Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.5944853Z 2025-05-07T20:32:42.5944926Z @given( 2025-05-07T20:32:42.5945042Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5945137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5945246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5945360Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5945471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5945539Z ) 2025-05-07T20:32:42.5945781Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5945871Z def test_silu_mul_quant( 2025-05-07T20:32:42.5945944Z self, 2025-05-07T20:32:42.5946017Z T: int, 2025-05-07T20:32:42.5946125Z D: int, 2025-05-07T20:32:42.5946222Z scale_ub: Optional[float], 2025-05-07T20:32:42.5946305Z contiguous: bool, 2025-05-07T20:32:42.5946385Z compiled: bool, 2025-05-07T20:32:42.5946461Z ) -> None: 2025-05-07T20:32:42.5946549Z torch.manual_seed(2025) 2025-05-07T20:32:42.5946617Z 2025-05-07T20:32:42.5946786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5946853Z 2025-05-07T20:32:42.5946936Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5947059Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5947143Z x = x_sign * x_clamp 2025-05-07T20:32:42.5947218Z x0 = x[:, :D] 2025-05-07T20:32:42.5947300Z x1 = x[:, D:] 2025-05-07T20:32:42.5947367Z 2025-05-07T20:32:42.5947451Z if contiguous: 2025-05-07T20:32:42.5947537Z x0 = x0.contiguous() 2025-05-07T20:32:42.5947623Z x1 = x1.contiguous() 2025-05-07T20:32:42.5947700Z 2025-05-07T20:32:42.5947786Z if scale_ub is not None: 2025-05-07T20:32:42.5947886Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5948020Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5948094Z ) 2025-05-07T20:32:42.5948168Z else: 2025-05-07T20:32:42.5948260Z scale_ub_tensor = None 2025-05-07T20:32:42.5948326Z 2025-05-07T20:32:42.5948451Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5948537Z op = silu_mul_quant 2025-05-07T20:32:42.5948616Z if compiled: 2025-05-07T20:32:42.5948712Z op = torch.compile(op) 2025-05-07T20:32:42.5948812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5948880Z 2025-05-07T20:32:42.5948969Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5948973Z 2025-05-07T20:32:42.5949065Z moe/activation_test.py:117: 2025-05-07T20:32:42.5949193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5949295Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5949389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5949748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.5949889Z return fn(*args, **kwargs) 2025-05-07T20:32:42.5950422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5950516Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5950906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5951127Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5951461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5951590Z kernel = self.compile( 2025-05-07T20:32:42.5951965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5952133Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5952253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5952260Z 2025-05-07T20:32:42.5952460Z self = 2025-05-07T20:32:42.5953236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5953752Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8ceeee0>} 2025-05-07T20:32:42.5954536Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5954725Z context = 2025-05-07T20:32:42.5954729Z 2025-05-07T20:32:42.5954892Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5955153Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5955257Z module_map=module_map) 2025-05-07T20:32:42.5955416Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5955507Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5955584Z E ^ 2025-05-07T20:32:42.5955933Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5955941Z 2025-05-07T20:32:42.5956349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5956357Z 2025-05-07T20:32:42.5956455Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5956672Z self=, 2025-05-07T20:32:42.5956746Z T=1, 2025-05-07T20:32:42.5956817Z D=5120, 2025-05-07T20:32:42.5956894Z scale_ub=None, 2025-05-07T20:32:42.5956978Z contiguous=False, 2025-05-07T20:32:42.5957058Z compiled=False, 2025-05-07T20:32:42.5957127Z ) 2025-05-07T20:32:42.5957347Z self = 2025-05-07T20:32:42.5957508Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.5957512Z 2025-05-07T20:32:42.5957586Z @given( 2025-05-07T20:32:42.5957697Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5957793Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5957913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5958025Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5958133Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5958205Z ) 2025-05-07T20:32:42.5958446Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5958580Z def test_silu_mul_quant( 2025-05-07T20:32:42.5958657Z self, 2025-05-07T20:32:42.5958730Z T: int, 2025-05-07T20:32:42.5958800Z D: int, 2025-05-07T20:32:42.5958900Z scale_ub: Optional[float], 2025-05-07T20:32:42.5959044Z contiguous: bool, 2025-05-07T20:32:42.5959130Z compiled: bool, 2025-05-07T20:32:42.5959201Z ) -> None: 2025-05-07T20:32:42.5959291Z torch.manual_seed(2025) 2025-05-07T20:32:42.5959358Z 2025-05-07T20:32:42.5959524Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5959632Z 2025-05-07T20:32:42.5959722Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5959840Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5959925Z x = x_sign * x_clamp 2025-05-07T20:32:42.5960004Z x0 = x[:, :D] 2025-05-07T20:32:42.5960078Z x1 = x[:, D:] 2025-05-07T20:32:42.5960145Z 2025-05-07T20:32:42.5960227Z if contiguous: 2025-05-07T20:32:42.5960317Z x0 = x0.contiguous() 2025-05-07T20:32:42.5960406Z x1 = x1.contiguous() 2025-05-07T20:32:42.5960472Z 2025-05-07T20:32:42.5960559Z if scale_ub is not None: 2025-05-07T20:32:42.5960664Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5960791Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5960861Z ) 2025-05-07T20:32:42.5960940Z else: 2025-05-07T20:32:42.5961028Z scale_ub_tensor = None 2025-05-07T20:32:42.5961096Z 2025-05-07T20:32:42.5961266Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5961355Z op = silu_mul_quant 2025-05-07T20:32:42.5961435Z if compiled: 2025-05-07T20:32:42.5961533Z op = torch.compile(op) 2025-05-07T20:32:42.5961634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5961703Z 2025-05-07T20:32:42.5961789Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5961797Z 2025-05-07T20:32:42.5961888Z moe/activation_test.py:117: 2025-05-07T20:32:42.5962013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5962112Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5962205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5962702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5962793Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5963149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5963371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5963707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5963797Z kernel = self.compile( 2025-05-07T20:32:42.5964174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5964343Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5964469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5964473Z 2025-05-07T20:32:42.5964674Z self = 2025-05-07T20:32:42.5965450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5965990Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8d595e0>} 2025-05-07T20:32:42.5966747Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5966984Z context = 2025-05-07T20:32:42.5966988Z 2025-05-07T20:32:42.5967186Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5967448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5967550Z module_map=module_map) 2025-05-07T20:32:42.5967708Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5967845Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5967918Z E ^ 2025-05-07T20:32:42.5968280Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5968284Z 2025-05-07T20:32:42.5968692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5968698Z 2025-05-07T20:32:42.5968796Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5969016Z self=, 2025-05-07T20:32:42.5969090Z T=4096, 2025-05-07T20:32:42.5969169Z D=7168, 2025-05-07T20:32:42.5969247Z scale_ub=1200.0, 2025-05-07T20:32:42.5969329Z contiguous=False, 2025-05-07T20:32:42.5969411Z compiled=False, 2025-05-07T20:32:42.5969478Z ) 2025-05-07T20:32:42.5969734Z self = 2025-05-07T20:32:42.5969910Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.5969915Z 2025-05-07T20:32:42.5969987Z @given( 2025-05-07T20:32:42.5970102Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5970200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5970309Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5970428Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5970539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5970609Z ) 2025-05-07T20:32:42.5970854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5970942Z def test_silu_mul_quant( 2025-05-07T20:32:42.5971013Z self, 2025-05-07T20:32:42.5971084Z T: int, 2025-05-07T20:32:42.5971156Z D: int, 2025-05-07T20:32:42.5971248Z scale_ub: Optional[float], 2025-05-07T20:32:42.5971340Z contiguous: bool, 2025-05-07T20:32:42.5971418Z compiled: bool, 2025-05-07T20:32:42.5971491Z ) -> None: 2025-05-07T20:32:42.5971584Z torch.manual_seed(2025) 2025-05-07T20:32:42.5971653Z 2025-05-07T20:32:42.5971815Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5971885Z 2025-05-07T20:32:42.5971971Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5972096Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5972182Z x = x_sign * x_clamp 2025-05-07T20:32:42.5972259Z x0 = x[:, :D] 2025-05-07T20:32:42.5972340Z x1 = x[:, D:] 2025-05-07T20:32:42.5972405Z 2025-05-07T20:32:42.5972483Z if contiguous: 2025-05-07T20:32:42.5972573Z x0 = x0.contiguous() 2025-05-07T20:32:42.5972657Z x1 = x1.contiguous() 2025-05-07T20:32:42.5973083Z 2025-05-07T20:32:42.5973177Z if scale_ub is not None: 2025-05-07T20:32:42.5973276Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5973415Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5973487Z ) 2025-05-07T20:32:42.5973559Z else: 2025-05-07T20:32:42.5973649Z scale_ub_tensor = None 2025-05-07T20:32:42.5973713Z 2025-05-07T20:32:42.5973836Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5973974Z op = silu_mul_quant 2025-05-07T20:32:42.5974053Z if compiled: 2025-05-07T20:32:42.5974145Z op = torch.compile(op) 2025-05-07T20:32:42.5979845Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5979926Z 2025-05-07T20:32:42.5980097Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5980103Z 2025-05-07T20:32:42.5980203Z moe/activation_test.py:117: 2025-05-07T20:32:42.5980333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5980431Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5980577Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5981079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5981178Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5981531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5981753Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5982093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5982187Z kernel = self.compile( 2025-05-07T20:32:42.5982565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5982740Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5982912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5982920Z 2025-05-07T20:32:42.5983128Z self = 2025-05-07T20:32:42.5983905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5984412Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc86f51f0>} 2025-05-07T20:32:42.5985156Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5985341Z context = 2025-05-07T20:32:42.5985348Z 2025-05-07T20:32:42.5985512Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5985770Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5985876Z module_map=module_map) 2025-05-07T20:32:42.5986034Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5986130Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5986206Z E ^ 2025-05-07T20:32:42.5986555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5986560Z 2025-05-07T20:32:42.5986970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5986975Z 2025-05-07T20:32:42.5987076Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5987292Z self=, 2025-05-07T20:32:42.5987375Z T=16384, 2025-05-07T20:32:42.5987447Z D=7168, 2025-05-07T20:32:42.5987523Z scale_ub=None, 2025-05-07T20:32:42.5987604Z contiguous=True, 2025-05-07T20:32:42.5987682Z compiled=True, 2025-05-07T20:32:42.5987749Z ) 2025-05-07T20:32:42.5987962Z self = 2025-05-07T20:32:42.5988174Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.5988178Z 2025-05-07T20:32:42.5988253Z @given( 2025-05-07T20:32:42.5988374Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5988469Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5988622Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5988737Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5988847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5988915Z ) 2025-05-07T20:32:42.5989160Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5989288Z def test_silu_mul_quant( 2025-05-07T20:32:42.5989360Z self, 2025-05-07T20:32:42.5989432Z T: int, 2025-05-07T20:32:42.5989503Z D: int, 2025-05-07T20:32:42.5989600Z scale_ub: Optional[float], 2025-05-07T20:32:42.5989686Z contiguous: bool, 2025-05-07T20:32:42.5989770Z compiled: bool, 2025-05-07T20:32:42.5989902Z ) -> None: 2025-05-07T20:32:42.5989991Z torch.manual_seed(2025) 2025-05-07T20:32:42.5990058Z 2025-05-07T20:32:42.5990221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5990289Z 2025-05-07T20:32:42.5990382Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5990502Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5990588Z x = x_sign * x_clamp 2025-05-07T20:32:42.5990664Z x0 = x[:, :D] 2025-05-07T20:32:42.5990737Z x1 = x[:, D:] 2025-05-07T20:32:42.5990805Z 2025-05-07T20:32:42.5990933Z if contiguous: 2025-05-07T20:32:42.5991023Z x0 = x0.contiguous() 2025-05-07T20:32:42.5991105Z x1 = x1.contiguous() 2025-05-07T20:32:42.5991177Z 2025-05-07T20:32:42.5991261Z if scale_ub is not None: 2025-05-07T20:32:42.5991361Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5991493Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5991566Z ) 2025-05-07T20:32:42.5991642Z else: 2025-05-07T20:32:42.5991731Z scale_ub_tensor = None 2025-05-07T20:32:42.5991799Z 2025-05-07T20:32:42.5991930Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5992014Z op = silu_mul_quant 2025-05-07T20:32:42.5992093Z if compiled: 2025-05-07T20:32:42.5992190Z op = torch.compile(op) 2025-05-07T20:32:42.5992290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5992359Z 2025-05-07T20:32:42.5992452Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5992457Z 2025-05-07T20:32:42.5992552Z moe/activation_test.py:117: 2025-05-07T20:32:42.5992680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5992777Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5992872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5993240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.5993332Z return fn(*args, **kwargs) 2025-05-07T20:32:42.5993827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5993921Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5994270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5994496Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5994836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5994923Z kernel = self.compile( 2025-05-07T20:32:42.5995299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5995543Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5995689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5995694Z 2025-05-07T20:32:42.5995965Z self = 2025-05-07T20:32:42.5996746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5997254Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc86f5ee0>} 2025-05-07T20:32:42.5998028Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5998219Z context = 2025-05-07T20:32:42.5998223Z 2025-05-07T20:32:42.5998387Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5998645Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5998749Z module_map=module_map) 2025-05-07T20:32:42.5998907Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5999000Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5999080Z E ^ 2025-05-07T20:32:42.5999473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5999479Z 2025-05-07T20:32:42.5999890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5999895Z 2025-05-07T20:32:42.5999996Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6000212Z self=, 2025-05-07T20:32:42.6000287Z T=4096, 2025-05-07T20:32:42.6000356Z D=5120, 2025-05-07T20:32:42.6000433Z scale_ub=None, 2025-05-07T20:32:42.6000524Z contiguous=False, 2025-05-07T20:32:42.6000604Z compiled=True, 2025-05-07T20:32:42.6000672Z ) 2025-05-07T20:32:42.6000885Z self = 2025-05-07T20:32:42.6001051Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6001058Z 2025-05-07T20:32:42.6001137Z @given( 2025-05-07T20:32:42.6001251Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6001343Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6001455Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6001568Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6001678Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6001751Z ) 2025-05-07T20:32:42.6001991Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6002077Z def test_silu_mul_quant( 2025-05-07T20:32:42.6002157Z self, 2025-05-07T20:32:42.6002226Z T: int, 2025-05-07T20:32:42.6002302Z D: int, 2025-05-07T20:32:42.6002397Z scale_ub: Optional[float], 2025-05-07T20:32:42.6002481Z contiguous: bool, 2025-05-07T20:32:42.6002564Z compiled: bool, 2025-05-07T20:32:42.6002636Z ) -> None: 2025-05-07T20:32:42.6002733Z torch.manual_seed(2025) 2025-05-07T20:32:42.6002805Z 2025-05-07T20:32:42.6002971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6003039Z 2025-05-07T20:32:42.6003134Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6003250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6003333Z x = x_sign * x_clamp 2025-05-07T20:32:42.6003457Z x0 = x[:, :D] 2025-05-07T20:32:42.6003533Z x1 = x[:, D:] 2025-05-07T20:32:42.6003602Z 2025-05-07T20:32:42.6003681Z if contiguous: 2025-05-07T20:32:42.6003980Z x0 = x0.contiguous() 2025-05-07T20:32:42.6004212Z x1 = x1.contiguous() 2025-05-07T20:32:42.6004314Z 2025-05-07T20:32:42.6004437Z if scale_ub is not None: 2025-05-07T20:32:42.6004556Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6004687Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6004759Z ) 2025-05-07T20:32:42.6004913Z else: 2025-05-07T20:32:42.6005001Z scale_ub_tensor = None 2025-05-07T20:32:42.6005066Z 2025-05-07T20:32:42.6005197Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6005281Z op = silu_mul_quant 2025-05-07T20:32:42.6005360Z if compiled: 2025-05-07T20:32:42.6005459Z op = torch.compile(op) 2025-05-07T20:32:42.6005565Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6005649Z 2025-05-07T20:32:42.6005743Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6005748Z 2025-05-07T20:32:42.6005861Z moe/activation_test.py:117: 2025-05-07T20:32:42.6005995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6006091Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6006185Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6006610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6006703Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6007197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6007289Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6007639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6007863Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6008198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6008285Z kernel = self.compile( 2025-05-07T20:32:42.6008662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6008831Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6008961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6008966Z 2025-05-07T20:32:42.6009167Z self = 2025-05-07T20:32:42.6009942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6010455Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8ebe940>} 2025-05-07T20:32:42.6011193Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6011384Z context = 2025-05-07T20:32:42.6011392Z 2025-05-07T20:32:42.6011552Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6011814Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6011917Z module_map=module_map) 2025-05-07T20:32:42.6012075Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6012239Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6012313Z E ^ 2025-05-07T20:32:42.6012662Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6012705Z 2025-05-07T20:32:42.6013118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6013123Z 2025-05-07T20:32:42.6013220Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6013447Z self=, 2025-05-07T20:32:42.6013554Z T=4096, 2025-05-07T20:32:42.6013626Z D=5120, 2025-05-07T20:32:42.6013705Z scale_ub=1200.0, 2025-05-07T20:32:42.6013786Z contiguous=False, 2025-05-07T20:32:42.6013864Z compiled=False, 2025-05-07T20:32:42.6013934Z ) 2025-05-07T20:32:42.6014147Z self = 2025-05-07T20:32:42.6014320Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.6014325Z 2025-05-07T20:32:42.6014398Z @given( 2025-05-07T20:32:42.6014511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6014613Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6014721Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6014832Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6014945Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6015017Z ) 2025-05-07T20:32:42.6015302Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6015395Z def test_silu_mul_quant( 2025-05-07T20:32:42.6015463Z self, 2025-05-07T20:32:42.6015533Z T: int, 2025-05-07T20:32:42.6015607Z D: int, 2025-05-07T20:32:42.6015700Z scale_ub: Optional[float], 2025-05-07T20:32:42.6015787Z contiguous: bool, 2025-05-07T20:32:42.6015869Z compiled: bool, 2025-05-07T20:32:42.6015958Z ) -> None: 2025-05-07T20:32:42.6016058Z torch.manual_seed(2025) 2025-05-07T20:32:42.6016139Z 2025-05-07T20:32:42.6016315Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6016389Z 2025-05-07T20:32:42.6016473Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6016591Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6016680Z x = x_sign * x_clamp 2025-05-07T20:32:42.6016755Z x0 = x[:, :D] 2025-05-07T20:32:42.6016836Z x1 = x[:, D:] 2025-05-07T20:32:42.6016904Z 2025-05-07T20:32:42.6016983Z if contiguous: 2025-05-07T20:32:42.6017068Z x0 = x0.contiguous() 2025-05-07T20:32:42.6017156Z x1 = x1.contiguous() 2025-05-07T20:32:42.6017225Z 2025-05-07T20:32:42.6017316Z if scale_ub is not None: 2025-05-07T20:32:42.6017416Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6017549Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6017624Z ) 2025-05-07T20:32:42.6017694Z else: 2025-05-07T20:32:42.6017784Z scale_ub_tensor = None 2025-05-07T20:32:42.6017857Z 2025-05-07T20:32:42.6017983Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6018067Z op = silu_mul_quant 2025-05-07T20:32:42.6018151Z if compiled: 2025-05-07T20:32:42.6018244Z op = torch.compile(op) 2025-05-07T20:32:42.6018345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6018423Z 2025-05-07T20:32:42.6018509Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6018514Z 2025-05-07T20:32:42.6018608Z moe/activation_test.py:117: 2025-05-07T20:32:42.6018730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6018828Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6018973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6019469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6019564Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6019958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6020178Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6020517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6020643Z kernel = self.compile( 2025-05-07T20:32:42.6021015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6021186Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6021307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6021314Z 2025-05-07T20:32:42.6021517Z self = 2025-05-07T20:32:42.6022291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6022856Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc892d3a0>} 2025-05-07T20:32:42.6023601Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6023787Z context = 2025-05-07T20:32:42.6023794Z 2025-05-07T20:32:42.6023956Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6024212Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6024316Z module_map=module_map) 2025-05-07T20:32:42.6024479Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6024572Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6024640Z E ^ 2025-05-07T20:32:42.6024992Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6025002Z 2025-05-07T20:32:42.6025408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6025413Z 2025-05-07T20:32:42.6025514Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6025732Z self=, 2025-05-07T20:32:42.6025810Z T=4096, 2025-05-07T20:32:42.6025884Z D=5120, 2025-05-07T20:32:42.6025959Z scale_ub=1200.0, 2025-05-07T20:32:42.6026045Z contiguous=False, 2025-05-07T20:32:42.6026122Z compiled=True, 2025-05-07T20:32:42.6026190Z ) 2025-05-07T20:32:42.6026409Z self = 2025-05-07T20:32:42.6026580Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.6026584Z 2025-05-07T20:32:42.6026657Z @given( 2025-05-07T20:32:42.6026774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6026873Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6026984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6027098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6027207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6027280Z ) 2025-05-07T20:32:42.6027520Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6027655Z def test_silu_mul_quant( 2025-05-07T20:32:42.6027726Z self, 2025-05-07T20:32:42.6027796Z T: int, 2025-05-07T20:32:42.6027867Z D: int, 2025-05-07T20:32:42.6028002Z scale_ub: Optional[float], 2025-05-07T20:32:42.6028087Z contiguous: bool, 2025-05-07T20:32:42.6028169Z compiled: bool, 2025-05-07T20:32:42.6028245Z ) -> None: 2025-05-07T20:32:42.6028335Z torch.manual_seed(2025) 2025-05-07T20:32:42.6028403Z 2025-05-07T20:32:42.6028572Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6028676Z 2025-05-07T20:32:42.6028766Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6028884Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6028970Z x = x_sign * x_clamp 2025-05-07T20:32:42.6029048Z x0 = x[:, :D] 2025-05-07T20:32:42.6029122Z x1 = x[:, D:] 2025-05-07T20:32:42.6029190Z 2025-05-07T20:32:42.6029271Z if contiguous: 2025-05-07T20:32:42.6029357Z x0 = x0.contiguous() 2025-05-07T20:32:42.6029443Z x1 = x1.contiguous() 2025-05-07T20:32:42.6029513Z 2025-05-07T20:32:42.6029599Z if scale_ub is not None: 2025-05-07T20:32:42.6029699Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6029883Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6029955Z ) 2025-05-07T20:32:42.6030024Z else: 2025-05-07T20:32:42.6030116Z scale_ub_tensor = None 2025-05-07T20:32:42.6030189Z 2025-05-07T20:32:42.6030360Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6030448Z op = silu_mul_quant 2025-05-07T20:32:42.6030532Z if compiled: 2025-05-07T20:32:42.6030632Z op = torch.compile(op) 2025-05-07T20:32:42.6030730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6030798Z 2025-05-07T20:32:42.6030897Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6030901Z 2025-05-07T20:32:42.6030992Z moe/activation_test.py:117: 2025-05-07T20:32:42.6031113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6031219Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6031311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6031677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6031764Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6032255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6032353Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6032703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6032923Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6033260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6033356Z kernel = self.compile( 2025-05-07T20:32:42.6033734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6033905Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6034028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6034032Z 2025-05-07T20:32:42.6034243Z self = 2025-05-07T20:32:42.6035021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6035567Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc892d280>} 2025-05-07T20:32:42.6036390Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6036588Z context = 2025-05-07T20:32:42.6036592Z 2025-05-07T20:32:42.6036754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6037054Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6037155Z module_map=module_map) 2025-05-07T20:32:42.6037312Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6037408Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6037484Z E ^ 2025-05-07T20:32:42.6037841Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6037852Z 2025-05-07T20:32:42.6038260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6038265Z 2025-05-07T20:32:42.6038362Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6038580Z self=, 2025-05-07T20:32:42.6038656Z T=2048, 2025-05-07T20:32:42.6038728Z D=7168, 2025-05-07T20:32:42.6038852Z scale_ub=1200.0, 2025-05-07T20:32:42.6038936Z contiguous=False, 2025-05-07T20:32:42.6039014Z compiled=False, 2025-05-07T20:32:42.6039084Z ) 2025-05-07T20:32:42.6039301Z self = 2025-05-07T20:32:42.6039474Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.6039482Z 2025-05-07T20:32:42.6039555Z @given( 2025-05-07T20:32:42.6039668Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6039765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6039873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6039990Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6040105Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6040174Z ) 2025-05-07T20:32:42.6040416Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6040511Z def test_silu_mul_quant( 2025-05-07T20:32:42.6040582Z self, 2025-05-07T20:32:42.6040656Z T: int, 2025-05-07T20:32:42.6040728Z D: int, 2025-05-07T20:32:42.6040819Z scale_ub: Optional[float], 2025-05-07T20:32:42.6040906Z contiguous: bool, 2025-05-07T20:32:42.6040988Z compiled: bool, 2025-05-07T20:32:42.6041061Z ) -> None: 2025-05-07T20:32:42.6041157Z torch.manual_seed(2025) 2025-05-07T20:32:42.6041225Z 2025-05-07T20:32:42.6041387Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6041458Z 2025-05-07T20:32:42.6041543Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6041661Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6041748Z x = x_sign * x_clamp 2025-05-07T20:32:42.6041822Z x0 = x[:, :D] 2025-05-07T20:32:42.6041896Z x1 = x[:, D:] 2025-05-07T20:32:42.6041962Z 2025-05-07T20:32:42.6042042Z if contiguous: 2025-05-07T20:32:42.6042137Z x0 = x0.contiguous() 2025-05-07T20:32:42.6042221Z x1 = x1.contiguous() 2025-05-07T20:32:42.6042287Z 2025-05-07T20:32:42.6042372Z if scale_ub is not None: 2025-05-07T20:32:42.6042473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6042600Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6042724Z ) 2025-05-07T20:32:42.6042796Z else: 2025-05-07T20:32:42.6042885Z scale_ub_tensor = None 2025-05-07T20:32:42.6042955Z 2025-05-07T20:32:42.6043079Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6043165Z op = silu_mul_quant 2025-05-07T20:32:42.6043286Z if compiled: 2025-05-07T20:32:42.6043384Z op = torch.compile(op) 2025-05-07T20:32:42.6043489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6043555Z 2025-05-07T20:32:42.6043640Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6043644Z 2025-05-07T20:32:42.6043781Z moe/activation_test.py:117: 2025-05-07T20:32:42.6043903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6043997Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6044093Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6044585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6044681Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6045033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6045252Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6045588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6045676Z kernel = self.compile( 2025-05-07T20:32:42.6046138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6046315Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6046435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6046439Z 2025-05-07T20:32:42.6046650Z self = 2025-05-07T20:32:42.6047428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6047935Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8a0e670>} 2025-05-07T20:32:42.6048675Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6048871Z context = 2025-05-07T20:32:42.6048876Z 2025-05-07T20:32:42.6049038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6049298Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6049404Z module_map=module_map) 2025-05-07T20:32:42.6049560Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6049655Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6049730Z E ^ 2025-05-07T20:32:42.6050079Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6050083Z 2025-05-07T20:32:42.6050491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6050500Z 2025-05-07T20:32:42.6050597Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6050814Z self=, 2025-05-07T20:32:42.6050889Z T=1, 2025-05-07T20:32:42.6050959Z D=7168, 2025-05-07T20:32:42.6051037Z scale_ub=None, 2025-05-07T20:32:42.6051166Z contiguous=True, 2025-05-07T20:32:42.6051244Z compiled=False, 2025-05-07T20:32:42.6051308Z ) 2025-05-07T20:32:42.6051526Z self = 2025-05-07T20:32:42.6051724Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.6051730Z 2025-05-07T20:32:42.6051803Z @given( 2025-05-07T20:32:42.6051915Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6052009Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6052126Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6052305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6052413Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6052485Z ) 2025-05-07T20:32:42.6052724Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6052813Z def test_silu_mul_quant( 2025-05-07T20:32:42.6052887Z self, 2025-05-07T20:32:42.6052959Z T: int, 2025-05-07T20:32:42.6053028Z D: int, 2025-05-07T20:32:42.6053126Z scale_ub: Optional[float], 2025-05-07T20:32:42.6053209Z contiguous: bool, 2025-05-07T20:32:42.6053290Z compiled: bool, 2025-05-07T20:32:42.6053367Z ) -> None: 2025-05-07T20:32:42.6053457Z torch.manual_seed(2025) 2025-05-07T20:32:42.6053526Z 2025-05-07T20:32:42.6053689Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6053760Z 2025-05-07T20:32:42.6053850Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6054011Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6054098Z x = x_sign * x_clamp 2025-05-07T20:32:42.6054175Z x0 = x[:, :D] 2025-05-07T20:32:42.6054250Z x1 = x[:, D:] 2025-05-07T20:32:42.6054317Z 2025-05-07T20:32:42.6054400Z if contiguous: 2025-05-07T20:32:42.6054484Z x0 = x0.contiguous() 2025-05-07T20:32:42.6054571Z x1 = x1.contiguous() 2025-05-07T20:32:42.6054641Z 2025-05-07T20:32:42.6054724Z if scale_ub is not None: 2025-05-07T20:32:42.6054827Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6054958Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6055029Z ) 2025-05-07T20:32:42.6055105Z else: 2025-05-07T20:32:42.6055194Z scale_ub_tensor = None 2025-05-07T20:32:42.6055263Z 2025-05-07T20:32:42.6055392Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6055476Z op = silu_mul_quant 2025-05-07T20:32:42.6055559Z if compiled: 2025-05-07T20:32:42.6055657Z op = torch.compile(op) 2025-05-07T20:32:42.6055759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6055825Z 2025-05-07T20:32:42.6055913Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6055918Z 2025-05-07T20:32:42.6056008Z moe/activation_test.py:117: 2025-05-07T20:32:42.6056135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6056232Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6056325Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6056830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6056922Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6057273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6057505Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6057839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6057930Z kernel = self.compile( 2025-05-07T20:32:42.6058303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6058521Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6058646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6058651Z 2025-05-07T20:32:42.6058889Z self = 2025-05-07T20:32:42.6059665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6060210Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8962280>} 2025-05-07T20:32:42.6060944Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6061133Z context = 2025-05-07T20:32:42.6061138Z 2025-05-07T20:32:42.6061296Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6061558Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6061662Z module_map=module_map) 2025-05-07T20:32:42.6061819Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6061951Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6062024Z E ^ 2025-05-07T20:32:42.6062383Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6062388Z 2025-05-07T20:32:42.6062792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6062800Z 2025-05-07T20:32:42.6062896Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6063114Z self=, 2025-05-07T20:32:42.6063187Z T=16384, 2025-05-07T20:32:42.6063260Z D=7168, 2025-05-07T20:32:42.6063343Z scale_ub=1200.0, 2025-05-07T20:32:42.6063426Z contiguous=False, 2025-05-07T20:32:42.6063505Z compiled=True, 2025-05-07T20:32:42.6063574Z ) 2025-05-07T20:32:42.6063790Z self = 2025-05-07T20:32:42.6063970Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.6063977Z 2025-05-07T20:32:42.6064050Z @given( 2025-05-07T20:32:42.6064162Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6064261Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6064370Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6064479Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6064598Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6064667Z ) 2025-05-07T20:32:42.6064910Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6065000Z def test_silu_mul_quant( 2025-05-07T20:32:42.6065069Z self, 2025-05-07T20:32:42.6065143Z T: int, 2025-05-07T20:32:42.6065212Z D: int, 2025-05-07T20:32:42.6065304Z scale_ub: Optional[float], 2025-05-07T20:32:42.6065389Z contiguous: bool, 2025-05-07T20:32:42.6065468Z compiled: bool, 2025-05-07T20:32:42.6065546Z ) -> None: 2025-05-07T20:32:42.6065637Z torch.manual_seed(2025) 2025-05-07T20:32:42.6065704Z 2025-05-07T20:32:42.6065868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6065935Z 2025-05-07T20:32:42.6066033Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6066166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6066317Z x = x_sign * x_clamp 2025-05-07T20:32:42.6066393Z x0 = x[:, :D] 2025-05-07T20:32:42.6066468Z x1 = x[:, D:] 2025-05-07T20:32:42.6066535Z 2025-05-07T20:32:42.6066612Z if contiguous: 2025-05-07T20:32:42.6066739Z x0 = x0.contiguous() 2025-05-07T20:32:42.6066824Z x1 = x1.contiguous() 2025-05-07T20:32:42.6066888Z 2025-05-07T20:32:42.6066974Z if scale_ub is not None: 2025-05-07T20:32:42.6067075Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6067207Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6067321Z ) 2025-05-07T20:32:42.6067394Z else: 2025-05-07T20:32:42.6067486Z scale_ub_tensor = None 2025-05-07T20:32:42.6067551Z 2025-05-07T20:32:42.6067675Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6067761Z op = silu_mul_quant 2025-05-07T20:32:42.6067840Z if compiled: 2025-05-07T20:32:42.6067940Z op = torch.compile(op) 2025-05-07T20:32:42.6068046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6068114Z 2025-05-07T20:32:42.6068205Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6068209Z 2025-05-07T20:32:42.6068307Z moe/activation_test.py:117: 2025-05-07T20:32:42.6068431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6068530Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6068625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6069026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6069119Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6069606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6069699Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6070096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6070317Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6070656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6070745Z kernel = self.compile( 2025-05-07T20:32:42.6071118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6071293Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6071415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6071420Z 2025-05-07T20:32:42.6071621Z self = 2025-05-07T20:32:42.6072397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6072901Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8962ee0>} 2025-05-07T20:32:42.6073643Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6073833Z context = 2025-05-07T20:32:42.6073838Z 2025-05-07T20:32:42.6073999Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6074256Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6074357Z module_map=module_map) 2025-05-07T20:32:42.6074565Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6074657Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6074730Z E ^ 2025-05-07T20:32:42.6075126Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6075131Z 2025-05-07T20:32:42.6075543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6075548Z 2025-05-07T20:32:42.6075649Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6075907Z self=, 2025-05-07T20:32:42.6075975Z T=1, 2025-05-07T20:32:42.6076049Z D=7168, 2025-05-07T20:32:42.6076121Z scale_ub=None, 2025-05-07T20:32:42.6076205Z contiguous=False, 2025-05-07T20:32:42.6076307Z compiled=False, 2025-05-07T20:32:42.6076376Z ) 2025-05-07T20:32:42.6076616Z self = 2025-05-07T20:32:42.6076778Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.6076782Z 2025-05-07T20:32:42.6076855Z @given( 2025-05-07T20:32:42.6076976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6077071Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6077179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6077295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6077403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6077517Z ) 2025-05-07T20:32:42.6077762Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6077850Z def test_silu_mul_quant( 2025-05-07T20:32:42.6077921Z self, 2025-05-07T20:32:42.6077992Z T: int, 2025-05-07T20:32:42.6078062Z D: int, 2025-05-07T20:32:42.6078158Z scale_ub: Optional[float], 2025-05-07T20:32:42.6078244Z contiguous: bool, 2025-05-07T20:32:42.6078324Z compiled: bool, 2025-05-07T20:32:42.6078397Z ) -> None: 2025-05-07T20:32:42.6078487Z torch.manual_seed(2025) 2025-05-07T20:32:42.6078558Z 2025-05-07T20:32:42.6078728Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6078799Z 2025-05-07T20:32:42.6078887Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6079007Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6079089Z x = x_sign * x_clamp 2025-05-07T20:32:42.6079171Z x0 = x[:, :D] 2025-05-07T20:32:42.6079245Z x1 = x[:, D:] 2025-05-07T20:32:42.6079310Z 2025-05-07T20:32:42.6079390Z if contiguous: 2025-05-07T20:32:42.6079476Z x0 = x0.contiguous() 2025-05-07T20:32:42.6079561Z x1 = x1.contiguous() 2025-05-07T20:32:42.6079631Z 2025-05-07T20:32:42.6079717Z if scale_ub is not None: 2025-05-07T20:32:42.6079819Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6079955Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6080025Z ) 2025-05-07T20:32:42.6080098Z else: 2025-05-07T20:32:42.6080194Z scale_ub_tensor = None 2025-05-07T20:32:42.6080261Z 2025-05-07T20:32:42.6080387Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6080472Z op = silu_mul_quant 2025-05-07T20:32:42.6080552Z if compiled: 2025-05-07T20:32:42.6080648Z op = torch.compile(op) 2025-05-07T20:32:42.6080755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6080822Z 2025-05-07T20:32:42.6080910Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6080914Z 2025-05-07T20:32:42.6081003Z moe/activation_test.py:117: 2025-05-07T20:32:42.6081125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6081225Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6081367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6081865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6082017Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6082372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6082590Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6082926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6083051Z kernel = self.compile( 2025-05-07T20:32:42.6083433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6083601Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6083727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6083731Z 2025-05-07T20:32:42.6083930Z self = 2025-05-07T20:32:42.6084704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6085255Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc887b670>} 2025-05-07T20:32:42.6086026Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6086235Z context = 2025-05-07T20:32:42.6086243Z 2025-05-07T20:32:42.6086402Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6086663Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6086765Z module_map=module_map) 2025-05-07T20:32:42.6086923Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6087019Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6087092Z E ^ 2025-05-07T20:32:42.6087442Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6087450Z 2025-05-07T20:32:42.6087858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6087863Z 2025-05-07T20:32:42.6087957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6088175Z self=, 2025-05-07T20:32:42.6088251Z T=2048, 2025-05-07T20:32:42.6088322Z D=7168, 2025-05-07T20:32:42.6088400Z scale_ub=None, 2025-05-07T20:32:42.6088479Z contiguous=False, 2025-05-07T20:32:42.6088556Z compiled=True, 2025-05-07T20:32:42.6088628Z ) 2025-05-07T20:32:42.6088843Z self = 2025-05-07T20:32:42.6089010Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6089014Z 2025-05-07T20:32:42.6089087Z @given( 2025-05-07T20:32:42.6089203Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6089300Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6089410Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6089522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6089634Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6089705Z ) 2025-05-07T20:32:42.6089990Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6090081Z def test_silu_mul_quant( 2025-05-07T20:32:42.6090151Z self, 2025-05-07T20:32:42.6090221Z T: int, 2025-05-07T20:32:42.6090299Z D: int, 2025-05-07T20:32:42.6090428Z scale_ub: Optional[float], 2025-05-07T20:32:42.6090519Z contiguous: bool, 2025-05-07T20:32:42.6090599Z compiled: bool, 2025-05-07T20:32:42.6090671Z ) -> None: 2025-05-07T20:32:42.6090764Z torch.manual_seed(2025) 2025-05-07T20:32:42.6090828Z 2025-05-07T20:32:42.6091033Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6091106Z 2025-05-07T20:32:42.6091196Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6091313Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6091401Z x = x_sign * x_clamp 2025-05-07T20:32:42.6091476Z x0 = x[:, :D] 2025-05-07T20:32:42.6091557Z x1 = x[:, D:] 2025-05-07T20:32:42.6091627Z 2025-05-07T20:32:42.6091704Z if contiguous: 2025-05-07T20:32:42.6091788Z x0 = x0.contiguous() 2025-05-07T20:32:42.6091873Z x1 = x1.contiguous() 2025-05-07T20:32:42.6091942Z 2025-05-07T20:32:42.6092034Z if scale_ub is not None: 2025-05-07T20:32:42.6092133Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6092262Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6092339Z ) 2025-05-07T20:32:42.6092409Z else: 2025-05-07T20:32:42.6092497Z scale_ub_tensor = None 2025-05-07T20:32:42.6092609Z 2025-05-07T20:32:42.6092733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6092816Z op = silu_mul_quant 2025-05-07T20:32:42.6092899Z if compiled: 2025-05-07T20:32:42.6092992Z op = torch.compile(op) 2025-05-07T20:32:42.6093091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6093162Z 2025-05-07T20:32:42.6093248Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6093252Z 2025-05-07T20:32:42.6093344Z moe/activation_test.py:117: 2025-05-07T20:32:42.6093468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6093567Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6093665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6094032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6094118Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6094612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6094701Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6095054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6095276Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6095608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6095789Z kernel = self.compile( 2025-05-07T20:32:42.6096198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6096400Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6101870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6101887Z 2025-05-07T20:32:42.6102108Z self = 2025-05-07T20:32:42.6102894Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6103493Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc845c550>} 2025-05-07T20:32:42.6104700Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6104900Z context = 2025-05-07T20:32:42.6104906Z 2025-05-07T20:32:42.6105071Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6105409Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6105513Z module_map=module_map) 2025-05-07T20:32:42.6105673Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6105770Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6105846Z E ^ 2025-05-07T20:32:42.6106197Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6106206Z 2025-05-07T20:32:42.6106619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6106623Z 2025-05-07T20:32:42.6106721Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6106940Z self=, 2025-05-07T20:32:42.6107015Z T=4096, 2025-05-07T20:32:42.6107147Z D=7168, 2025-05-07T20:32:42.6107231Z scale_ub=None, 2025-05-07T20:32:42.6107312Z contiguous=False, 2025-05-07T20:32:42.6107390Z compiled=True, 2025-05-07T20:32:42.6107459Z ) 2025-05-07T20:32:42.6107678Z self = 2025-05-07T20:32:42.6107852Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6107859Z 2025-05-07T20:32:42.6107932Z @given( 2025-05-07T20:32:42.6108046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6108143Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6108254Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6108366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6108477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6108548Z ) 2025-05-07T20:32:42.6108788Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6108885Z def test_silu_mul_quant( 2025-05-07T20:32:42.6108957Z self, 2025-05-07T20:32:42.6109034Z T: int, 2025-05-07T20:32:42.6109105Z D: int, 2025-05-07T20:32:42.6109199Z scale_ub: Optional[float], 2025-05-07T20:32:42.6109286Z contiguous: bool, 2025-05-07T20:32:42.6109365Z compiled: bool, 2025-05-07T20:32:42.6109444Z ) -> None: 2025-05-07T20:32:42.6109539Z torch.manual_seed(2025) 2025-05-07T20:32:42.6109607Z 2025-05-07T20:32:42.6109771Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6109897Z 2025-05-07T20:32:42.6109989Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6110108Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6110195Z x = x_sign * x_clamp 2025-05-07T20:32:42.6110268Z x0 = x[:, :D] 2025-05-07T20:32:42.6110344Z x1 = x[:, D:] 2025-05-07T20:32:42.6110409Z 2025-05-07T20:32:42.6110494Z if contiguous: 2025-05-07T20:32:42.6110584Z x0 = x0.contiguous() 2025-05-07T20:32:42.6110666Z x1 = x1.contiguous() 2025-05-07T20:32:42.6110731Z 2025-05-07T20:32:42.6110819Z if scale_ub is not None: 2025-05-07T20:32:42.6110919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6111051Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6111193Z ) 2025-05-07T20:32:42.6111265Z else: 2025-05-07T20:32:42.6111355Z scale_ub_tensor = None 2025-05-07T20:32:42.6111427Z 2025-05-07T20:32:42.6111558Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6111681Z op = silu_mul_quant 2025-05-07T20:32:42.6111765Z if compiled: 2025-05-07T20:32:42.6111861Z op = torch.compile(op) 2025-05-07T20:32:42.6111969Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6112036Z 2025-05-07T20:32:42.6112126Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6112170Z 2025-05-07T20:32:42.6112268Z moe/activation_test.py:117: 2025-05-07T20:32:42.6112392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6112487Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6112583Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6112948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6113043Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6113541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6113633Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6113985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6114205Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6114581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6114675Z kernel = self.compile( 2025-05-07T20:32:42.6115048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6115223Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6115347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6115352Z 2025-05-07T20:32:42.6115554Z self = 2025-05-07T20:32:42.6116386Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6116892Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8550160>} 2025-05-07T20:32:42.6117641Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6117832Z context = 2025-05-07T20:32:42.6117837Z 2025-05-07T20:32:42.6117999Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6118264Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6118365Z module_map=module_map) 2025-05-07T20:32:42.6118527Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6118621Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6118692Z E ^ 2025-05-07T20:32:42.6119046Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6119053Z 2025-05-07T20:32:42.6119463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6119468Z 2025-05-07T20:32:42.6119563Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6119825Z self=, 2025-05-07T20:32:42.6119897Z T=16384, 2025-05-07T20:32:42.6119972Z D=5120, 2025-05-07T20:32:42.6120050Z scale_ub=1200.0, 2025-05-07T20:32:42.6120132Z contiguous=False, 2025-05-07T20:32:42.6120253Z compiled=False, 2025-05-07T20:32:42.6120321Z ) 2025-05-07T20:32:42.6120533Z self = 2025-05-07T20:32:42.6120713Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.6120718Z 2025-05-07T20:32:42.6120828Z @given( 2025-05-07T20:32:42.6120942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6121040Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6121154Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6121271Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6121377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6121447Z ) 2025-05-07T20:32:42.6121692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6121779Z def test_silu_mul_quant( 2025-05-07T20:32:42.6121851Z self, 2025-05-07T20:32:42.6121928Z T: int, 2025-05-07T20:32:42.6122001Z D: int, 2025-05-07T20:32:42.6122094Z scale_ub: Optional[float], 2025-05-07T20:32:42.6122181Z contiguous: bool, 2025-05-07T20:32:42.6122261Z compiled: bool, 2025-05-07T20:32:42.6122333Z ) -> None: 2025-05-07T20:32:42.6122427Z torch.manual_seed(2025) 2025-05-07T20:32:42.6122537Z 2025-05-07T20:32:42.6122704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6122776Z 2025-05-07T20:32:42.6122863Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6122983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6123065Z x = x_sign * x_clamp 2025-05-07T20:32:42.6123140Z x0 = x[:, :D] 2025-05-07T20:32:42.6123220Z x1 = x[:, D:] 2025-05-07T20:32:42.6123288Z 2025-05-07T20:32:42.6123364Z if contiguous: 2025-05-07T20:32:42.6123455Z x0 = x0.contiguous() 2025-05-07T20:32:42.6123540Z x1 = x1.contiguous() 2025-05-07T20:32:42.6123610Z 2025-05-07T20:32:42.6123700Z if scale_ub is not None: 2025-05-07T20:32:42.6123801Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6123934Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6124006Z ) 2025-05-07T20:32:42.6124079Z else: 2025-05-07T20:32:42.6124172Z scale_ub_tensor = None 2025-05-07T20:32:42.6124238Z 2025-05-07T20:32:42.6124362Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6124449Z op = silu_mul_quant 2025-05-07T20:32:42.6124529Z if compiled: 2025-05-07T20:32:42.6124625Z op = torch.compile(op) 2025-05-07T20:32:42.6124729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6124797Z 2025-05-07T20:32:42.6124880Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6124885Z 2025-05-07T20:32:42.6124977Z moe/activation_test.py:117: 2025-05-07T20:32:42.6125102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6125199Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6125293Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6125792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6125892Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6126268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6126512Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6126853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6126993Z kernel = self.compile( 2025-05-07T20:32:42.6127368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6127579Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6127703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6127707Z 2025-05-07T20:32:42.6127911Z self = 2025-05-07T20:32:42.6128726Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6129229Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8550940>} 2025-05-07T20:32:42.6129973Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6130161Z context = 2025-05-07T20:32:42.6130165Z 2025-05-07T20:32:42.6130325Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6130643Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6130752Z module_map=module_map) 2025-05-07T20:32:42.6130910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6131005Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6131086Z E ^ 2025-05-07T20:32:42.6131442Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6131449Z 2025-05-07T20:32:42.6131859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6131864Z 2025-05-07T20:32:42.6131965Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6132181Z self=, 2025-05-07T20:32:42.6132256Z T=16384, 2025-05-07T20:32:42.6132326Z D=5120, 2025-05-07T20:32:42.6132405Z scale_ub=1200.0, 2025-05-07T20:32:42.6132489Z contiguous=True, 2025-05-07T20:32:42.6132572Z compiled=True, 2025-05-07T20:32:42.6132640Z ) 2025-05-07T20:32:42.6132863Z self = 2025-05-07T20:32:42.6133033Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.6133038Z 2025-05-07T20:32:42.6133111Z @given( 2025-05-07T20:32:42.6133226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6133325Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6133440Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6133552Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6133662Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6133735Z ) 2025-05-07T20:32:42.6133979Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6134070Z def test_silu_mul_quant( 2025-05-07T20:32:42.6134143Z self, 2025-05-07T20:32:42.6134220Z T: int, 2025-05-07T20:32:42.6134290Z D: int, 2025-05-07T20:32:42.6134385Z scale_ub: Optional[float], 2025-05-07T20:32:42.6134468Z contiguous: bool, 2025-05-07T20:32:42.6134553Z compiled: bool, 2025-05-07T20:32:42.6134626Z ) -> None: 2025-05-07T20:32:42.6134715Z torch.manual_seed(2025) 2025-05-07T20:32:42.6134784Z 2025-05-07T20:32:42.6134991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6135061Z 2025-05-07T20:32:42.6135149Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6135266Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6135387Z x = x_sign * x_clamp 2025-05-07T20:32:42.6135468Z x0 = x[:, :D] 2025-05-07T20:32:42.6135544Z x1 = x[:, D:] 2025-05-07T20:32:42.6135614Z 2025-05-07T20:32:42.6135692Z if contiguous: 2025-05-07T20:32:42.6135777Z x0 = x0.contiguous() 2025-05-07T20:32:42.6135863Z x1 = x1.contiguous() 2025-05-07T20:32:42.6135987Z 2025-05-07T20:32:42.6136074Z if scale_ub is not None: 2025-05-07T20:32:42.6136178Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6136308Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6136381Z ) 2025-05-07T20:32:42.6136454Z else: 2025-05-07T20:32:42.6136542Z scale_ub_tensor = None 2025-05-07T20:32:42.6136622Z 2025-05-07T20:32:42.6136767Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6136870Z op = silu_mul_quant 2025-05-07T20:32:42.6136956Z if compiled: 2025-05-07T20:32:42.6137054Z op = torch.compile(op) 2025-05-07T20:32:42.6137154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6137223Z 2025-05-07T20:32:42.6137309Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6137314Z 2025-05-07T20:32:42.6137405Z moe/activation_test.py:117: 2025-05-07T20:32:42.6137575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6137675Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6137769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6138135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6138221Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6138715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6138808Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6139162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6139383Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6139712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6139808Z kernel = self.compile( 2025-05-07T20:32:42.6140182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6140353Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6140478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6140485Z 2025-05-07T20:32:42.6140686Z self = 2025-05-07T20:32:42.6141461Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6141964Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8351550>} 2025-05-07T20:32:42.6142708Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6142899Z context = 2025-05-07T20:32:42.6142904Z 2025-05-07T20:32:42.6143063Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6143366Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6143469Z module_map=module_map) 2025-05-07T20:32:42.6143663Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6143764Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6143837Z E ^ 2025-05-07T20:32:42.6144194Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6144240Z 2025-05-07T20:32:42.6144654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6144659Z 2025-05-07T20:32:42.6144755Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6144975Z self=, 2025-05-07T20:32:42.6145052Z T=16384, 2025-05-07T20:32:42.6145125Z D=5120, 2025-05-07T20:32:42.6145203Z scale_ub=None, 2025-05-07T20:32:42.6145286Z contiguous=False, 2025-05-07T20:32:42.6145363Z compiled=True, 2025-05-07T20:32:42.6145434Z ) 2025-05-07T20:32:42.6145671Z self = 2025-05-07T20:32:42.6145874Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6145880Z 2025-05-07T20:32:42.6145960Z @given( 2025-05-07T20:32:42.6146073Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6146209Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6146324Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6146435Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6146555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6146621Z ) 2025-05-07T20:32:42.6146862Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6146958Z def test_silu_mul_quant( 2025-05-07T20:32:42.6147031Z self, 2025-05-07T20:32:42.6147106Z T: int, 2025-05-07T20:32:42.6147177Z D: int, 2025-05-07T20:32:42.6147271Z scale_ub: Optional[float], 2025-05-07T20:32:42.6147362Z contiguous: bool, 2025-05-07T20:32:42.6147442Z compiled: bool, 2025-05-07T20:32:42.6147514Z ) -> None: 2025-05-07T20:32:42.6147606Z torch.manual_seed(2025) 2025-05-07T20:32:42.6147675Z 2025-05-07T20:32:42.6147837Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6147916Z 2025-05-07T20:32:42.6148003Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6148122Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6148207Z x = x_sign * x_clamp 2025-05-07T20:32:42.6148281Z x0 = x[:, :D] 2025-05-07T20:32:42.6148358Z x1 = x[:, D:] 2025-05-07T20:32:42.6148428Z 2025-05-07T20:32:42.6148506Z if contiguous: 2025-05-07T20:32:42.6148595Z x0 = x0.contiguous() 2025-05-07T20:32:42.6148679Z x1 = x1.contiguous() 2025-05-07T20:32:42.6148747Z 2025-05-07T20:32:42.6148835Z if scale_ub is not None: 2025-05-07T20:32:42.6148939Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6149071Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6149144Z ) 2025-05-07T20:32:42.6149213Z else: 2025-05-07T20:32:42.6149302Z scale_ub_tensor = None 2025-05-07T20:32:42.6149369Z 2025-05-07T20:32:42.6149498Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6149583Z op = silu_mul_quant 2025-05-07T20:32:42.6149665Z if compiled: 2025-05-07T20:32:42.6149760Z op = torch.compile(op) 2025-05-07T20:32:42.6149907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6149975Z 2025-05-07T20:32:42.6150060Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6150111Z 2025-05-07T20:32:42.6150205Z moe/activation_test.py:117: 2025-05-07T20:32:42.6150334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6150428Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6150560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6150925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6151013Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6151512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6151644Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6151997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6152217Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6152553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6152647Z kernel = self.compile( 2025-05-07T20:32:42.6153025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6153202Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6153322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6153327Z 2025-05-07T20:32:42.6153567Z self = 2025-05-07T20:32:42.6154350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6154851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc85361f0>} 2025-05-07T20:32:42.6155598Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6155785Z context = 2025-05-07T20:32:42.6155790Z 2025-05-07T20:32:42.6155948Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6156214Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6156316Z module_map=module_map) 2025-05-07T20:32:42.6156475Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6156571Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6156638Z E ^ 2025-05-07T20:32:42.6157008Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6157013Z 2025-05-07T20:32:42.6157425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6157433Z 2025-05-07T20:32:42.6157532Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6157748Z self=, 2025-05-07T20:32:42.6157824Z T=2048, 2025-05-07T20:32:42.6157893Z D=5120, 2025-05-07T20:32:42.6157969Z scale_ub=None, 2025-05-07T20:32:42.6158057Z contiguous=False, 2025-05-07T20:32:42.6158135Z compiled=True, 2025-05-07T20:32:42.6158203Z ) 2025-05-07T20:32:42.6158418Z self = 2025-05-07T20:32:42.6158587Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6158592Z 2025-05-07T20:32:42.6158705Z @given( 2025-05-07T20:32:42.6158819Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6158913Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6159028Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6159179Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6159288Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6159361Z ) 2025-05-07T20:32:42.6159604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6159695Z def test_silu_mul_quant( 2025-05-07T20:32:42.6159831Z self, 2025-05-07T20:32:42.6159906Z T: int, 2025-05-07T20:32:42.6159976Z D: int, 2025-05-07T20:32:42.6160074Z scale_ub: Optional[float], 2025-05-07T20:32:42.6160157Z contiguous: bool, 2025-05-07T20:32:42.6160237Z compiled: bool, 2025-05-07T20:32:42.6160312Z ) -> None: 2025-05-07T20:32:42.6160404Z torch.manual_seed(2025) 2025-05-07T20:32:42.6160475Z 2025-05-07T20:32:42.6160636Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6160707Z 2025-05-07T20:32:42.6160796Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6160917Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6161001Z x = x_sign * x_clamp 2025-05-07T20:32:42.6161076Z x0 = x[:, :D] 2025-05-07T20:32:42.6161152Z x1 = x[:, D:] 2025-05-07T20:32:42.6161219Z 2025-05-07T20:32:42.6161300Z if contiguous: 2025-05-07T20:32:42.6161385Z x0 = x0.contiguous() 2025-05-07T20:32:42.6161520Z x1 = x1.contiguous() 2025-05-07T20:32:42.6161591Z 2025-05-07T20:32:42.6161678Z if scale_ub is not None: 2025-05-07T20:32:42.6161781Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6161909Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6161980Z ) 2025-05-07T20:32:42.6162056Z else: 2025-05-07T20:32:42.6162144Z scale_ub_tensor = None 2025-05-07T20:32:42.6162211Z 2025-05-07T20:32:42.6162337Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6162419Z op = silu_mul_quant 2025-05-07T20:32:42.6162500Z if compiled: 2025-05-07T20:32:42.6162597Z op = torch.compile(op) 2025-05-07T20:32:42.6162698Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6162771Z 2025-05-07T20:32:42.6162856Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6162861Z 2025-05-07T20:32:42.6162953Z moe/activation_test.py:117: 2025-05-07T20:32:42.6163082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6163180Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6163276Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6163640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6163729Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6164220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6164311Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6164666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6164887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6165223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6165315Z kernel = self.compile( 2025-05-07T20:32:42.6165695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6165865Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6165989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6166041Z 2025-05-07T20:32:42.6166243Z self = 2025-05-07T20:32:42.6167057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6167563Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8536f70>} 2025-05-07T20:32:42.6168336Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6168525Z context = 2025-05-07T20:32:42.6168532Z 2025-05-07T20:32:42.6168692Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6168951Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6169060Z module_map=module_map) 2025-05-07T20:32:42.6169219Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6169312Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6169380Z E ^ 2025-05-07T20:32:42.6169776Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6169784Z 2025-05-07T20:32:42.6170197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6170202Z 2025-05-07T20:32:42.6170298Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6170519Z self=, 2025-05-07T20:32:42.6170591Z T=2048, 2025-05-07T20:32:42.6170661Z D=5120, 2025-05-07T20:32:42.6170744Z scale_ub=1200.0, 2025-05-07T20:32:42.6170825Z contiguous=False, 2025-05-07T20:32:42.6170903Z compiled=True, 2025-05-07T20:32:42.6170971Z ) 2025-05-07T20:32:42.6171191Z self = 2025-05-07T20:32:42.6171360Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.6171364Z 2025-05-07T20:32:42.6171436Z @given( 2025-05-07T20:32:42.6171552Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6171652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6171761Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6171872Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6171984Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6172054Z ) 2025-05-07T20:32:42.6172298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6172390Z def test_silu_mul_quant( 2025-05-07T20:32:42.6172460Z self, 2025-05-07T20:32:42.6172527Z T: int, 2025-05-07T20:32:42.6172601Z D: int, 2025-05-07T20:32:42.6172699Z scale_ub: Optional[float], 2025-05-07T20:32:42.6172782Z contiguous: bool, 2025-05-07T20:32:42.6172864Z compiled: bool, 2025-05-07T20:32:42.6172939Z ) -> None: 2025-05-07T20:32:42.6173034Z torch.manual_seed(2025) 2025-05-07T20:32:42.6173102Z 2025-05-07T20:32:42.6173267Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6173342Z 2025-05-07T20:32:42.6173426Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6173544Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6173631Z x = x_sign * x_clamp 2025-05-07T20:32:42.6173703Z x0 = x[:, :D] 2025-05-07T20:32:42.6173776Z x1 = x[:, D:] 2025-05-07T20:32:42.6173893Z 2025-05-07T20:32:42.6173971Z if contiguous: 2025-05-07T20:32:42.6174055Z x0 = x0.contiguous() 2025-05-07T20:32:42.6174141Z x1 = x1.contiguous() 2025-05-07T20:32:42.6174209Z 2025-05-07T20:32:42.6174338Z if scale_ub is not None: 2025-05-07T20:32:42.6174440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6174571Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6174645Z ) 2025-05-07T20:32:42.6174716Z else: 2025-05-07T20:32:42.6174807Z scale_ub_tensor = None 2025-05-07T20:32:42.6174920Z 2025-05-07T20:32:42.6175044Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6175127Z op = silu_mul_quant 2025-05-07T20:32:42.6175209Z if compiled: 2025-05-07T20:32:42.6175303Z op = torch.compile(op) 2025-05-07T20:32:42.6175402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6175477Z 2025-05-07T20:32:42.6175561Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6175566Z 2025-05-07T20:32:42.6175662Z moe/activation_test.py:117: 2025-05-07T20:32:42.6175786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6175897Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6176005Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6176390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6176478Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6177012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6177106Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6177459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6177679Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6178015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6178109Z kernel = self.compile( 2025-05-07T20:32:42.6178487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6178655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6178780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6178788Z 2025-05-07T20:32:42.6178993Z self = 2025-05-07T20:32:42.6179775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6180282Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc81f7940>} 2025-05-07T20:32:42.6181027Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6181214Z context = 2025-05-07T20:32:42.6181219Z 2025-05-07T20:32:42.6181382Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6181646Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6181750Z module_map=module_map) 2025-05-07T20:32:42.6181912Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6182006Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6182120Z E ^ 2025-05-07T20:32:42.6182479Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6182484Z 2025-05-07T20:32:42.6182928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6182933Z 2025-05-07T20:32:42.6183030Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6183249Z self=, 2025-05-07T20:32:42.6183321Z T=4096, 2025-05-07T20:32:42.6183434Z D=5120, 2025-05-07T20:32:42.6183512Z scale_ub=1200.0, 2025-05-07T20:32:42.6183592Z contiguous=True, 2025-05-07T20:32:42.6183672Z compiled=True, 2025-05-07T20:32:42.6183738Z ) 2025-05-07T20:32:42.6183949Z self = 2025-05-07T20:32:42.6184117Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.6184125Z 2025-05-07T20:32:42.6184196Z @given( 2025-05-07T20:32:42.6184308Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6184403Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6184517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6184632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6184741Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6184808Z ) 2025-05-07T20:32:42.6185052Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6185185Z def test_silu_mul_quant( 2025-05-07T20:32:42.6185260Z self, 2025-05-07T20:32:42.6185332Z T: int, 2025-05-07T20:32:42.6185403Z D: int, 2025-05-07T20:32:42.6185497Z scale_ub: Optional[float], 2025-05-07T20:32:42.6185581Z contiguous: bool, 2025-05-07T20:32:42.6185660Z compiled: bool, 2025-05-07T20:32:42.6185732Z ) -> None: 2025-05-07T20:32:42.6185828Z torch.manual_seed(2025) 2025-05-07T20:32:42.6185893Z 2025-05-07T20:32:42.6186065Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6186152Z 2025-05-07T20:32:42.6186245Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6186387Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6186469Z x = x_sign * x_clamp 2025-05-07T20:32:42.6186544Z x0 = x[:, :D] 2025-05-07T20:32:42.6186619Z x1 = x[:, D:] 2025-05-07T20:32:42.6186682Z 2025-05-07T20:32:42.6186760Z if contiguous: 2025-05-07T20:32:42.6186855Z x0 = x0.contiguous() 2025-05-07T20:32:42.6186938Z x1 = x1.contiguous() 2025-05-07T20:32:42.6187007Z 2025-05-07T20:32:42.6187095Z if scale_ub is not None: 2025-05-07T20:32:42.6187195Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6187326Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6187401Z ) 2025-05-07T20:32:42.6187474Z else: 2025-05-07T20:32:42.6187564Z scale_ub_tensor = None 2025-05-07T20:32:42.6187631Z 2025-05-07T20:32:42.6187754Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6187845Z op = silu_mul_quant 2025-05-07T20:32:42.6187923Z if compiled: 2025-05-07T20:32:42.6188017Z op = torch.compile(op) 2025-05-07T20:32:42.6188121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6188188Z 2025-05-07T20:32:42.6188275Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6188282Z 2025-05-07T20:32:42.6188381Z moe/activation_test.py:117: 2025-05-07T20:32:42.6188505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6188603Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6188697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6189060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6189231Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6189718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6189925Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6190283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6190501Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6190880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6190970Z kernel = self.compile( 2025-05-07T20:32:42.6191344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6191517Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6191640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6191645Z 2025-05-07T20:32:42.6191848Z self = 2025-05-07T20:32:42.6192628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6193167Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc82b4790>} 2025-05-07T20:32:42.6193922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6194108Z context = 2025-05-07T20:32:42.6194119Z 2025-05-07T20:32:42.6194285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6194542Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6194646Z module_map=module_map) 2025-05-07T20:32:42.6194807Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6194898Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6194974Z E ^ 2025-05-07T20:32:42.6195334Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6195341Z 2025-05-07T20:32:42.6195749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6195754Z 2025-05-07T20:32:42.6195855Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6196074Z self=, 2025-05-07T20:32:42.6196148Z T=128, 2025-05-07T20:32:42.6196222Z D=5120, 2025-05-07T20:32:42.6196300Z scale_ub=1200.0, 2025-05-07T20:32:42.6196402Z contiguous=False, 2025-05-07T20:32:42.6196490Z compiled=True, 2025-05-07T20:32:42.6196571Z ) 2025-05-07T20:32:42.6196797Z self = 2025-05-07T20:32:42.6196963Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.6196967Z 2025-05-07T20:32:42.6197037Z @given( 2025-05-07T20:32:42.6197162Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6197257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6197367Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6197481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6197589Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6197708Z ) 2025-05-07T20:32:42.6197949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6198040Z def test_silu_mul_quant( 2025-05-07T20:32:42.6198111Z self, 2025-05-07T20:32:42.6198181Z T: int, 2025-05-07T20:32:42.6198287Z D: int, 2025-05-07T20:32:42.6198384Z scale_ub: Optional[float], 2025-05-07T20:32:42.6198468Z contiguous: bool, 2025-05-07T20:32:42.6198548Z compiled: bool, 2025-05-07T20:32:42.6198625Z ) -> None: 2025-05-07T20:32:42.6198715Z torch.manual_seed(2025) 2025-05-07T20:32:42.6198827Z 2025-05-07T20:32:42.6198996Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6199064Z 2025-05-07T20:32:42.6199151Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6199268Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6199351Z x = x_sign * x_clamp 2025-05-07T20:32:42.6199426Z x0 = x[:, :D] 2025-05-07T20:32:42.6199504Z x1 = x[:, D:] 2025-05-07T20:32:42.6199572Z 2025-05-07T20:32:42.6199651Z if contiguous: 2025-05-07T20:32:42.6199738Z x0 = x0.contiguous() 2025-05-07T20:32:42.6199823Z x1 = x1.contiguous() 2025-05-07T20:32:42.6199892Z 2025-05-07T20:32:42.6199983Z if scale_ub is not None: 2025-05-07T20:32:42.6200086Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6200216Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6200289Z ) 2025-05-07T20:32:42.6200361Z else: 2025-05-07T20:32:42.6200494Z scale_ub_tensor = None 2025-05-07T20:32:42.6200563Z 2025-05-07T20:32:42.6200693Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6200776Z op = silu_mul_quant 2025-05-07T20:32:42.6200854Z if compiled: 2025-05-07T20:32:42.6200950Z op = torch.compile(op) 2025-05-07T20:32:42.6201051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6201123Z 2025-05-07T20:32:42.6201212Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6201217Z 2025-05-07T20:32:42.6201308Z moe/activation_test.py:117: 2025-05-07T20:32:42.6201434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6201533Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6201626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6201992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6202082Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6202579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6202674Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6203027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6203248Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6203582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6203673Z kernel = self.compile( 2025-05-07T20:32:42.6204363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6204538Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6204664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6204672Z 2025-05-07T20:32:42.6204877Z self = 2025-05-07T20:32:42.6205655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6206256Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc81370d0>} 2025-05-07T20:32:42.6207054Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6207245Z context = 2025-05-07T20:32:42.6207250Z 2025-05-07T20:32:42.6207468Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6207726Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6207830Z module_map=module_map) 2025-05-07T20:32:42.6207987Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6208083Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6208157Z E ^ 2025-05-07T20:32:42.6208507Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6208512Z 2025-05-07T20:32:42.6208927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6208932Z 2025-05-07T20:32:42.6209030Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6209244Z self=, 2025-05-07T20:32:42.6209377Z T=16384, 2025-05-07T20:32:42.6209450Z D=7168, 2025-05-07T20:32:42.6209528Z scale_ub=1200.0, 2025-05-07T20:32:42.6209611Z contiguous=True, 2025-05-07T20:32:42.6209689Z compiled=True, 2025-05-07T20:32:42.6209759Z ) 2025-05-07T20:32:42.6209970Z self = 2025-05-07T20:32:42.6210138Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.6210146Z 2025-05-07T20:32:42.6210222Z @given( 2025-05-07T20:32:42.6210334Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6210427Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6210544Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6210656Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6210765Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6210838Z ) 2025-05-07T20:32:42.6211081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6211174Z def test_silu_mul_quant( 2025-05-07T20:32:42.6211242Z self, 2025-05-07T20:32:42.6211311Z T: int, 2025-05-07T20:32:42.6211385Z D: int, 2025-05-07T20:32:42.6211479Z scale_ub: Optional[float], 2025-05-07T20:32:42.6211562Z contiguous: bool, 2025-05-07T20:32:42.6211648Z compiled: bool, 2025-05-07T20:32:42.6211720Z ) -> None: 2025-05-07T20:32:42.6211809Z torch.manual_seed(2025) 2025-05-07T20:32:42.6211876Z 2025-05-07T20:32:42.6212040Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6212108Z 2025-05-07T20:32:42.6212202Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6212320Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6212407Z x = x_sign * x_clamp 2025-05-07T20:32:42.6212479Z x0 = x[:, :D] 2025-05-07T20:32:42.6212551Z x1 = x[:, D:] 2025-05-07T20:32:42.6212623Z 2025-05-07T20:32:42.6212703Z if contiguous: 2025-05-07T20:32:42.6212790Z x0 = x0.contiguous() 2025-05-07T20:32:42.6212878Z x1 = x1.contiguous() 2025-05-07T20:32:42.6212946Z 2025-05-07T20:32:42.6213032Z if scale_ub is not None: 2025-05-07T20:32:42.6213138Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6213266Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6213388Z ) 2025-05-07T20:32:42.6213465Z else: 2025-05-07T20:32:42.6213554Z scale_ub_tensor = None 2025-05-07T20:32:42.6213622Z 2025-05-07T20:32:42.6213788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6213878Z op = silu_mul_quant 2025-05-07T20:32:42.6213962Z if compiled: 2025-05-07T20:32:42.6214056Z op = torch.compile(op) 2025-05-07T20:32:42.6214158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6214227Z 2025-05-07T20:32:42.6214354Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6214358Z 2025-05-07T20:32:42.6214452Z moe/activation_test.py:117: 2025-05-07T20:32:42.6214578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6214675Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6214767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6215136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6215222Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6215714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6215803Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6216187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6216471Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6216809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6216902Z kernel = self.compile( 2025-05-07T20:32:42.6217278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6217451Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6217576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6217580Z 2025-05-07T20:32:42.6217784Z self = 2025-05-07T20:32:42.6218560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6219065Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8137d30>} 2025-05-07T20:32:42.6219806Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6219997Z context = 2025-05-07T20:32:42.6220002Z 2025-05-07T20:32:42.6220162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6220427Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6226190Z module_map=module_map) 2025-05-07T20:32:42.6226377Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6226477Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6226550Z E ^ 2025-05-07T20:32:42.6226921Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6226926Z 2025-05-07T20:32:42.6227351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6227356Z 2025-05-07T20:32:42.6227455Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6227742Z self=, 2025-05-07T20:32:42.6227816Z T=16384, 2025-05-07T20:32:42.6227888Z D=5120, 2025-05-07T20:32:42.6227970Z scale_ub=1200.0, 2025-05-07T20:32:42.6228088Z contiguous=True, 2025-05-07T20:32:42.6228167Z compiled=False, 2025-05-07T20:32:42.6228241Z ) 2025-05-07T20:32:42.6228452Z self = 2025-05-07T20:32:42.6228626Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.6228669Z 2025-05-07T20:32:42.6228747Z @given( 2025-05-07T20:32:42.6228863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6228956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6229072Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6229184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6229298Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6229367Z ) 2025-05-07T20:32:42.6229609Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6229700Z def test_silu_mul_quant( 2025-05-07T20:32:42.6229772Z self, 2025-05-07T20:32:42.6229904Z T: int, 2025-05-07T20:32:42.6229981Z D: int, 2025-05-07T20:32:42.6230075Z scale_ub: Optional[float], 2025-05-07T20:32:42.6230160Z contiguous: bool, 2025-05-07T20:32:42.6230244Z compiled: bool, 2025-05-07T20:32:42.6230320Z ) -> None: 2025-05-07T20:32:42.6230457Z torch.manual_seed(2025) 2025-05-07T20:32:42.6230528Z 2025-05-07T20:32:42.6230692Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6230766Z 2025-05-07T20:32:42.6230852Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6230971Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6231056Z x = x_sign * x_clamp 2025-05-07T20:32:42.6231133Z x0 = x[:, :D] 2025-05-07T20:32:42.6231206Z x1 = x[:, D:] 2025-05-07T20:32:42.6231278Z 2025-05-07T20:32:42.6231357Z if contiguous: 2025-05-07T20:32:42.6231443Z x0 = x0.contiguous() 2025-05-07T20:32:42.6231534Z x1 = x1.contiguous() 2025-05-07T20:32:42.6231599Z 2025-05-07T20:32:42.6231684Z if scale_ub is not None: 2025-05-07T20:32:42.6231787Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6231920Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6231995Z ) 2025-05-07T20:32:42.6232073Z else: 2025-05-07T20:32:42.6232162Z scale_ub_tensor = None 2025-05-07T20:32:42.6232234Z 2025-05-07T20:32:42.6232359Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6232446Z op = silu_mul_quant 2025-05-07T20:32:42.6232530Z if compiled: 2025-05-07T20:32:42.6232624Z op = torch.compile(op) 2025-05-07T20:32:42.6232729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6232799Z 2025-05-07T20:32:42.6232884Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6232889Z 2025-05-07T20:32:42.6232978Z moe/activation_test.py:117: 2025-05-07T20:32:42.6233110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6233207Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6233305Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6233806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6233902Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6234263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6234484Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6234873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6234962Z kernel = self.compile( 2025-05-07T20:32:42.6235374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6235548Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6235668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6235673Z 2025-05-07T20:32:42.6235874Z self = 2025-05-07T20:32:42.6236693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6237197Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc831d700>} 2025-05-07T20:32:42.6237944Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6238130Z context = 2025-05-07T20:32:42.6238134Z 2025-05-07T20:32:42.6238298Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6238595Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6238700Z module_map=module_map) 2025-05-07T20:32:42.6238863Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6238958Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6239031Z E ^ 2025-05-07T20:32:42.6239393Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6239401Z 2025-05-07T20:32:42.6239810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6239817Z 2025-05-07T20:32:42.6239917Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6240135Z self=, 2025-05-07T20:32:42.6240204Z T=1, 2025-05-07T20:32:42.6240275Z D=7168, 2025-05-07T20:32:42.6240350Z scale_ub=1200.0, 2025-05-07T20:32:42.6240435Z contiguous=False, 2025-05-07T20:32:42.6240519Z compiled=False, 2025-05-07T20:32:42.6240589Z ) 2025-05-07T20:32:42.6240800Z self = 2025-05-07T20:32:42.6240963Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.6240968Z 2025-05-07T20:32:42.6241038Z @given( 2025-05-07T20:32:42.6241153Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6241250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6241361Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6241477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6241586Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6241656Z ) 2025-05-07T20:32:42.6241901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6241990Z def test_silu_mul_quant( 2025-05-07T20:32:42.6242062Z self, 2025-05-07T20:32:42.6242139Z T: int, 2025-05-07T20:32:42.6242211Z D: int, 2025-05-07T20:32:42.6242303Z scale_ub: Optional[float], 2025-05-07T20:32:42.6242391Z contiguous: bool, 2025-05-07T20:32:42.6242472Z compiled: bool, 2025-05-07T20:32:42.6242548Z ) -> None: 2025-05-07T20:32:42.6242640Z torch.manual_seed(2025) 2025-05-07T20:32:42.6242755Z 2025-05-07T20:32:42.6242925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6242993Z 2025-05-07T20:32:42.6243079Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6243202Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6243325Z x = x_sign * x_clamp 2025-05-07T20:32:42.6243400Z x0 = x[:, :D] 2025-05-07T20:32:42.6243476Z x1 = x[:, D:] 2025-05-07T20:32:42.6243543Z 2025-05-07T20:32:42.6243621Z if contiguous: 2025-05-07T20:32:42.6243711Z x0 = x0.contiguous() 2025-05-07T20:32:42.6243837Z x1 = x1.contiguous() 2025-05-07T20:32:42.6243904Z 2025-05-07T20:32:42.6243994Z if scale_ub is not None: 2025-05-07T20:32:42.6244094Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6244225Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6244297Z ) 2025-05-07T20:32:42.6244370Z else: 2025-05-07T20:32:42.6244465Z scale_ub_tensor = None 2025-05-07T20:32:42.6244533Z 2025-05-07T20:32:42.6244655Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6244745Z op = silu_mul_quant 2025-05-07T20:32:42.6244824Z if compiled: 2025-05-07T20:32:42.6244921Z op = torch.compile(op) 2025-05-07T20:32:42.6245027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6245093Z 2025-05-07T20:32:42.6245181Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6245185Z 2025-05-07T20:32:42.6245278Z moe/activation_test.py:117: 2025-05-07T20:32:42.6245443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6245543Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6245638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6246142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6246241Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6246593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6246813Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6247148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6247235Z kernel = self.compile( 2025-05-07T20:32:42.6247614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6247787Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6247907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6247911Z 2025-05-07T20:32:42.6248116Z self = 2025-05-07T20:32:42.6248887Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6249394Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc82540d0>} 2025-05-07T20:32:42.6250134Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6250324Z context = 2025-05-07T20:32:42.6250329Z 2025-05-07T20:32:42.6250492Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6250751Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6250908Z module_map=module_map) 2025-05-07T20:32:42.6251066Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6251163Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6251241Z E ^ 2025-05-07T20:32:42.6251626Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6251631Z 2025-05-07T20:32:42.6252042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6252108Z 2025-05-07T20:32:42.6252207Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6252423Z self=, 2025-05-07T20:32:42.6252499Z T=4096, 2025-05-07T20:32:42.6252569Z D=7168, 2025-05-07T20:32:42.6252645Z scale_ub=1200.0, 2025-05-07T20:32:42.6252729Z contiguous=False, 2025-05-07T20:32:42.6252811Z compiled=True, 2025-05-07T20:32:42.6252879Z ) 2025-05-07T20:32:42.6253100Z self = 2025-05-07T20:32:42.6253269Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.6253273Z 2025-05-07T20:32:42.6253351Z @given( 2025-05-07T20:32:42.6253467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6253559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6253671Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6253822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6253935Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6254010Z ) 2025-05-07T20:32:42.6254250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6254340Z def test_silu_mul_quant( 2025-05-07T20:32:42.6254410Z self, 2025-05-07T20:32:42.6254479Z T: int, 2025-05-07T20:32:42.6254558Z D: int, 2025-05-07T20:32:42.6254650Z scale_ub: Optional[float], 2025-05-07T20:32:42.6254734Z contiguous: bool, 2025-05-07T20:32:42.6254817Z compiled: bool, 2025-05-07T20:32:42.6254890Z ) -> None: 2025-05-07T20:32:42.6254984Z torch.manual_seed(2025) 2025-05-07T20:32:42.6255053Z 2025-05-07T20:32:42.6255219Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6255289Z 2025-05-07T20:32:42.6255378Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6255496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6255590Z x = x_sign * x_clamp 2025-05-07T20:32:42.6255664Z x0 = x[:, :D] 2025-05-07T20:32:42.6255739Z x1 = x[:, D:] 2025-05-07T20:32:42.6255810Z 2025-05-07T20:32:42.6255889Z if contiguous: 2025-05-07T20:32:42.6255973Z x0 = x0.contiguous() 2025-05-07T20:32:42.6256062Z x1 = x1.contiguous() 2025-05-07T20:32:42.6256135Z 2025-05-07T20:32:42.6256221Z if scale_ub is not None: 2025-05-07T20:32:42.6256323Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6256454Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6256524Z ) 2025-05-07T20:32:42.6256602Z else: 2025-05-07T20:32:42.6256690Z scale_ub_tensor = None 2025-05-07T20:32:42.6256755Z 2025-05-07T20:32:42.6256883Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6256968Z op = silu_mul_quant 2025-05-07T20:32:42.6257050Z if compiled: 2025-05-07T20:32:42.6257150Z op = torch.compile(op) 2025-05-07T20:32:42.6257251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6257323Z 2025-05-07T20:32:42.6257408Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6257413Z 2025-05-07T20:32:42.6257505Z moe/activation_test.py:117: 2025-05-07T20:32:42.6257631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6257774Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6257869Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6258268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6258359Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6258846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6258938Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6259329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6259552Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6259886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6259980Z kernel = self.compile( 2025-05-07T20:32:42.6260359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6260528Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6260658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6260663Z 2025-05-07T20:32:42.6260863Z self = 2025-05-07T20:32:42.6261676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6262188Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc8254dc0>} 2025-05-07T20:32:42.6262925Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6263124Z context = 2025-05-07T20:32:42.6263129Z 2025-05-07T20:32:42.6263290Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6263550Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6263651Z module_map=module_map) 2025-05-07T20:32:42.6263812Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6263908Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6263982Z E ^ 2025-05-07T20:32:42.6264329Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6264334Z 2025-05-07T20:32:42.6264746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6264751Z 2025-05-07T20:32:42.6264849Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6265073Z self=, 2025-05-07T20:32:42.6265147Z T=128, 2025-05-07T20:32:42.6265215Z D=7168, 2025-05-07T20:32:42.6265296Z scale_ub=1200.0, 2025-05-07T20:32:42.6265378Z contiguous=False, 2025-05-07T20:32:42.6265456Z compiled=True, 2025-05-07T20:32:42.6265528Z ) 2025-05-07T20:32:42.6265745Z self = 2025-05-07T20:32:42.6265934Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.6265940Z 2025-05-07T20:32:42.6266015Z @given( 2025-05-07T20:32:42.6266152Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6266248Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6266402Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6266512Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6266623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6266690Z ) 2025-05-07T20:32:42.6266969Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6267064Z def test_silu_mul_quant( 2025-05-07T20:32:42.6267133Z self, 2025-05-07T20:32:42.6267207Z T: int, 2025-05-07T20:32:42.6267280Z D: int, 2025-05-07T20:32:42.6267375Z scale_ub: Optional[float], 2025-05-07T20:32:42.6267500Z contiguous: bool, 2025-05-07T20:32:42.6267581Z compiled: bool, 2025-05-07T20:32:42.6267654Z ) -> None: 2025-05-07T20:32:42.6267749Z torch.manual_seed(2025) 2025-05-07T20:32:42.6267816Z 2025-05-07T20:32:42.6267977Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6268054Z 2025-05-07T20:32:42.6268139Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6268258Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6268344Z x = x_sign * x_clamp 2025-05-07T20:32:42.6268417Z x0 = x[:, :D] 2025-05-07T20:32:42.6268496Z x1 = x[:, D:] 2025-05-07T20:32:42.6268565Z 2025-05-07T20:32:42.6268643Z if contiguous: 2025-05-07T20:32:42.6268732Z x0 = x0.contiguous() 2025-05-07T20:32:42.6268816Z x1 = x1.contiguous() 2025-05-07T20:32:42.6268885Z 2025-05-07T20:32:42.6268972Z if scale_ub is not None: 2025-05-07T20:32:42.6269117Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6269250Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6269324Z ) 2025-05-07T20:32:42.6269398Z else: 2025-05-07T20:32:42.6269489Z scale_ub_tensor = None 2025-05-07T20:32:42.6269556Z 2025-05-07T20:32:42.6269681Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6269767Z op = silu_mul_quant 2025-05-07T20:32:42.6269900Z if compiled: 2025-05-07T20:32:42.6269996Z op = torch.compile(op) 2025-05-07T20:32:42.6270099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6270173Z 2025-05-07T20:32:42.6270260Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6270265Z 2025-05-07T20:32:42.6270359Z moe/activation_test.py:117: 2025-05-07T20:32:42.6270481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6270575Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6270677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6271038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6271124Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6271612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6271706Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6272060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6272280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6272613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6272702Z kernel = self.compile( 2025-05-07T20:32:42.6273081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6273257Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6273377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6273382Z 2025-05-07T20:32:42.6273583Z self = 2025-05-07T20:32:42.6274408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6274943Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc806a940>} 2025-05-07T20:32:42.6275686Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6275914Z context = 2025-05-07T20:32:42.6275919Z 2025-05-07T20:32:42.6276078Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6276339Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6276447Z module_map=module_map) 2025-05-07T20:32:42.6276607Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6276702Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6276774Z E ^ 2025-05-07T20:32:42.6277123Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6277128Z 2025-05-07T20:32:42.6277575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6277584Z 2025-05-07T20:32:42.6277684Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6277900Z self=, 2025-05-07T20:32:42.6277970Z T=2048, 2025-05-07T20:32:42.6278044Z D=7168, 2025-05-07T20:32:42.6278121Z scale_ub=None, 2025-05-07T20:32:42.6278204Z contiguous=True, 2025-05-07T20:32:42.6278286Z compiled=True, 2025-05-07T20:32:42.6278355Z ) 2025-05-07T20:32:42.6278566Z self = 2025-05-07T20:32:42.6278744Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.6278749Z 2025-05-07T20:32:42.6278828Z @given( 2025-05-07T20:32:42.6278942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6279036Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6279154Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6279272Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6279381Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6279450Z ) 2025-05-07T20:32:42.6279693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6279781Z def test_silu_mul_quant( 2025-05-07T20:32:42.6279861Z self, 2025-05-07T20:32:42.6279933Z T: int, 2025-05-07T20:32:42.6280005Z D: int, 2025-05-07T20:32:42.6280102Z scale_ub: Optional[float], 2025-05-07T20:32:42.6280184Z contiguous: bool, 2025-05-07T20:32:42.6280267Z compiled: bool, 2025-05-07T20:32:42.6280340Z ) -> None: 2025-05-07T20:32:42.6280429Z torch.manual_seed(2025) 2025-05-07T20:32:42.6280497Z 2025-05-07T20:32:42.6280661Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6280731Z 2025-05-07T20:32:42.6280818Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6280944Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6281028Z x = x_sign * x_clamp 2025-05-07T20:32:42.6281105Z x0 = x[:, :D] 2025-05-07T20:32:42.6281182Z x1 = x[:, D:] 2025-05-07T20:32:42.6281249Z 2025-05-07T20:32:42.6281330Z if contiguous: 2025-05-07T20:32:42.6281414Z x0 = x0.contiguous() 2025-05-07T20:32:42.6281546Z x1 = x1.contiguous() 2025-05-07T20:32:42.6281612Z 2025-05-07T20:32:42.6281697Z if scale_ub is not None: 2025-05-07T20:32:42.6281800Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6281928Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6282062Z ) 2025-05-07T20:32:42.6282139Z else: 2025-05-07T20:32:42.6282228Z scale_ub_tensor = None 2025-05-07T20:32:42.6282297Z 2025-05-07T20:32:42.6282424Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6282508Z op = silu_mul_quant 2025-05-07T20:32:42.6282630Z if compiled: 2025-05-07T20:32:42.6282729Z op = torch.compile(op) 2025-05-07T20:32:42.6282830Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6282897Z 2025-05-07T20:32:42.6282983Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6282987Z 2025-05-07T20:32:42.6283078Z moe/activation_test.py:117: 2025-05-07T20:32:42.6283207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6283304Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6283397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6283763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6283849Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6284341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6284474Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6284831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6285056Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6285389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6285480Z kernel = self.compile( 2025-05-07T20:32:42.6285857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6286030Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6286154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6286158Z 2025-05-07T20:32:42.6286361Z self = 2025-05-07T20:32:42.6287188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6287697Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc7fc9550>} 2025-05-07T20:32:42.6288441Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6288632Z context = 2025-05-07T20:32:42.6288637Z 2025-05-07T20:32:42.6288798Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6289056Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6289166Z module_map=module_map) 2025-05-07T20:32:42.6289326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6289423Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6289491Z E ^ 2025-05-07T20:32:42.6289840Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6289888Z 2025-05-07T20:32:42.6290300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6290304Z 2025-05-07T20:32:42.6290400Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6290656Z self=, 2025-05-07T20:32:42.6290729Z T=16384, 2025-05-07T20:32:42.6290798Z D=5120, 2025-05-07T20:32:42.6290879Z scale_ub=None, 2025-05-07T20:32:42.6290962Z contiguous=False, 2025-05-07T20:32:42.6291041Z compiled=False, 2025-05-07T20:32:42.6291155Z ) 2025-05-07T20:32:42.6291373Z self = 2025-05-07T20:32:42.6291543Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.6291547Z 2025-05-07T20:32:42.6291622Z @given( 2025-05-07T20:32:42.6291735Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6291835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6291946Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6292059Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6292172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6292244Z ) 2025-05-07T20:32:42.6292485Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6292575Z def test_silu_mul_quant( 2025-05-07T20:32:42.6292644Z self, 2025-05-07T20:32:42.6292715Z T: int, 2025-05-07T20:32:42.6292791Z D: int, 2025-05-07T20:32:42.6292924Z scale_ub: Optional[float], 2025-05-07T20:32:42.6293009Z contiguous: bool, 2025-05-07T20:32:42.6293092Z compiled: bool, 2025-05-07T20:32:42.6293164Z ) -> None: 2025-05-07T20:32:42.6293257Z torch.manual_seed(2025) 2025-05-07T20:32:42.6293325Z 2025-05-07T20:32:42.6293489Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6293566Z 2025-05-07T20:32:42.6293651Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6293768Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6295641Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6295650Z 2025-05-07T20:32:42.6295764Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.6295768Z 2025-05-07T20:32:42.6295879Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6296136Z self=, 2025-05-07T20:32:42.6296217Z T=4096, 2025-05-07T20:32:42.6296287Z D=7168, 2025-05-07T20:32:42.6296363Z scale_ub=1200.0, 2025-05-07T20:32:42.6296445Z contiguous=True, 2025-05-07T20:32:42.6296525Z compiled=True, 2025-05-07T20:32:42.6296593Z ) 2025-05-07T20:32:42.6296805Z self = 2025-05-07T20:32:42.6296969Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.6296974Z 2025-05-07T20:32:42.6297048Z @given( 2025-05-07T20:32:42.6297165Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6297259Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6297367Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6297481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6297589Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6297707Z ) 2025-05-07T20:32:42.6297954Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6298043Z def test_silu_mul_quant( 2025-05-07T20:32:42.6298122Z self, 2025-05-07T20:32:42.6298230Z T: int, 2025-05-07T20:32:42.6298305Z D: int, 2025-05-07T20:32:42.6298399Z scale_ub: Optional[float], 2025-05-07T20:32:42.6298483Z contiguous: bool, 2025-05-07T20:32:42.6298561Z compiled: bool, 2025-05-07T20:32:42.6298637Z ) -> None: 2025-05-07T20:32:42.6298725Z torch.manual_seed(2025) 2025-05-07T20:32:42.6298835Z 2025-05-07T20:32:42.6298999Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6299065Z 2025-05-07T20:32:42.6299154Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6299272Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6301079Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6301091Z 2025-05-07T20:32:42.6301202Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.6301247Z 2025-05-07T20:32:42.6301345Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6301570Z self=, 2025-05-07T20:32:42.6301644Z T=16384, 2025-05-07T20:32:42.6301713Z D=7168, 2025-05-07T20:32:42.6301790Z scale_ub=None, 2025-05-07T20:32:42.6301872Z contiguous=False, 2025-05-07T20:32:42.6301952Z compiled=False, 2025-05-07T20:32:42.6302026Z ) 2025-05-07T20:32:42.6302234Z self = 2025-05-07T20:32:42.6302405Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.6302413Z 2025-05-07T20:32:42.6302486Z @given( 2025-05-07T20:32:42.6302596Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6302693Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6302803Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6302919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6303035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6303105Z ) 2025-05-07T20:32:42.6303349Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6303437Z def test_silu_mul_quant( 2025-05-07T20:32:42.6303511Z self, 2025-05-07T20:32:42.6303591Z T: int, 2025-05-07T20:32:42.6303662Z D: int, 2025-05-07T20:32:42.6304051Z scale_ub: Optional[float], 2025-05-07T20:32:42.6304166Z contiguous: bool, 2025-05-07T20:32:42.6304249Z compiled: bool, 2025-05-07T20:32:42.6304320Z ) -> None: 2025-05-07T20:32:42.6304414Z torch.manual_seed(2025) 2025-05-07T20:32:42.6304482Z 2025-05-07T20:32:42.6304642Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6306485Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6306583Z 2025-05-07T20:32:42.6306700Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6306704Z 2025-05-07T20:32:42.6306807Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6307085Z self=, 2025-05-07T20:32:42.6307166Z T=2048, 2025-05-07T20:32:42.6307235Z D=7168, 2025-05-07T20:32:42.6307310Z scale_ub=1200.0, 2025-05-07T20:32:42.6307391Z contiguous=True, 2025-05-07T20:32:42.6307466Z compiled=True, 2025-05-07T20:32:42.6307535Z ) 2025-05-07T20:32:42.6307819Z self = 2025-05-07T20:32:42.6307984Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.6307989Z 2025-05-07T20:32:42.6308058Z @given( 2025-05-07T20:32:42.6308172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6308265Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6308379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6308489Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6308595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6308667Z ) 2025-05-07T20:32:42.6308905Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6308993Z def test_silu_mul_quant( 2025-05-07T20:32:42.6309070Z self, 2025-05-07T20:32:42.6309145Z T: int, 2025-05-07T20:32:42.6309213Z D: int, 2025-05-07T20:32:42.6309370Z scale_ub: Optional[float], 2025-05-07T20:32:42.6309455Z contiguous: bool, 2025-05-07T20:32:42.6309536Z compiled: bool, 2025-05-07T20:32:42.6309611Z ) -> None: 2025-05-07T20:32:42.6309698Z torch.manual_seed(2025) 2025-05-07T20:32:42.6309769Z 2025-05-07T20:32:42.6309981Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6310051Z 2025-05-07T20:32:42.6310140Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6310258Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6312025Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6312039Z 2025-05-07T20:32:42.6312151Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.6312156Z 2025-05-07T20:32:42.6312254Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6312478Z self=, 2025-05-07T20:32:42.6312550Z T=2048, 2025-05-07T20:32:42.6312623Z D=7168, 2025-05-07T20:32:42.6312701Z scale_ub=None, 2025-05-07T20:32:42.6312779Z contiguous=True, 2025-05-07T20:32:42.6312858Z compiled=False, 2025-05-07T20:32:42.6312928Z ) 2025-05-07T20:32:42.6313135Z self = 2025-05-07T20:32:42.6313305Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.6313310Z 2025-05-07T20:32:42.6313379Z @given( 2025-05-07T20:32:42.6313494Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6313595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6313702Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6313813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6313925Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6314065Z ) 2025-05-07T20:32:42.6314309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6314399Z def test_silu_mul_quant( 2025-05-07T20:32:42.6314469Z self, 2025-05-07T20:32:42.6314548Z T: int, 2025-05-07T20:32:42.6314661Z D: int, 2025-05-07T20:32:42.6314757Z scale_ub: Optional[float], 2025-05-07T20:32:42.6314844Z contiguous: bool, 2025-05-07T20:32:42.6314923Z compiled: bool, 2025-05-07T20:32:42.6314995Z ) -> None: 2025-05-07T20:32:42.6315092Z torch.manual_seed(2025) 2025-05-07T20:32:42.6315160Z 2025-05-07T20:32:42.6315365Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6315438Z 2025-05-07T20:32:42.6315526Z > x_sign = torch.sign(x) 2025-05-07T20:32:42.6317297Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6317305Z 2025-05-07T20:32:42.6317419Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:42.6317423Z 2025-05-07T20:32:42.6317528Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6317791Z self=, 2025-05-07T20:32:42.6317866Z T=1, 2025-05-07T20:32:42.6317935Z D=7168, 2025-05-07T20:32:42.6318012Z scale_ub=1200.0, 2025-05-07T20:32:42.6318091Z contiguous=True, 2025-05-07T20:32:42.6318171Z compiled=False, 2025-05-07T20:32:42.6318238Z ) 2025-05-07T20:32:42.6318447Z self = 2025-05-07T20:32:42.6318613Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.6318618Z 2025-05-07T20:32:42.6318691Z @given( 2025-05-07T20:32:42.6318809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6318901Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6319010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6319124Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6319231Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6319302Z ) 2025-05-07T20:32:42.6319553Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6319644Z def test_silu_mul_quant( 2025-05-07T20:32:42.6319716Z self, 2025-05-07T20:32:42.6319793Z T: int, 2025-05-07T20:32:42.6319861Z D: int, 2025-05-07T20:32:42.6319955Z scale_ub: Optional[float], 2025-05-07T20:32:42.6320040Z contiguous: bool, 2025-05-07T20:32:42.6320119Z compiled: bool, 2025-05-07T20:32:42.6320197Z ) -> None: 2025-05-07T20:32:42.6320287Z torch.manual_seed(2025) 2025-05-07T20:32:42.6320356Z 2025-05-07T20:32:42.6320523Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6320591Z 2025-05-07T20:32:42.6320676Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6320798Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6320881Z x = x_sign * x_clamp 2025-05-07T20:32:42.6320954Z x0 = x[:, :D] 2025-05-07T20:32:42.6321039Z x1 = x[:, D:] 2025-05-07T20:32:42.6321106Z 2025-05-07T20:32:42.6321188Z if contiguous: 2025-05-07T20:32:42.6321274Z x0 = x0.contiguous() 2025-05-07T20:32:42.6321356Z x1 = x1.contiguous() 2025-05-07T20:32:42.6321422Z 2025-05-07T20:32:42.6321508Z if scale_ub is not None: 2025-05-07T20:32:42.6321609Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6321790Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6321860Z ) 2025-05-07T20:32:42.6321934Z else: 2025-05-07T20:32:42.6322027Z scale_ub_tensor = None 2025-05-07T20:32:42.6322133Z 2025-05-07T20:32:42.6322261Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6322350Z op = silu_mul_quant 2025-05-07T20:32:42.6322431Z if compiled: 2025-05-07T20:32:42.6322529Z op = torch.compile(op) 2025-05-07T20:32:42.6322635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6322744Z 2025-05-07T20:32:42.6322834Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6322838Z 2025-05-07T20:32:42.6322928Z moe/activation_test.py:117: 2025-05-07T20:32:42.6323051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6323152Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6323249Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6323747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6323840Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6324195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6324419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6324792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6324884Z kernel = self.compile( 2025-05-07T20:32:42.6325264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6325433Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6325557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6325566Z 2025-05-07T20:32:42.6325769Z self = 2025-05-07T20:32:42.6326549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6327053Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc818f040>} 2025-05-07T20:32:42.6327795Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6327983Z context = 2025-05-07T20:32:42.6327990Z 2025-05-07T20:32:42.6328149Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6328408Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6328515Z module_map=module_map) 2025-05-07T20:32:42.6328674Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6328771Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6328842Z E ^ 2025-05-07T20:32:42.6329197Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6329204Z 2025-05-07T20:32:42.6329611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6329616Z 2025-05-07T20:32:42.6329712Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6329932Z self=, 2025-05-07T20:32:42.6330046Z T=128, 2025-05-07T20:32:42.6330116Z D=5120, 2025-05-07T20:32:42.6330196Z scale_ub=None, 2025-05-07T20:32:42.6330274Z contiguous=True, 2025-05-07T20:32:42.6330351Z compiled=False, 2025-05-07T20:32:42.6330422Z ) 2025-05-07T20:32:42.6330678Z self = 2025-05-07T20:32:42.6330846Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.6330850Z 2025-05-07T20:32:42.6330925Z @given( 2025-05-07T20:32:42.6331037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6331176Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6331289Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6331400Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6331509Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6331576Z ) 2025-05-07T20:32:42.6331823Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6331917Z def test_silu_mul_quant( 2025-05-07T20:32:42.6331984Z self, 2025-05-07T20:32:42.6332055Z T: int, 2025-05-07T20:32:42.6332127Z D: int, 2025-05-07T20:32:42.6332223Z scale_ub: Optional[float], 2025-05-07T20:32:42.6332305Z contiguous: bool, 2025-05-07T20:32:42.6332389Z compiled: bool, 2025-05-07T20:32:42.6332460Z ) -> None: 2025-05-07T20:32:42.6332549Z torch.manual_seed(2025) 2025-05-07T20:32:42.6332617Z 2025-05-07T20:32:42.6332819Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6332894Z 2025-05-07T20:32:42.6332980Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6333099Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6333183Z x = x_sign * x_clamp 2025-05-07T20:32:42.6333258Z x0 = x[:, :D] 2025-05-07T20:32:42.6333333Z x1 = x[:, D:] 2025-05-07T20:32:42.6333402Z 2025-05-07T20:32:42.6333480Z if contiguous: 2025-05-07T20:32:42.6333563Z x0 = x0.contiguous() 2025-05-07T20:32:42.6333648Z x1 = x1.contiguous() 2025-05-07T20:32:42.6333715Z 2025-05-07T20:32:42.6333799Z if scale_ub is not None: 2025-05-07T20:32:42.6333908Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6334037Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6334114Z ) 2025-05-07T20:32:42.6334187Z else: 2025-05-07T20:32:42.6334276Z scale_ub_tensor = None 2025-05-07T20:32:42.6334352Z 2025-05-07T20:32:42.6334475Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6334561Z op = silu_mul_quant 2025-05-07T20:32:42.6334644Z if compiled: 2025-05-07T20:32:42.6334738Z op = torch.compile(op) 2025-05-07T20:32:42.6334839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6334912Z 2025-05-07T20:32:42.6334998Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6335003Z 2025-05-07T20:32:42.6335097Z moe/activation_test.py:117: 2025-05-07T20:32:42.6335225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6335323Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6335419Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6335920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6336014Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6336375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6336591Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6336930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6337064Z kernel = self.compile( 2025-05-07T20:32:42.6337438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6337610Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6337771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6337776Z 2025-05-07T20:32:42.6337980Z self = 2025-05-07T20:32:42.6338759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6339302Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc818fa60>} 2025-05-07T20:32:42.6340046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6340235Z context = 2025-05-07T20:32:42.6340240Z 2025-05-07T20:32:42.6340404Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6340661Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6340821Z module_map=module_map) 2025-05-07T20:32:42.6340982Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6341077Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6341150Z E ^ 2025-05-07T20:32:42.6341510Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6341519Z 2025-05-07T20:32:42.6341925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6341930Z 2025-05-07T20:32:42.6342027Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6342248Z self=, 2025-05-07T20:32:42.6342320Z T=128, 2025-05-07T20:32:42.6342390Z D=7168, 2025-05-07T20:32:42.6342463Z scale_ub=None, 2025-05-07T20:32:42.6342541Z contiguous=True, 2025-05-07T20:32:42.6342623Z compiled=False, 2025-05-07T20:32:42.6348162Z ) 2025-05-07T20:32:42.6348417Z self = 2025-05-07T20:32:42.6348593Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.6348599Z 2025-05-07T20:32:42.6348673Z @given( 2025-05-07T20:32:42.6348789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6348884Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6348997Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6349110Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6349219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6349290Z ) 2025-05-07T20:32:42.6349539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6349629Z def test_silu_mul_quant( 2025-05-07T20:32:42.6349701Z self, 2025-05-07T20:32:42.6349775Z T: int, 2025-05-07T20:32:42.6349901Z D: int, 2025-05-07T20:32:42.6350003Z scale_ub: Optional[float], 2025-05-07T20:32:42.6350092Z contiguous: bool, 2025-05-07T20:32:42.6350171Z compiled: bool, 2025-05-07T20:32:42.6350242Z ) -> None: 2025-05-07T20:32:42.6350336Z torch.manual_seed(2025) 2025-05-07T20:32:42.6350404Z 2025-05-07T20:32:42.6350571Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6350711Z 2025-05-07T20:32:42.6350798Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6350922Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6351005Z x = x_sign * x_clamp 2025-05-07T20:32:42.6351079Z x0 = x[:, :D] 2025-05-07T20:32:42.6351196Z x1 = x[:, D:] 2025-05-07T20:32:42.6351261Z 2025-05-07T20:32:42.6351343Z if contiguous: 2025-05-07T20:32:42.6351434Z x0 = x0.contiguous() 2025-05-07T20:32:42.6351522Z x1 = x1.contiguous() 2025-05-07T20:32:42.6351589Z 2025-05-07T20:32:42.6351680Z if scale_ub is not None: 2025-05-07T20:32:42.6351825Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6351964Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6352038Z ) 2025-05-07T20:32:42.6352112Z else: 2025-05-07T20:32:42.6352204Z scale_ub_tensor = None 2025-05-07T20:32:42.6352268Z 2025-05-07T20:32:42.6352394Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6352486Z op = silu_mul_quant 2025-05-07T20:32:42.6352567Z if compiled: 2025-05-07T20:32:42.6352663Z op = torch.compile(op) 2025-05-07T20:32:42.6352770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6352838Z 2025-05-07T20:32:42.6352924Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6352929Z 2025-05-07T20:32:42.6353024Z moe/activation_test.py:117: 2025-05-07T20:32:42.6353150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6353292Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6353390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6353894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6353991Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6354349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6354573Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6354915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6355004Z kernel = self.compile( 2025-05-07T20:32:42.6355383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6355553Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6355682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6355687Z 2025-05-07T20:32:42.6355892Z self = 2025-05-07T20:32:42.6356672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6357188Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc7d85790>} 2025-05-07T20:32:42.6357931Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6358123Z context = 2025-05-07T20:32:42.6358133Z 2025-05-07T20:32:42.6358295Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6358554Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6358659Z module_map=module_map) 2025-05-07T20:32:42.6358860Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6358955Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6359031Z E ^ 2025-05-07T20:32:42.6359417Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6359423Z 2025-05-07T20:32:42.6359834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6359839Z 2025-05-07T20:32:42.6359936Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6360155Z self=, 2025-05-07T20:32:42.6360271Z T=2048, 2025-05-07T20:32:42.6360340Z D=7168, 2025-05-07T20:32:42.6360416Z scale_ub=1200.0, 2025-05-07T20:32:42.6360498Z contiguous=True, 2025-05-07T20:32:42.6360578Z compiled=False, 2025-05-07T20:32:42.6360647Z ) 2025-05-07T20:32:42.6360862Z self = 2025-05-07T20:32:42.6361033Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.6361038Z 2025-05-07T20:32:42.6361109Z @given( 2025-05-07T20:32:42.6361222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6361319Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6361433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6361546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6361655Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6361730Z ) 2025-05-07T20:32:42.6362012Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6362102Z def test_silu_mul_quant( 2025-05-07T20:32:42.6362175Z self, 2025-05-07T20:32:42.6362244Z T: int, 2025-05-07T20:32:42.6362313Z D: int, 2025-05-07T20:32:42.6362410Z scale_ub: Optional[float], 2025-05-07T20:32:42.6362498Z contiguous: bool, 2025-05-07T20:32:42.6362579Z compiled: bool, 2025-05-07T20:32:42.6362655Z ) -> None: 2025-05-07T20:32:42.6362745Z torch.manual_seed(2025) 2025-05-07T20:32:42.6362814Z 2025-05-07T20:32:42.6362977Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6364764Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6364785Z 2025-05-07T20:32:42.6364898Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6364905Z 2025-05-07T20:32:42.6365003Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6365228Z self=, 2025-05-07T20:32:42.6365298Z T=1, 2025-05-07T20:32:42.6365367Z D=5120, 2025-05-07T20:32:42.6365450Z scale_ub=1200.0, 2025-05-07T20:32:42.6365532Z contiguous=True, 2025-05-07T20:32:42.6365611Z compiled=False, 2025-05-07T20:32:42.6365685Z ) 2025-05-07T20:32:42.6365894Z self = 2025-05-07T20:32:42.6366077Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.6366086Z 2025-05-07T20:32:42.6366164Z @given( 2025-05-07T20:32:42.6366300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6366399Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6366507Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6366668Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6366779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6366847Z ) 2025-05-07T20:32:42.6367086Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6367218Z def test_silu_mul_quant( 2025-05-07T20:32:42.6367293Z self, 2025-05-07T20:32:42.6367367Z T: int, 2025-05-07T20:32:42.6367438Z D: int, 2025-05-07T20:32:42.6367529Z scale_ub: Optional[float], 2025-05-07T20:32:42.6367618Z contiguous: bool, 2025-05-07T20:32:42.6367697Z compiled: bool, 2025-05-07T20:32:42.6367808Z ) -> None: 2025-05-07T20:32:42.6367898Z torch.manual_seed(2025) 2025-05-07T20:32:42.6367966Z 2025-05-07T20:32:42.6368125Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6368195Z 2025-05-07T20:32:42.6368284Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6368400Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6368491Z x = x_sign * x_clamp 2025-05-07T20:32:42.6368565Z x0 = x[:, :D] 2025-05-07T20:32:42.6368651Z x1 = x[:, D:] 2025-05-07T20:32:42.6368719Z 2025-05-07T20:32:42.6368796Z if contiguous: 2025-05-07T20:32:42.6368887Z x0 = x0.contiguous() 2025-05-07T20:32:42.6368969Z x1 = x1.contiguous() 2025-05-07T20:32:42.6369037Z 2025-05-07T20:32:42.6369126Z if scale_ub is not None: 2025-05-07T20:32:42.6369226Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6369400Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6369479Z ) 2025-05-07T20:32:42.6369550Z else: 2025-05-07T20:32:42.6369641Z scale_ub_tensor = None 2025-05-07T20:32:42.6369708Z 2025-05-07T20:32:42.6369835Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6369919Z op = silu_mul_quant 2025-05-07T20:32:42.6370006Z if compiled: 2025-05-07T20:32:42.6370101Z op = torch.compile(op) 2025-05-07T20:32:42.6370206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6370272Z 2025-05-07T20:32:42.6370358Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6370362Z 2025-05-07T20:32:42.6370459Z moe/activation_test.py:117: 2025-05-07T20:32:42.6370583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6370679Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6370774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6371276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6371377Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6371732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6371956Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6372296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6372387Z kernel = self.compile( 2025-05-07T20:32:42.6372763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6372936Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6373056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6373063Z 2025-05-07T20:32:42.6373271Z self = 2025-05-07T20:32:42.6374049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6374601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc7ea3040>} 2025-05-07T20:32:42.6375381Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6375568Z context = 2025-05-07T20:32:42.6375573Z 2025-05-07T20:32:42.6375739Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6376061Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6376164Z module_map=module_map) 2025-05-07T20:32:42.6376321Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6376416Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6376494Z E ^ 2025-05-07T20:32:42.6376846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6376851Z 2025-05-07T20:32:42.6377261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6377265Z 2025-05-07T20:32:42.6377364Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6377583Z self=, 2025-05-07T20:32:42.6377660Z T=2048, 2025-05-07T20:32:42.6377733Z D=5120, 2025-05-07T20:32:42.6377850Z scale_ub=None, 2025-05-07T20:32:42.6377940Z contiguous=True, 2025-05-07T20:32:42.6378019Z compiled=False, 2025-05-07T20:32:42.6378087Z ) 2025-05-07T20:32:42.6378301Z self = 2025-05-07T20:32:42.6378470Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.6378479Z 2025-05-07T20:32:42.6378555Z @given( 2025-05-07T20:32:42.6378670Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6378765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6378882Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6378992Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6379101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6379172Z ) 2025-05-07T20:32:42.6379412Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6379512Z def test_silu_mul_quant( 2025-05-07T20:32:42.6379584Z self, 2025-05-07T20:32:42.6379653Z T: int, 2025-05-07T20:32:42.6379725Z D: int, 2025-05-07T20:32:42.6379820Z scale_ub: Optional[float], 2025-05-07T20:32:42.6379904Z contiguous: bool, 2025-05-07T20:32:42.6379983Z compiled: bool, 2025-05-07T20:32:42.6380061Z ) -> None: 2025-05-07T20:32:42.6380153Z torch.manual_seed(2025) 2025-05-07T20:32:42.6380224Z 2025-05-07T20:32:42.6380386Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6380453Z 2025-05-07T20:32:42.6380540Z > x_sign = torch.sign(x) 2025-05-07T20:32:42.6382322Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6382330Z 2025-05-07T20:32:42.6382447Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:42.6382496Z 2025-05-07T20:32:42.6382596Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6382814Z self=, 2025-05-07T20:32:42.6382892Z T=16384, 2025-05-07T20:32:42.6382964Z D=5120, 2025-05-07T20:32:42.6383080Z scale_ub=None, 2025-05-07T20:32:42.6383166Z contiguous=True, 2025-05-07T20:32:42.6383243Z compiled=False, 2025-05-07T20:32:42.6383315Z ) 2025-05-07T20:32:42.6383531Z self = 2025-05-07T20:32:42.6383702Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.6383748Z 2025-05-07T20:32:42.6383820Z @given( 2025-05-07T20:32:42.6383933Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6384025Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6384140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6384252Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6384364Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6384435Z ) 2025-05-07T20:32:42.6384676Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6384766Z def test_silu_mul_quant( 2025-05-07T20:32:42.6384838Z self, 2025-05-07T20:32:42.6384908Z T: int, 2025-05-07T20:32:42.6384983Z D: int, 2025-05-07T20:32:42.6385076Z scale_ub: Optional[float], 2025-05-07T20:32:42.6385160Z contiguous: bool, 2025-05-07T20:32:42.6385243Z compiled: bool, 2025-05-07T20:32:42.6385317Z ) -> None: 2025-05-07T20:32:42.6385449Z torch.manual_seed(2025) 2025-05-07T20:32:42.6385523Z 2025-05-07T20:32:42.6385687Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6387471Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6387479Z 2025-05-07T20:32:42.6387592Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6387597Z 2025-05-07T20:32:42.6387696Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6387922Z self=, 2025-05-07T20:32:42.6387995Z T=4096, 2025-05-07T20:32:42.6388066Z D=5120, 2025-05-07T20:32:42.6388141Z scale_ub=None, 2025-05-07T20:32:42.6388220Z contiguous=True, 2025-05-07T20:32:42.6388303Z compiled=False, 2025-05-07T20:32:42.6388368Z ) 2025-05-07T20:32:42.6388580Z self = 2025-05-07T20:32:42.6388750Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.6388755Z 2025-05-07T20:32:42.6388825Z @given( 2025-05-07T20:32:42.6388942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6389033Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6389142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6389254Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6389361Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6389432Z ) 2025-05-07T20:32:42.6389673Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6389761Z def test_silu_mul_quant( 2025-05-07T20:32:42.6389910Z self, 2025-05-07T20:32:42.6389985Z T: int, 2025-05-07T20:32:42.6390058Z D: int, 2025-05-07T20:32:42.6390150Z scale_ub: Optional[float], 2025-05-07T20:32:42.6390285Z contiguous: bool, 2025-05-07T20:32:42.6390365Z compiled: bool, 2025-05-07T20:32:42.6390439Z ) -> None: 2025-05-07T20:32:42.6390529Z torch.manual_seed(2025) 2025-05-07T20:32:42.6390598Z 2025-05-07T20:32:42.6390805Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6392584Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6392629Z 2025-05-07T20:32:42.6392749Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6392753Z 2025-05-07T20:32:42.6392850Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6393071Z self=, 2025-05-07T20:32:42.6393146Z T=2048, 2025-05-07T20:32:42.6393218Z D=5120, 2025-05-07T20:32:42.6393296Z scale_ub=None, 2025-05-07T20:32:42.6393379Z contiguous=False, 2025-05-07T20:32:42.6393456Z compiled=False, 2025-05-07T20:32:42.6393528Z ) 2025-05-07T20:32:42.6393738Z self = 2025-05-07T20:32:42.6393947Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.6393952Z 2025-05-07T20:32:42.6394030Z @given( 2025-05-07T20:32:42.6394141Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6394235Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6394347Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6394461Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6394568Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6394639Z ) 2025-05-07T20:32:42.6394881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6394975Z def test_silu_mul_quant( 2025-05-07T20:32:42.6395042Z self, 2025-05-07T20:32:42.6395114Z T: int, 2025-05-07T20:32:42.6395194Z D: int, 2025-05-07T20:32:42.6395287Z scale_ub: Optional[float], 2025-05-07T20:32:42.6395371Z contiguous: bool, 2025-05-07T20:32:42.6395459Z compiled: bool, 2025-05-07T20:32:42.6395532Z ) -> None: 2025-05-07T20:32:42.6395621Z torch.manual_seed(2025) 2025-05-07T20:32:42.6395689Z 2025-05-07T20:32:42.6395849Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6397624Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6397632Z 2025-05-07T20:32:42.6397744Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6397751Z 2025-05-07T20:32:42.6397854Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6398072Z self=, 2025-05-07T20:32:42.6398141Z T=4096, 2025-05-07T20:32:42.6398216Z D=7168, 2025-05-07T20:32:42.6398292Z scale_ub=None, 2025-05-07T20:32:42.6398369Z contiguous=True, 2025-05-07T20:32:42.6398450Z compiled=True, 2025-05-07T20:32:42.6398564Z ) 2025-05-07T20:32:42.6398781Z self = 2025-05-07T20:32:42.6398947Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.6398951Z 2025-05-07T20:32:42.6399059Z @given( 2025-05-07T20:32:42.6399176Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6399270Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6399379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6399490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6399642Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6399713Z ) 2025-05-07T20:32:42.6399954Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6400044Z def test_silu_mul_quant( 2025-05-07T20:32:42.6400118Z self, 2025-05-07T20:32:42.6400190Z T: int, 2025-05-07T20:32:42.6400262Z D: int, 2025-05-07T20:32:42.6400360Z scale_ub: Optional[float], 2025-05-07T20:32:42.6400446Z contiguous: bool, 2025-05-07T20:32:42.6400526Z compiled: bool, 2025-05-07T20:32:42.6400600Z ) -> None: 2025-05-07T20:32:42.6400691Z torch.manual_seed(2025) 2025-05-07T20:32:42.6400756Z 2025-05-07T20:32:42.6400921Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6402735Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6402752Z 2025-05-07T20:32:42.6402866Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6402871Z 2025-05-07T20:32:42.6402965Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6403193Z self=, 2025-05-07T20:32:42.6403264Z T=2048, 2025-05-07T20:32:42.6403337Z D=5120, 2025-05-07T20:32:42.6403416Z scale_ub=1200.0, 2025-05-07T20:32:42.6403496Z contiguous=False, 2025-05-07T20:32:42.6403583Z compiled=False, 2025-05-07T20:32:42.6403651Z ) 2025-05-07T20:32:42.6404243Z self = 2025-05-07T20:32:42.6404423Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.6404428Z 2025-05-07T20:32:42.6404499Z @given( 2025-05-07T20:32:42.6404611Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6404709Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6404821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6404930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6405039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6405108Z ) 2025-05-07T20:32:42.6405360Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6405449Z def test_silu_mul_quant( 2025-05-07T20:32:42.6405521Z self, 2025-05-07T20:32:42.6405595Z T: int, 2025-05-07T20:32:42.6405666Z D: int, 2025-05-07T20:32:42.6405759Z scale_ub: Optional[float], 2025-05-07T20:32:42.6405853Z contiguous: bool, 2025-05-07T20:32:42.6405933Z compiled: bool, 2025-05-07T20:32:42.6406005Z ) -> None: 2025-05-07T20:32:42.6406097Z torch.manual_seed(2025) 2025-05-07T20:32:42.6406166Z 2025-05-07T20:32:42.6406338Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6408325Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6408331Z 2025-05-07T20:32:42.6408502Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6408510Z 2025-05-07T20:32:42.6408606Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6408828Z self=, 2025-05-07T20:32:42.6408904Z T=4096, 2025-05-07T20:32:42.6408975Z D=7168, 2025-05-07T20:32:42.6409052Z scale_ub=1200.0, 2025-05-07T20:32:42.6409139Z contiguous=True, 2025-05-07T20:32:42.6409221Z compiled=False, 2025-05-07T20:32:42.6409291Z ) 2025-05-07T20:32:42.6409510Z self = 2025-05-07T20:32:42.6409677Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.6409682Z 2025-05-07T20:32:42.6409758Z @given( 2025-05-07T20:32:42.6409869Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6409960Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6410072Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6410241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6410352Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6410421Z ) 2025-05-07T20:32:42.6410661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6410749Z def test_silu_mul_quant( 2025-05-07T20:32:42.6410828Z self, 2025-05-07T20:32:42.6410900Z T: int, 2025-05-07T20:32:42.6410969Z D: int, 2025-05-07T20:32:42.6411066Z scale_ub: Optional[float], 2025-05-07T20:32:42.6411148Z contiguous: bool, 2025-05-07T20:32:42.6411235Z compiled: bool, 2025-05-07T20:32:42.6411309Z ) -> None: 2025-05-07T20:32:42.6411399Z torch.manual_seed(2025) 2025-05-07T20:32:42.6411467Z 2025-05-07T20:32:42.6411629Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6413416Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6413430Z 2025-05-07T20:32:42.6413541Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6413545Z 2025-05-07T20:32:42.6413643Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6413867Z self=, 2025-05-07T20:32:42.6413939Z T=16384, 2025-05-07T20:32:42.6414010Z D=7168, 2025-05-07T20:32:42.6414090Z scale_ub=None, 2025-05-07T20:32:42.6414170Z contiguous=False, 2025-05-07T20:32:42.6414251Z compiled=True, 2025-05-07T20:32:42.6414322Z ) 2025-05-07T20:32:42.6414530Z self = 2025-05-07T20:32:42.6414700Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6414705Z 2025-05-07T20:32:42.6414775Z @given( 2025-05-07T20:32:42.6414885Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6415027Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6415135Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6415247Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6415398Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6415467Z ) 2025-05-07T20:32:42.6415710Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6415800Z def test_silu_mul_quant( 2025-05-07T20:32:42.6415872Z self, 2025-05-07T20:32:42.6415948Z T: int, 2025-05-07T20:32:42.6416058Z D: int, 2025-05-07T20:32:42.6416150Z scale_ub: Optional[float], 2025-05-07T20:32:42.6416239Z contiguous: bool, 2025-05-07T20:32:42.6416318Z compiled: bool, 2025-05-07T20:32:42.6416388Z ) -> None: 2025-05-07T20:32:42.6416483Z torch.manual_seed(2025) 2025-05-07T20:32:42.6416551Z 2025-05-07T20:32:42.6416713Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6418543Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6418551Z 2025-05-07T20:32:42.6418662Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6418671Z 2025-05-07T20:32:42.6418767Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6418989Z self=, 2025-05-07T20:32:42.6419067Z T=4096, 2025-05-07T20:32:42.6419143Z D=7168, 2025-05-07T20:32:42.6419223Z scale_ub=None, 2025-05-07T20:32:42.6419307Z contiguous=True, 2025-05-07T20:32:42.6419387Z compiled=False, 2025-05-07T20:32:42.6419452Z ) 2025-05-07T20:32:42.6419675Z self = 2025-05-07T20:32:42.6419841Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.6419845Z 2025-05-07T20:32:42.6419921Z @given( 2025-05-07T20:32:42.6420034Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6420128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6420247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6420358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6420466Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6420541Z ) 2025-05-07T20:32:42.6420781Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6420871Z def test_silu_mul_quant( 2025-05-07T20:32:42.6420945Z self, 2025-05-07T20:32:42.6421015Z T: int, 2025-05-07T20:32:42.6421082Z D: int, 2025-05-07T20:32:42.6421177Z scale_ub: Optional[float], 2025-05-07T20:32:42.6421262Z contiguous: bool, 2025-05-07T20:32:42.6421348Z compiled: bool, 2025-05-07T20:32:42.6421419Z ) -> None: 2025-05-07T20:32:42.6421511Z torch.manual_seed(2025) 2025-05-07T20:32:42.6421581Z 2025-05-07T20:32:42.6421742Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6423527Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6423579Z 2025-05-07T20:32:42.6423693Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6423734Z 2025-05-07T20:32:42.6423834Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6424057Z self=, 2025-05-07T20:32:42.6424130Z T=16384, 2025-05-07T20:32:42.6424204Z D=7168, 2025-05-07T20:32:42.6424283Z scale_ub=None, 2025-05-07T20:32:42.6424400Z contiguous=True, 2025-05-07T20:32:42.6424483Z compiled=False, 2025-05-07T20:32:42.6424553Z ) 2025-05-07T20:32:42.6424761Z self = 2025-05-07T20:32:42.6424936Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.6424941Z 2025-05-07T20:32:42.6425017Z @given( 2025-05-07T20:32:42.6425128Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6425223Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6425333Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6425447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6425557Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6425628Z ) 2025-05-07T20:32:42.6425873Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6425959Z def test_silu_mul_quant( 2025-05-07T20:32:42.6426093Z self, 2025-05-07T20:32:42.6426175Z T: int, 2025-05-07T20:32:42.6426265Z D: int, 2025-05-07T20:32:42.6426361Z scale_ub: Optional[float], 2025-05-07T20:32:42.6426445Z contiguous: bool, 2025-05-07T20:32:42.6426526Z compiled: bool, 2025-05-07T20:32:42.6426599Z ) -> None: 2025-05-07T20:32:42.6426692Z torch.manual_seed(2025) 2025-05-07T20:32:42.6426758Z 2025-05-07T20:32:42.6426920Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6428707Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6428715Z 2025-05-07T20:32:42.6428829Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6428838Z 2025-05-07T20:32:42.6428935Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6429154Z self=, 2025-05-07T20:32:42.6429233Z T=16384, 2025-05-07T20:32:42.6429302Z D=7168, 2025-05-07T20:32:42.6429378Z scale_ub=1200.0, 2025-05-07T20:32:42.6429459Z contiguous=True, 2025-05-07T20:32:42.6429536Z compiled=False, 2025-05-07T20:32:42.6429604Z ) 2025-05-07T20:32:42.6429872Z self = 2025-05-07T20:32:42.6430042Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.6430046Z 2025-05-07T20:32:42.6430122Z @given( 2025-05-07T20:32:42.6430235Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6430332Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6430445Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6430556Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6430663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6430734Z ) 2025-05-07T20:32:42.6431023Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6431112Z def test_silu_mul_quant( 2025-05-07T20:32:42.6431188Z self, 2025-05-07T20:32:42.6431258Z T: int, 2025-05-07T20:32:42.6431334Z D: int, 2025-05-07T20:32:42.6431470Z scale_ub: Optional[float], 2025-05-07T20:32:42.6431555Z contiguous: bool, 2025-05-07T20:32:42.6431638Z compiled: bool, 2025-05-07T20:32:42.6431708Z ) -> None: 2025-05-07T20:32:42.6431797Z torch.manual_seed(2025) 2025-05-07T20:32:42.6431868Z 2025-05-07T20:32:42.6432031Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6433854Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6433865Z 2025-05-07T20:32:42.6433980Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6433985Z 2025-05-07T20:32:42.6434083Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6434304Z self=, 2025-05-07T20:32:42.6434375Z T=128, 2025-05-07T20:32:42.6434488Z D=5120, 2025-05-07T20:32:42.6434569Z scale_ub=1200.0, 2025-05-07T20:32:42.6434646Z contiguous=False, 2025-05-07T20:32:42.6434726Z compiled=False, 2025-05-07T20:32:42.6434795Z ) 2025-05-07T20:32:42.6435010Z self = 2025-05-07T20:32:42.6435176Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.6435183Z 2025-05-07T20:32:42.6435256Z @given( 2025-05-07T20:32:42.6435367Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6435460Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6435571Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6435680Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6435793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6435863Z ) 2025-05-07T20:32:42.6436106Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6436202Z def test_silu_mul_quant( 2025-05-07T20:32:42.6436271Z self, 2025-05-07T20:32:42.6436345Z T: int, 2025-05-07T20:32:42.6436416Z D: int, 2025-05-07T20:32:42.6436507Z scale_ub: Optional[float], 2025-05-07T20:32:42.6436592Z contiguous: bool, 2025-05-07T20:32:42.6436671Z compiled: bool, 2025-05-07T20:32:42.6436743Z ) -> None: 2025-05-07T20:32:42.6436834Z torch.manual_seed(2025) 2025-05-07T20:32:42.6436903Z 2025-05-07T20:32:42.6437065Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6437139Z 2025-05-07T20:32:42.6437230Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6437352Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6437436Z x = x_sign * x_clamp 2025-05-07T20:32:42.6437512Z x0 = x[:, :D] 2025-05-07T20:32:42.6437589Z x1 = x[:, D:] 2025-05-07T20:32:42.6437658Z 2025-05-07T20:32:42.6437735Z if contiguous: 2025-05-07T20:32:42.6437834Z x0 = x0.contiguous() 2025-05-07T20:32:42.6437921Z x1 = x1.contiguous() 2025-05-07T20:32:42.6437987Z 2025-05-07T20:32:42.6438081Z if scale_ub is not None: 2025-05-07T20:32:42.6438182Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6438313Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6438458Z ) 2025-05-07T20:32:42.6438530Z else: 2025-05-07T20:32:42.6438623Z scale_ub_tensor = None 2025-05-07T20:32:42.6438691Z 2025-05-07T20:32:42.6438815Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6438944Z op = silu_mul_quant 2025-05-07T20:32:42.6439024Z if compiled: 2025-05-07T20:32:42.6439120Z op = torch.compile(op) 2025-05-07T20:32:42.6439226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6439295Z 2025-05-07T20:32:42.6439381Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6439426Z 2025-05-07T20:32:42.6439522Z moe/activation_test.py:117: 2025-05-07T20:32:42.6439645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6439741Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6439838Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6440335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6440432Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6440786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6441008Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6441346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6441434Z kernel = self.compile( 2025-05-07T20:32:42.6441854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6442023Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6442144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6442148Z 2025-05-07T20:32:42.6442355Z self = 2025-05-07T20:32:42.6443138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6443648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc7bc5ca0>} 2025-05-07T20:32:42.6444394Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6444588Z context = 2025-05-07T20:32:42.6444593Z 2025-05-07T20:32:42.6444755Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6445018Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6445122Z module_map=module_map) 2025-05-07T20:32:42.6445279Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6445375Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6445453Z E ^ 2025-05-07T20:32:42.6445818Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6445824Z 2025-05-07T20:32:42.6446272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6446280Z 2025-05-07T20:32:42.6446376Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6446592Z self=, 2025-05-07T20:32:42.6446671Z T=2048, 2025-05-07T20:32:42.6446738Z D=7168, 2025-05-07T20:32:42.6446858Z scale_ub=None, 2025-05-07T20:32:42.6446945Z contiguous=False, 2025-05-07T20:32:42.6447026Z compiled=False, 2025-05-07T20:32:42.6447092Z ) 2025-05-07T20:32:42.6447306Z self = 2025-05-07T20:32:42.6447510Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.6447515Z 2025-05-07T20:32:42.6447588Z @given( 2025-05-07T20:32:42.6447705Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6447797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6447913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6448068Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6448178Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6448250Z ) 2025-05-07T20:32:42.6448492Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6448580Z def test_silu_mul_quant( 2025-05-07T20:32:42.6448660Z self, 2025-05-07T20:32:42.6448732Z T: int, 2025-05-07T20:32:42.6448805Z D: int, 2025-05-07T20:32:42.6448899Z scale_ub: Optional[float], 2025-05-07T20:32:42.6448984Z contiguous: bool, 2025-05-07T20:32:42.6449067Z compiled: bool, 2025-05-07T20:32:42.6449138Z ) -> None: 2025-05-07T20:32:42.6449229Z torch.manual_seed(2025) 2025-05-07T20:32:42.6449302Z 2025-05-07T20:32:42.6449465Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6451286Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6451300Z 2025-05-07T20:32:42.6451413Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6451418Z 2025-05-07T20:32:42.6451521Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6451740Z self=, 2025-05-07T20:32:42.6451811Z T=128, 2025-05-07T20:32:42.6451882Z D=7168, 2025-05-07T20:32:42.6451960Z scale_ub=1200.0, 2025-05-07T20:32:42.6452039Z contiguous=True, 2025-05-07T20:32:42.6452124Z compiled=True, 2025-05-07T20:32:42.6452196Z ) 2025-05-07T20:32:42.6452406Z self = 2025-05-07T20:32:42.6452569Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.6452574Z 2025-05-07T20:32:42.6452645Z @given( 2025-05-07T20:32:42.6452756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6452856Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6452964Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6453078Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6453188Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6453258Z ) 2025-05-07T20:32:42.6453502Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6453590Z def test_silu_mul_quant( 2025-05-07T20:32:42.6453662Z self, 2025-05-07T20:32:42.6453738Z T: int, 2025-05-07T20:32:42.6453809Z D: int, 2025-05-07T20:32:42.6453902Z scale_ub: Optional[float], 2025-05-07T20:32:42.6453990Z contiguous: bool, 2025-05-07T20:32:42.6454068Z compiled: bool, 2025-05-07T20:32:42.6454141Z ) -> None: 2025-05-07T20:32:42.6454231Z torch.manual_seed(2025) 2025-05-07T20:32:42.6454297Z 2025-05-07T20:32:42.6454505Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6454578Z 2025-05-07T20:32:42.6454665Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6454788Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6454923Z x = x_sign * x_clamp 2025-05-07T20:32:42.6454998Z x0 = x[:, :D] 2025-05-07T20:32:42.6455078Z x1 = x[:, D:] 2025-05-07T20:32:42.6455146Z 2025-05-07T20:32:42.6455225Z if contiguous: 2025-05-07T20:32:42.6455317Z x0 = x0.contiguous() 2025-05-07T20:32:42.6455400Z x1 = x1.contiguous() 2025-05-07T20:32:42.6455512Z 2025-05-07T20:32:42.6455602Z if scale_ub is not None: 2025-05-07T20:32:42.6455704Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6455834Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6455909Z ) 2025-05-07T20:32:42.6455981Z else: 2025-05-07T20:32:42.6456070Z scale_ub_tensor = None 2025-05-07T20:32:42.6456142Z 2025-05-07T20:32:42.6456266Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6456352Z op = silu_mul_quant 2025-05-07T20:32:42.6456431Z if compiled: 2025-05-07T20:32:42.6456528Z op = torch.compile(op) 2025-05-07T20:32:42.6456631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6456699Z 2025-05-07T20:32:42.6456785Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6456789Z 2025-05-07T20:32:42.6456885Z moe/activation_test.py:117: 2025-05-07T20:32:42.6457048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6457153Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6457246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6457610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6457700Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6458197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6458288Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6458649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6458871Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6459206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6459299Z kernel = self.compile( 2025-05-07T20:32:42.6459675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6459848Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6459968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6459975Z 2025-05-07T20:32:42.6460176Z self = 2025-05-07T20:32:42.6460960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6461466Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9fc7b390d0>} 2025-05-07T20:32:42.6462215Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6462404Z context = 2025-05-07T20:32:42.6462409Z 2025-05-07T20:32:42.6462575Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6462879Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6462980Z module_map=module_map) 2025-05-07T20:32:42.6463179Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6463273Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6463343Z E ^ 2025-05-07T20:32:42.6463697Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6463746Z 2025-05-07T20:32:42.6464160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6464165Z 2025-05-07T20:32:42.6464265Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6464481Z self=, 2025-05-07T20:32:42.6464556Z T=128, 2025-05-07T20:32:42.6464633Z D=7168, 2025-05-07T20:32:42.6464711Z scale_ub=1200.0, 2025-05-07T20:32:42.6464789Z contiguous=True, 2025-05-07T20:32:42.6464871Z compiled=False, 2025-05-07T20:32:42.6464936Z ) 2025-05-07T20:32:42.6465153Z self = 2025-05-07T20:32:42.6465316Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.6465321Z 2025-05-07T20:32:42.6465391Z @given( 2025-05-07T20:32:42.6465507Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6465665Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6465780Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6465896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6466004Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6466078Z ) 2025-05-07T20:32:42.6466320Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6466424Z def test_silu_mul_quant( 2025-05-07T20:32:42.6466511Z self, 2025-05-07T20:32:42.6466589Z T: int, 2025-05-07T20:32:42.6466672Z D: int, 2025-05-07T20:32:42.6466768Z scale_ub: Optional[float], 2025-05-07T20:32:42.6466856Z contiguous: bool, 2025-05-07T20:32:42.6466938Z compiled: bool, 2025-05-07T20:32:42.6472339Z ) -> None: 2025-05-07T20:32:42.6472449Z torch.manual_seed(2025) 2025-05-07T20:32:42.6472518Z 2025-05-07T20:32:42.6472692Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6472770Z 2025-05-07T20:32:42.6472863Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6472989Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6474797Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6474805Z 2025-05-07T20:32:42.6474926Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.6474931Z 2025-05-07T20:32:42.6475031Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6475261Z self=, 2025-05-07T20:32:42.6475343Z T=128, 2025-05-07T20:32:42.6475416Z D=5120, 2025-05-07T20:32:42.6475497Z scale_ub=1200.0, 2025-05-07T20:32:42.6475575Z contiguous=True, 2025-05-07T20:32:42.6475651Z compiled=True, 2025-05-07T20:32:42.6475723Z ) 2025-05-07T20:32:42.6475940Z self = 2025-05-07T20:32:42.6476175Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.6476180Z 2025-05-07T20:32:42.6476252Z @given( 2025-05-07T20:32:42.6476367Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6476500Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6476618Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6476731Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6476844Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6476949Z ) 2025-05-07T20:32:42.6477194Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6477287Z def test_silu_mul_quant( 2025-05-07T20:32:42.6477359Z self, 2025-05-07T20:32:42.6477429Z T: int, 2025-05-07T20:32:42.6477505Z D: int, 2025-05-07T20:32:42.6477597Z scale_ub: Optional[float], 2025-05-07T20:32:42.6477684Z contiguous: bool, 2025-05-07T20:32:42.6477765Z compiled: bool, 2025-05-07T20:32:42.6477842Z ) -> None: 2025-05-07T20:32:42.6477931Z torch.manual_seed(2025) 2025-05-07T20:32:42.6478005Z 2025-05-07T20:32:42.6478172Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6478241Z 2025-05-07T20:32:42.6478329Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6478447Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6480257Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6480268Z 2025-05-07T20:32:42.6480382Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:42.6480387Z 2025-05-07T20:32:42.6480486Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6480711Z self=, 2025-05-07T20:32:42.6480784Z T=128, 2025-05-07T20:32:42.6480860Z D=7168, 2025-05-07T20:32:42.6480940Z scale_ub=None, 2025-05-07T20:32:42.6481017Z contiguous=True, 2025-05-07T20:32:42.6481096Z compiled=True, 2025-05-07T20:32:42.6481167Z ) 2025-05-07T20:32:42.6481390Z self = 2025-05-07T20:32:42.6481551Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.6481556Z 2025-05-07T20:32:42.6481629Z @given( 2025-05-07T20:32:42.6481744Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6481838Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6481947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6482060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6482169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6482236Z ) 2025-05-07T20:32:42.6482485Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6482572Z def test_silu_mul_quant( 2025-05-07T20:32:42.6482646Z self, 2025-05-07T20:32:42.6482717Z T: int, 2025-05-07T20:32:42.6482786Z D: int, 2025-05-07T20:32:42.6482890Z scale_ub: Optional[float], 2025-05-07T20:32:42.6482973Z contiguous: bool, 2025-05-07T20:32:42.6483053Z compiled: bool, 2025-05-07T20:32:42.6483126Z ) -> None: 2025-05-07T20:32:42.6483215Z torch.manual_seed(2025) 2025-05-07T20:32:42.6483282Z 2025-05-07T20:32:42.6483445Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6485290Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:42.6485339Z 2025-05-07T20:32:42.6485455Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:42.6485585Z =============================== warnings summary =============================== 2025-05-07T20:32:42.6485890Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:42.6486213Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:42.6486528Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:42.6487426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:42.6487650Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:42.6487694Z 2025-05-07T20:32:42.6487904Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:42.6488067Z ================= 1 failed, 1 deselected, 3 warnings in 19.29s ================= 2025-05-07T20:32:44.1746478Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:44.2362988Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:32:44.2363252Z 2025-05-07T20:32:46.2379924Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:48.3950088Z ============================= test session starts ============================== 2025-05-07T20:32:48.3950759Z platform linux -- Python 3.9.18, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:48.3951289Z cachedir: .pytest_cache 2025-05-07T20:32:48.3951861Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:48.3952567Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:48.3952982Z plugins: hypothesis-6.131.14 2025-05-07T20:32:49.9940148Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:50.2068882Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:50.2069279Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:50.2069512Z 2025-05-07T20:32:52.8653435Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.8654240Z self=, 2025-05-07T20:32:52.8654657Z T=1, 2025-05-07T20:32:52.8654869Z D=5120, 2025-05-07T20:32:52.8655075Z scale_ub=None, 2025-05-07T20:32:52.8655293Z contiguous=True, 2025-05-07T20:32:52.8655517Z compiled=True, 2025-05-07T20:32:52.8655734Z ) 2025-05-07T20:32:52.8656059Z self = 2025-05-07T20:32:52.8656545Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.8657131Z 2025-05-07T20:32:52.8657212Z @given( 2025-05-07T20:32:52.8657447Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.8657763Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.8658171Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.8658535Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.8658893Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.8659180Z ) 2025-05-07T20:32:52.8659537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.8660070Z def test_silu_mul_quant( 2025-05-07T20:32:52.8660311Z self, 2025-05-07T20:32:52.8660511Z T: int, 2025-05-07T20:32:52.8660712Z D: int, 2025-05-07T20:32:52.8660926Z scale_ub: Optional[float], 2025-05-07T20:32:52.8661199Z contiguous: bool, 2025-05-07T20:32:52.8661440Z compiled: bool, 2025-05-07T20:32:52.8661669Z ) -> None: 2025-05-07T20:32:52.8661884Z torch.manual_seed(2025) 2025-05-07T20:32:52.8662133Z 2025-05-07T20:32:52.8662407Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.8662747Z 2025-05-07T20:32:52.8662942Z x_sign = torch.sign(x) 2025-05-07T20:32:52.8663239Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.8663543Z x = x_sign * x_clamp 2025-05-07T20:32:52.8663794Z x0 = x[:, :D] 2025-05-07T20:32:52.8664014Z x1 = x[:, D:] 2025-05-07T20:32:52.8664217Z 2025-05-07T20:32:52.8664407Z if contiguous: 2025-05-07T20:32:52.8664738Z x0 = x0.contiguous() 2025-05-07T20:32:52.8664996Z x1 = x1.contiguous() 2025-05-07T20:32:52.8665239Z 2025-05-07T20:32:52.8665432Z if scale_ub is not None: 2025-05-07T20:32:52.8665703Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.8666046Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.8666362Z ) 2025-05-07T20:32:52.8666560Z else: 2025-05-07T20:32:52.8666768Z scale_ub_tensor = None 2025-05-07T20:32:52.8667022Z 2025-05-07T20:32:52.8667255Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.8667566Z op = silu_mul_quant 2025-05-07T20:32:52.8667824Z if compiled: 2025-05-07T20:32:52.8668079Z op = torch.compile(op) 2025-05-07T20:32:52.8668374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.8668655Z 2025-05-07T20:32:52.8668850Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.8669141Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.8669437Z 2025-05-07T20:32:52.8669677Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.8670087Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.8670385Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.8670704Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.8671070Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.8671375Z 2025-05-07T20:32:52.8671581Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.8671776Z 2025-05-07T20:32:52.8671885Z moe/activation_test.py:126: 2025-05-07T20:32:52.8672182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.8672523Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.8672855Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.8673644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.8674420Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.8674968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.8675652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.8676390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.8677151Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.8677910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:52.8678707Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.8679432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.8680113Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.8680716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.8681234Z fn() 2025-05-07T20:32:52.8681736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.8682324Z self.fn.run( 2025-05-07T20:32:52.8682792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.8683322Z kernel = self.compile( 2025-05-07T20:32:52.8683862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.8684517Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.8684959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.8685209Z 2025-05-07T20:32:52.8685424Z self = 2025-05-07T20:32:52.8686514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.8687921Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891b3dd9d0>} 2025-05-07T20:32:52.8689317Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.8690339Z context = 2025-05-07T20:32:52.8690630Z 2025-05-07T20:32:52.8690808Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.8691326Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.8691800Z module_map=module_map) 2025-05-07T20:32:52.8692171Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.8692527Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.8692797Z E ^ 2025-05-07T20:32:52.8693272Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.8693724Z 2025-05-07T20:32:52.8694147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.8694657Z 2025-05-07T20:32:52.8694762Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.8695182Z self=, 2025-05-07T20:32:52.8695594Z T=2048, 2025-05-07T20:32:52.8695785Z D=5120, 2025-05-07T20:32:52.8695984Z scale_ub=1200.0, 2025-05-07T20:32:52.8696210Z contiguous=True, 2025-05-07T20:32:52.8696428Z compiled=False, 2025-05-07T20:32:52.8696634Z ) 2025-05-07T20:32:54.3518174Z self = 2025-05-07T20:32:54.3519253Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.3519616Z 2025-05-07T20:32:54.3519705Z @given( 2025-05-07T20:32:54.3519944Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3520350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3520663Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3521000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3521324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3521611Z ) 2025-05-07T20:32:54.3522058Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3522499Z def test_silu_mul_quant( 2025-05-07T20:32:54.3522748Z self, 2025-05-07T20:32:54.3522945Z T: int, 2025-05-07T20:32:54.3523137Z D: int, 2025-05-07T20:32:54.3523357Z scale_ub: Optional[float], 2025-05-07T20:32:54.3523633Z contiguous: bool, 2025-05-07T20:32:54.3523872Z compiled: bool, 2025-05-07T20:32:54.3524102Z ) -> None: 2025-05-07T20:32:54.3524322Z torch.manual_seed(2025) 2025-05-07T20:32:54.3524563Z 2025-05-07T20:32:54.3524843Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3525193Z 2025-05-07T20:32:54.3525388Z x_sign = torch.sign(x) 2025-05-07T20:32:54.3525675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.3525995Z x = x_sign * x_clamp 2025-05-07T20:32:54.3526237Z x0 = x[:, :D] 2025-05-07T20:32:54.3526447Z x1 = x[:, D:] 2025-05-07T20:32:54.3527088Z 2025-05-07T20:32:54.3527281Z if contiguous: 2025-05-07T20:32:54.3527517Z x0 = x0.contiguous() 2025-05-07T20:32:54.3527778Z x1 = x1.contiguous() 2025-05-07T20:32:54.3528025Z 2025-05-07T20:32:54.3528213Z if scale_ub is not None: 2025-05-07T20:32:54.3528491Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.3528835Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.3529139Z ) 2025-05-07T20:32:54.3529335Z else: 2025-05-07T20:32:54.3529545Z scale_ub_tensor = None 2025-05-07T20:32:54.3529794Z 2025-05-07T20:32:54.3530036Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.3530355Z op = silu_mul_quant 2025-05-07T20:32:54.3530605Z if compiled: 2025-05-07T20:32:54.3530853Z op = torch.compile(op) 2025-05-07T20:32:54.3531154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.3531437Z 2025-05-07T20:32:54.3531624Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.3531800Z 2025-05-07T20:32:54.3531900Z moe/activation_test.py:117: 2025-05-07T20:32:54.3532196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.3532528Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.3532814Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.3533518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.3534216Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.3534754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.3535446Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.3536110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.3536644Z kernel = self.compile( 2025-05-07T20:32:54.3537192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.3537856Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.3538256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.3538535Z 2025-05-07T20:32:54.3538744Z self = 2025-05-07T20:32:54.3539881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.3541298Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f88f9ced5e0>} 2025-05-07T20:32:54.3542697Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.3543724Z context = 2025-05-07T20:32:54.3544012Z 2025-05-07T20:32:54.3544186Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.3544715Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.3545194Z module_map=module_map) 2025-05-07T20:32:54.3545562Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.3545927Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.3546189Z E ^ 2025-05-07T20:32:54.3546661Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.3547161Z 2025-05-07T20:32:54.3547581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.3548101Z 2025-05-07T20:32:54.3548205Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3548663Z self=, 2025-05-07T20:32:54.3549079Z T=2048, 2025-05-07T20:32:54.3549276Z D=5120, 2025-05-07T20:32:54.3549472Z scale_ub=1200.0, 2025-05-07T20:32:54.3549694Z contiguous=True, 2025-05-07T20:32:54.3549994Z compiled=True, 2025-05-07T20:32:54.3550205Z ) 2025-05-07T20:32:54.3550530Z self = 2025-05-07T20:32:54.3551024Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.3551299Z 2025-05-07T20:32:54.3551375Z @given( 2025-05-07T20:32:54.3551605Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3551919Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3552232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3552565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3552891Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3553180Z ) 2025-05-07T20:32:54.3553535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3553983Z def test_silu_mul_quant( 2025-05-07T20:32:54.3554218Z self, 2025-05-07T20:32:54.3554416Z T: int, 2025-05-07T20:32:54.3554617Z D: int, 2025-05-07T20:32:54.3554830Z scale_ub: Optional[float], 2025-05-07T20:32:54.3555113Z contiguous: bool, 2025-05-07T20:32:54.3555355Z compiled: bool, 2025-05-07T20:32:54.3555575Z ) -> None: 2025-05-07T20:32:54.3555798Z torch.manual_seed(2025) 2025-05-07T20:32:54.3556044Z 2025-05-07T20:32:54.3556309Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3556667Z 2025-05-07T20:32:54.3556871Z x_sign = torch.sign(x) 2025-05-07T20:32:54.3557158Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.3557470Z x = x_sign * x_clamp 2025-05-07T20:32:54.3557716Z x0 = x[:, :D] 2025-05-07T20:32:54.3557929Z x1 = x[:, D:] 2025-05-07T20:32:54.3558145Z 2025-05-07T20:32:54.3558386Z if contiguous: 2025-05-07T20:32:54.3558638Z x0 = x0.contiguous() 2025-05-07T20:32:54.3558930Z x1 = x1.contiguous() 2025-05-07T20:32:54.3559173Z 2025-05-07T20:32:54.3559368Z if scale_ub is not None: 2025-05-07T20:32:54.3559684Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.3560028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.3560349Z ) 2025-05-07T20:32:54.3560538Z else: 2025-05-07T20:32:54.3560751Z scale_ub_tensor = None 2025-05-07T20:32:54.3561005Z 2025-05-07T20:32:54.3561273Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.3561593Z op = silu_mul_quant 2025-05-07T20:32:54.3561856Z if compiled: 2025-05-07T20:32:54.3562106Z op = torch.compile(op) 2025-05-07T20:32:54.3562408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.3562694Z 2025-05-07T20:32:54.3562887Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.3563177Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.3563474Z 2025-05-07T20:32:54.3563704Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.3564048Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.3564344Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.3564665Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.3565022Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.3565331Z 2025-05-07T20:32:54.3565580Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.3565780Z 2025-05-07T20:32:54.3565879Z moe/activation_test.py:126: 2025-05-07T20:32:54.3566177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.3566513Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.3566854Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.3567640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.3568397Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.3568949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.3569628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.3570318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.3571051Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.3571807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:54.3572548Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.3573281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.3573924Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.3574529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.3575040Z fn() 2025-05-07T20:32:54.3575546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.3576136Z self.fn.run( 2025-05-07T20:32:54.3576608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.3577141Z kernel = self.compile( 2025-05-07T20:32:54.3577686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.3578344Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.3578839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.3579080Z 2025-05-07T20:32:54.3579292Z self = 2025-05-07T20:32:54.3580435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.3581844Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8919e54160>} 2025-05-07T20:32:54.3583260Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.3584294Z context = 2025-05-07T20:32:54.3584604Z 2025-05-07T20:32:54.3584771Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.3585310Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.3585778Z module_map=module_map) 2025-05-07T20:32:54.3586157Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.3586525Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.3586799Z E ^ 2025-05-07T20:32:54.3587300Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.3587770Z 2025-05-07T20:32:54.3588191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.3588711Z 2025-05-07T20:32:54.3588824Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3589298Z self=, 2025-05-07T20:32:54.3589701Z T=16384, 2025-05-07T20:32:54.3589941Z D=7168, 2025-05-07T20:32:54.3590130Z scale_ub=1200.0, 2025-05-07T20:32:54.3590352Z contiguous=False, 2025-05-07T20:32:54.3590589Z compiled=False, 2025-05-07T20:32:54.3590801Z ) 2025-05-07T20:32:55.6894992Z self = 2025-05-07T20:32:55.6895732Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:55.6896174Z 2025-05-07T20:32:55.6896299Z @given( 2025-05-07T20:32:55.6896550Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.6896863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.6897176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.6897514Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.6897840Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.6898144Z ) 2025-05-07T20:32:55.6898503Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.6898997Z def test_silu_mul_quant( 2025-05-07T20:32:55.6899241Z self, 2025-05-07T20:32:55.6899449Z T: int, 2025-05-07T20:32:55.6899652Z D: int, 2025-05-07T20:32:55.6899870Z scale_ub: Optional[float], 2025-05-07T20:32:55.6900152Z contiguous: bool, 2025-05-07T20:32:55.6900396Z compiled: bool, 2025-05-07T20:32:55.6900619Z ) -> None: 2025-05-07T20:32:55.6900843Z torch.manual_seed(2025) 2025-05-07T20:32:55.6901098Z 2025-05-07T20:32:55.6901370Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.6901726Z 2025-05-07T20:32:55.6901927Z x_sign = torch.sign(x) 2025-05-07T20:32:55.6902219Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.6902540Z x = x_sign * x_clamp 2025-05-07T20:32:55.6902962Z x0 = x[:, :D] 2025-05-07T20:32:55.6903173Z x1 = x[:, D:] 2025-05-07T20:32:55.6903387Z 2025-05-07T20:32:55.6903577Z if contiguous: 2025-05-07T20:32:55.6903988Z x0 = x0.contiguous() 2025-05-07T20:32:55.6904338Z x1 = x1.contiguous() 2025-05-07T20:32:55.6904694Z 2025-05-07T20:32:55.6904891Z if scale_ub is not None: 2025-05-07T20:32:55.6905164Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.6905508Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.6905824Z ) 2025-05-07T20:32:55.6906012Z else: 2025-05-07T20:32:55.6906305Z scale_ub_tensor = None 2025-05-07T20:32:55.6906572Z 2025-05-07T20:32:55.6906801Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.6907119Z op = silu_mul_quant 2025-05-07T20:32:55.6907375Z if compiled: 2025-05-07T20:32:55.6907619Z op = torch.compile(op) 2025-05-07T20:32:55.6907925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.6908202Z 2025-05-07T20:32:55.6908390Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.6908565Z 2025-05-07T20:32:55.6908667Z moe/activation_test.py:117: 2025-05-07T20:32:55.6908994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.6909354Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.6909634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.6910528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.6911448Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.6911990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.6912673Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.6913341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.6913877Z kernel = self.compile( 2025-05-07T20:32:55.6914413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.6915073Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.6915474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.6915705Z 2025-05-07T20:32:55.6915916Z self = 2025-05-07T20:32:55.6917007Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.6918404Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8919dffe50>} 2025-05-07T20:32:55.6919766Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.6920797Z context = 2025-05-07T20:32:55.6921087Z 2025-05-07T20:32:55.6921265Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.6921794Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.6922261Z module_map=module_map) 2025-05-07T20:32:55.6922630Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.6922978Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.6923242Z E ^ 2025-05-07T20:32:55.6923721Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.6924241Z 2025-05-07T20:32:55.6924663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.6925173Z 2025-05-07T20:32:55.6925366Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.6925791Z self=, 2025-05-07T20:32:55.6926193Z T=1, 2025-05-07T20:32:55.6926375Z D=7168, 2025-05-07T20:32:55.6926566Z scale_ub=None, 2025-05-07T20:32:55.6926786Z contiguous=True, 2025-05-07T20:32:55.6927046Z compiled=True, 2025-05-07T20:32:55.6927249Z ) 2025-05-07T20:32:55.6927570Z self = 2025-05-07T20:32:55.6928048Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:55.6928312Z 2025-05-07T20:32:55.6928389Z @given( 2025-05-07T20:32:55.6928624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.6928995Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.6929297Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.6929627Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.6929958Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.6930240Z ) 2025-05-07T20:32:55.6930587Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.6931028Z def test_silu_mul_quant( 2025-05-07T20:32:55.6931266Z self, 2025-05-07T20:32:55.6931460Z T: int, 2025-05-07T20:32:55.6931709Z D: int, 2025-05-07T20:32:55.6931927Z scale_ub: Optional[float], 2025-05-07T20:32:55.6932198Z contiguous: bool, 2025-05-07T20:32:55.6932441Z compiled: bool, 2025-05-07T20:32:55.6932666Z ) -> None: 2025-05-07T20:32:55.6932877Z torch.manual_seed(2025) 2025-05-07T20:32:55.6933124Z 2025-05-07T20:32:55.6933392Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.6933731Z 2025-05-07T20:32:55.6933931Z x_sign = torch.sign(x) 2025-05-07T20:32:55.6934226Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.6934536Z x = x_sign * x_clamp 2025-05-07T20:32:55.6934785Z x0 = x[:, :D] 2025-05-07T20:32:55.6935001Z x1 = x[:, D:] 2025-05-07T20:32:55.6935204Z 2025-05-07T20:32:55.6935395Z if contiguous: 2025-05-07T20:32:55.6935635Z x0 = x0.contiguous() 2025-05-07T20:32:55.6935888Z x1 = x1.contiguous() 2025-05-07T20:32:55.6936139Z 2025-05-07T20:32:55.6936332Z if scale_ub is not None: 2025-05-07T20:32:55.6936604Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.6936943Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.6937254Z ) 2025-05-07T20:32:55.6937444Z else: 2025-05-07T20:32:55.6937661Z scale_ub_tensor = None 2025-05-07T20:32:55.6937919Z 2025-05-07T20:32:55.6938151Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.6938459Z op = silu_mul_quant 2025-05-07T20:32:55.6938711Z if compiled: 2025-05-07T20:32:55.6938964Z op = torch.compile(op) 2025-05-07T20:32:55.6939300Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.6939584Z 2025-05-07T20:32:55.6939776Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.6940055Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.6940348Z 2025-05-07T20:32:55.6940591Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.6940921Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.6941211Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.6941525Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.6941885Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.6942242Z 2025-05-07T20:32:55.6942441Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.6942635Z 2025-05-07T20:32:55.6942742Z moe/activation_test.py:126: 2025-05-07T20:32:55.6943032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.6943415Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.6943745Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.6944534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.6945344Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.6945891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.6946572Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.6947249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.6947977Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.6948734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:55.6949478Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.6950245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.6950936Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.6951544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.6952063Z fn() 2025-05-07T20:32:55.6952569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.6953157Z self.fn.run( 2025-05-07T20:32:55.6953629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.6954154Z kernel = self.compile( 2025-05-07T20:32:55.6954700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.6955354Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.6955748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.6955983Z 2025-05-07T20:32:55.6956193Z self = 2025-05-07T20:32:55.6957286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.6958696Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8919dff550>} 2025-05-07T20:32:55.6960078Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.6961100Z context = 2025-05-07T20:32:55.6961391Z 2025-05-07T20:32:55.6961563Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.6962093Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.6962568Z module_map=module_map) 2025-05-07T20:32:55.6962930Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.6963291Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.6963557Z E ^ 2025-05-07T20:32:55.6964065Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.6964521Z 2025-05-07T20:32:55.6964974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.6965493Z 2025-05-07T20:32:55.6965596Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.6966009Z self=, 2025-05-07T20:32:55.6966410Z T=4096, 2025-05-07T20:32:55.6966600Z D=5120, 2025-05-07T20:32:55.6966838Z scale_ub=None, 2025-05-07T20:32:55.6967048Z contiguous=False, 2025-05-07T20:32:55.6967279Z compiled=False, 2025-05-07T20:32:55.6967483Z ) 2025-05-07T20:32:57.4462128Z self = 2025-05-07T20:32:57.4462876Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:57.4463274Z 2025-05-07T20:32:57.4463379Z @given( 2025-05-07T20:32:57.4463692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.4464057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.4464377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.4464722Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.4465050Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.4465339Z ) 2025-05-07T20:32:57.4465697Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.4466280Z def test_silu_mul_quant( 2025-05-07T20:32:57.4466527Z self, 2025-05-07T20:32:57.4466727Z T: int, 2025-05-07T20:32:57.4466925Z D: int, 2025-05-07T20:32:57.4467140Z scale_ub: Optional[float], 2025-05-07T20:32:57.4467414Z contiguous: bool, 2025-05-07T20:32:57.4467654Z compiled: bool, 2025-05-07T20:32:57.4467877Z ) -> None: 2025-05-07T20:32:57.4468099Z torch.manual_seed(2025) 2025-05-07T20:32:57.4468344Z 2025-05-07T20:32:57.4468608Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.4468951Z 2025-05-07T20:32:57.4469142Z x_sign = torch.sign(x) 2025-05-07T20:32:57.4469462Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.4469862Z x = x_sign * x_clamp 2025-05-07T20:32:57.4470105Z x0 = x[:, :D] 2025-05-07T20:32:57.4470317Z x1 = x[:, D:] 2025-05-07T20:32:57.4470526Z 2025-05-07T20:32:57.4470716Z if contiguous: 2025-05-07T20:32:57.4470947Z x0 = x0.contiguous() 2025-05-07T20:32:57.4471212Z x1 = x1.contiguous() 2025-05-07T20:32:57.4471451Z 2025-05-07T20:32:57.4471641Z if scale_ub is not None: 2025-05-07T20:32:57.4471909Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.4472247Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.4472558Z ) 2025-05-07T20:32:57.4472747Z else: 2025-05-07T20:32:57.4472966Z scale_ub_tensor = None 2025-05-07T20:32:57.4473218Z 2025-05-07T20:32:57.4473444Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.4473758Z op = silu_mul_quant 2025-05-07T20:32:57.4474011Z if compiled: 2025-05-07T20:32:57.4474255Z op = torch.compile(op) 2025-05-07T20:32:57.4474554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.4474830Z 2025-05-07T20:32:57.4475015Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.4475187Z 2025-05-07T20:32:57.4475293Z moe/activation_test.py:117: 2025-05-07T20:32:57.4475593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.4475931Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.4476212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.4476908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.4477678Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.4478208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.4478943Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.4479604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.4480133Z kernel = self.compile( 2025-05-07T20:32:57.4480677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.4481390Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.4481791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.4482017Z 2025-05-07T20:32:57.4482227Z self = 2025-05-07T20:32:57.4483316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.4484704Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8919a23940>} 2025-05-07T20:32:57.4486090Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.4487124Z context = 2025-05-07T20:32:57.4487410Z 2025-05-07T20:32:57.4487576Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.4488098Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.4488573Z module_map=module_map) 2025-05-07T20:32:57.4488948Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.4489328Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.4489609Z E ^ 2025-05-07T20:32:57.4490084Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.4490536Z 2025-05-07T20:32:57.4490960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.4491481Z 2025-05-07T20:32:57.4491583Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.4491997Z self=, 2025-05-07T20:32:57.4492398Z T=4096, 2025-05-07T20:32:57.4492581Z D=7168, 2025-05-07T20:32:57.4492778Z scale_ub=None, 2025-05-07T20:32:57.4493005Z contiguous=False, 2025-05-07T20:32:57.4493223Z compiled=False, 2025-05-07T20:32:57.4493432Z ) 2025-05-07T20:32:57.4493752Z self = 2025-05-07T20:32:57.4494245Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:57.4494525Z 2025-05-07T20:32:57.4494605Z @given( 2025-05-07T20:32:57.4494836Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.4495149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.4495450Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.4495784Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.4496114Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.4496391Z ) 2025-05-07T20:32:57.4496737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.4497180Z def test_silu_mul_quant( 2025-05-07T20:32:57.4497471Z self, 2025-05-07T20:32:57.4497664Z T: int, 2025-05-07T20:32:57.4497859Z D: int, 2025-05-07T20:32:57.4498071Z scale_ub: Optional[float], 2025-05-07T20:32:57.4498340Z contiguous: bool, 2025-05-07T20:32:57.4498577Z compiled: bool, 2025-05-07T20:32:57.4498840Z ) -> None: 2025-05-07T20:32:57.4499059Z torch.manual_seed(2025) 2025-05-07T20:32:57.4499299Z 2025-05-07T20:32:57.4499586Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.4499951Z 2025-05-07T20:32:57.4500144Z x_sign = torch.sign(x) 2025-05-07T20:32:57.4500481Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.4500784Z x = x_sign * x_clamp 2025-05-07T20:32:57.4501024Z x0 = x[:, :D] 2025-05-07T20:32:57.4501246Z x1 = x[:, D:] 2025-05-07T20:32:57.4501449Z 2025-05-07T20:32:57.4501635Z if contiguous: 2025-05-07T20:32:57.4501870Z x0 = x0.contiguous() 2025-05-07T20:32:57.4502128Z x1 = x1.contiguous() 2025-05-07T20:32:57.4502369Z 2025-05-07T20:32:57.4502560Z if scale_ub is not None: 2025-05-07T20:32:57.4502827Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.4503175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.4503487Z ) 2025-05-07T20:32:57.4503679Z else: 2025-05-07T20:32:57.4504071Z scale_ub_tensor = None 2025-05-07T20:32:57.4504321Z 2025-05-07T20:32:57.4504552Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.4504859Z op = silu_mul_quant 2025-05-07T20:32:57.4505186Z if compiled: 2025-05-07T20:32:57.4505434Z op = torch.compile(op) 2025-05-07T20:32:57.4505724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.4505995Z 2025-05-07T20:32:57.4506187Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.4506350Z 2025-05-07T20:32:57.4506449Z moe/activation_test.py:117: 2025-05-07T20:32:57.4506750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.4507081Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.4507363Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.4508060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.4508751Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.4509291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.4510037Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.4510695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.4511224Z kernel = self.compile( 2025-05-07T20:32:57.4511762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.4512411Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.4512801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.4513028Z 2025-05-07T20:32:57.4513242Z self = 2025-05-07T20:32:57.4514330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.4515708Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89199e65e0>} 2025-05-07T20:32:57.4517049Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.4518145Z context = 2025-05-07T20:32:57.4518430Z 2025-05-07T20:32:57.4518602Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.4519200Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.4519695Z module_map=module_map) 2025-05-07T20:32:57.4520061Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.4520412Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.4520731Z E ^ 2025-05-07T20:32:57.4521201Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.4521651Z 2025-05-07T20:32:57.4522070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.4522585Z 2025-05-07T20:32:57.4522696Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.4523102Z self=, 2025-05-07T20:32:57.4523504Z T=128, 2025-05-07T20:32:57.4523693Z D=7168, 2025-05-07T20:32:57.4523880Z scale_ub=None, 2025-05-07T20:32:57.4524098Z contiguous=False, 2025-05-07T20:32:57.4524324Z compiled=True, 2025-05-07T20:32:57.4524521Z ) 2025-05-07T20:32:57.5282488Z self = 2025-05-07T20:32:57.5283225Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:57.5283651Z 2025-05-07T20:32:57.5283755Z @given( 2025-05-07T20:32:57.5284061Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.5284477Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.5284815Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.5285140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.5285461Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.5285742Z ) 2025-05-07T20:32:57.5286085Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.5286517Z def test_silu_mul_quant( 2025-05-07T20:32:57.5286755Z self, 2025-05-07T20:32:57.5286942Z T: int, 2025-05-07T20:32:57.5287125Z D: int, 2025-05-07T20:32:57.5287342Z scale_ub: Optional[float], 2025-05-07T20:32:57.5287614Z contiguous: bool, 2025-05-07T20:32:57.5287842Z compiled: bool, 2025-05-07T20:32:57.5288067Z ) -> None: 2025-05-07T20:32:57.5288280Z torch.manual_seed(2025) 2025-05-07T20:32:57.5288513Z 2025-05-07T20:32:57.5288769Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.5289104Z 2025-05-07T20:32:57.5289290Z x_sign = torch.sign(x) 2025-05-07T20:32:57.5289579Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.5289883Z x = x_sign * x_clamp 2025-05-07T20:32:57.5296632Z x0 = x[:, :D] 2025-05-07T20:32:57.5296931Z x1 = x[:, D:] 2025-05-07T20:32:57.5297140Z 2025-05-07T20:32:57.5297320Z if contiguous: 2025-05-07T20:32:57.5297561Z x0 = x0.contiguous() 2025-05-07T20:32:57.5297820Z x1 = x1.contiguous() 2025-05-07T20:32:57.5298046Z 2025-05-07T20:32:57.5298234Z if scale_ub is not None: 2025-05-07T20:32:57.5298510Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.5298844Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.5299178Z ) 2025-05-07T20:32:57.5299395Z else: 2025-05-07T20:32:57.5299598Z scale_ub_tensor = None 2025-05-07T20:32:57.5299853Z 2025-05-07T20:32:57.5300090Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.5300401Z op = silu_mul_quant 2025-05-07T20:32:57.5300650Z if compiled: 2025-05-07T20:32:57.5301019Z op = torch.compile(op) 2025-05-07T20:32:57.5301312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.5301588Z 2025-05-07T20:32:57.5301782Z y_fp8, y_scale = fn() 2025-05-07T20:32:57.5302137Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:57.5302426Z 2025-05-07T20:32:57.5302661Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.5302998Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:57.5303280Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:57.5303597Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:57.5304305Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.5304613Z 2025-05-07T20:32:57.5304809Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:57.5305003Z 2025-05-07T20:32:57.5305104Z moe/activation_test.py:126: 2025-05-07T20:32:57.5305395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.5305734Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:57.5306068Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.5306884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:57.5307649Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:57.5308195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.5308949Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.5309695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:57.5310471Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:57.5311227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:57.5311977Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:57.5312708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:57.5313346Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:57.5313953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:57.5314469Z fn() 2025-05-07T20:32:57.5314977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:57.5315562Z self.fn.run( 2025-05-07T20:32:57.5316031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.5316564Z kernel = self.compile( 2025-05-07T20:32:57.5317106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.5317768Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.5318165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.5318393Z 2025-05-07T20:32:57.5318597Z self = 2025-05-07T20:32:57.5319693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.5321092Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8919e54310>} 2025-05-07T20:32:57.5322455Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.5323563Z context = 2025-05-07T20:32:57.5323851Z 2025-05-07T20:32:57.5324071Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.5324600Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.5325079Z module_map=module_map) 2025-05-07T20:32:57.5325514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.5325861Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:57.5326123Z E ^ 2025-05-07T20:32:57.5326591Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.5327052Z 2025-05-07T20:32:57.5327483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.5328000Z 2025-05-07T20:32:57.5328098Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.5328511Z self=, 2025-05-07T20:32:57.5328918Z T=128, 2025-05-07T20:32:57.5329098Z D=7168, 2025-05-07T20:32:57.5329309Z scale_ub=None, 2025-05-07T20:32:57.5329545Z contiguous=False, 2025-05-07T20:32:57.5329767Z compiled=False, 2025-05-07T20:32:57.5329972Z ) 2025-05-07T20:32:57.9351061Z self = 2025-05-07T20:32:57.9351856Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:57.9352235Z 2025-05-07T20:32:57.9352345Z @given( 2025-05-07T20:32:57.9352664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.9353072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.9353476Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.9353903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.9354227Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.9354520Z ) 2025-05-07T20:32:57.9354876Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.9355321Z def test_silu_mul_quant( 2025-05-07T20:32:57.9355560Z self, 2025-05-07T20:32:57.9355756Z T: int, 2025-05-07T20:32:57.9355957Z D: int, 2025-05-07T20:32:57.9356169Z scale_ub: Optional[float], 2025-05-07T20:32:57.9356449Z contiguous: bool, 2025-05-07T20:32:57.9356689Z compiled: bool, 2025-05-07T20:32:57.9356911Z ) -> None: 2025-05-07T20:32:57.9357132Z torch.manual_seed(2025) 2025-05-07T20:32:57.9357376Z 2025-05-07T20:32:57.9357644Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.9357986Z 2025-05-07T20:32:57.9358181Z x_sign = torch.sign(x) 2025-05-07T20:32:57.9358474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.9358786Z x = x_sign * x_clamp 2025-05-07T20:32:57.9359030Z x0 = x[:, :D] 2025-05-07T20:32:57.9359244Z x1 = x[:, D:] 2025-05-07T20:32:57.9359457Z 2025-05-07T20:32:57.9359645Z if contiguous: 2025-05-07T20:32:57.9359873Z x0 = x0.contiguous() 2025-05-07T20:32:57.9360136Z x1 = x1.contiguous() 2025-05-07T20:32:57.9360378Z 2025-05-07T20:32:57.9360571Z if scale_ub is not None: 2025-05-07T20:32:57.9360845Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.9361183Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.9361498Z ) 2025-05-07T20:32:57.9361685Z else: 2025-05-07T20:32:57.9361898Z scale_ub_tensor = None 2025-05-07T20:32:57.9362151Z 2025-05-07T20:32:57.9362379Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.9362782Z op = silu_mul_quant 2025-05-07T20:32:57.9363032Z if compiled: 2025-05-07T20:32:57.9363275Z op = torch.compile(op) 2025-05-07T20:32:57.9363575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.9363852Z 2025-05-07T20:32:57.9364103Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.9364279Z 2025-05-07T20:32:57.9364380Z moe/activation_test.py:117: 2025-05-07T20:32:57.9364681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.9365013Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.9365351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.9366045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.9366746Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.9367279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.9367962Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.9368625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.9369159Z kernel = self.compile( 2025-05-07T20:32:57.9369697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.9370350Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.9370792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.9371024Z 2025-05-07T20:32:57.9371234Z self = 2025-05-07T20:32:57.9372315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.9373697Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891957e160>} 2025-05-07T20:32:57.9375050Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.9376072Z context = 2025-05-07T20:32:57.9376362Z 2025-05-07T20:32:57.9376527Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.9377046Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.9377515Z module_map=module_map) 2025-05-07T20:32:57.9377886Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.9378242Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.9378506Z E ^ 2025-05-07T20:32:57.9378978Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.9379431Z 2025-05-07T20:32:57.9379852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.9380364Z 2025-05-07T20:32:57.9380464Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.9380885Z self=, 2025-05-07T20:32:57.9381300Z T=4096, 2025-05-07T20:32:57.9381486Z D=5120, 2025-05-07T20:32:57.9381679Z scale_ub=1200.0, 2025-05-07T20:32:57.9381907Z contiguous=True, 2025-05-07T20:32:57.9382130Z compiled=False, 2025-05-07T20:32:57.9382336Z ) 2025-05-07T20:32:57.9382656Z self = 2025-05-07T20:32:57.9383197Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:57.9383475Z 2025-05-07T20:32:57.9383552Z @given( 2025-05-07T20:32:57.9383782Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.9384186Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.9384489Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.9384814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.9385143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.9385422Z ) 2025-05-07T20:32:57.9385774Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.9386279Z def test_silu_mul_quant( 2025-05-07T20:32:57.9386512Z self, 2025-05-07T20:32:57.9386705Z T: int, 2025-05-07T20:32:57.9386901Z D: int, 2025-05-07T20:32:57.9387114Z scale_ub: Optional[float], 2025-05-07T20:32:57.9387384Z contiguous: bool, 2025-05-07T20:32:57.9387627Z compiled: bool, 2025-05-07T20:32:57.9387845Z ) -> None: 2025-05-07T20:32:57.9388065Z torch.manual_seed(2025) 2025-05-07T20:32:57.9388309Z 2025-05-07T20:32:57.9388577Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.9388929Z 2025-05-07T20:32:57.9389123Z x_sign = torch.sign(x) 2025-05-07T20:32:57.9389442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.9389778Z x = x_sign * x_clamp 2025-05-07T20:32:57.9390081Z x0 = x[:, :D] 2025-05-07T20:32:57.9390300Z x1 = x[:, D:] 2025-05-07T20:32:57.9390568Z 2025-05-07T20:32:57.9390748Z if contiguous: 2025-05-07T20:32:57.9390977Z x0 = x0.contiguous() 2025-05-07T20:32:57.9391240Z x1 = x1.contiguous() 2025-05-07T20:32:57.9391473Z 2025-05-07T20:32:57.9391668Z if scale_ub is not None: 2025-05-07T20:32:57.9391942Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.9392278Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.9392587Z ) 2025-05-07T20:32:57.9392781Z else: 2025-05-07T20:32:57.9392999Z scale_ub_tensor = None 2025-05-07T20:32:57.9393247Z 2025-05-07T20:32:57.9393476Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.9393790Z op = silu_mul_quant 2025-05-07T20:32:57.9394036Z if compiled: 2025-05-07T20:32:57.9394291Z op = torch.compile(op) 2025-05-07T20:32:57.9394589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.9394862Z 2025-05-07T20:32:57.9395067Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.9395230Z 2025-05-07T20:32:57.9395335Z moe/activation_test.py:117: 2025-05-07T20:32:57.9395623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.9395959Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.9396233Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.9396926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.9397610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.9398148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.9398832Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.9399513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.9400070Z kernel = self.compile( 2025-05-07T20:32:57.9400608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.9401259Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.9401646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.9401926Z 2025-05-07T20:32:57.9402129Z self = 2025-05-07T20:32:57.9403247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.9404868Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89194f75e0>} 2025-05-07T20:32:57.9406279Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.9407309Z context = 2025-05-07T20:32:57.9407603Z 2025-05-07T20:32:57.9407766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.9408288Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.9408749Z module_map=module_map) 2025-05-07T20:32:57.9409121Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.9409473Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.9409731Z E ^ 2025-05-07T20:32:57.9410193Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.9410659Z 2025-05-07T20:32:57.9411141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.9411652Z 2025-05-07T20:32:57.9411760Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.9412178Z self=, 2025-05-07T20:32:57.9412582Z T=1, 2025-05-07T20:32:57.9412770Z D=5120, 2025-05-07T20:32:57.9412960Z scale_ub=None, 2025-05-07T20:32:57.9413171Z contiguous=True, 2025-05-07T20:32:57.9413398Z compiled=True, 2025-05-07T20:32:57.9413601Z ) 2025-05-07T20:32:58.6023021Z self = 2025-05-07T20:32:58.6023728Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.6024117Z 2025-05-07T20:32:58.6024224Z @given( 2025-05-07T20:32:58.6024527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.6024863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.6025175Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.6025497Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.6025821Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.6026101Z ) 2025-05-07T20:32:58.6026440Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.6026890Z def test_silu_mul_quant( 2025-05-07T20:32:58.6027131Z self, 2025-05-07T20:32:58.6027317Z T: int, 2025-05-07T20:32:58.6027511Z D: int, 2025-05-07T20:32:58.6027731Z scale_ub: Optional[float], 2025-05-07T20:32:58.6027997Z contiguous: bool, 2025-05-07T20:32:58.6028237Z compiled: bool, 2025-05-07T20:32:58.6028459Z ) -> None: 2025-05-07T20:32:58.6028672Z torch.manual_seed(2025) 2025-05-07T20:32:58.6028905Z 2025-05-07T20:32:58.6029175Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.6029522Z 2025-05-07T20:32:58.6029706Z x_sign = torch.sign(x) 2025-05-07T20:32:58.6030078Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.6030379Z x = x_sign * x_clamp 2025-05-07T20:32:58.6030611Z x0 = x[:, :D] 2025-05-07T20:32:58.6030826Z x1 = x[:, D:] 2025-05-07T20:32:58.6031033Z 2025-05-07T20:32:58.6031209Z if contiguous: 2025-05-07T20:32:58.6031555Z x0 = x0.contiguous() 2025-05-07T20:32:58.6031812Z x1 = x1.contiguous() 2025-05-07T20:32:58.6032042Z 2025-05-07T20:32:58.6032231Z if scale_ub is not None: 2025-05-07T20:32:58.6032503Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.6032897Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.6033208Z ) 2025-05-07T20:32:58.6033401Z else: 2025-05-07T20:32:58.6033601Z scale_ub_tensor = None 2025-05-07T20:32:58.6033847Z 2025-05-07T20:32:58.6034081Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.6034455Z op = silu_mul_quant 2025-05-07T20:32:58.6034696Z if compiled: 2025-05-07T20:32:58.6034941Z op = torch.compile(op) 2025-05-07T20:32:58.6035236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.6035503Z 2025-05-07T20:32:58.6035691Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.6035977Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.6036258Z 2025-05-07T20:32:58.6036488Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.6036820Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.6037107Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.6037420Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.6037777Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.6038090Z 2025-05-07T20:32:58.6038282Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.6038549Z 2025-05-07T20:32:58.6038650Z moe/activation_test.py:126: 2025-05-07T20:32:58.6038943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6039270Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.6039597Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.6040392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.6041160Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.6041701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.6042382Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.6043061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.6043774Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.6044533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:58.6045276Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.6045999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.6046623Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.6047225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.6047742Z fn() 2025-05-07T20:32:58.6048242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.6048812Z self.fn.run( 2025-05-07T20:32:58.6049279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.6049811Z kernel = self.compile( 2025-05-07T20:32:58.6050344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.6050994Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.6051441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6051667Z 2025-05-07T20:32:58.6051877Z self = 2025-05-07T20:32:58.6053001Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.6054385Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891928c5e0>} 2025-05-07T20:32:58.6055767Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.6056784Z context = 2025-05-07T20:32:58.6057069Z 2025-05-07T20:32:58.6057234Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.6057756Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.6058227Z module_map=module_map) 2025-05-07T20:32:58.6058589Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.6058937Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.6059203Z E ^ 2025-05-07T20:32:58.6059759Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.6060213Z 2025-05-07T20:32:58.6060633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.6061324Z 2025-05-07T20:32:58.6061426Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.6061835Z self=, 2025-05-07T20:32:58.6062242Z T=2048, 2025-05-07T20:32:58.6062424Z D=5120, 2025-05-07T20:32:58.6062611Z scale_ub=None, 2025-05-07T20:32:58.6062823Z contiguous=True, 2025-05-07T20:32:58.6063038Z compiled=True, 2025-05-07T20:32:58.6063241Z ) 2025-05-07T20:32:59.2242105Z self = 2025-05-07T20:32:59.2243524Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:59.2244260Z 2025-05-07T20:32:59.2244475Z @given( 2025-05-07T20:32:59.2245077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.2245780Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.2246375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.2247019Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.2247647Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.2248208Z ) 2025-05-07T20:32:59.2248895Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.2249691Z def test_silu_mul_quant( 2025-05-07T20:32:59.2249967Z self, 2025-05-07T20:32:59.2250171Z T: int, 2025-05-07T20:32:59.2250362Z D: int, 2025-05-07T20:32:59.2250581Z scale_ub: Optional[float], 2025-05-07T20:32:59.2250854Z contiguous: bool, 2025-05-07T20:32:59.2251097Z compiled: bool, 2025-05-07T20:32:59.2251320Z ) -> None: 2025-05-07T20:32:59.2251536Z torch.manual_seed(2025) 2025-05-07T20:32:59.2251775Z 2025-05-07T20:32:59.2252044Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.2252387Z 2025-05-07T20:32:59.2252577Z x_sign = torch.sign(x) 2025-05-07T20:32:59.2252861Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.2253169Z x = x_sign * x_clamp 2025-05-07T20:32:59.2259277Z x0 = x[:, :D] 2025-05-07T20:32:59.2259728Z x1 = x[:, D:] 2025-05-07T20:32:59.2259937Z 2025-05-07T20:32:59.2260119Z if contiguous: 2025-05-07T20:32:59.2260358Z x0 = x0.contiguous() 2025-05-07T20:32:59.2260620Z x1 = x1.contiguous() 2025-05-07T20:32:59.2260851Z 2025-05-07T20:32:59.2261118Z if scale_ub is not None: 2025-05-07T20:32:59.2261393Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.2261729Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.2262040Z ) 2025-05-07T20:32:59.2262237Z else: 2025-05-07T20:32:59.2262441Z scale_ub_tensor = None 2025-05-07T20:32:59.2262769Z 2025-05-07T20:32:59.2263005Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.2263317Z op = silu_mul_quant 2025-05-07T20:32:59.2263576Z if compiled: 2025-05-07T20:32:59.2263832Z op = torch.compile(op) 2025-05-07T20:32:59.2264140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.2264410Z 2025-05-07T20:32:59.2264601Z y_fp8, y_scale = fn() 2025-05-07T20:32:59.2264888Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:59.2265174Z 2025-05-07T20:32:59.2265413Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.2265758Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:59.2266045Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:59.2266360Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:59.2266725Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:59.2267102Z 2025-05-07T20:32:59.2267308Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:59.2267502Z 2025-05-07T20:32:59.2267606Z moe/activation_test.py:126: 2025-05-07T20:32:59.2267903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.2268232Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:59.2268570Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:59.2269363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:59.2270199Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:59.2270750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.2271431Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.2272126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:59.2272842Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:59.2273595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:59.2274339Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:59.2275062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:59.2275696Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:59.2276303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:59.2276816Z fn() 2025-05-07T20:32:59.2277317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:59.2277904Z self.fn.run( 2025-05-07T20:32:59.2278368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.2278896Z kernel = self.compile( 2025-05-07T20:32:59.2279437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.2280144Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.2280544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.2280768Z 2025-05-07T20:32:59.2281007Z self = 2025-05-07T20:32:59.2282099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.2283541Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8918da8f70>} 2025-05-07T20:32:59.2284885Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.2285907Z context = 2025-05-07T20:32:59.2286192Z 2025-05-07T20:32:59.2286360Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.2286887Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.2287355Z module_map=module_map) 2025-05-07T20:32:59.2287727Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.2288075Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:59.2288384Z E ^ 2025-05-07T20:32:59.2288856Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.2289307Z 2025-05-07T20:32:59.2289726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.2290240Z 2025-05-07T20:32:59.2290345Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.2290761Z self=, 2025-05-07T20:32:59.2291167Z T=128, 2025-05-07T20:32:59.2291347Z D=5120, 2025-05-07T20:32:59.2291542Z scale_ub=None, 2025-05-07T20:32:59.2291763Z contiguous=True, 2025-05-07T20:32:59.2291983Z compiled=True, 2025-05-07T20:32:59.2292188Z ) 2025-05-07T20:33:00.2026095Z self = 2025-05-07T20:33:00.2026799Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.2027198Z 2025-05-07T20:33:00.2027325Z @given( 2025-05-07T20:33:00.2027630Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.2028050Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.2028460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.2028789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.2029122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.2029411Z ) 2025-05-07T20:33:00.2029782Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.2030278Z def test_silu_mul_quant( 2025-05-07T20:33:00.2030520Z self, 2025-05-07T20:33:00.2030716Z T: int, 2025-05-07T20:33:00.2030913Z D: int, 2025-05-07T20:33:00.2031124Z scale_ub: Optional[float], 2025-05-07T20:33:00.2031403Z contiguous: bool, 2025-05-07T20:33:00.2031641Z compiled: bool, 2025-05-07T20:33:00.2031862Z ) -> None: 2025-05-07T20:33:00.2032087Z torch.manual_seed(2025) 2025-05-07T20:33:00.2032330Z 2025-05-07T20:33:00.2032599Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.2032945Z 2025-05-07T20:33:00.2033142Z x_sign = torch.sign(x) 2025-05-07T20:33:00.2033427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.2033872Z x = x_sign * x_clamp 2025-05-07T20:33:00.2034114Z x0 = x[:, :D] 2025-05-07T20:33:00.2034323Z x1 = x[:, D:] 2025-05-07T20:33:00.2034531Z 2025-05-07T20:33:00.2034716Z if contiguous: 2025-05-07T20:33:00.2034940Z x0 = x0.contiguous() 2025-05-07T20:33:00.2035266Z x1 = x1.contiguous() 2025-05-07T20:33:00.2035511Z 2025-05-07T20:33:00.2035704Z if scale_ub is not None: 2025-05-07T20:33:00.2035972Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.2036310Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.2036689Z ) 2025-05-07T20:33:00.2036880Z else: 2025-05-07T20:33:00.2037094Z scale_ub_tensor = None 2025-05-07T20:33:00.2037345Z 2025-05-07T20:33:00.2037570Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.2037886Z op = silu_mul_quant 2025-05-07T20:33:00.2038134Z if compiled: 2025-05-07T20:33:00.2038375Z op = torch.compile(op) 2025-05-07T20:33:00.2038677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.2038950Z 2025-05-07T20:33:00.2039135Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.2039424Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.2039733Z 2025-05-07T20:33:00.2040004Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.2040334Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.2040625Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.2040941Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.2041364Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.2041679Z 2025-05-07T20:33:00.2041878Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.2042076Z 2025-05-07T20:33:00.2042176Z moe/activation_test.py:126: 2025-05-07T20:33:00.2042471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.2042807Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.2043135Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.2043921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.2044686Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.2045232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.2045915Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.2046603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.2047319Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.2048067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:00.2048806Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.2049535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.2050229Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.2050827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.2051336Z fn() 2025-05-07T20:33:00.2051840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.2052423Z self.fn.run( 2025-05-07T20:33:00.2052887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.2053418Z kernel = self.compile( 2025-05-07T20:33:00.2053955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.2054655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.2055046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.2055314Z 2025-05-07T20:33:00.2055520Z self = 2025-05-07T20:33:00.2056613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.2058060Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8919050a60>} 2025-05-07T20:33:00.2059401Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.2060484Z context = 2025-05-07T20:33:00.2060776Z 2025-05-07T20:33:00.2060946Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.2061475Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.2061942Z module_map=module_map) 2025-05-07T20:33:00.2062309Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.2062712Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.2062973Z E ^ 2025-05-07T20:33:00.2063438Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.2063900Z 2025-05-07T20:33:00.2064318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.2064837Z 2025-05-07T20:33:00.2064947Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.2065356Z self=, 2025-05-07T20:33:00.2065765Z T=4096, 2025-05-07T20:33:00.2065951Z D=5120, 2025-05-07T20:33:00.2066144Z scale_ub=None, 2025-05-07T20:33:00.2066356Z contiguous=True, 2025-05-07T20:33:00.2066577Z compiled=True, 2025-05-07T20:33:00.2066776Z ) 2025-05-07T20:33:01.0410015Z self = 2025-05-07T20:33:01.0410646Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.0410917Z 2025-05-07T20:33:01.0411007Z @given( 2025-05-07T20:33:01.0411233Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.0411548Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.0411852Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.0412185Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.0412530Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.0412814Z ) 2025-05-07T20:33:01.0413163Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.0413609Z def test_silu_mul_quant( 2025-05-07T20:33:01.0413850Z self, 2025-05-07T20:33:01.0414037Z T: int, 2025-05-07T20:33:01.0414237Z D: int, 2025-05-07T20:33:01.0414456Z scale_ub: Optional[float], 2025-05-07T20:33:01.0414720Z contiguous: bool, 2025-05-07T20:33:01.0414965Z compiled: bool, 2025-05-07T20:33:01.0415188Z ) -> None: 2025-05-07T20:33:01.0415397Z torch.manual_seed(2025) 2025-05-07T20:33:01.0415644Z 2025-05-07T20:33:01.0415915Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.0416254Z 2025-05-07T20:33:01.0416449Z x_sign = torch.sign(x) 2025-05-07T20:33:01.0416896Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.0417202Z x = x_sign * x_clamp 2025-05-07T20:33:01.0417449Z x0 = x[:, :D] 2025-05-07T20:33:01.0417664Z x1 = x[:, D:] 2025-05-07T20:33:01.0417869Z 2025-05-07T20:33:01.0418118Z if contiguous: 2025-05-07T20:33:01.0418347Z x0 = x0.contiguous() 2025-05-07T20:33:01.0418602Z x1 = x1.contiguous() 2025-05-07T20:33:01.0418833Z 2025-05-07T20:33:01.0419021Z if scale_ub is not None: 2025-05-07T20:33:01.0419292Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.0419693Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.0419999Z ) 2025-05-07T20:33:01.0420188Z else: 2025-05-07T20:33:01.0420388Z scale_ub_tensor = None 2025-05-07T20:33:01.0420638Z 2025-05-07T20:33:01.0420869Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.0421177Z op = silu_mul_quant 2025-05-07T20:33:01.0421431Z if compiled: 2025-05-07T20:33:01.0421674Z op = torch.compile(op) 2025-05-07T20:33:01.0421966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.0422241Z 2025-05-07T20:33:01.0422430Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.0422719Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.0423004Z 2025-05-07T20:33:01.0423237Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.0423576Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.0423934Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.0424252Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.0424612Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.0424914Z 2025-05-07T20:33:01.0425117Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.0425311Z 2025-05-07T20:33:01.0425417Z moe/activation_test.py:126: 2025-05-07T20:33:01.0425711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.0426049Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.0426377Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.0427171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.0427924Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.0428471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.0429161Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.0429963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.0430709Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.0431472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:01.0432221Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.0432951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.0433590Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.0434197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.0434719Z fn() 2025-05-07T20:33:01.0435217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.0435799Z self.fn.run( 2025-05-07T20:33:01.0436267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.0436843Z kernel = self.compile( 2025-05-07T20:33:01.0437382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.0438031Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.0438468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.0438699Z 2025-05-07T20:33:01.0438903Z self = 2025-05-07T20:33:01.0439999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.0441489Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8918c30670>} 2025-05-07T20:33:01.0442834Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.0443860Z context = 2025-05-07T20:33:01.0444144Z 2025-05-07T20:33:01.0444308Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.0444836Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.0445342Z module_map=module_map) 2025-05-07T20:33:01.0445708Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.0446060Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.0446330Z E ^ 2025-05-07T20:33:01.0446799Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.0447254Z 2025-05-07T20:33:01.0447667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.0448183Z 2025-05-07T20:33:01.0448283Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.0448702Z self=, 2025-05-07T20:33:01.0449104Z T=16384, 2025-05-07T20:33:01.0449289Z D=5120, 2025-05-07T20:33:01.0449481Z scale_ub=None, 2025-05-07T20:33:01.0449691Z contiguous=True, 2025-05-07T20:33:01.0449906Z compiled=True, 2025-05-07T20:33:01.0450113Z ) 2025-05-07T20:33:01.0881839Z W0507 20:33:01.086616 88371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:01.0883085Z W0507 20:33:01.086616 88371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:01.0884424Z W0507 20:33:01.086616 88371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:01.0885424Z W0507 20:33:01.086616 88371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:01.0886535Z W0507 20:33:01.086616 88371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:01.2090462Z self = 2025-05-07T20:33:01.2091012Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.2091289Z 2025-05-07T20:33:01.2091366Z @given( 2025-05-07T20:33:01.2091602Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2091918Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2092320Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2092657Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2092986Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2093337Z ) 2025-05-07T20:33:01.2093687Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2094129Z def test_silu_mul_quant( 2025-05-07T20:33:01.2094372Z self, 2025-05-07T20:33:01.2094563Z T: int, 2025-05-07T20:33:01.2094762Z D: int, 2025-05-07T20:33:01.2095044Z scale_ub: Optional[float], 2025-05-07T20:33:01.2095311Z contiguous: bool, 2025-05-07T20:33:01.2095549Z compiled: bool, 2025-05-07T20:33:01.2095773Z ) -> None: 2025-05-07T20:33:01.2095981Z torch.manual_seed(2025) 2025-05-07T20:33:01.2096222Z 2025-05-07T20:33:01.2096495Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2096834Z 2025-05-07T20:33:01.2097025Z x_sign = torch.sign(x) 2025-05-07T20:33:01.2097320Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.2097629Z x = x_sign * x_clamp 2025-05-07T20:33:01.2097861Z x0 = x[:, :D] 2025-05-07T20:33:01.2098087Z x1 = x[:, D:] 2025-05-07T20:33:01.2098294Z 2025-05-07T20:33:01.2098470Z if contiguous: 2025-05-07T20:33:01.2098704Z x0 = x0.contiguous() 2025-05-07T20:33:01.2098961Z x1 = x1.contiguous() 2025-05-07T20:33:01.2099195Z 2025-05-07T20:33:01.2099386Z if scale_ub is not None: 2025-05-07T20:33:01.2099729Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.2100061Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.2100399Z ) 2025-05-07T20:33:01.2100615Z else: 2025-05-07T20:33:01.2100822Z scale_ub_tensor = None 2025-05-07T20:33:01.2101079Z 2025-05-07T20:33:01.2101316Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.2101627Z op = silu_mul_quant 2025-05-07T20:33:01.2101877Z if compiled: 2025-05-07T20:33:01.2102135Z op = torch.compile(op) 2025-05-07T20:33:01.2102430Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.2102712Z 2025-05-07T20:33:01.2102909Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.2103195Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.2103482Z 2025-05-07T20:33:01.2103885Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.2104233Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.2104522Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.2104840Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.2105207Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.2105515Z 2025-05-07T20:33:01.2105720Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.2105916Z 2025-05-07T20:33:01.2106030Z moe/activation_test.py:126: 2025-05-07T20:33:01.2106336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2106669Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.2107007Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.2107814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.2108576Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.2109128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.2109877Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.2110565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.2111356Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.2112108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:01.2112913Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.2113652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.2114286Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.2114971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.2115493Z fn() 2025-05-07T20:33:01.2115996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.2116575Z self.fn.run( 2025-05-07T20:33:01.2117051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.2117581Z kernel = self.compile( 2025-05-07T20:33:01.2118119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.2118779Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.2119177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2119411Z 2025-05-07T20:33:01.2119615Z self = 2025-05-07T20:33:01.2120833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.2122219Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8918befc10>} 2025-05-07T20:33:01.2123603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.2124634Z context = 2025-05-07T20:33:01.2124919Z 2025-05-07T20:33:01.2125086Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.2125620Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.2126096Z module_map=module_map) 2025-05-07T20:33:01.2126462Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.2126820Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.2127086Z E ^ 2025-05-07T20:33:01.2127548Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.2128003Z 2025-05-07T20:33:01.2128425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.2135808Z 2025-05-07T20:33:01.2135951Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2136385Z self=, 2025-05-07T20:33:01.2136804Z T=1, 2025-05-07T20:33:01.2136998Z D=5120, 2025-05-07T20:33:01.2137193Z scale_ub=1200.0, 2025-05-07T20:33:01.2137437Z contiguous=True, 2025-05-07T20:33:01.2137668Z compiled=True, 2025-05-07T20:33:01.2137877Z ) 2025-05-07T20:33:01.3837391Z self = 2025-05-07T20:33:01.3837898Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.3838198Z 2025-05-07T20:33:01.3838277Z @given( 2025-05-07T20:33:01.3838647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.3838956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.3839269Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.3839603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.3839994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.3840336Z ) 2025-05-07T20:33:01.3840694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.3841143Z def test_silu_mul_quant( 2025-05-07T20:33:01.3841379Z self, 2025-05-07T20:33:01.3841644Z T: int, 2025-05-07T20:33:01.3841847Z D: int, 2025-05-07T20:33:01.3842066Z scale_ub: Optional[float], 2025-05-07T20:33:01.3842345Z contiguous: bool, 2025-05-07T20:33:01.3842585Z compiled: bool, 2025-05-07T20:33:01.3842802Z ) -> None: 2025-05-07T20:33:01.3843018Z torch.manual_seed(2025) 2025-05-07T20:33:01.3843270Z 2025-05-07T20:33:01.3843543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.3843892Z 2025-05-07T20:33:01.3844085Z x_sign = torch.sign(x) 2025-05-07T20:33:01.3844372Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.3844689Z x = x_sign * x_clamp 2025-05-07T20:33:01.3844937Z x0 = x[:, :D] 2025-05-07T20:33:01.3845145Z x1 = x[:, D:] 2025-05-07T20:33:01.3845354Z 2025-05-07T20:33:01.3845545Z if contiguous: 2025-05-07T20:33:01.3845778Z x0 = x0.contiguous() 2025-05-07T20:33:01.3846104Z x1 = x1.contiguous() 2025-05-07T20:33:01.3846352Z 2025-05-07T20:33:01.3846546Z if scale_ub is not None: 2025-05-07T20:33:01.3846816Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.3847152Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.3847457Z ) 2025-05-07T20:33:01.3847649Z else: 2025-05-07T20:33:01.3847869Z scale_ub_tensor = None 2025-05-07T20:33:01.3848124Z 2025-05-07T20:33:01.3848354Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.3848673Z op = silu_mul_quant 2025-05-07T20:33:01.3848923Z if compiled: 2025-05-07T20:33:01.3849174Z op = torch.compile(op) 2025-05-07T20:33:01.3849475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.3849752Z 2025-05-07T20:33:01.3849939Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.3850114Z 2025-05-07T20:33:01.3850213Z moe/activation_test.py:117: 2025-05-07T20:33:01.3850567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.3850902Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.3851182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.3851747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.3852312Z return fn(*args, **kwargs) 2025-05-07T20:33:01.3852978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.3853660Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.3854197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.3854888Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.3855553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.3856082Z kernel = self.compile( 2025-05-07T20:33:01.3856625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.3857278Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.3857668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.3857950Z 2025-05-07T20:33:01.3858155Z self = 2025-05-07T20:33:01.3859281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.3860668Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8918457670>} 2025-05-07T20:33:01.3862060Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.3863081Z context = 2025-05-07T20:33:01.3863374Z 2025-05-07T20:33:01.3863540Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.3864067Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.3864535Z module_map=module_map) 2025-05-07T20:33:01.3864899Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.3865254Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.3865510Z E ^ 2025-05-07T20:33:01.3865975Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.3866481Z 2025-05-07T20:33:01.3866898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.3867412Z 2025-05-07T20:33:01.3867515Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.3867931Z self=, 2025-05-07T20:33:01.3868341Z T=1, 2025-05-07T20:33:01.3868523Z D=5120, 2025-05-07T20:33:01.3868713Z scale_ub=None, 2025-05-07T20:33:01.3868924Z contiguous=False, 2025-05-07T20:33:01.3869157Z compiled=True, 2025-05-07T20:33:01.3869363Z ) 2025-05-07T20:33:01.4675961Z self = 2025-05-07T20:33:01.4676501Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.4676768Z 2025-05-07T20:33:01.4676859Z @given( 2025-05-07T20:33:01.4677092Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.4677423Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.4677739Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.4678072Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.4678411Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.4678706Z ) 2025-05-07T20:33:01.4679068Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.4679512Z def test_silu_mul_quant( 2025-05-07T20:33:01.4679764Z self, 2025-05-07T20:33:01.4679968Z T: int, 2025-05-07T20:33:01.4680187Z D: int, 2025-05-07T20:33:01.4680436Z scale_ub: Optional[float], 2025-05-07T20:33:01.4680720Z contiguous: bool, 2025-05-07T20:33:01.4680958Z compiled: bool, 2025-05-07T20:33:01.4681187Z ) -> None: 2025-05-07T20:33:01.4681410Z torch.manual_seed(2025) 2025-05-07T20:33:01.4681648Z 2025-05-07T20:33:01.4681925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.4682279Z 2025-05-07T20:33:01.4682472Z x_sign = torch.sign(x) 2025-05-07T20:33:01.4682768Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.4683083Z x = x_sign * x_clamp 2025-05-07T20:33:01.4683322Z x0 = x[:, :D] 2025-05-07T20:33:01.4683546Z x1 = x[:, D:] 2025-05-07T20:33:01.4683764Z 2025-05-07T20:33:01.4684047Z if contiguous: 2025-05-07T20:33:01.4684285Z x0 = x0.contiguous() 2025-05-07T20:33:01.4684549Z x1 = x1.contiguous() 2025-05-07T20:33:01.4684795Z 2025-05-07T20:33:01.4684987Z if scale_ub is not None: 2025-05-07T20:33:01.4685334Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.4685687Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.4685994Z ) 2025-05-07T20:33:01.4686196Z else: 2025-05-07T20:33:01.4686416Z scale_ub_tensor = None 2025-05-07T20:33:01.4686670Z 2025-05-07T20:33:01.4686973Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.4687299Z op = silu_mul_quant 2025-05-07T20:33:01.4687555Z if compiled: 2025-05-07T20:33:01.4687812Z op = torch.compile(op) 2025-05-07T20:33:01.4688124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.4688403Z 2025-05-07T20:33:01.4688607Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.4688911Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.4689216Z 2025-05-07T20:33:01.4689454Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.4689802Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.4690108Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.4690465Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.4690850Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.4691174Z 2025-05-07T20:33:01.4691450Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.4691659Z 2025-05-07T20:33:01.4691765Z moe/activation_test.py:126: 2025-05-07T20:33:01.4692070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.4692414Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.4692745Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.4693545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.4694330Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.4694886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.4695586Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.4696288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.4697021Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.4697771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:01.4698527Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.4699266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.4699911Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.4700518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.4701040Z fn() 2025-05-07T20:33:01.4701551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.4702135Z self.fn.run( 2025-05-07T20:33:01.4702617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.4703155Z kernel = self.compile( 2025-05-07T20:33:01.4703852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.4704509Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.4704982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.4705212Z 2025-05-07T20:33:01.4705424Z self = 2025-05-07T20:33:01.4706571Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.4707973Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89184b79d0>} 2025-05-07T20:33:01.4709411Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.4710550Z context = 2025-05-07T20:33:01.4710840Z 2025-05-07T20:33:01.4711013Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.4711531Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.4712014Z module_map=module_map) 2025-05-07T20:33:01.4712382Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.4712737Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.4713010Z E ^ 2025-05-07T20:33:01.4713552Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.4714028Z 2025-05-07T20:33:01.4714445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.4714972Z 2025-05-07T20:33:01.4715078Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.4715504Z self=, 2025-05-07T20:33:01.4715919Z T=1, 2025-05-07T20:33:01.4716109Z D=5120, 2025-05-07T20:33:01.4716313Z scale_ub=None, 2025-05-07T20:33:01.4716526Z contiguous=True, 2025-05-07T20:33:01.4716769Z compiled=False, 2025-05-07T20:33:01.4716983Z ) 2025-05-07T20:33:01.8254142Z self = 2025-05-07T20:33:01.8255437Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:01.8255965Z 2025-05-07T20:33:01.8256124Z @given( 2025-05-07T20:33:01.8256601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.8257225Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.8257833Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.8258480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.8259135Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.8259700Z ) 2025-05-07T20:33:01.8260387Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.8260888Z def test_silu_mul_quant( 2025-05-07T20:33:01.8261130Z self, 2025-05-07T20:33:01.8261327Z T: int, 2025-05-07T20:33:01.8261528Z D: int, 2025-05-07T20:33:01.8261743Z scale_ub: Optional[float], 2025-05-07T20:33:01.8262009Z contiguous: bool, 2025-05-07T20:33:01.8262246Z compiled: bool, 2025-05-07T20:33:01.8262474Z ) -> None: 2025-05-07T20:33:01.8262687Z torch.manual_seed(2025) 2025-05-07T20:33:01.8262933Z 2025-05-07T20:33:01.8263211Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.8263563Z 2025-05-07T20:33:01.8263754Z x_sign = torch.sign(x) 2025-05-07T20:33:01.8264048Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.8264359Z x = x_sign * x_clamp 2025-05-07T20:33:01.8264590Z x0 = x[:, :D] 2025-05-07T20:33:01.8264923Z x1 = x[:, D:] 2025-05-07T20:33:01.8265130Z 2025-05-07T20:33:01.8265310Z if contiguous: 2025-05-07T20:33:01.8265542Z x0 = x0.contiguous() 2025-05-07T20:33:01.8265804Z x1 = x1.contiguous() 2025-05-07T20:33:01.8266047Z 2025-05-07T20:33:01.8266306Z if scale_ub is not None: 2025-05-07T20:33:01.8266586Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.8266923Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.8267236Z ) 2025-05-07T20:33:01.8267437Z else: 2025-05-07T20:33:01.8267712Z scale_ub_tensor = None 2025-05-07T20:33:01.8267970Z 2025-05-07T20:33:01.8268212Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.8268529Z op = silu_mul_quant 2025-05-07T20:33:01.8268777Z if compiled: 2025-05-07T20:33:01.8269028Z op = torch.compile(op) 2025-05-07T20:33:01.8269327Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.8269601Z 2025-05-07T20:33:01.8269879Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.8270062Z 2025-05-07T20:33:01.8270163Z moe/activation_test.py:117: 2025-05-07T20:33:01.8270467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8270799Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.8271089Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.8271781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.8272553Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.8273099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.8273783Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.8274443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.8274975Z kernel = self.compile( 2025-05-07T20:33:01.8275521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.8276175Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.8276565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8276796Z 2025-05-07T20:33:01.8277002Z self = 2025-05-07T20:33:01.8278096Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.8279484Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8918480940>} 2025-05-07T20:33:01.8280879Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.8281908Z context = 2025-05-07T20:33:01.8282198Z 2025-05-07T20:33:01.8282370Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.8282894Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.8283364Z module_map=module_map) 2025-05-07T20:33:01.8283730Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.8284082Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.8284334Z E ^ 2025-05-07T20:33:01.8284808Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.8285314Z 2025-05-07T20:33:01.8285730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.8286243Z 2025-05-07T20:33:01.8286389Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.8286804Z self=, 2025-05-07T20:33:01.8287213Z T=128, 2025-05-07T20:33:01.8287400Z D=5120, 2025-05-07T20:33:01.8287586Z scale_ub=None, 2025-05-07T20:33:01.8287803Z contiguous=False, 2025-05-07T20:33:01.8288069Z compiled=True, 2025-05-07T20:33:01.8288275Z ) 2025-05-07T20:33:01.8288592Z self = 2025-05-07T20:33:01.8289084Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.8289352Z 2025-05-07T20:33:01.8289435Z @given( 2025-05-07T20:33:01.8289658Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.8289980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.8290294Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.8290646Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.8291009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.8291295Z ) 2025-05-07T20:33:01.8291640Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.8292073Z def test_silu_mul_quant( 2025-05-07T20:33:01.8292316Z self, 2025-05-07T20:33:01.8292514Z T: int, 2025-05-07T20:33:01.8292753Z D: int, 2025-05-07T20:33:01.8292976Z scale_ub: Optional[float], 2025-05-07T20:33:01.8293251Z contiguous: bool, 2025-05-07T20:33:01.8293482Z compiled: bool, 2025-05-07T20:33:01.8293702Z ) -> None: 2025-05-07T20:33:01.8293917Z torch.manual_seed(2025) 2025-05-07T20:33:01.8294156Z 2025-05-07T20:33:01.8294426Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.8294771Z 2025-05-07T20:33:01.8294960Z x_sign = torch.sign(x) 2025-05-07T20:33:01.8295251Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.8295566Z x = x_sign * x_clamp 2025-05-07T20:33:01.8295802Z x0 = x[:, :D] 2025-05-07T20:33:01.8296020Z x1 = x[:, D:] 2025-05-07T20:33:01.8296226Z 2025-05-07T20:33:01.8296402Z if contiguous: 2025-05-07T20:33:01.8296631Z x0 = x0.contiguous() 2025-05-07T20:33:01.8296889Z x1 = x1.contiguous() 2025-05-07T20:33:01.8297136Z 2025-05-07T20:33:01.8297328Z if scale_ub is not None: 2025-05-07T20:33:01.8297598Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.8297936Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.8298238Z ) 2025-05-07T20:33:01.8298431Z else: 2025-05-07T20:33:01.8298637Z scale_ub_tensor = None 2025-05-07T20:33:01.8298886Z 2025-05-07T20:33:01.8299119Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.8299433Z op = silu_mul_quant 2025-05-07T20:33:01.8299677Z if compiled: 2025-05-07T20:33:01.8299924Z op = torch.compile(op) 2025-05-07T20:33:01.8300251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.8300544Z 2025-05-07T20:33:01.8300735Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.8300897Z 2025-05-07T20:33:01.8301000Z moe/activation_test.py:117: 2025-05-07T20:33:01.8301294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8301628Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.8301918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.8302475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.8303029Z return fn(*args, **kwargs) 2025-05-07T20:33:01.8303686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.8304705Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.8305329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.8306009Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.8306672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.8307203Z kernel = self.compile( 2025-05-07T20:33:01.8307805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.8308460Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.8308860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8309087Z 2025-05-07T20:33:01.8309315Z self = 2025-05-07T20:33:01.8310519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.8312084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917cbd040>} 2025-05-07T20:33:01.8313651Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.8314676Z context = 2025-05-07T20:33:01.8314958Z 2025-05-07T20:33:01.8315129Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.8315652Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.8316122Z module_map=module_map) 2025-05-07T20:33:01.8316483Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.8316827Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.8317078Z E ^ 2025-05-07T20:33:01.8317545Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.8317992Z 2025-05-07T20:33:01.8318417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.8318931Z 2025-05-07T20:33:01.8319033Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.8319443Z self=, 2025-05-07T20:33:01.8319849Z T=128, 2025-05-07T20:33:01.8320031Z D=7168, 2025-05-07T20:33:01.8320216Z scale_ub=1200.0, 2025-05-07T20:33:01.8320443Z contiguous=False, 2025-05-07T20:33:01.8320699Z compiled=False, 2025-05-07T20:33:01.8320908Z ) 2025-05-07T20:33:01.9848381Z self = 2025-05-07T20:33:01.9848953Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.9849231Z 2025-05-07T20:33:01.9849310Z @given( 2025-05-07T20:33:01.9849614Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.9850026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.9850340Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.9850674Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.9856969Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.9857297Z ) 2025-05-07T20:33:01.9857651Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.9858215Z def test_silu_mul_quant( 2025-05-07T20:33:01.9858460Z self, 2025-05-07T20:33:01.9858653Z T: int, 2025-05-07T20:33:01.9858845Z D: int, 2025-05-07T20:33:01.9859066Z scale_ub: Optional[float], 2025-05-07T20:33:01.9859336Z contiguous: bool, 2025-05-07T20:33:01.9859643Z compiled: bool, 2025-05-07T20:33:01.9859869Z ) -> None: 2025-05-07T20:33:01.9860078Z torch.manual_seed(2025) 2025-05-07T20:33:01.9860329Z 2025-05-07T20:33:01.9860611Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.9860957Z 2025-05-07T20:33:01.9861219Z x_sign = torch.sign(x) 2025-05-07T20:33:01.9861512Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.9861824Z x = x_sign * x_clamp 2025-05-07T20:33:01.9862065Z x0 = x[:, :D] 2025-05-07T20:33:01.9862278Z x1 = x[:, D:] 2025-05-07T20:33:01.9862484Z 2025-05-07T20:33:01.9862664Z if contiguous: 2025-05-07T20:33:01.9862903Z x0 = x0.contiguous() 2025-05-07T20:33:01.9863162Z x1 = x1.contiguous() 2025-05-07T20:33:01.9863398Z 2025-05-07T20:33:01.9863592Z if scale_ub is not None: 2025-05-07T20:33:01.9863874Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.9864209Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.9864520Z ) 2025-05-07T20:33:01.9864710Z else: 2025-05-07T20:33:01.9864913Z scale_ub_tensor = None 2025-05-07T20:33:01.9865164Z 2025-05-07T20:33:01.9865396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.9865791Z op = silu_mul_quant 2025-05-07T20:33:01.9866037Z if compiled: 2025-05-07T20:33:01.9866290Z op = torch.compile(op) 2025-05-07T20:33:01.9866591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.9866862Z 2025-05-07T20:33:01.9867055Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.9867221Z 2025-05-07T20:33:01.9867327Z moe/activation_test.py:117: 2025-05-07T20:33:01.9867616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.9867951Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.9868228Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.9868917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.9869619Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.9870236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.9870924Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.9871589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.9872121Z kernel = self.compile( 2025-05-07T20:33:01.9872668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.9873323Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.9873716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.9873947Z 2025-05-07T20:33:01.9874151Z self = 2025-05-07T20:33:01.9875236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.9876619Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917cbdd30>} 2025-05-07T20:33:01.9877991Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.9879070Z context = 2025-05-07T20:33:01.9879354Z 2025-05-07T20:33:01.9879556Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.9880080Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.9880595Z module_map=module_map) 2025-05-07T20:33:01.9880962Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.9881351Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.9881616Z E ^ 2025-05-07T20:33:01.9882088Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.9882540Z 2025-05-07T20:33:01.9882959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.9883479Z 2025-05-07T20:33:01.9883580Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.9883994Z self=, 2025-05-07T20:33:01.9884410Z T=128, 2025-05-07T20:33:01.9884594Z D=5120, 2025-05-07T20:33:01.9884782Z scale_ub=None, 2025-05-07T20:33:01.9884997Z contiguous=False, 2025-05-07T20:33:01.9885216Z compiled=False, 2025-05-07T20:33:01.9885423Z ) 2025-05-07T20:33:01.9885745Z self = 2025-05-07T20:33:01.9886280Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:01.9886555Z 2025-05-07T20:33:01.9886634Z @given( 2025-05-07T20:33:01.9886863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.9887171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.9887476Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.9887812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.9888142Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.9888419Z ) 2025-05-07T20:33:01.9888766Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.9889208Z def test_silu_mul_quant( 2025-05-07T20:33:01.9889441Z self, 2025-05-07T20:33:01.9889637Z T: int, 2025-05-07T20:33:01.9889830Z D: int, 2025-05-07T20:33:01.9890040Z scale_ub: Optional[float], 2025-05-07T20:33:01.9890315Z contiguous: bool, 2025-05-07T20:33:01.9890598Z compiled: bool, 2025-05-07T20:33:01.9890830Z ) -> None: 2025-05-07T20:33:01.9891041Z torch.manual_seed(2025) 2025-05-07T20:33:01.9891279Z 2025-05-07T20:33:01.9891541Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.9891878Z 2025-05-07T20:33:01.9892067Z x_sign = torch.sign(x) 2025-05-07T20:33:01.9892356Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.9892655Z x = x_sign * x_clamp 2025-05-07T20:33:01.9892893Z x0 = x[:, :D] 2025-05-07T20:33:01.9893111Z x1 = x[:, D:] 2025-05-07T20:33:01.9893311Z 2025-05-07T20:33:01.9893497Z if contiguous: 2025-05-07T20:33:01.9893727Z x0 = x0.contiguous() 2025-05-07T20:33:01.9893977Z x1 = x1.contiguous() 2025-05-07T20:33:01.9894215Z 2025-05-07T20:33:01.9894403Z if scale_ub is not None: 2025-05-07T20:33:01.9894665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.9895007Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.9895313Z ) 2025-05-07T20:33:01.9895497Z else: 2025-05-07T20:33:01.9895706Z scale_ub_tensor = None 2025-05-07T20:33:01.9895958Z 2025-05-07T20:33:01.9896181Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.9896491Z op = silu_mul_quant 2025-05-07T20:33:01.9896792Z if compiled: 2025-05-07T20:33:01.9897040Z op = torch.compile(op) 2025-05-07T20:33:01.9897329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.9897602Z 2025-05-07T20:33:01.9897829Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.9897999Z 2025-05-07T20:33:01.9898098Z moe/activation_test.py:117: 2025-05-07T20:33:01.9898394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.9898720Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.9898993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.9899725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.9900419Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.9900957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.9901638Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.9902294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.9902825Z kernel = self.compile( 2025-05-07T20:33:01.9903361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.9904607Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.9905002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.9905307Z 2025-05-07T20:33:01.9905515Z self = 2025-05-07T20:33:01.9906597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.9907982Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917c82310>} 2025-05-07T20:33:01.9909338Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.9910456Z context = 2025-05-07T20:33:01.9910742Z 2025-05-07T20:33:01.9910916Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.9911433Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.9911896Z module_map=module_map) 2025-05-07T20:33:01.9912266Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.9912616Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.9912873Z E ^ 2025-05-07T20:33:01.9913340Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.9913792Z 2025-05-07T20:33:01.9914216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.9914728Z 2025-05-07T20:33:01.9914832Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.9915247Z self=, 2025-05-07T20:33:01.9915662Z T=128, 2025-05-07T20:33:01.9915841Z D=5120, 2025-05-07T20:33:01.9916029Z scale_ub=1200.0, 2025-05-07T20:33:01.9916248Z contiguous=True, 2025-05-07T20:33:01.9916460Z compiled=False, 2025-05-07T20:33:01.9916670Z ) 2025-05-07T20:33:02.2191202Z self = 2025-05-07T20:33:02.2191731Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:02.2192125Z 2025-05-07T20:33:02.2192203Z @given( 2025-05-07T20:33:02.2192499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.2192889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.2193272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.2193607Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.2193937Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.2194225Z ) 2025-05-07T20:33:02.2194579Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.2195112Z def test_silu_mul_quant( 2025-05-07T20:33:02.2195354Z self, 2025-05-07T20:33:02.2195548Z T: int, 2025-05-07T20:33:02.2195744Z D: int, 2025-05-07T20:33:02.2195963Z scale_ub: Optional[float], 2025-05-07T20:33:02.2196235Z contiguous: bool, 2025-05-07T20:33:02.2196475Z compiled: bool, 2025-05-07T20:33:02.2196706Z ) -> None: 2025-05-07T20:33:02.2196918Z torch.manual_seed(2025) 2025-05-07T20:33:02.2197162Z 2025-05-07T20:33:02.2197432Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.2197772Z 2025-05-07T20:33:02.2197967Z x_sign = torch.sign(x) 2025-05-07T20:33:02.2198285Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.2198596Z x = x_sign * x_clamp 2025-05-07T20:33:02.2198841Z x0 = x[:, :D] 2025-05-07T20:33:02.2199056Z x1 = x[:, D:] 2025-05-07T20:33:02.2199263Z 2025-05-07T20:33:02.2199519Z if contiguous: 2025-05-07T20:33:02.2199747Z x0 = x0.contiguous() 2025-05-07T20:33:02.2200008Z x1 = x1.contiguous() 2025-05-07T20:33:02.2200249Z 2025-05-07T20:33:02.2200435Z if scale_ub is not None: 2025-05-07T20:33:02.2200707Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.2201050Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.2201367Z ) 2025-05-07T20:33:02.2201563Z else: 2025-05-07T20:33:02.2201771Z scale_ub_tensor = None 2025-05-07T20:33:02.2202023Z 2025-05-07T20:33:02.2202259Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.2202574Z op = silu_mul_quant 2025-05-07T20:33:02.2202823Z if compiled: 2025-05-07T20:33:02.2203078Z op = torch.compile(op) 2025-05-07T20:33:02.2203371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.2203646Z 2025-05-07T20:33:02.2204021Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.2204192Z 2025-05-07T20:33:02.2204295Z moe/activation_test.py:117: 2025-05-07T20:33:02.2204597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2204931Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.2205210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.2205903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.2206598Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.2207139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.2207817Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.2208484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.2209015Z kernel = self.compile( 2025-05-07T20:33:02.2209560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.2210210Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.2210604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2210831Z 2025-05-07T20:33:02.2211112Z self = 2025-05-07T20:33:02.2212251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.2213643Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917c82ee0>} 2025-05-07T20:33:02.2214999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.2216098Z context = 2025-05-07T20:33:02.2216386Z 2025-05-07T20:33:02.2216557Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.2217084Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.2217546Z module_map=module_map) 2025-05-07T20:33:02.2217911Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.2218260Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.2218508Z E ^ 2025-05-07T20:33:02.2218974Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.2219426Z 2025-05-07T20:33:02.2219907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.2220450Z 2025-05-07T20:33:02.2220576Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.2220981Z self=, 2025-05-07T20:33:02.2221379Z T=1, 2025-05-07T20:33:02.2221554Z D=7168, 2025-05-07T20:33:02.2221742Z scale_ub=1200.0, 2025-05-07T20:33:02.2221962Z contiguous=True, 2025-05-07T20:33:02.2222180Z compiled=True, 2025-05-07T20:33:02.2222373Z ) 2025-05-07T20:33:02.2222687Z self = 2025-05-07T20:33:02.2223172Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:02.2223431Z 2025-05-07T20:33:02.2223507Z @given( 2025-05-07T20:33:02.2223732Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.2224050Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.2224362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.2224681Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.2225006Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.2225288Z ) 2025-05-07T20:33:02.2225633Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.2226079Z def test_silu_mul_quant( 2025-05-07T20:33:02.2226313Z self, 2025-05-07T20:33:02.2226495Z T: int, 2025-05-07T20:33:02.2226680Z D: int, 2025-05-07T20:33:02.2226893Z scale_ub: Optional[float], 2025-05-07T20:33:02.2227151Z contiguous: bool, 2025-05-07T20:33:02.2227381Z compiled: bool, 2025-05-07T20:33:02.2227599Z ) -> None: 2025-05-07T20:33:02.2227808Z torch.manual_seed(2025) 2025-05-07T20:33:02.2228048Z 2025-05-07T20:33:02.2228310Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.2228641Z 2025-05-07T20:33:02.2228824Z x_sign = torch.sign(x) 2025-05-07T20:33:02.2229101Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.2229401Z x = x_sign * x_clamp 2025-05-07T20:33:02.2229627Z x0 = x[:, :D] 2025-05-07T20:33:02.2229903Z x1 = x[:, D:] 2025-05-07T20:33:02.2230104Z 2025-05-07T20:33:02.2230276Z if contiguous: 2025-05-07T20:33:02.2230580Z x0 = x0.contiguous() 2025-05-07T20:33:02.2230851Z x1 = x1.contiguous() 2025-05-07T20:33:02.2231081Z 2025-05-07T20:33:02.2231264Z if scale_ub is not None: 2025-05-07T20:33:02.2231527Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.2231889Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.2232188Z ) 2025-05-07T20:33:02.2232373Z else: 2025-05-07T20:33:02.2232568Z scale_ub_tensor = None 2025-05-07T20:33:02.2232809Z 2025-05-07T20:33:02.2233029Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.2233375Z op = silu_mul_quant 2025-05-07T20:33:02.2233613Z if compiled: 2025-05-07T20:33:02.2233848Z op = torch.compile(op) 2025-05-07T20:33:02.2234136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.2234395Z 2025-05-07T20:33:02.2234580Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.2234740Z 2025-05-07T20:33:02.2234847Z moe/activation_test.py:117: 2025-05-07T20:33:02.2235127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2235450Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.2235717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.2236271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.2236820Z return fn(*args, **kwargs) 2025-05-07T20:33:02.2237518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.2238203Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.2238727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.2239399Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.2240056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.2240630Z kernel = self.compile( 2025-05-07T20:33:02.2241159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.2241804Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.2242192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2242414Z 2025-05-07T20:33:02.2242614Z self = 2025-05-07T20:33:02.2243699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.2245069Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89181f1940>} 2025-05-07T20:33:02.2246413Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.2247433Z context = 2025-05-07T20:33:02.2247719Z 2025-05-07T20:33:02.2247887Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.2248409Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.2248870Z module_map=module_map) 2025-05-07T20:33:02.2249232Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.2249573Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.2249825Z E ^ 2025-05-07T20:33:02.2250288Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.2250783Z 2025-05-07T20:33:02.2251204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.2251727Z 2025-05-07T20:33:02.2251864Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.2252272Z self=, 2025-05-07T20:33:02.2252672Z T=1, 2025-05-07T20:33:02.2252841Z D=7168, 2025-05-07T20:33:02.2253018Z scale_ub=1200.0, 2025-05-07T20:33:02.2253238Z contiguous=False, 2025-05-07T20:33:02.2253491Z compiled=True, 2025-05-07T20:33:02.2253683Z ) 2025-05-07T20:33:02.5564712Z self = 2025-05-07T20:33:02.5565232Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:02.5565538Z 2025-05-07T20:33:02.5565647Z @given( 2025-05-07T20:33:02.5565964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.5566275Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.5566577Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.5566901Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.5567229Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.5567508Z ) 2025-05-07T20:33:02.5567858Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.5568293Z def test_silu_mul_quant( 2025-05-07T20:33:02.5568528Z self, 2025-05-07T20:33:02.5568837Z T: int, 2025-05-07T20:33:02.5569038Z D: int, 2025-05-07T20:33:02.5569248Z scale_ub: Optional[float], 2025-05-07T20:33:02.5569516Z contiguous: bool, 2025-05-07T20:33:02.5569772Z compiled: bool, 2025-05-07T20:33:02.5569995Z ) -> None: 2025-05-07T20:33:02.5570206Z torch.manual_seed(2025) 2025-05-07T20:33:02.5570445Z 2025-05-07T20:33:02.5570762Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.5571096Z 2025-05-07T20:33:02.5571287Z x_sign = torch.sign(x) 2025-05-07T20:33:02.5571575Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.5571884Z x = x_sign * x_clamp 2025-05-07T20:33:02.5572115Z x0 = x[:, :D] 2025-05-07T20:33:02.5572327Z x1 = x[:, D:] 2025-05-07T20:33:02.5572531Z 2025-05-07T20:33:02.5572707Z if contiguous: 2025-05-07T20:33:02.5572938Z x0 = x0.contiguous() 2025-05-07T20:33:02.5573192Z x1 = x1.contiguous() 2025-05-07T20:33:02.5573432Z 2025-05-07T20:33:02.5573624Z if scale_ub is not None: 2025-05-07T20:33:02.5573891Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.5574220Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.5574524Z ) 2025-05-07T20:33:02.5574712Z else: 2025-05-07T20:33:02.5574913Z scale_ub_tensor = None 2025-05-07T20:33:02.5575174Z 2025-05-07T20:33:02.5575404Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.5575713Z op = silu_mul_quant 2025-05-07T20:33:02.5575961Z if compiled: 2025-05-07T20:33:02.5576211Z op = torch.compile(op) 2025-05-07T20:33:02.5576509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.5576776Z 2025-05-07T20:33:02.5576976Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.5577141Z 2025-05-07T20:33:02.5577251Z moe/activation_test.py:117: 2025-05-07T20:33:02.5577546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.5577880Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.5578163Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.5578715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.5579276Z return fn(*args, **kwargs) 2025-05-07T20:33:02.5585525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.5586270Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.5586929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.5587626Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.5588295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.5588890Z kernel = self.compile( 2025-05-07T20:33:02.5589440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.5590198Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.5590630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.5590879Z 2025-05-07T20:33:02.5591085Z self = 2025-05-07T20:33:02.5592173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.5593561Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917fec5e0>} 2025-05-07T20:33:02.5594957Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.5595977Z context = 2025-05-07T20:33:02.5596265Z 2025-05-07T20:33:02.5596430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.5596956Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.5597425Z module_map=module_map) 2025-05-07T20:33:02.5597788Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.5598142Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.5598395Z E ^ 2025-05-07T20:33:02.5598864Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.5599326Z 2025-05-07T20:33:02.5599748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.5600268Z 2025-05-07T20:33:02.5600372Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.5600834Z self=, 2025-05-07T20:33:02.5601236Z T=1, 2025-05-07T20:33:02.5601418Z D=7168, 2025-05-07T20:33:02.5601606Z scale_ub=None, 2025-05-07T20:33:02.5601812Z contiguous=False, 2025-05-07T20:33:02.5602034Z compiled=True, 2025-05-07T20:33:02.5602234Z ) 2025-05-07T20:33:02.6734897Z self = 2025-05-07T20:33:02.6735495Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:02.6735881Z 2025-05-07T20:33:02.6735961Z @given( 2025-05-07T20:33:02.6736187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.6736507Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.6736812Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.6737140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.6737471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.6737752Z ) 2025-05-07T20:33:02.6738097Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.6738650Z def test_silu_mul_quant( 2025-05-07T20:33:02.6738881Z self, 2025-05-07T20:33:02.6739072Z T: int, 2025-05-07T20:33:02.6739269Z D: int, 2025-05-07T20:33:02.6739479Z scale_ub: Optional[float], 2025-05-07T20:33:02.6739812Z contiguous: bool, 2025-05-07T20:33:02.6740055Z compiled: bool, 2025-05-07T20:33:02.6740274Z ) -> None: 2025-05-07T20:33:02.6740484Z torch.manual_seed(2025) 2025-05-07T20:33:02.6740769Z 2025-05-07T20:33:02.6741043Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.6741450Z 2025-05-07T20:33:02.6741641Z x_sign = torch.sign(x) 2025-05-07T20:33:02.6741924Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.6742224Z x = x_sign * x_clamp 2025-05-07T20:33:02.6742463Z x0 = x[:, :D] 2025-05-07T20:33:02.6742677Z x1 = x[:, D:] 2025-05-07T20:33:02.6742875Z 2025-05-07T20:33:02.6743062Z if contiguous: 2025-05-07T20:33:02.6743295Z x0 = x0.contiguous() 2025-05-07T20:33:02.6743547Z x1 = x1.contiguous() 2025-05-07T20:33:02.6743784Z 2025-05-07T20:33:02.6743971Z if scale_ub is not None: 2025-05-07T20:33:02.6744245Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.6744575Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.6744881Z ) 2025-05-07T20:33:02.6745070Z else: 2025-05-07T20:33:02.6745272Z scale_ub_tensor = None 2025-05-07T20:33:02.6745526Z 2025-05-07T20:33:02.6745826Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.6746140Z op = silu_mul_quant 2025-05-07T20:33:02.6746394Z if compiled: 2025-05-07T20:33:02.6746641Z op = torch.compile(op) 2025-05-07T20:33:02.6746930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6747204Z 2025-05-07T20:33:02.6747395Z y_fp8, y_scale = fn() 2025-05-07T20:33:02.6747682Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:02.6747970Z 2025-05-07T20:33:02.6748210Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.6748549Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:02.6748844Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:02.6749157Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:02.6749514Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:02.6749888Z 2025-05-07T20:33:02.6750085Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:02.6750283Z 2025-05-07T20:33:02.6750385Z moe/activation_test.py:126: 2025-05-07T20:33:02.6750675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6751055Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:02.6751382Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:02.6752166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:02.6752925Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:02.6753468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.6754148Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.6754823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:02.6755548Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:02.6756296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:02.6757038Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:02.6757807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:02.6758441Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:02.6759081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:02.6759594Z fn() 2025-05-07T20:33:02.6760091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:02.6760668Z self.fn.run( 2025-05-07T20:33:02.6761132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.6761691Z kernel = self.compile( 2025-05-07T20:33:02.6762223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.6762870Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.6763266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6763493Z 2025-05-07T20:33:02.6763694Z self = 2025-05-07T20:33:02.6764787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.6766207Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917e44160>} 2025-05-07T20:33:02.6767550Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.6768570Z context = 2025-05-07T20:33:02.6768855Z 2025-05-07T20:33:02.6769016Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.6769532Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.6769995Z module_map=module_map) 2025-05-07T20:33:02.6770350Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.6770750Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:02.6771007Z E ^ 2025-05-07T20:33:02.6771472Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.6771925Z 2025-05-07T20:33:02.6772337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.6772848Z 2025-05-07T20:33:02.6772948Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6773350Z self=, 2025-05-07T20:33:02.6773745Z T=1, 2025-05-07T20:33:02.6773919Z D=5120, 2025-05-07T20:33:02.6774101Z scale_ub=1200.0, 2025-05-07T20:33:02.6774310Z contiguous=False, 2025-05-07T20:33:02.6774531Z compiled=True, 2025-05-07T20:33:02.6774730Z ) 2025-05-07T20:33:02.8762080Z self = 2025-05-07T20:33:02.8762817Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:02.8763179Z 2025-05-07T20:33:02.8763280Z @given( 2025-05-07T20:33:02.8763604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.8764008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.8764309Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.8764634Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.8764953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.8765357Z ) 2025-05-07T20:33:02.8765694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.8766128Z def test_silu_mul_quant( 2025-05-07T20:33:02.8766361Z self, 2025-05-07T20:33:02.8766541Z T: int, 2025-05-07T20:33:02.8766798Z D: int, 2025-05-07T20:33:02.8767014Z scale_ub: Optional[float], 2025-05-07T20:33:02.8767275Z contiguous: bool, 2025-05-07T20:33:02.8767506Z compiled: bool, 2025-05-07T20:33:02.8767721Z ) -> None: 2025-05-07T20:33:02.8767926Z torch.manual_seed(2025) 2025-05-07T20:33:02.8768158Z 2025-05-07T20:33:02.8768492Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.8768827Z 2025-05-07T20:33:02.8769017Z x_sign = torch.sign(x) 2025-05-07T20:33:02.8769307Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.8769605Z x = x_sign * x_clamp 2025-05-07T20:33:02.8769840Z x0 = x[:, :D] 2025-05-07T20:33:02.8770050Z x1 = x[:, D:] 2025-05-07T20:33:02.8770245Z 2025-05-07T20:33:02.8770449Z if contiguous: 2025-05-07T20:33:02.8770695Z x0 = x0.contiguous() 2025-05-07T20:33:02.8770947Z x1 = x1.contiguous() 2025-05-07T20:33:02.8771178Z 2025-05-07T20:33:02.8771367Z if scale_ub is not None: 2025-05-07T20:33:02.8771630Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.8771952Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.8772257Z ) 2025-05-07T20:33:02.8772440Z else: 2025-05-07T20:33:02.8772729Z scale_ub_tensor = None 2025-05-07T20:33:02.8772978Z 2025-05-07T20:33:02.8773203Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.8773507Z op = silu_mul_quant 2025-05-07T20:33:02.8773757Z if compiled: 2025-05-07T20:33:02.8774001Z op = torch.compile(op) 2025-05-07T20:33:02.8774284Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8774554Z 2025-05-07T20:33:02.8774738Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.8774900Z 2025-05-07T20:33:02.8775001Z moe/activation_test.py:117: 2025-05-07T20:33:02.8775290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8775620Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.8775896Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8776442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.8776993Z return fn(*args, **kwargs) 2025-05-07T20:33:02.8777650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.8778339Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.8778864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.8779539Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.8780191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.8780772Z kernel = self.compile( 2025-05-07T20:33:02.8781311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.8781966Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.8782356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8782587Z 2025-05-07T20:33:02.8782799Z self = 2025-05-07T20:33:02.8783874Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.8785301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917e44b80>} 2025-05-07T20:33:02.8786679Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.8787701Z context = 2025-05-07T20:33:02.8787980Z 2025-05-07T20:33:02.8788194Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.8788706Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.8789163Z module_map=module_map) 2025-05-07T20:33:02.8789525Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.8789946Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.8790205Z E ^ 2025-05-07T20:33:02.8790675Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.8791177Z 2025-05-07T20:33:02.8791597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.8792104Z 2025-05-07T20:33:02.8792201Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.8792607Z self=, 2025-05-07T20:33:02.8793009Z T=1, 2025-05-07T20:33:02.8793227Z D=5120, 2025-05-07T20:33:02.8793408Z scale_ub=1200.0, 2025-05-07T20:33:02.8793625Z contiguous=False, 2025-05-07T20:33:02.8793837Z compiled=False, 2025-05-07T20:33:02.8794034Z ) 2025-05-07T20:33:02.8794338Z self = 2025-05-07T20:33:02.8794817Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:02.8795083Z 2025-05-07T20:33:02.8795157Z @given( 2025-05-07T20:33:02.8795374Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.8795676Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.8795974Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.8796296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.8796612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.8796883Z ) 2025-05-07T20:33:02.8797227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.8797660Z def test_silu_mul_quant( 2025-05-07T20:33:02.8797891Z self, 2025-05-07T20:33:02.8798072Z T: int, 2025-05-07T20:33:02.8798256Z D: int, 2025-05-07T20:33:02.8798471Z scale_ub: Optional[float], 2025-05-07T20:33:02.8798726Z contiguous: bool, 2025-05-07T20:33:02.8798953Z compiled: bool, 2025-05-07T20:33:02.8799169Z ) -> None: 2025-05-07T20:33:02.8799374Z torch.manual_seed(2025) 2025-05-07T20:33:02.8799613Z 2025-05-07T20:33:02.8799877Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.8800211Z 2025-05-07T20:33:02.8800416Z x_sign = torch.sign(x) 2025-05-07T20:33:02.8800735Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.8801030Z x = x_sign * x_clamp 2025-05-07T20:33:02.8801265Z x0 = x[:, :D] 2025-05-07T20:33:02.8801475Z x1 = x[:, D:] 2025-05-07T20:33:02.8801666Z 2025-05-07T20:33:02.8801851Z if contiguous: 2025-05-07T20:33:02.8802074Z x0 = x0.contiguous() 2025-05-07T20:33:02.8802320Z x1 = x1.contiguous() 2025-05-07T20:33:02.8802555Z 2025-05-07T20:33:02.8802733Z if scale_ub is not None: 2025-05-07T20:33:02.8803009Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.8803329Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.8803678Z ) 2025-05-07T20:33:02.8804047Z else: 2025-05-07T20:33:02.8804242Z scale_ub_tensor = None 2025-05-07T20:33:02.8804487Z 2025-05-07T20:33:02.8804710Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.8805080Z op = silu_mul_quant 2025-05-07T20:33:02.8805330Z if compiled: 2025-05-07T20:33:02.8805566Z op = torch.compile(op) 2025-05-07T20:33:02.8805856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8806115Z 2025-05-07T20:33:02.8806294Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.8806522Z 2025-05-07T20:33:02.8806622Z moe/activation_test.py:117: 2025-05-07T20:33:02.8806906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8807228Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.8807501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8808183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.8808869Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.8809406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.8810078Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.8810777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.8811303Z kernel = self.compile( 2025-05-07T20:33:02.8811904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.8812562Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.8812943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8813169Z 2025-05-07T20:33:02.8813374Z self = 2025-05-07T20:33:02.8814450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.8815823Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917927550>} 2025-05-07T20:33:02.8817161Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.8818177Z context = 2025-05-07T20:33:02.8818461Z 2025-05-07T20:33:02.8818621Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.8819143Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.8819598Z module_map=module_map) 2025-05-07T20:33:02.8819954Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.8820300Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.8820546Z E ^ 2025-05-07T20:33:02.8821053Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.8821512Z 2025-05-07T20:33:02.8821928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.8822439Z 2025-05-07T20:33:02.8822542Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.8822947Z self=, 2025-05-07T20:33:02.8823345Z T=16384, 2025-05-07T20:33:02.8823528Z D=5120, 2025-05-07T20:33:02.8823775Z scale_ub=1200.0, 2025-05-07T20:33:02.8823990Z contiguous=False, 2025-05-07T20:33:02.8824213Z compiled=True, 2025-05-07T20:33:02.8824402Z ) 2025-05-07T20:33:03.0004240Z self = 2025-05-07T20:33:03.0005079Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.0005515Z 2025-05-07T20:33:03.0005623Z @given( 2025-05-07T20:33:03.0005931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.0006363Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.0006855Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.0007312Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.0007645Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.0007925Z ) 2025-05-07T20:33:03.0008276Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.0008723Z def test_silu_mul_quant( 2025-05-07T20:33:03.0008964Z self, 2025-05-07T20:33:03.0009157Z T: int, 2025-05-07T20:33:03.0009348Z D: int, 2025-05-07T20:33:03.0009566Z scale_ub: Optional[float], 2025-05-07T20:33:03.0009833Z contiguous: bool, 2025-05-07T20:33:03.0010079Z compiled: bool, 2025-05-07T20:33:03.0010307Z ) -> None: 2025-05-07T20:33:03.0010543Z torch.manual_seed(2025) 2025-05-07T20:33:03.0010812Z 2025-05-07T20:33:03.0011084Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.0011426Z 2025-05-07T20:33:03.0011705Z x_sign = torch.sign(x) 2025-05-07T20:33:03.0012002Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.0012308Z x = x_sign * x_clamp 2025-05-07T20:33:03.0012546Z x0 = x[:, :D] 2025-05-07T20:33:03.0012764Z x1 = x[:, D:] 2025-05-07T20:33:03.0012967Z 2025-05-07T20:33:03.0013152Z if contiguous: 2025-05-07T20:33:03.0013385Z x0 = x0.contiguous() 2025-05-07T20:33:03.0013642Z x1 = x1.contiguous() 2025-05-07T20:33:03.0013886Z 2025-05-07T20:33:03.0014080Z if scale_ub is not None: 2025-05-07T20:33:03.0014354Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.0014688Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.0014997Z ) 2025-05-07T20:33:03.0015186Z else: 2025-05-07T20:33:03.0015393Z scale_ub_tensor = None 2025-05-07T20:33:03.0015645Z 2025-05-07T20:33:03.0015875Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.0016192Z op = silu_mul_quant 2025-05-07T20:33:03.0016444Z if compiled: 2025-05-07T20:33:03.0016691Z op = torch.compile(op) 2025-05-07T20:33:03.0016985Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.0017260Z 2025-05-07T20:33:03.0017452Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.0017620Z 2025-05-07T20:33:03.0017721Z moe/activation_test.py:117: 2025-05-07T20:33:03.0018017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.0018347Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.0018628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.0019187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.0019748Z return fn(*args, **kwargs) 2025-05-07T20:33:03.0020407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.0021146Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.0021686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.0022366Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.0023021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.0023616Z kernel = self.compile( 2025-05-07T20:33:03.0024152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.0024839Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.0025235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.0025461Z 2025-05-07T20:33:03.0025668Z self = 2025-05-07T20:33:03.0026820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.0028200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89180401f0>} 2025-05-07T20:33:03.0029548Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.0030643Z context = 2025-05-07T20:33:03.0030932Z 2025-05-07T20:33:03.0031095Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.0031665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.0032141Z module_map=module_map) 2025-05-07T20:33:03.0032506Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.0032860Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.0033121Z E ^ 2025-05-07T20:33:03.0033612Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.0034068Z 2025-05-07T20:33:03.0034492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.0040882Z 2025-05-07T20:33:03.0041014Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.0041438Z self=, 2025-05-07T20:33:03.0041835Z T=2048, 2025-05-07T20:33:03.0042028Z D=7168, 2025-05-07T20:33:03.0042217Z scale_ub=1200.0, 2025-05-07T20:33:03.0042437Z contiguous=False, 2025-05-07T20:33:03.0042660Z compiled=True, 2025-05-07T20:33:03.0042858Z ) 2025-05-07T20:33:03.0043174Z self = 2025-05-07T20:33:03.0043665Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.0043940Z 2025-05-07T20:33:03.0044018Z @given( 2025-05-07T20:33:03.0044247Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.0044555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.0044862Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.0045189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.0045513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.0045800Z ) 2025-05-07T20:33:03.0046151Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.0046583Z def test_silu_mul_quant( 2025-05-07T20:33:03.0046821Z self, 2025-05-07T20:33:03.0047019Z T: int, 2025-05-07T20:33:03.0047209Z D: int, 2025-05-07T20:33:03.0047426Z scale_ub: Optional[float], 2025-05-07T20:33:03.0047693Z contiguous: bool, 2025-05-07T20:33:03.0047930Z compiled: bool, 2025-05-07T20:33:03.0048151Z ) -> None: 2025-05-07T20:33:03.0048363Z torch.manual_seed(2025) 2025-05-07T20:33:03.0048679Z 2025-05-07T20:33:03.0048944Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.0049286Z 2025-05-07T20:33:03.0049485Z x_sign = torch.sign(x) 2025-05-07T20:33:03.0049776Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.0050129Z x = x_sign * x_clamp 2025-05-07T20:33:03.0050367Z x0 = x[:, :D] 2025-05-07T20:33:03.0050578Z x1 = x[:, D:] 2025-05-07T20:33:03.0050803Z 2025-05-07T20:33:03.0051012Z if contiguous: 2025-05-07T20:33:03.0051236Z x0 = x0.contiguous() 2025-05-07T20:33:03.0051495Z x1 = x1.contiguous() 2025-05-07T20:33:03.0051775Z 2025-05-07T20:33:03.0051957Z if scale_ub is not None: 2025-05-07T20:33:03.0052224Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.0052555Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.0052857Z ) 2025-05-07T20:33:03.0053051Z else: 2025-05-07T20:33:03.0053259Z scale_ub_tensor = None 2025-05-07T20:33:03.0053505Z 2025-05-07T20:33:03.0053726Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.0054038Z op = silu_mul_quant 2025-05-07T20:33:03.0054284Z if compiled: 2025-05-07T20:33:03.0054529Z op = torch.compile(op) 2025-05-07T20:33:03.0054824Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.0055092Z 2025-05-07T20:33:03.0055274Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.0055437Z 2025-05-07T20:33:03.0055535Z moe/activation_test.py:117: 2025-05-07T20:33:03.0055871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.0056196Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.0056471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.0057023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.0057573Z return fn(*args, **kwargs) 2025-05-07T20:33:03.0058231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.0058913Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.0059447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.0060120Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.0060825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.0061357Z kernel = self.compile( 2025-05-07T20:33:03.0061891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.0062548Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.0062934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.0063165Z 2025-05-07T20:33:03.0063369Z self = 2025-05-07T20:33:03.0064462Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.0065849Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8918040ee0>} 2025-05-07T20:33:03.0067202Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.0068235Z context = 2025-05-07T20:33:03.0068525Z 2025-05-07T20:33:03.0068744Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.0069270Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.0069733Z module_map=module_map) 2025-05-07T20:33:03.0070211Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.0070595Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.0070874Z E ^ 2025-05-07T20:33:03.0071345Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.0071841Z 2025-05-07T20:33:03.0072256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.0072767Z 2025-05-07T20:33:03.2731105Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.2731727Z self=, 2025-05-07T20:33:03.2732309Z T=1, 2025-05-07T20:33:03.2732553Z D=5120, 2025-05-07T20:33:03.2732805Z scale_ub=None, 2025-05-07T20:33:03.2733077Z contiguous=False, 2025-05-07T20:33:03.2733293Z compiled=False, 2025-05-07T20:33:03.2733490Z ) 2025-05-07T20:33:03.2733800Z self = 2025-05-07T20:33:03.2734283Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.2734546Z 2025-05-07T20:33:03.2734622Z @given( 2025-05-07T20:33:03.2734842Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.2735273Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.2735579Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.2735902Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.2736253Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.2736524Z ) 2025-05-07T20:33:03.2736863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.2737301Z def test_silu_mul_quant( 2025-05-07T20:33:03.2737535Z self, 2025-05-07T20:33:03.2737713Z T: int, 2025-05-07T20:33:03.2737900Z D: int, 2025-05-07T20:33:03.2738111Z scale_ub: Optional[float], 2025-05-07T20:33:03.2738371Z contiguous: bool, 2025-05-07T20:33:03.2738604Z compiled: bool, 2025-05-07T20:33:03.2738818Z ) -> None: 2025-05-07T20:33:03.2739025Z torch.manual_seed(2025) 2025-05-07T20:33:03.2739263Z 2025-05-07T20:33:03.2739532Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.2739869Z 2025-05-07T20:33:03.2740055Z x_sign = torch.sign(x) 2025-05-07T20:33:03.2740347Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.2740644Z x = x_sign * x_clamp 2025-05-07T20:33:03.2740916Z x0 = x[:, :D] 2025-05-07T20:33:03.2741131Z x1 = x[:, D:] 2025-05-07T20:33:03.2741326Z 2025-05-07T20:33:03.2741507Z if contiguous: 2025-05-07T20:33:03.2741732Z x0 = x0.contiguous() 2025-05-07T20:33:03.2741977Z x1 = x1.contiguous() 2025-05-07T20:33:03.2742210Z 2025-05-07T20:33:03.2742389Z if scale_ub is not None: 2025-05-07T20:33:03.2742656Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.2742983Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.2743286Z ) 2025-05-07T20:33:03.2743472Z else: 2025-05-07T20:33:03.2743669Z scale_ub_tensor = None 2025-05-07T20:33:03.2743911Z 2025-05-07T20:33:03.2744148Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.2744456Z op = silu_mul_quant 2025-05-07T20:33:03.2744707Z if compiled: 2025-05-07T20:33:03.2744947Z op = torch.compile(op) 2025-05-07T20:33:03.2745233Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.2745508Z 2025-05-07T20:33:03.2745692Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.2745923Z 2025-05-07T20:33:03.2746017Z moe/activation_test.py:117: 2025-05-07T20:33:03.2746307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.2746629Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.2746962Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.2747645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.2748332Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.2748870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.2749602Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.2750339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.2750914Z kernel = self.compile( 2025-05-07T20:33:03.2751458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.2752094Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.2752485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.2752707Z 2025-05-07T20:33:03.2752914Z self = 2025-05-07T20:33:03.2754041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.2755415Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89178db5e0>} 2025-05-07T20:33:03.2756751Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.2757765Z context = 2025-05-07T20:33:03.2758047Z 2025-05-07T20:33:03.2758212Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.2758720Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.2759181Z module_map=module_map) 2025-05-07T20:33:03.2759543Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.2759890Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.2760136Z E ^ 2025-05-07T20:33:03.2760597Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.2761046Z 2025-05-07T20:33:03.2761462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.2761973Z 2025-05-07T20:33:03.2762073Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.2762474Z self=, 2025-05-07T20:33:03.2762868Z T=4096, 2025-05-07T20:33:03.2763044Z D=7168, 2025-05-07T20:33:03.2763227Z scale_ub=1200.0, 2025-05-07T20:33:03.2763444Z contiguous=False, 2025-05-07T20:33:03.2763660Z compiled=False, 2025-05-07T20:33:03.2763849Z ) 2025-05-07T20:33:03.2764159Z self = 2025-05-07T20:33:03.2764648Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.2764919Z 2025-05-07T20:33:03.2764992Z @given( 2025-05-07T20:33:03.2765214Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.2765517Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.2765866Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.2766183Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.2766504Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.2766782Z ) 2025-05-07T20:33:03.2767157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.2767587Z def test_silu_mul_quant( 2025-05-07T20:33:03.2767823Z self, 2025-05-07T20:33:03.2768002Z T: int, 2025-05-07T20:33:03.2768194Z D: int, 2025-05-07T20:33:03.2768408Z scale_ub: Optional[float], 2025-05-07T20:33:03.2768733Z contiguous: bool, 2025-05-07T20:33:03.2768969Z compiled: bool, 2025-05-07T20:33:03.2769186Z ) -> None: 2025-05-07T20:33:03.2769395Z torch.manual_seed(2025) 2025-05-07T20:33:03.2769626Z 2025-05-07T20:33:03.2769889Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.2770227Z 2025-05-07T20:33:03.2770417Z x_sign = torch.sign(x) 2025-05-07T20:33:03.2770729Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.2771043Z x = x_sign * x_clamp 2025-05-07T20:33:03.2771272Z x0 = x[:, :D] 2025-05-07T20:33:03.2771477Z x1 = x[:, D:] 2025-05-07T20:33:03.2771678Z 2025-05-07T20:33:03.2771852Z if contiguous: 2025-05-07T20:33:03.2772072Z x0 = x0.contiguous() 2025-05-07T20:33:03.2772327Z x1 = x1.contiguous() 2025-05-07T20:33:03.2772558Z 2025-05-07T20:33:03.2772736Z if scale_ub is not None: 2025-05-07T20:33:03.2773048Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.2773375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.2773668Z ) 2025-05-07T20:33:03.2773857Z else: 2025-05-07T20:33:03.2774061Z scale_ub_tensor = None 2025-05-07T20:33:03.2774297Z 2025-05-07T20:33:03.2774523Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.2774828Z op = silu_mul_quant 2025-05-07T20:33:03.2775066Z if compiled: 2025-05-07T20:33:03.2775299Z op = torch.compile(op) 2025-05-07T20:33:03.2775595Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.2775862Z 2025-05-07T20:33:03.2776038Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.2776202Z 2025-05-07T20:33:03.2776298Z moe/activation_test.py:117: 2025-05-07T20:33:03.2776585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.2776901Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.2777182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.2777869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.2778551Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.2779075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.2779748Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.2780398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.2780920Z kernel = self.compile( 2025-05-07T20:33:03.2781450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.2782092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.2782480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.2782706Z 2025-05-07T20:33:03.2782905Z self = 2025-05-07T20:33:03.2783981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.2785402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917f8b1f0>} 2025-05-07T20:33:03.2786788Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.2787813Z context = 2025-05-07T20:33:03.2788137Z 2025-05-07T20:33:03.2788302Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.2788817Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.2789284Z module_map=module_map) 2025-05-07T20:33:03.2789643Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.2790036Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.2790286Z E ^ 2025-05-07T20:33:03.2790741Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.2791197Z 2025-05-07T20:33:03.2791614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.2792133Z 2025-05-07T20:33:03.2792234Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.2792679Z self=, 2025-05-07T20:33:03.2793083Z T=16384, 2025-05-07T20:33:03.2793264Z D=7168, 2025-05-07T20:33:03.2793449Z scale_ub=None, 2025-05-07T20:33:03.2793649Z contiguous=True, 2025-05-07T20:33:03.2793864Z compiled=True, 2025-05-07T20:33:03.2794060Z ) 2025-05-07T20:33:03.5626025Z self = 2025-05-07T20:33:03.5626802Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.5627179Z 2025-05-07T20:33:03.5627279Z @given( 2025-05-07T20:33:03.5627582Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5627954Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5628289Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5628654Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5629019Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5629330Z ) 2025-05-07T20:33:03.5629690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5630189Z def test_silu_mul_quant( 2025-05-07T20:33:03.5630423Z self, 2025-05-07T20:33:03.5630624Z T: int, 2025-05-07T20:33:03.5630852Z D: int, 2025-05-07T20:33:03.5631071Z scale_ub: Optional[float], 2025-05-07T20:33:03.5631341Z contiguous: bool, 2025-05-07T20:33:03.5631580Z compiled: bool, 2025-05-07T20:33:03.5631797Z ) -> None: 2025-05-07T20:33:03.5632008Z torch.manual_seed(2025) 2025-05-07T20:33:03.5632244Z 2025-05-07T20:33:03.5632507Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5632850Z 2025-05-07T20:33:03.5633037Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5633322Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5633621Z x = x_sign * x_clamp 2025-05-07T20:33:03.5633856Z x0 = x[:, :D] 2025-05-07T20:33:03.5634064Z x1 = x[:, D:] 2025-05-07T20:33:03.5634270Z 2025-05-07T20:33:03.5634451Z if contiguous: 2025-05-07T20:33:03.5634675Z x0 = x0.contiguous() 2025-05-07T20:33:03.5634925Z x1 = x1.contiguous() 2025-05-07T20:33:03.5635162Z 2025-05-07T20:33:03.5635349Z if scale_ub is not None: 2025-05-07T20:33:03.5635615Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5636085Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5636395Z ) 2025-05-07T20:33:03.5636577Z else: 2025-05-07T20:33:03.5636789Z scale_ub_tensor = None 2025-05-07T20:33:03.5637039Z 2025-05-07T20:33:03.5637332Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5637645Z op = silu_mul_quant 2025-05-07T20:33:03.5637901Z if compiled: 2025-05-07T20:33:03.5638155Z op = torch.compile(op) 2025-05-07T20:33:03.5638448Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5638789Z 2025-05-07T20:33:03.5638981Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5639149Z 2025-05-07T20:33:03.5639248Z moe/activation_test.py:117: 2025-05-07T20:33:03.5639539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5639873Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5640151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5640704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5641303Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5641969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5642643Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5643174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5643912Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5644563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5645091Z kernel = self.compile( 2025-05-07T20:33:03.5645631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5646285Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5646666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5646899Z 2025-05-07T20:33:03.5647105Z self = 2025-05-07T20:33:03.5648177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5649562Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917f8bee0>} 2025-05-07T20:33:03.5650897Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5651906Z context = 2025-05-07T20:33:03.5652192Z 2025-05-07T20:33:03.5652359Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5652880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5653341Z module_map=module_map) 2025-05-07T20:33:03.5653696Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5654043Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5654303Z E ^ 2025-05-07T20:33:03.5654764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5655216Z 2025-05-07T20:33:03.5655628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5656185Z 2025-05-07T20:33:03.5656286Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5656696Z self=, 2025-05-07T20:33:03.5657095Z T=4096, 2025-05-07T20:33:03.5657278Z D=5120, 2025-05-07T20:33:03.5657503Z scale_ub=None, 2025-05-07T20:33:03.5657709Z contiguous=False, 2025-05-07T20:33:03.5657939Z compiled=True, 2025-05-07T20:33:03.5658139Z ) 2025-05-07T20:33:03.5658457Z self = 2025-05-07T20:33:03.5658947Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.5659285Z 2025-05-07T20:33:03.5659364Z @given( 2025-05-07T20:33:03.5659589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5659896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5660208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5660535Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5660884Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5661195Z ) 2025-05-07T20:33:03.5661537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5661970Z def test_silu_mul_quant( 2025-05-07T20:33:03.5662216Z self, 2025-05-07T20:33:03.5662407Z T: int, 2025-05-07T20:33:03.5662596Z D: int, 2025-05-07T20:33:03.5662813Z scale_ub: Optional[float], 2025-05-07T20:33:03.5663079Z contiguous: bool, 2025-05-07T20:33:03.5663309Z compiled: bool, 2025-05-07T20:33:03.5663528Z ) -> None: 2025-05-07T20:33:03.5663791Z torch.manual_seed(2025) 2025-05-07T20:33:03.5664032Z 2025-05-07T20:33:03.5664295Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5664629Z 2025-05-07T20:33:03.5664821Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5665102Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5665412Z x = x_sign * x_clamp 2025-05-07T20:33:03.5665651Z x0 = x[:, :D] 2025-05-07T20:33:03.5665857Z x1 = x[:, D:] 2025-05-07T20:33:03.5666058Z 2025-05-07T20:33:03.5666239Z if contiguous: 2025-05-07T20:33:03.5666463Z x0 = x0.contiguous() 2025-05-07T20:33:03.5666721Z x1 = x1.contiguous() 2025-05-07T20:33:03.5666956Z 2025-05-07T20:33:03.5667140Z if scale_ub is not None: 2025-05-07T20:33:03.5667412Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5673784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5674122Z ) 2025-05-07T20:33:03.5674309Z else: 2025-05-07T20:33:03.5674522Z scale_ub_tensor = None 2025-05-07T20:33:03.5674778Z 2025-05-07T20:33:03.5675013Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5675329Z op = silu_mul_quant 2025-05-07T20:33:03.5675582Z if compiled: 2025-05-07T20:33:03.5675831Z op = torch.compile(op) 2025-05-07T20:33:03.5676125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5676397Z 2025-05-07T20:33:03.5676588Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5676754Z 2025-05-07T20:33:03.5676855Z moe/activation_test.py:117: 2025-05-07T20:33:03.5677156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5677485Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5677760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5678335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5678894Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5679550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5680227Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5680762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5681513Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5682216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5682741Z kernel = self.compile( 2025-05-07T20:33:03.5683283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5683937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5684372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5684602Z 2025-05-07T20:33:03.5684806Z self = 2025-05-07T20:33:03.5685893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5687289Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917d7e940>} 2025-05-07T20:33:03.5688638Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5689699Z context = 2025-05-07T20:33:03.5689991Z 2025-05-07T20:33:03.5690158Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5690684Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5691200Z module_map=module_map) 2025-05-07T20:33:03.5691571Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5691925Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5692183Z E ^ 2025-05-07T20:33:03.5692645Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5693103Z 2025-05-07T20:33:03.5693526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5694044Z 2025-05-07T20:33:03.7639379Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.7640012Z self=, 2025-05-07T20:33:03.7640586Z T=4096, 2025-05-07T20:33:03.7640847Z D=5120, 2025-05-07T20:33:03.7641150Z scale_ub=1200.0, 2025-05-07T20:33:03.7641444Z contiguous=False, 2025-05-07T20:33:03.7641745Z compiled=False, 2025-05-07T20:33:03.7642009Z ) 2025-05-07T20:33:03.7642324Z self = 2025-05-07T20:33:03.7642829Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.7643112Z 2025-05-07T20:33:03.7643192Z @given( 2025-05-07T20:33:03.7643452Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.7643763Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.7644069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.7644401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.7644725Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.7645015Z ) 2025-05-07T20:33:03.7645362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.7645804Z def test_silu_mul_quant( 2025-05-07T20:33:03.7646047Z self, 2025-05-07T20:33:03.7646236Z T: int, 2025-05-07T20:33:03.7646427Z D: int, 2025-05-07T20:33:03.7646643Z scale_ub: Optional[float], 2025-05-07T20:33:03.7647032Z contiguous: bool, 2025-05-07T20:33:03.7647266Z compiled: bool, 2025-05-07T20:33:03.7647488Z ) -> None: 2025-05-07T20:33:03.7647703Z torch.manual_seed(2025) 2025-05-07T20:33:03.7647936Z 2025-05-07T20:33:03.7648272Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.7648617Z 2025-05-07T20:33:03.7648809Z x_sign = torch.sign(x) 2025-05-07T20:33:03.7649101Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.7649413Z x = x_sign * x_clamp 2025-05-07T20:33:03.7649646Z x0 = x[:, :D] 2025-05-07T20:33:03.7649925Z x1 = x[:, D:] 2025-05-07T20:33:03.7650135Z 2025-05-07T20:33:03.7650318Z if contiguous: 2025-05-07T20:33:03.7650561Z x0 = x0.contiguous() 2025-05-07T20:33:03.7650824Z x1 = x1.contiguous() 2025-05-07T20:33:03.7651063Z 2025-05-07T20:33:03.7651247Z if scale_ub is not None: 2025-05-07T20:33:03.7651527Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.7651867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.7652173Z ) 2025-05-07T20:33:03.7652359Z else: 2025-05-07T20:33:03.7652568Z scale_ub_tensor = None 2025-05-07T20:33:03.7652814Z 2025-05-07T20:33:03.7653046Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.7653359Z op = silu_mul_quant 2025-05-07T20:33:03.7653606Z if compiled: 2025-05-07T20:33:03.7653851Z op = torch.compile(op) 2025-05-07T20:33:03.7654211Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7654487Z 2025-05-07T20:33:03.7654679Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.7654848Z 2025-05-07T20:33:03.7654954Z moe/activation_test.py:117: 2025-05-07T20:33:03.7655252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7655579Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.7655868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7656568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.7657256Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.7657805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.7658487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.7659148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.7659671Z kernel = self.compile( 2025-05-07T20:33:03.7660205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.7660883Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.7661299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7661530Z 2025-05-07T20:33:03.7661733Z self = 2025-05-07T20:33:03.7662822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.7664206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917bae3a0>} 2025-05-07T20:33:03.7665558Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.7666579Z context = 2025-05-07T20:33:03.7666924Z 2025-05-07T20:33:03.7667088Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.7667611Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.7668121Z module_map=module_map) 2025-05-07T20:33:03.7668479Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.7668836Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.7669097Z E ^ 2025-05-07T20:33:03.7669563Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.7670173Z 2025-05-07T20:33:03.7670590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.7671141Z 2025-05-07T20:33:03.7671257Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.7671666Z self=, 2025-05-07T20:33:03.7672064Z T=4096, 2025-05-07T20:33:03.7672246Z D=5120, 2025-05-07T20:33:03.7672434Z scale_ub=1200.0, 2025-05-07T20:33:03.7672651Z contiguous=False, 2025-05-07T20:33:03.7672875Z compiled=True, 2025-05-07T20:33:03.7673077Z ) 2025-05-07T20:33:03.7673393Z self = 2025-05-07T20:33:03.7673887Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.7674164Z 2025-05-07T20:33:03.7674244Z @given( 2025-05-07T20:33:03.7674521Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.7674833Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.7675140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.7675465Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.7675789Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.7676075Z ) 2025-05-07T20:33:03.7676425Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.7676867Z def test_silu_mul_quant( 2025-05-07T20:33:03.7677109Z self, 2025-05-07T20:33:03.7677295Z T: int, 2025-05-07T20:33:03.7677491Z D: int, 2025-05-07T20:33:03.7677713Z scale_ub: Optional[float], 2025-05-07T20:33:03.7677985Z contiguous: bool, 2025-05-07T20:33:03.7678218Z compiled: bool, 2025-05-07T20:33:03.7678449Z ) -> None: 2025-05-07T20:33:03.7678665Z torch.manual_seed(2025) 2025-05-07T20:33:03.7678900Z 2025-05-07T20:33:03.7679177Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.7679513Z 2025-05-07T20:33:03.7679707Z x_sign = torch.sign(x) 2025-05-07T20:33:03.7680000Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.7680304Z x = x_sign * x_clamp 2025-05-07T20:33:03.7680540Z x0 = x[:, :D] 2025-05-07T20:33:03.7680747Z x1 = x[:, D:] 2025-05-07T20:33:03.7680947Z 2025-05-07T20:33:03.7681121Z if contiguous: 2025-05-07T20:33:03.7681335Z x0 = x0.contiguous() 2025-05-07T20:33:03.7681591Z x1 = x1.contiguous() 2025-05-07T20:33:03.7681823Z 2025-05-07T20:33:03.7682005Z if scale_ub is not None: 2025-05-07T20:33:03.7682277Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.7682607Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.7682905Z ) 2025-05-07T20:33:03.7683090Z else: 2025-05-07T20:33:03.7683294Z scale_ub_tensor = None 2025-05-07T20:33:03.7683540Z 2025-05-07T20:33:03.7683769Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.7684075Z op = silu_mul_quant 2025-05-07T20:33:03.7684321Z if compiled: 2025-05-07T20:33:03.7684557Z op = torch.compile(op) 2025-05-07T20:33:03.7684848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7685167Z 2025-05-07T20:33:03.7685345Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.7685510Z 2025-05-07T20:33:03.7685603Z moe/activation_test.py:117: 2025-05-07T20:33:03.7685891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7686266Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.7686551Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7687110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.7687664Z return fn(*args, **kwargs) 2025-05-07T20:33:03.7688384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.7689067Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.7689596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.7690268Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.7690980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.7691503Z kernel = self.compile( 2025-05-07T20:33:03.7692044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.7692700Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.7693087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7693320Z 2025-05-07T20:33:03.7693566Z self = 2025-05-07T20:33:03.7694657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.7696046Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917bae280>} 2025-05-07T20:33:03.7697400Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.7698417Z context = 2025-05-07T20:33:03.7698705Z 2025-05-07T20:33:03.7698871Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.7699394Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.7699858Z module_map=module_map) 2025-05-07T20:33:03.7700217Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.7700569Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.7700830Z E ^ 2025-05-07T20:33:03.7701334Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.7701791Z 2025-05-07T20:33:03.7702207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.7702724Z 2025-05-07T20:33:04.0467313Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0467846Z self=, 2025-05-07T20:33:04.0468453Z T=2048, 2025-05-07T20:33:04.0468715Z D=7168, 2025-05-07T20:33:04.0468986Z scale_ub=1200.0, 2025-05-07T20:33:04.0469281Z contiguous=False, 2025-05-07T20:33:04.0469577Z compiled=False, 2025-05-07T20:33:04.0469952Z ) 2025-05-07T20:33:04.0470315Z self = 2025-05-07T20:33:04.0470810Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:04.0471204Z 2025-05-07T20:33:04.0471290Z @given( 2025-05-07T20:33:04.0471514Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.0471814Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.0472183Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.0472519Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.0472838Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.0473121Z ) 2025-05-07T20:33:04.0473468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.0473963Z def test_silu_mul_quant( 2025-05-07T20:33:04.0474201Z self, 2025-05-07T20:33:04.0474398Z T: int, 2025-05-07T20:33:04.0474588Z D: int, 2025-05-07T20:33:04.0474808Z scale_ub: Optional[float], 2025-05-07T20:33:04.0475075Z contiguous: bool, 2025-05-07T20:33:04.0475308Z compiled: bool, 2025-05-07T20:33:04.0475533Z ) -> None: 2025-05-07T20:33:04.0475751Z torch.manual_seed(2025) 2025-05-07T20:33:04.0475984Z 2025-05-07T20:33:04.0476256Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.0476597Z 2025-05-07T20:33:04.0476793Z x_sign = torch.sign(x) 2025-05-07T20:33:04.0477077Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.0477391Z x = x_sign * x_clamp 2025-05-07T20:33:04.0477628Z x0 = x[:, :D] 2025-05-07T20:33:04.0477836Z x1 = x[:, D:] 2025-05-07T20:33:04.0478041Z 2025-05-07T20:33:04.0478222Z if contiguous: 2025-05-07T20:33:04.0478518Z x0 = x0.contiguous() 2025-05-07T20:33:04.0478778Z x1 = x1.contiguous() 2025-05-07T20:33:04.0479013Z 2025-05-07T20:33:04.0479198Z if scale_ub is not None: 2025-05-07T20:33:04.0479470Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.0479804Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.0480107Z ) 2025-05-07T20:33:04.0480295Z else: 2025-05-07T20:33:04.0480510Z scale_ub_tensor = None 2025-05-07T20:33:04.0480778Z 2025-05-07T20:33:04.0481032Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0481356Z op = silu_mul_quant 2025-05-07T20:33:04.0481605Z if compiled: 2025-05-07T20:33:04.0481845Z op = torch.compile(op) 2025-05-07T20:33:04.0482140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0482412Z 2025-05-07T20:33:04.0482595Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.0482770Z 2025-05-07T20:33:04.0482869Z moe/activation_test.py:117: 2025-05-07T20:33:04.0483162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0483486Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.0483764Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0484451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.0485142Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.0485669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.0486351Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.0487013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.0487539Z kernel = self.compile( 2025-05-07T20:33:04.0488081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.0488737Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.0489131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0489355Z 2025-05-07T20:33:04.0489557Z self = 2025-05-07T20:33:04.0490735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.0492180Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917b7d670>} 2025-05-07T20:33:04.0493533Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.0494595Z context = 2025-05-07T20:33:04.0494880Z 2025-05-07T20:33:04.0495043Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.0495564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.0496034Z module_map=module_map) 2025-05-07T20:33:04.0496392Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.0496748Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.0497004Z E ^ 2025-05-07T20:33:04.0497464Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.0497914Z 2025-05-07T20:33:04.0498370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.0498895Z 2025-05-07T20:33:04.0498998Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0499409Z self=, 2025-05-07T20:33:04.0499808Z T=1, 2025-05-07T20:33:04.0499985Z D=7168, 2025-05-07T20:33:04.0500177Z scale_ub=None, 2025-05-07T20:33:04.0500384Z contiguous=True, 2025-05-07T20:33:04.0500610Z compiled=False, 2025-05-07T20:33:04.0500814Z ) 2025-05-07T20:33:04.0501127Z self = 2025-05-07T20:33:04.0501608Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:04.0501874Z 2025-05-07T20:33:04.0501950Z @given( 2025-05-07T20:33:04.0502176Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.0502479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.0502794Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.0503123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.0503444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.0503916Z ) 2025-05-07T20:33:04.0504268Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.0504707Z def test_silu_mul_quant( 2025-05-07T20:33:04.0504941Z self, 2025-05-07T20:33:04.0505132Z T: int, 2025-05-07T20:33:04.0505324Z D: int, 2025-05-07T20:33:04.0505533Z scale_ub: Optional[float], 2025-05-07T20:33:04.0505803Z contiguous: bool, 2025-05-07T20:33:04.0506039Z compiled: bool, 2025-05-07T20:33:04.0506256Z ) -> None: 2025-05-07T20:33:04.0506470Z torch.manual_seed(2025) 2025-05-07T20:33:04.0506709Z 2025-05-07T20:33:04.0506993Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.0507329Z 2025-05-07T20:33:04.0507522Z x_sign = torch.sign(x) 2025-05-07T20:33:04.0507814Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.0508119Z x = x_sign * x_clamp 2025-05-07T20:33:04.0508365Z x0 = x[:, :D] 2025-05-07T20:33:04.0508585Z x1 = x[:, D:] 2025-05-07T20:33:04.0508782Z 2025-05-07T20:33:04.0508964Z if contiguous: 2025-05-07T20:33:04.0509192Z x0 = x0.contiguous() 2025-05-07T20:33:04.0509923Z x1 = x1.contiguous() 2025-05-07T20:33:04.0510163Z 2025-05-07T20:33:04.0510350Z if scale_ub is not None: 2025-05-07T20:33:04.0510622Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.0511063Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.0511369Z ) 2025-05-07T20:33:04.0511558Z else: 2025-05-07T20:33:04.0511760Z scale_ub_tensor = None 2025-05-07T20:33:04.0512008Z 2025-05-07T20:33:04.0512235Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0512605Z op = silu_mul_quant 2025-05-07T20:33:04.0512854Z if compiled: 2025-05-07T20:33:04.0513096Z op = torch.compile(op) 2025-05-07T20:33:04.0513384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0513655Z 2025-05-07T20:33:04.0513843Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.0514006Z 2025-05-07T20:33:04.0514103Z moe/activation_test.py:117: 2025-05-07T20:33:04.0514399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0514726Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.0515005Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0515692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.0516377Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.0516910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.0517649Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.0518314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.0518843Z kernel = self.compile( 2025-05-07T20:33:04.0519380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.0520027Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.0520419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0520647Z 2025-05-07T20:33:04.0520860Z self = 2025-05-07T20:33:04.0522006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.0523384Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89177e0280>} 2025-05-07T20:33:04.0524726Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.0525746Z context = 2025-05-07T20:33:04.0526030Z 2025-05-07T20:33:04.0526199Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.0526719Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.0527186Z module_map=module_map) 2025-05-07T20:33:04.0527546Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.0527910Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.0528164Z E ^ 2025-05-07T20:33:04.0528628Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.0529080Z 2025-05-07T20:33:04.0529502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.0530061Z 2025-05-07T20:33:04.0530168Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0530579Z self=, 2025-05-07T20:33:04.0530982Z T=16384, 2025-05-07T20:33:04.0537232Z D=7168, 2025-05-07T20:33:04.0537467Z scale_ub=1200.0, 2025-05-07T20:33:04.0537693Z contiguous=False, 2025-05-07T20:33:04.0537921Z compiled=True, 2025-05-07T20:33:04.0538125Z ) 2025-05-07T20:33:04.2434136Z self = 2025-05-07T20:33:04.2435019Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:04.2435423Z 2025-05-07T20:33:04.2435527Z @given( 2025-05-07T20:33:04.2435836Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.2436257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.2436630Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.2436965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.2437292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.2437575Z ) 2025-05-07T20:33:04.2437926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.2438370Z def test_silu_mul_quant( 2025-05-07T20:33:04.2438612Z self, 2025-05-07T20:33:04.2438798Z T: int, 2025-05-07T20:33:04.2438996Z D: int, 2025-05-07T20:33:04.2439211Z scale_ub: Optional[float], 2025-05-07T20:33:04.2439474Z contiguous: bool, 2025-05-07T20:33:04.2439795Z compiled: bool, 2025-05-07T20:33:04.2440022Z ) -> None: 2025-05-07T20:33:04.2440233Z torch.manual_seed(2025) 2025-05-07T20:33:04.2440476Z 2025-05-07T20:33:04.2440765Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.2441131Z 2025-05-07T20:33:04.2441325Z x_sign = torch.sign(x) 2025-05-07T20:33:04.2441611Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.2441918Z x = x_sign * x_clamp 2025-05-07T20:33:04.2442155Z x0 = x[:, :D] 2025-05-07T20:33:04.2442368Z x1 = x[:, D:] 2025-05-07T20:33:04.2442567Z 2025-05-07T20:33:04.2442753Z if contiguous: 2025-05-07T20:33:04.2442986Z x0 = x0.contiguous() 2025-05-07T20:33:04.2443240Z x1 = x1.contiguous() 2025-05-07T20:33:04.2443476Z 2025-05-07T20:33:04.2443664Z if scale_ub is not None: 2025-05-07T20:33:04.2443933Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.2444263Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.2444573Z ) 2025-05-07T20:33:04.2444766Z else: 2025-05-07T20:33:04.2444970Z scale_ub_tensor = None 2025-05-07T20:33:04.2445218Z 2025-05-07T20:33:04.2445446Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.2445751Z op = silu_mul_quant 2025-05-07T20:33:04.2446002Z if compiled: 2025-05-07T20:33:04.2446251Z op = torch.compile(op) 2025-05-07T20:33:04.2446546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.2446816Z 2025-05-07T20:33:04.2446998Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.2447167Z 2025-05-07T20:33:04.2447267Z moe/activation_test.py:117: 2025-05-07T20:33:04.2447561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.2447888Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.2448172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.2448730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.2449292Z return fn(*args, **kwargs) 2025-05-07T20:33:04.2449951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.2450639Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.2451298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.2451979Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.2452744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.2453275Z kernel = self.compile( 2025-05-07T20:33:04.2453819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.2454510Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.2454906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.2455137Z 2025-05-07T20:33:04.2455344Z self = 2025-05-07T20:33:04.2456436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.2457838Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89177e0ee0>} 2025-05-07T20:33:04.2459192Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.2460257Z context = 2025-05-07T20:33:04.2460548Z 2025-05-07T20:33:04.2460714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.2461242Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.2461708Z module_map=module_map) 2025-05-07T20:33:04.2462073Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.2462425Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.2462679Z E ^ 2025-05-07T20:33:04.2463146Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.2463603Z 2025-05-07T20:33:04.2464021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.2464536Z 2025-05-07T20:33:04.2464648Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.2465060Z self=, 2025-05-07T20:33:04.2465460Z T=1, 2025-05-07T20:33:04.2465645Z D=7168, 2025-05-07T20:33:04.2465827Z scale_ub=None, 2025-05-07T20:33:04.2466040Z contiguous=False, 2025-05-07T20:33:04.2466264Z compiled=False, 2025-05-07T20:33:04.2466460Z ) 2025-05-07T20:33:04.2466775Z self = 2025-05-07T20:33:04.2467262Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:04.2467524Z 2025-05-07T20:33:04.2467602Z @given( 2025-05-07T20:33:04.2467825Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.2468139Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.2468444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.2468771Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.2469105Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.2469394Z ) 2025-05-07T20:33:04.2469736Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.2470246Z def test_silu_mul_quant( 2025-05-07T20:33:04.2470482Z self, 2025-05-07T20:33:04.2470667Z T: int, 2025-05-07T20:33:04.2470886Z D: int, 2025-05-07T20:33:04.2471175Z scale_ub: Optional[float], 2025-05-07T20:33:04.2471463Z contiguous: bool, 2025-05-07T20:33:04.2471698Z compiled: bool, 2025-05-07T20:33:04.2471915Z ) -> None: 2025-05-07T20:33:04.2472126Z torch.manual_seed(2025) 2025-05-07T20:33:04.2472366Z 2025-05-07T20:33:04.2472670Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.2473016Z 2025-05-07T20:33:04.2473210Z x_sign = torch.sign(x) 2025-05-07T20:33:04.2473491Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.2473800Z x = x_sign * x_clamp 2025-05-07T20:33:04.2474084Z x0 = x[:, :D] 2025-05-07T20:33:04.2474293Z x1 = x[:, D:] 2025-05-07T20:33:04.2474500Z 2025-05-07T20:33:04.2474680Z if contiguous: 2025-05-07T20:33:04.2474904Z x0 = x0.contiguous() 2025-05-07T20:33:04.2475159Z x1 = x1.contiguous() 2025-05-07T20:33:04.2475395Z 2025-05-07T20:33:04.2475577Z if scale_ub is not None: 2025-05-07T20:33:04.2475848Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.2476180Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.2476487Z ) 2025-05-07T20:33:04.2476678Z else: 2025-05-07T20:33:04.2476891Z scale_ub_tensor = None 2025-05-07T20:33:04.2477140Z 2025-05-07T20:33:04.2477364Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.2477675Z op = silu_mul_quant 2025-05-07T20:33:04.2477922Z if compiled: 2025-05-07T20:33:04.2478164Z op = torch.compile(op) 2025-05-07T20:33:04.2478519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.2478792Z 2025-05-07T20:33:04.2478975Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.2479144Z 2025-05-07T20:33:04.2479241Z moe/activation_test.py:117: 2025-05-07T20:33:04.2479533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.2479862Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.2480142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.2480836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.2481581Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.2482113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.2482793Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.2483460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.2483989Z kernel = self.compile( 2025-05-07T20:33:04.2484530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.2485179Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.2485576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.2485802Z 2025-05-07T20:33:04.2486007Z self = 2025-05-07T20:33:04.2487103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.2488498Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917abd670>} 2025-05-07T20:33:04.2489862Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.2490936Z context = 2025-05-07T20:33:04.2491274Z 2025-05-07T20:33:04.2491443Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.2491968Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.2492472Z module_map=module_map) 2025-05-07T20:33:04.2492832Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.2493188Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.2493445Z E ^ 2025-05-07T20:33:04.2493915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.2494414Z 2025-05-07T20:33:04.2494832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.2495353Z 2025-05-07T20:33:04.2495455Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.2495868Z self=, 2025-05-07T20:33:04.2496264Z T=2048, 2025-05-07T20:33:04.2496443Z D=7168, 2025-05-07T20:33:04.2496629Z scale_ub=None, 2025-05-07T20:33:04.2496848Z contiguous=False, 2025-05-07T20:33:04.2497068Z compiled=True, 2025-05-07T20:33:04.2497268Z ) 2025-05-07T20:33:04.5363953Z self = 2025-05-07T20:33:04.5365396Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:04.5366121Z 2025-05-07T20:33:04.5366322Z @given( 2025-05-07T20:33:04.5367069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.5367693Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.5368282Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.5368926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.5369576Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.5370124Z ) 2025-05-07T20:33:04.5370801Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.5371371Z def test_silu_mul_quant( 2025-05-07T20:33:04.5371618Z self, 2025-05-07T20:33:04.5371814Z T: int, 2025-05-07T20:33:04.5372012Z D: int, 2025-05-07T20:33:04.5372219Z scale_ub: Optional[float], 2025-05-07T20:33:04.5372487Z contiguous: bool, 2025-05-07T20:33:04.5372716Z compiled: bool, 2025-05-07T20:33:04.5372929Z ) -> None: 2025-05-07T20:33:04.5373135Z torch.manual_seed(2025) 2025-05-07T20:33:04.5373378Z 2025-05-07T20:33:04.5373646Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.5373987Z 2025-05-07T20:33:04.5374170Z x_sign = torch.sign(x) 2025-05-07T20:33:04.5374453Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.5374749Z x = x_sign * x_clamp 2025-05-07T20:33:04.5374985Z x0 = x[:, :D] 2025-05-07T20:33:04.5375192Z x1 = x[:, D:] 2025-05-07T20:33:04.5375389Z 2025-05-07T20:33:04.5375565Z if contiguous: 2025-05-07T20:33:04.5375785Z x0 = x0.contiguous() 2025-05-07T20:33:04.5376033Z x1 = x1.contiguous() 2025-05-07T20:33:04.5376264Z 2025-05-07T20:33:04.5376451Z if scale_ub is not None: 2025-05-07T20:33:04.5376715Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.5377050Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.5377347Z ) 2025-05-07T20:33:04.5377530Z else: 2025-05-07T20:33:04.5377742Z scale_ub_tensor = None 2025-05-07T20:33:04.5377986Z 2025-05-07T20:33:04.5378206Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.5378515Z op = silu_mul_quant 2025-05-07T20:33:04.5378759Z if compiled: 2025-05-07T20:33:04.5378999Z op = torch.compile(op) 2025-05-07T20:33:04.5379283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5379618Z 2025-05-07T20:33:04.5379796Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.5379955Z 2025-05-07T20:33:04.5380049Z moe/activation_test.py:117: 2025-05-07T20:33:04.5380409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.5380732Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.5381000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5381553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.5382106Z return fn(*args, **kwargs) 2025-05-07T20:33:04.5382822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.5383496Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.5384025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.5384705Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.5385354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.5385879Z kernel = self.compile( 2025-05-07T20:33:04.5386412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.5387058Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.5387482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.5387717Z 2025-05-07T20:33:04.5387918Z self = 2025-05-07T20:33:04.5388995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.5390456Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917658550>} 2025-05-07T20:33:04.5391848Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.5392865Z context = 2025-05-07T20:33:04.5393150Z 2025-05-07T20:33:04.5393315Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.5393832Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.5394289Z module_map=module_map) 2025-05-07T20:33:04.5394649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.5394997Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.5395246Z E ^ 2025-05-07T20:33:04.5395705Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.5396159Z 2025-05-07T20:33:04.5396575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.5397082Z 2025-05-07T20:33:04.5397182Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.5397588Z self=, 2025-05-07T20:33:04.5397989Z T=4096, 2025-05-07T20:33:04.5398167Z D=7168, 2025-05-07T20:33:04.5398357Z scale_ub=None, 2025-05-07T20:33:04.5398557Z contiguous=False, 2025-05-07T20:33:04.5398774Z compiled=True, 2025-05-07T20:33:04.5398968Z ) 2025-05-07T20:33:04.5399274Z self = 2025-05-07T20:33:04.5399759Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:04.5400079Z 2025-05-07T20:33:04.5400156Z @given( 2025-05-07T20:33:04.5400373Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.5400677Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.5401091Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.5401438Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.5401756Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.5402035Z ) 2025-05-07T20:33:04.5402379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.5402847Z def test_silu_mul_quant( 2025-05-07T20:33:04.5403078Z self, 2025-05-07T20:33:04.5403265Z T: int, 2025-05-07T20:33:04.5403446Z D: int, 2025-05-07T20:33:04.5403653Z scale_ub: Optional[float], 2025-05-07T20:33:04.5404099Z contiguous: bool, 2025-05-07T20:33:04.5404324Z compiled: bool, 2025-05-07T20:33:04.5404545Z ) -> None: 2025-05-07T20:33:04.5404751Z torch.manual_seed(2025) 2025-05-07T20:33:04.5404984Z 2025-05-07T20:33:04.5405245Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.5405575Z 2025-05-07T20:33:04.5405761Z x_sign = torch.sign(x) 2025-05-07T20:33:04.5406042Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.5406350Z x = x_sign * x_clamp 2025-05-07T20:33:04.5406580Z x0 = x[:, :D] 2025-05-07T20:33:04.5406781Z x1 = x[:, D:] 2025-05-07T20:33:04.5406979Z 2025-05-07T20:33:04.5407226Z if contiguous: 2025-05-07T20:33:04.5407448Z x0 = x0.contiguous() 2025-05-07T20:33:04.5407695Z x1 = x1.contiguous() 2025-05-07T20:33:04.5407925Z 2025-05-07T20:33:04.5408104Z if scale_ub is not None: 2025-05-07T20:33:04.5408366Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.5408690Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.5408988Z ) 2025-05-07T20:33:04.5409171Z else: 2025-05-07T20:33:04.5409371Z scale_ub_tensor = None 2025-05-07T20:33:04.5409609Z 2025-05-07T20:33:04.5409835Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.5410141Z op = silu_mul_quant 2025-05-07T20:33:04.5410380Z if compiled: 2025-05-07T20:33:04.5410620Z op = torch.compile(op) 2025-05-07T20:33:04.5410907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5411173Z 2025-05-07T20:33:04.5411355Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.5411519Z 2025-05-07T20:33:04.5411616Z moe/activation_test.py:117: 2025-05-07T20:33:04.5411906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.5412227Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.5412497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5413049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.5413596Z return fn(*args, **kwargs) 2025-05-07T20:33:04.5414252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.5414930Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.5415455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.5416125Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.5416781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.5417306Z kernel = self.compile( 2025-05-07T20:33:04.5417834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.5418483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.5418967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.5419190Z 2025-05-07T20:33:04.5419395Z self = 2025-05-07T20:33:04.5420532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.5421971Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891777b160>} 2025-05-07T20:33:04.5423370Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.5424398Z context = 2025-05-07T20:33:04.5424680Z 2025-05-07T20:33:04.5424844Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.5425365Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.5425825Z module_map=module_map) 2025-05-07T20:33:04.5426181Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.5426530Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.5426781Z E ^ 2025-05-07T20:33:04.5427295Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.5427743Z 2025-05-07T20:33:04.5428160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.5428668Z 2025-05-07T20:33:04.7487783Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.7488403Z self=, 2025-05-07T20:33:04.7488984Z T=16384, 2025-05-07T20:33:04.7489248Z D=5120, 2025-05-07T20:33:04.7489544Z scale_ub=1200.0, 2025-05-07T20:33:04.7489842Z contiguous=False, 2025-05-07T20:33:04.7490133Z compiled=False, 2025-05-07T20:33:04.7490332Z ) 2025-05-07T20:33:04.7490634Z self = 2025-05-07T20:33:04.7491184Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:04.7491475Z 2025-05-07T20:33:04.7491549Z @given( 2025-05-07T20:33:04.7491770Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.7492071Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.7492376Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.7492703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.7493027Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.7493311Z ) 2025-05-07T20:33:04.7493655Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.7494086Z def test_silu_mul_quant( 2025-05-07T20:33:04.7494321Z self, 2025-05-07T20:33:04.7494515Z T: int, 2025-05-07T20:33:04.7494703Z D: int, 2025-05-07T20:33:04.7494910Z scale_ub: Optional[float], 2025-05-07T20:33:04.7495175Z contiguous: bool, 2025-05-07T20:33:04.7495406Z compiled: bool, 2025-05-07T20:33:04.7495623Z ) -> None: 2025-05-07T20:33:04.7495841Z torch.manual_seed(2025) 2025-05-07T20:33:04.7496078Z 2025-05-07T20:33:04.7496340Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.7496678Z 2025-05-07T20:33:04.7496867Z x_sign = torch.sign(x) 2025-05-07T20:33:04.7497147Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.7497452Z x = x_sign * x_clamp 2025-05-07T20:33:04.7497814Z x0 = x[:, :D] 2025-05-07T20:33:04.7498027Z x1 = x[:, D:] 2025-05-07T20:33:04.7498228Z 2025-05-07T20:33:04.7498410Z if contiguous: 2025-05-07T20:33:04.7498633Z x0 = x0.contiguous() 2025-05-07T20:33:04.7498947Z x1 = x1.contiguous() 2025-05-07T20:33:04.7499193Z 2025-05-07T20:33:04.7499376Z if scale_ub is not None: 2025-05-07T20:33:04.7499657Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.7506757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.7507092Z ) 2025-05-07T20:33:04.7507402Z else: 2025-05-07T20:33:04.7507619Z scale_ub_tensor = None 2025-05-07T20:33:04.7507880Z 2025-05-07T20:33:04.7508112Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.7508448Z op = silu_mul_quant 2025-05-07T20:33:04.7508702Z if compiled: 2025-05-07T20:33:04.7508943Z op = torch.compile(op) 2025-05-07T20:33:04.7509246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.7509519Z 2025-05-07T20:33:04.7509713Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.7509952Z 2025-05-07T20:33:04.7510053Z moe/activation_test.py:117: 2025-05-07T20:33:04.7510352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.7510682Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.7510971Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.7511740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.7512452Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.7512990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.7513683Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.7514353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.7514888Z kernel = self.compile( 2025-05-07T20:33:04.7515427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.7516086Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.7516482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.7516707Z 2025-05-07T20:33:04.7516909Z self = 2025-05-07T20:33:04.7518003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.7519392Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891777b940>} 2025-05-07T20:33:04.7520750Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.7521818Z context = 2025-05-07T20:33:04.7522100Z 2025-05-07T20:33:04.7522262Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.7522778Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.7523249Z module_map=module_map) 2025-05-07T20:33:04.7523607Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.7523956Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.7524206Z E ^ 2025-05-07T20:33:04.7524669Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.7525188Z 2025-05-07T20:33:04.7525602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.7526172Z 2025-05-07T20:33:04.7526275Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.7526685Z self=, 2025-05-07T20:33:04.7527090Z T=16384, 2025-05-07T20:33:04.7527272Z D=5120, 2025-05-07T20:33:04.7527461Z scale_ub=1200.0, 2025-05-07T20:33:04.7527717Z contiguous=True, 2025-05-07T20:33:04.7527928Z compiled=True, 2025-05-07T20:33:04.7528128Z ) 2025-05-07T20:33:04.7528442Z self = 2025-05-07T20:33:04.7528927Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:04.7529201Z 2025-05-07T20:33:04.7529272Z @given( 2025-05-07T20:33:04.7529494Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.7529790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.7530092Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.7530417Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.7530744Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.7531022Z ) 2025-05-07T20:33:04.7531412Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.7531845Z def test_silu_mul_quant( 2025-05-07T20:33:04.7532077Z self, 2025-05-07T20:33:04.7532309Z T: int, 2025-05-07T20:33:04.7532502Z D: int, 2025-05-07T20:33:04.7532707Z scale_ub: Optional[float], 2025-05-07T20:33:04.7532968Z contiguous: bool, 2025-05-07T20:33:04.7533204Z compiled: bool, 2025-05-07T20:33:04.7533414Z ) -> None: 2025-05-07T20:33:04.7533625Z torch.manual_seed(2025) 2025-05-07T20:33:04.7533860Z 2025-05-07T20:33:04.7534121Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.7534463Z 2025-05-07T20:33:04.7534646Z x_sign = torch.sign(x) 2025-05-07T20:33:04.7534927Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.7535235Z x = x_sign * x_clamp 2025-05-07T20:33:04.7535470Z x0 = x[:, :D] 2025-05-07T20:33:04.7535680Z x1 = x[:, D:] 2025-05-07T20:33:04.7535876Z 2025-05-07T20:33:04.7536053Z if contiguous: 2025-05-07T20:33:04.7536276Z x0 = x0.contiguous() 2025-05-07T20:33:04.7536527Z x1 = x1.contiguous() 2025-05-07T20:33:04.7536759Z 2025-05-07T20:33:04.7536943Z if scale_ub is not None: 2025-05-07T20:33:04.7537203Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.7537531Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.7537831Z ) 2025-05-07T20:33:04.7538012Z else: 2025-05-07T20:33:04.7538222Z scale_ub_tensor = None 2025-05-07T20:33:04.7538469Z 2025-05-07T20:33:04.7538684Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.7538990Z op = silu_mul_quant 2025-05-07T20:33:04.7539232Z if compiled: 2025-05-07T20:33:04.7539469Z op = torch.compile(op) 2025-05-07T20:33:04.7539756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.7540025Z 2025-05-07T20:33:04.7540206Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.7540367Z 2025-05-07T20:33:04.7540464Z moe/activation_test.py:117: 2025-05-07T20:33:04.7540755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.7541081Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.7541349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.7541901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.7542500Z return fn(*args, **kwargs) 2025-05-07T20:33:04.7543148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.7543826Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.7544390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.7545064Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.7545713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.7546273Z kernel = self.compile( 2025-05-07T20:33:04.7546802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.7547448Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.7547830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.7548062Z 2025-05-07T20:33:04.7548265Z self = 2025-05-07T20:33:04.7549352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.7550802Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917551550>} 2025-05-07T20:33:04.7552188Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.7553210Z context = 2025-05-07T20:33:04.7553495Z 2025-05-07T20:33:04.7553662Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.7554178Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.7554636Z module_map=module_map) 2025-05-07T20:33:04.7555007Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.7555351Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.7555598Z E ^ 2025-05-07T20:33:04.7556060Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.7556521Z 2025-05-07T20:33:04.7556936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.7557444Z 2025-05-07T20:33:04.9775810Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.9776443Z self=, 2025-05-07T20:33:04.9777003Z T=16384, 2025-05-07T20:33:04.9777271Z D=5120, 2025-05-07T20:33:04.9777487Z scale_ub=None, 2025-05-07T20:33:04.9777702Z contiguous=False, 2025-05-07T20:33:04.9777931Z compiled=True, 2025-05-07T20:33:04.9778135Z ) 2025-05-07T20:33:04.9778452Z self = 2025-05-07T20:33:04.9778950Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:04.9779225Z 2025-05-07T20:33:04.9779307Z @given( 2025-05-07T20:33:04.9779532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.9779852Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.9780163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.9780492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.9780822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.9781141Z ) 2025-05-07T20:33:04.9781514Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.9782067Z def test_silu_mul_quant( 2025-05-07T20:33:04.9782313Z self, 2025-05-07T20:33:04.9782506Z T: int, 2025-05-07T20:33:04.9782699Z D: int, 2025-05-07T20:33:04.9782980Z scale_ub: Optional[float], 2025-05-07T20:33:04.9783255Z contiguous: bool, 2025-05-07T20:33:04.9783494Z compiled: bool, 2025-05-07T20:33:04.9783721Z ) -> None: 2025-05-07T20:33:04.9783936Z torch.manual_seed(2025) 2025-05-07T20:33:04.9784175Z 2025-05-07T20:33:04.9784449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.9784859Z 2025-05-07T20:33:04.9785045Z x_sign = torch.sign(x) 2025-05-07T20:33:04.9785343Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.9785658Z x = x_sign * x_clamp 2025-05-07T20:33:04.9785896Z x0 = x[:, :D] 2025-05-07T20:33:04.9786114Z x1 = x[:, D:] 2025-05-07T20:33:04.9786328Z 2025-05-07T20:33:04.9786509Z if contiguous: 2025-05-07T20:33:04.9786743Z x0 = x0.contiguous() 2025-05-07T20:33:04.9787002Z x1 = x1.contiguous() 2025-05-07T20:33:04.9787243Z 2025-05-07T20:33:04.9787431Z if scale_ub is not None: 2025-05-07T20:33:04.9787709Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.9788046Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.9788351Z ) 2025-05-07T20:33:04.9788549Z else: 2025-05-07T20:33:04.9788770Z scale_ub_tensor = None 2025-05-07T20:33:04.9789029Z 2025-05-07T20:33:04.9789337Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.9789653Z op = silu_mul_quant 2025-05-07T20:33:04.9789980Z if compiled: 2025-05-07T20:33:04.9790232Z op = torch.compile(op) 2025-05-07T20:33:04.9790532Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.9790801Z 2025-05-07T20:33:04.9790993Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.9791203Z 2025-05-07T20:33:04.9791329Z moe/activation_test.py:117: 2025-05-07T20:33:04.9791633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.9791962Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.9792253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.9792819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.9793371Z return fn(*args, **kwargs) 2025-05-07T20:33:04.9794044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.9794737Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.9795275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.9795954Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.9796621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.9797150Z kernel = self.compile( 2025-05-07T20:33:04.9797691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.9798349Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.9798745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.9798970Z 2025-05-07T20:33:04.9799187Z self = 2025-05-07T20:33:04.9800269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.9801654Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891764a1f0>} 2025-05-07T20:33:04.9803124Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.9804649Z context = 2025-05-07T20:33:04.9804937Z 2025-05-07T20:33:04.9805109Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.9805709Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.9806180Z module_map=module_map) 2025-05-07T20:33:04.9806547Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.9806895Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.9807150Z E ^ 2025-05-07T20:33:04.9807616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.9808071Z 2025-05-07T20:33:04.9808498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.9809014Z 2025-05-07T20:33:04.9809115Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.9809529Z self=, 2025-05-07T20:33:04.9809934Z T=2048, 2025-05-07T20:33:04.9810114Z D=5120, 2025-05-07T20:33:04.9810370Z scale_ub=None, 2025-05-07T20:33:04.9810587Z contiguous=False, 2025-05-07T20:33:04.9810807Z compiled=True, 2025-05-07T20:33:04.9811010Z ) 2025-05-07T20:33:05.1020549Z self = 2025-05-07T20:33:05.1021603Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.1022383Z 2025-05-07T20:33:05.1022595Z @given( 2025-05-07T20:33:05.1023185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.1024033Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.1024673Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.1025323Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.1025973Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.1026530Z ) 2025-05-07T20:33:05.1027209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.1028085Z def test_silu_mul_quant( 2025-05-07T20:33:05.1028559Z self, 2025-05-07T20:33:05.1028938Z T: int, 2025-05-07T20:33:05.1029314Z D: int, 2025-05-07T20:33:05.1029742Z scale_ub: Optional[float], 2025-05-07T20:33:05.1030373Z contiguous: bool, 2025-05-07T20:33:05.1030831Z compiled: bool, 2025-05-07T20:33:05.1031124Z ) -> None: 2025-05-07T20:33:05.1031341Z torch.manual_seed(2025) 2025-05-07T20:33:05.1031572Z 2025-05-07T20:33:05.1031843Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.1032182Z 2025-05-07T20:33:05.1032364Z x_sign = torch.sign(x) 2025-05-07T20:33:05.1032663Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.1032968Z x = x_sign * x_clamp 2025-05-07T20:33:05.1033201Z x0 = x[:, :D] 2025-05-07T20:33:05.1033417Z x1 = x[:, D:] 2025-05-07T20:33:05.1033620Z 2025-05-07T20:33:05.1033797Z if contiguous: 2025-05-07T20:33:05.1034028Z x0 = x0.contiguous() 2025-05-07T20:33:05.1034285Z x1 = x1.contiguous() 2025-05-07T20:33:05.1034520Z 2025-05-07T20:33:05.1034703Z if scale_ub is not None: 2025-05-07T20:33:05.1034976Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.1035309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.1035614Z ) 2025-05-07T20:33:05.1035919Z else: 2025-05-07T20:33:05.1036127Z scale_ub_tensor = None 2025-05-07T20:33:05.1036369Z 2025-05-07T20:33:05.1036602Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.1036913Z op = silu_mul_quant 2025-05-07T20:33:05.1037222Z if compiled: 2025-05-07T20:33:05.1037473Z op = torch.compile(op) 2025-05-07T20:33:05.1037766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.1038033Z 2025-05-07T20:33:05.1038224Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.1038386Z 2025-05-07T20:33:05.1038549Z moe/activation_test.py:117: 2025-05-07T20:33:05.1038840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.1039166Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.1039451Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.1040015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.1040573Z return fn(*args, **kwargs) 2025-05-07T20:33:05.1041233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.1041924Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.1042456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.1043138Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.1043858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.1044392Z kernel = self.compile( 2025-05-07T20:33:05.1044925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.1045583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.1045982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.1046214Z 2025-05-07T20:33:05.1046420Z self = 2025-05-07T20:33:05.1047504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.1048897Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891764af70>} 2025-05-07T20:33:05.1050250Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.1051273Z context = 2025-05-07T20:33:05.1051560Z 2025-05-07T20:33:05.1051729Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.1052246Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.1052715Z module_map=module_map) 2025-05-07T20:33:05.1053087Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.1053432Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.1053688Z E ^ 2025-05-07T20:33:05.1054157Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.1054618Z 2025-05-07T20:33:05.1055042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.1055560Z 2025-05-07T20:33:05.1055665Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.1056081Z self=, 2025-05-07T20:33:05.1056548Z T=2048, 2025-05-07T20:33:05.1056733Z D=5120, 2025-05-07T20:33:05.1056926Z scale_ub=1200.0, 2025-05-07T20:33:05.1057153Z contiguous=False, 2025-05-07T20:33:05.1057368Z compiled=True, 2025-05-07T20:33:05.1057610Z ) 2025-05-07T20:33:05.1057948Z self = 2025-05-07T20:33:05.1058448Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.1058721Z 2025-05-07T20:33:05.1058806Z @given( 2025-05-07T20:33:05.1059030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.1059380Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.1059682Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.1060006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.1060332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.1060617Z ) 2025-05-07T20:33:05.1060964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.1061404Z def test_silu_mul_quant( 2025-05-07T20:33:05.1061646Z self, 2025-05-07T20:33:05.1061835Z T: int, 2025-05-07T20:33:05.1062026Z D: int, 2025-05-07T20:33:05.1062247Z scale_ub: Optional[float], 2025-05-07T20:33:05.1062514Z contiguous: bool, 2025-05-07T20:33:05.1062747Z compiled: bool, 2025-05-07T20:33:05.1062970Z ) -> None: 2025-05-07T20:33:05.1063182Z torch.manual_seed(2025) 2025-05-07T20:33:05.1063415Z 2025-05-07T20:33:05.1063745Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.1064086Z 2025-05-07T20:33:05.1064274Z x_sign = torch.sign(x) 2025-05-07T20:33:05.1064568Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.1064873Z x = x_sign * x_clamp 2025-05-07T20:33:05.1065114Z x0 = x[:, :D] 2025-05-07T20:33:05.1065337Z x1 = x[:, D:] 2025-05-07T20:33:05.1065539Z 2025-05-07T20:33:05.1065719Z if contiguous: 2025-05-07T20:33:05.1065945Z x0 = x0.contiguous() 2025-05-07T20:33:05.1066203Z x1 = x1.contiguous() 2025-05-07T20:33:05.1066441Z 2025-05-07T20:33:05.1066644Z if scale_ub is not None: 2025-05-07T20:33:05.1066916Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.1067248Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.1067554Z ) 2025-05-07T20:33:05.1067746Z else: 2025-05-07T20:33:05.1067955Z scale_ub_tensor = None 2025-05-07T20:33:05.1068200Z 2025-05-07T20:33:05.1068426Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.1068743Z op = silu_mul_quant 2025-05-07T20:33:05.1068986Z if compiled: 2025-05-07T20:33:05.1069227Z op = torch.compile(op) 2025-05-07T20:33:05.1069520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.1069790Z 2025-05-07T20:33:05.1070022Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.1070186Z 2025-05-07T20:33:05.1070287Z moe/activation_test.py:117: 2025-05-07T20:33:05.1070577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.1070909Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.1071212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.1071789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.1072339Z return fn(*args, **kwargs) 2025-05-07T20:33:05.1073001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.1073690Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.1074220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.1074957Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.1075616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.1076146Z kernel = self.compile( 2025-05-07T20:33:05.1076713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.1077367Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.1077759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.1078029Z 2025-05-07T20:33:05.1078237Z self = 2025-05-07T20:33:05.1079327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.1080712Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891746e940>} 2025-05-07T20:33:05.1082070Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.1083100Z context = 2025-05-07T20:33:05.1083384Z 2025-05-07T20:33:05.1083615Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.1084145Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.1084613Z module_map=module_map) 2025-05-07T20:33:05.1084990Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.1085343Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.1085600Z E ^ 2025-05-07T20:33:05.1086065Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.1086519Z 2025-05-07T20:33:05.1086953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.1087480Z 2025-05-07T20:33:05.5054716Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.5055311Z self=, 2025-05-07T20:33:05.5055912Z T=4096, 2025-05-07T20:33:05.5056172Z D=5120, 2025-05-07T20:33:05.5056417Z scale_ub=1200.0, 2025-05-07T20:33:05.5056699Z contiguous=True, 2025-05-07T20:33:05.5056916Z compiled=True, 2025-05-07T20:33:05.5057103Z ) 2025-05-07T20:33:05.5057447Z self = 2025-05-07T20:33:05.5057949Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.5058222Z 2025-05-07T20:33:05.5058302Z @given( 2025-05-07T20:33:05.5058525Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.5058838Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.5059146Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.5059476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.5059798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.5060082Z ) 2025-05-07T20:33:05.5060434Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.5060873Z def test_silu_mul_quant( 2025-05-07T20:33:05.5061135Z self, 2025-05-07T20:33:05.5061351Z T: int, 2025-05-07T20:33:05.5061541Z D: int, 2025-05-07T20:33:05.5061754Z scale_ub: Optional[float], 2025-05-07T20:33:05.5062025Z contiguous: bool, 2025-05-07T20:33:05.5062256Z compiled: bool, 2025-05-07T20:33:05.5062599Z ) -> None: 2025-05-07T20:33:05.5062812Z torch.manual_seed(2025) 2025-05-07T20:33:05.5063047Z 2025-05-07T20:33:05.5063318Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.5063662Z 2025-05-07T20:33:05.5063911Z x_sign = torch.sign(x) 2025-05-07T20:33:05.5064202Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.5064509Z x = x_sign * x_clamp 2025-05-07T20:33:05.5064742Z x0 = x[:, :D] 2025-05-07T20:33:05.5064948Z x1 = x[:, D:] 2025-05-07T20:33:05.5065152Z 2025-05-07T20:33:05.5065409Z if contiguous: 2025-05-07T20:33:05.5065634Z x0 = x0.contiguous() 2025-05-07T20:33:05.5065899Z x1 = x1.contiguous() 2025-05-07T20:33:05.5066143Z 2025-05-07T20:33:05.5066328Z if scale_ub is not None: 2025-05-07T20:33:05.5066599Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.5066933Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.5067240Z ) 2025-05-07T20:33:05.5067429Z else: 2025-05-07T20:33:05.5067632Z scale_ub_tensor = None 2025-05-07T20:33:05.5067873Z 2025-05-07T20:33:05.5068097Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.5068411Z op = silu_mul_quant 2025-05-07T20:33:05.5068650Z if compiled: 2025-05-07T20:33:05.5068894Z op = torch.compile(op) 2025-05-07T20:33:05.5069187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.5069457Z 2025-05-07T20:33:05.5069642Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.5069969Z 2025-05-07T20:33:05.5070069Z moe/activation_test.py:117: 2025-05-07T20:33:05.5070361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.5070687Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.5070964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.5071560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.5072117Z return fn(*args, **kwargs) 2025-05-07T20:33:05.5072780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.5073460Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.5073993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.5074670Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.5075337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.5075855Z kernel = self.compile( 2025-05-07T20:33:05.5076393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.5077045Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.5077438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.5077668Z 2025-05-07T20:33:05.5077871Z self = 2025-05-07T20:33:05.5078959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.5080343Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917409790>} 2025-05-07T20:33:05.5081685Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.5082749Z context = 2025-05-07T20:33:05.5083035Z 2025-05-07T20:33:05.5083198Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.5083754Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.5084222Z module_map=module_map) 2025-05-07T20:33:05.5084582Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.5084937Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.5085195Z E ^ 2025-05-07T20:33:05.5085703Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.5086160Z 2025-05-07T20:33:05.5086576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.5087087Z 2025-05-07T20:33:05.5087188Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.5087596Z self=, 2025-05-07T20:33:05.5087992Z T=128, 2025-05-07T20:33:05.5088173Z D=5120, 2025-05-07T20:33:05.5088362Z scale_ub=1200.0, 2025-05-07T20:33:05.5088579Z contiguous=False, 2025-05-07T20:33:05.5088808Z compiled=True, 2025-05-07T20:33:05.5089010Z ) 2025-05-07T20:33:05.6408038Z self = 2025-05-07T20:33:05.6408838Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.6409212Z 2025-05-07T20:33:05.6409437Z @given( 2025-05-07T20:33:05.6409758Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.6410068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.6410375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.6410697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.6411027Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.6411353Z ) 2025-05-07T20:33:05.6411691Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.6412131Z def test_silu_mul_quant( 2025-05-07T20:33:05.6412370Z self, 2025-05-07T20:33:05.6412560Z T: int, 2025-05-07T20:33:05.6412753Z D: int, 2025-05-07T20:33:05.6412967Z scale_ub: Optional[float], 2025-05-07T20:33:05.6413235Z contiguous: bool, 2025-05-07T20:33:05.6413468Z compiled: bool, 2025-05-07T20:33:05.6413690Z ) -> None: 2025-05-07T20:33:05.6413898Z torch.manual_seed(2025) 2025-05-07T20:33:05.6414146Z 2025-05-07T20:33:05.6414418Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.6414754Z 2025-05-07T20:33:05.6414942Z x_sign = torch.sign(x) 2025-05-07T20:33:05.6415235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.6415542Z x = x_sign * x_clamp 2025-05-07T20:33:05.6415778Z x0 = x[:, :D] 2025-05-07T20:33:05.6415988Z x1 = x[:, D:] 2025-05-07T20:33:05.6416195Z 2025-05-07T20:33:05.6416373Z if contiguous: 2025-05-07T20:33:05.6416602Z x0 = x0.contiguous() 2025-05-07T20:33:05.6416856Z x1 = x1.contiguous() 2025-05-07T20:33:05.6417093Z 2025-05-07T20:33:05.6417283Z if scale_ub is not None: 2025-05-07T20:33:05.6417564Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.6417894Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.6418208Z ) 2025-05-07T20:33:05.6418398Z else: 2025-05-07T20:33:05.6418604Z scale_ub_tensor = None 2025-05-07T20:33:05.6418850Z 2025-05-07T20:33:05.6419079Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.6419396Z op = silu_mul_quant 2025-05-07T20:33:05.6419639Z if compiled: 2025-05-07T20:33:05.6419882Z op = torch.compile(op) 2025-05-07T20:33:05.6420252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.6420517Z 2025-05-07T20:33:05.6420707Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.6420874Z 2025-05-07T20:33:05.6420979Z moe/activation_test.py:117: 2025-05-07T20:33:05.6421352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.6421707Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.6421982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.6422533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.6423146Z return fn(*args, **kwargs) 2025-05-07T20:33:05.6423804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.6424492Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.6425018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.6425701Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.6426360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.6426891Z kernel = self.compile( 2025-05-07T20:33:05.6427428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.6428078Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.6428518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.6428748Z 2025-05-07T20:33:05.6428954Z self = 2025-05-07T20:33:05.6430114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.6431505Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89172fe0d0>} 2025-05-07T20:33:05.6432852Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.6433869Z context = 2025-05-07T20:33:05.6434159Z 2025-05-07T20:33:05.6434327Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.6434844Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.6435315Z module_map=module_map) 2025-05-07T20:33:05.6435681Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.6436028Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.6436286Z E ^ 2025-05-07T20:33:05.6436757Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.6437206Z 2025-05-07T20:33:05.6437632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.6438155Z 2025-05-07T20:33:05.6438258Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.6438670Z self=, 2025-05-07T20:33:05.6439076Z T=16384, 2025-05-07T20:33:05.6439262Z D=7168, 2025-05-07T20:33:05.6439454Z scale_ub=1200.0, 2025-05-07T20:33:05.6439671Z contiguous=True, 2025-05-07T20:33:05.6439891Z compiled=True, 2025-05-07T20:33:05.6440094Z ) 2025-05-07T20:33:05.6440409Z self = 2025-05-07T20:33:05.6440948Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.6441227Z 2025-05-07T20:33:05.6441302Z @given( 2025-05-07T20:33:05.6441532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.6441877Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.6442181Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.6442513Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.6442844Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.6443130Z ) 2025-05-07T20:33:05.6443537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.6443976Z def test_silu_mul_quant( 2025-05-07T20:33:05.6444224Z self, 2025-05-07T20:33:05.6444410Z T: int, 2025-05-07T20:33:05.6444603Z D: int, 2025-05-07T20:33:05.6444817Z scale_ub: Optional[float], 2025-05-07T20:33:05.6445082Z contiguous: bool, 2025-05-07T20:33:05.6445327Z compiled: bool, 2025-05-07T20:33:05.6445542Z ) -> None: 2025-05-07T20:33:05.6445751Z torch.manual_seed(2025) 2025-05-07T20:33:05.6445991Z 2025-05-07T20:33:05.6446259Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.6446596Z 2025-05-07T20:33:05.6446786Z x_sign = torch.sign(x) 2025-05-07T20:33:05.6447067Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.6447372Z x = x_sign * x_clamp 2025-05-07T20:33:05.6447603Z x0 = x[:, :D] 2025-05-07T20:33:05.6447813Z x1 = x[:, D:] 2025-05-07T20:33:05.6448063Z 2025-05-07T20:33:05.6448233Z if contiguous: 2025-05-07T20:33:05.6448458Z x0 = x0.contiguous() 2025-05-07T20:33:05.6448707Z x1 = x1.contiguous() 2025-05-07T20:33:05.6448935Z 2025-05-07T20:33:05.6449112Z if scale_ub is not None: 2025-05-07T20:33:05.6449385Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.6449709Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.6450009Z ) 2025-05-07T20:33:05.6450193Z else: 2025-05-07T20:33:05.6450388Z scale_ub_tensor = None 2025-05-07T20:33:05.6450629Z 2025-05-07T20:33:05.6450853Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.6451155Z op = silu_mul_quant 2025-05-07T20:33:05.6451398Z if compiled: 2025-05-07T20:33:05.6451635Z op = torch.compile(op) 2025-05-07T20:33:05.6451920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.6452187Z 2025-05-07T20:33:05.6452377Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.6452538Z 2025-05-07T20:33:05.6452636Z moe/activation_test.py:117: 2025-05-07T20:33:05.6452919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.6453239Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.6453511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.6454059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.6454611Z return fn(*args, **kwargs) 2025-05-07T20:33:05.6455268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.6455949Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.6456473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.6457152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.6457810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.6458329Z kernel = self.compile( 2025-05-07T20:33:05.6458862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.6459552Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.6459939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.6460160Z 2025-05-07T20:33:05.6460400Z self = 2025-05-07T20:33:05.6461536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.6462943Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89172fed30>} 2025-05-07T20:33:05.6464281Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.6465296Z context = 2025-05-07T20:33:05.6465577Z 2025-05-07T20:33:05.6465740Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.6466261Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.6466721Z module_map=module_map) 2025-05-07T20:33:05.6467084Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.6467434Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.6467730Z E ^ 2025-05-07T20:33:05.6468193Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.6468650Z 2025-05-07T20:33:05.6469068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.6469582Z 2025-05-07T20:33:05.9231686Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.9232296Z self=, 2025-05-07T20:33:05.9232843Z T=16384, 2025-05-07T20:33:05.9233091Z D=5120, 2025-05-07T20:33:05.9233341Z scale_ub=1200.0, 2025-05-07T20:33:05.9233649Z contiguous=True, 2025-05-07T20:33:05.9233873Z compiled=False, 2025-05-07T20:33:05.9234079Z ) 2025-05-07T20:33:05.9234396Z self = 2025-05-07T20:33:05.9234887Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:05.9235175Z 2025-05-07T20:33:05.9235251Z @given( 2025-05-07T20:33:05.9235476Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.9235788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.9236127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.9236447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.9236809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.9237213Z ) 2025-05-07T20:33:05.9237691Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.9238248Z def test_silu_mul_quant( 2025-05-07T20:33:05.9238492Z self, 2025-05-07T20:33:05.9238686Z T: int, 2025-05-07T20:33:05.9238873Z D: int, 2025-05-07T20:33:05.9239086Z scale_ub: Optional[float], 2025-05-07T20:33:05.9239355Z contiguous: bool, 2025-05-07T20:33:05.9239585Z compiled: bool, 2025-05-07T20:33:05.9239802Z ) -> None: 2025-05-07T20:33:05.9240018Z torch.manual_seed(2025) 2025-05-07T20:33:05.9240253Z 2025-05-07T20:33:05.9240518Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.9240852Z 2025-05-07T20:33:05.9241035Z x_sign = torch.sign(x) 2025-05-07T20:33:05.9241323Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.9241802Z x = x_sign * x_clamp 2025-05-07T20:33:05.9242038Z x0 = x[:, :D] 2025-05-07T20:33:05.9242282Z x1 = x[:, D:] 2025-05-07T20:33:05.9242489Z 2025-05-07T20:33:05.9242675Z if contiguous: 2025-05-07T20:33:05.9242906Z x0 = x0.contiguous() 2025-05-07T20:33:05.9243224Z x1 = x1.contiguous() 2025-05-07T20:33:05.9243468Z 2025-05-07T20:33:05.9243653Z if scale_ub is not None: 2025-05-07T20:33:05.9243916Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.9244245Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.9244617Z ) 2025-05-07T20:33:05.9244808Z else: 2025-05-07T20:33:05.9245006Z scale_ub_tensor = None 2025-05-07T20:33:05.9245253Z 2025-05-07T20:33:05.9245482Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.9245787Z op = silu_mul_quant 2025-05-07T20:33:05.9246031Z if compiled: 2025-05-07T20:33:05.9246276Z op = torch.compile(op) 2025-05-07T20:33:05.9246564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.9246831Z 2025-05-07T20:33:05.9247019Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.9247183Z 2025-05-07T20:33:05.9247280Z moe/activation_test.py:117: 2025-05-07T20:33:05.9247573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.9247906Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.9248186Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.9248941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.9249634Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.9250173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.9250850Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.9251511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.9252038Z kernel = self.compile( 2025-05-07T20:33:05.9252577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.9253217Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.9253608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.9253835Z 2025-05-07T20:33:05.9254046Z self = 2025-05-07T20:33:05.9255141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.9256516Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891725c700>} 2025-05-07T20:33:05.9257871Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.9258891Z context = 2025-05-07T20:33:05.9259175Z 2025-05-07T20:33:05.9259351Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.9259872Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.9260337Z module_map=module_map) 2025-05-07T20:33:05.9260696Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.9261046Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.9261296Z E ^ 2025-05-07T20:33:05.9261868Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.9262317Z 2025-05-07T20:33:05.9262779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.9263291Z 2025-05-07T20:33:05.9263395Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.9263806Z self=, 2025-05-07T20:33:05.9264210Z T=1, 2025-05-07T20:33:05.9264385Z D=7168, 2025-05-07T20:33:05.9264568Z scale_ub=1200.0, 2025-05-07T20:33:05.9264852Z contiguous=False, 2025-05-07T20:33:05.9265076Z compiled=False, 2025-05-07T20:33:05.9265270Z ) 2025-05-07T20:33:05.9265584Z self = 2025-05-07T20:33:05.9266067Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.9266331Z 2025-05-07T20:33:05.9266410Z @given( 2025-05-07T20:33:05.9266635Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.9266941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.9267242Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.9267566Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.9267890Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.9268169Z ) 2025-05-07T20:33:05.9268509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.9268944Z def test_silu_mul_quant( 2025-05-07T20:33:05.9269227Z self, 2025-05-07T20:33:05.9269415Z T: int, 2025-05-07T20:33:05.9269606Z D: int, 2025-05-07T20:33:05.9269904Z scale_ub: Optional[float], 2025-05-07T20:33:05.9270161Z contiguous: bool, 2025-05-07T20:33:05.9270394Z compiled: bool, 2025-05-07T20:33:05.9270608Z ) -> None: 2025-05-07T20:33:05.9270811Z torch.manual_seed(2025) 2025-05-07T20:33:05.9271049Z 2025-05-07T20:33:05.9271313Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.9271694Z 2025-05-07T20:33:05.9271879Z x_sign = torch.sign(x) 2025-05-07T20:33:05.9272168Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.9272469Z x = x_sign * x_clamp 2025-05-07T20:33:05.9278507Z x0 = x[:, :D] 2025-05-07T20:33:05.9278759Z x1 = x[:, D:] 2025-05-07T20:33:05.9278963Z 2025-05-07T20:33:05.9279142Z if contiguous: 2025-05-07T20:33:05.9279374Z x0 = x0.contiguous() 2025-05-07T20:33:05.9279642Z x1 = x1.contiguous() 2025-05-07T20:33:05.9279867Z 2025-05-07T20:33:05.9280056Z if scale_ub is not None: 2025-05-07T20:33:05.9280324Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.9280656Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.9280957Z ) 2025-05-07T20:33:05.9281145Z else: 2025-05-07T20:33:05.9281341Z scale_ub_tensor = None 2025-05-07T20:33:05.9281594Z 2025-05-07T20:33:05.9281824Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.9282131Z op = silu_mul_quant 2025-05-07T20:33:05.9282375Z if compiled: 2025-05-07T20:33:05.9282614Z op = torch.compile(op) 2025-05-07T20:33:05.9282898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.9283164Z 2025-05-07T20:33:05.9283346Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.9283505Z 2025-05-07T20:33:05.9283607Z moe/activation_test.py:117: 2025-05-07T20:33:05.9283898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.9284228Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.9284498Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.9285183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.9285951Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.9286481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.9287200Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.9287852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.9288380Z kernel = self.compile( 2025-05-07T20:33:05.9288915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.9289597Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.9289988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.9290223Z 2025-05-07T20:33:05.9290426Z self = 2025-05-07T20:33:05.9291506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.9292884Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89173940d0>} 2025-05-07T20:33:05.9294266Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.9295290Z context = 2025-05-07T20:33:05.9295572Z 2025-05-07T20:33:05.9295741Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.9296260Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.9296726Z module_map=module_map) 2025-05-07T20:33:05.9297089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.9297440Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.9297686Z E ^ 2025-05-07T20:33:05.9298151Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.9298597Z 2025-05-07T20:33:05.9299015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.9299527Z 2025-05-07T20:33:05.9299638Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.9300040Z self=, 2025-05-07T20:33:05.9300445Z T=4096, 2025-05-07T20:33:05.9300618Z D=7168, 2025-05-07T20:33:05.9300792Z scale_ub=1200.0, 2025-05-07T20:33:05.9301006Z contiguous=False, 2025-05-07T20:33:05.9301225Z compiled=True, 2025-05-07T20:33:05.9301417Z ) 2025-05-07T20:33:06.0470138Z self = 2025-05-07T20:33:06.0470833Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:06.0471244Z 2025-05-07T20:33:06.0471352Z @given( 2025-05-07T20:33:06.0471672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.0472120Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.0472488Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.0472829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.0473151Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.0473433Z ) 2025-05-07T20:33:06.0473780Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.0474215Z def test_silu_mul_quant( 2025-05-07T20:33:06.0474456Z self, 2025-05-07T20:33:06.0474768Z T: int, 2025-05-07T20:33:06.0474960Z D: int, 2025-05-07T20:33:06.0475178Z scale_ub: Optional[float], 2025-05-07T20:33:06.0475447Z contiguous: bool, 2025-05-07T20:33:06.0475683Z compiled: bool, 2025-05-07T20:33:06.0475906Z ) -> None: 2025-05-07T20:33:06.0476186Z torch.manual_seed(2025) 2025-05-07T20:33:06.0476428Z 2025-05-07T20:33:06.0476697Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.0477037Z 2025-05-07T20:33:06.0477226Z x_sign = torch.sign(x) 2025-05-07T20:33:06.0477516Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.0477889Z x = x_sign * x_clamp 2025-05-07T20:33:06.0478127Z x0 = x[:, :D] 2025-05-07T20:33:06.0478335Z x1 = x[:, D:] 2025-05-07T20:33:06.0478532Z 2025-05-07T20:33:06.0478714Z if contiguous: 2025-05-07T20:33:06.0478939Z x0 = x0.contiguous() 2025-05-07T20:33:06.0479188Z x1 = x1.contiguous() 2025-05-07T20:33:06.0479424Z 2025-05-07T20:33:06.0479601Z if scale_ub is not None: 2025-05-07T20:33:06.0479864Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.0480195Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.0480499Z ) 2025-05-07T20:33:06.0480686Z else: 2025-05-07T20:33:06.0480891Z scale_ub_tensor = None 2025-05-07T20:33:06.0481131Z 2025-05-07T20:33:06.0481362Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.0481674Z op = silu_mul_quant 2025-05-07T20:33:06.0482014Z if compiled: 2025-05-07T20:33:06.0482249Z op = torch.compile(op) 2025-05-07T20:33:06.0482542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.0482809Z 2025-05-07T20:33:06.0482988Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.0483152Z 2025-05-07T20:33:06.0483248Z moe/activation_test.py:117: 2025-05-07T20:33:06.0483537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.0483866Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.0484135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.0484692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.0485247Z return fn(*args, **kwargs) 2025-05-07T20:33:06.0485903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.0486590Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.0487128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.0487805Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.0488463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.0488991Z kernel = self.compile( 2025-05-07T20:33:06.0489526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.0490176Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.0490568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.0490798Z 2025-05-07T20:33:06.0490997Z self = 2025-05-07T20:33:06.0492133Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.0493524Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917394dc0>} 2025-05-07T20:33:06.0494918Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.0495979Z context = 2025-05-07T20:33:06.0496266Z 2025-05-07T20:33:06.0496430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.0496947Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.0497409Z module_map=module_map) 2025-05-07T20:33:06.0497813Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.0498161Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.0498409Z E ^ 2025-05-07T20:33:06.0498870Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.0499339Z 2025-05-07T20:33:06.0499756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.0500266Z 2025-05-07T20:33:06.0500373Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.0500784Z self=, 2025-05-07T20:33:06.0501186Z T=128, 2025-05-07T20:33:06.0501372Z D=7168, 2025-05-07T20:33:06.0501576Z scale_ub=1200.0, 2025-05-07T20:33:06.0501826Z contiguous=False, 2025-05-07T20:33:06.0502051Z compiled=True, 2025-05-07T20:33:06.0502248Z ) 2025-05-07T20:33:06.0502603Z self = 2025-05-07T20:33:06.0503095Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:06.0503361Z 2025-05-07T20:33:06.0503442Z @given( 2025-05-07T20:33:06.0503664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.0504158Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.0504460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.0504779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.0505098Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.0505380Z ) 2025-05-07T20:33:06.0505715Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.0506153Z def test_silu_mul_quant( 2025-05-07T20:33:06.0506389Z self, 2025-05-07T20:33:06.0506576Z T: int, 2025-05-07T20:33:06.0506756Z D: int, 2025-05-07T20:33:06.0506973Z scale_ub: Optional[float], 2025-05-07T20:33:06.0507233Z contiguous: bool, 2025-05-07T20:33:06.0507467Z compiled: bool, 2025-05-07T20:33:06.0507680Z ) -> None: 2025-05-07T20:33:06.0507888Z torch.manual_seed(2025) 2025-05-07T20:33:06.0508122Z 2025-05-07T20:33:06.0508392Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.0508741Z 2025-05-07T20:33:06.0508919Z x_sign = torch.sign(x) 2025-05-07T20:33:06.0509195Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.0509499Z x = x_sign * x_clamp 2025-05-07T20:33:06.0509727Z x0 = x[:, :D] 2025-05-07T20:33:06.0509983Z x1 = x[:, D:] 2025-05-07T20:33:06.0510187Z 2025-05-07T20:33:06.0510359Z if contiguous: 2025-05-07T20:33:06.0510582Z x0 = x0.contiguous() 2025-05-07T20:33:06.0510832Z x1 = x1.contiguous() 2025-05-07T20:33:06.0511061Z 2025-05-07T20:33:06.0511251Z if scale_ub is not None: 2025-05-07T20:33:06.0511515Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.0511840Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.0512132Z ) 2025-05-07T20:33:06.0512314Z else: 2025-05-07T20:33:06.0512517Z scale_ub_tensor = None 2025-05-07T20:33:06.0512756Z 2025-05-07T20:33:06.0513054Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.0513367Z op = silu_mul_quant 2025-05-07T20:33:06.0513603Z if compiled: 2025-05-07T20:33:06.0513842Z op = torch.compile(op) 2025-05-07T20:33:06.0514199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.0514467Z 2025-05-07T20:33:06.0514647Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.0514808Z 2025-05-07T20:33:06.0514906Z moe/activation_test.py:117: 2025-05-07T20:33:06.0515200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.0515616Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.0515886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.0516444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.0516993Z return fn(*args, **kwargs) 2025-05-07T20:33:06.0517643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.0518327Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.0518859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.0519536Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.0520192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.0520715Z kernel = self.compile( 2025-05-07T20:33:06.0521312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.0521962Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.0522351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.0522580Z 2025-05-07T20:33:06.0522781Z self = 2025-05-07T20:33:06.0523866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.0525244Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89171c1940>} 2025-05-07T20:33:06.0526588Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.0527611Z context = 2025-05-07T20:33:06.0527893Z 2025-05-07T20:33:06.0528061Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.0528582Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.0529039Z module_map=module_map) 2025-05-07T20:33:06.0529399Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.0529752Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.0529994Z E ^ 2025-05-07T20:33:06.0530447Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.0530899Z 2025-05-07T20:33:06.0531326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.0531839Z 2025-05-07T20:33:06.2255615Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.2256646Z self=, 2025-05-07T20:33:06.2257169Z T=2048, 2025-05-07T20:33:06.2257445Z D=7168, 2025-05-07T20:33:06.2257712Z scale_ub=None, 2025-05-07T20:33:06.2258254Z contiguous=True, 2025-05-07T20:33:06.2258493Z compiled=True, 2025-05-07T20:33:06.2258709Z ) 2025-05-07T20:33:06.2259034Z self = 2025-05-07T20:33:06.2259679Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:06.2259955Z 2025-05-07T20:33:06.2260045Z @given( 2025-05-07T20:33:06.2260280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.2260607Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.2260935Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.2261381Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.2261754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.2262050Z ) 2025-05-07T20:33:06.2262412Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.2262854Z def test_silu_mul_quant( 2025-05-07T20:33:06.2263112Z self, 2025-05-07T20:33:06.2263318Z T: int, 2025-05-07T20:33:06.2263518Z D: int, 2025-05-07T20:33:06.2263746Z scale_ub: Optional[float], 2025-05-07T20:33:06.2264027Z contiguous: bool, 2025-05-07T20:33:06.2264267Z compiled: bool, 2025-05-07T20:33:06.2264508Z ) -> None: 2025-05-07T20:33:06.2264736Z torch.manual_seed(2025) 2025-05-07T20:33:06.2264979Z 2025-05-07T20:33:06.2265258Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.2265615Z 2025-05-07T20:33:06.2265812Z x_sign = torch.sign(x) 2025-05-07T20:33:06.2266203Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.2266525Z x = x_sign * x_clamp 2025-05-07T20:33:06.2266780Z x0 = x[:, :D] 2025-05-07T20:33:06.2267000Z x1 = x[:, D:] 2025-05-07T20:33:06.2267221Z 2025-05-07T20:33:06.2267416Z if contiguous: 2025-05-07T20:33:06.2267648Z x0 = x0.contiguous() 2025-05-07T20:33:06.2267918Z x1 = x1.contiguous() 2025-05-07T20:33:06.2268166Z 2025-05-07T20:33:06.2268366Z if scale_ub is not None: 2025-05-07T20:33:06.2268661Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.2269015Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.2269329Z ) 2025-05-07T20:33:06.2269530Z else: 2025-05-07T20:33:06.2269751Z scale_ub_tensor = None 2025-05-07T20:33:06.2270105Z 2025-05-07T20:33:06.2270346Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.2270673Z op = silu_mul_quant 2025-05-07T20:33:06.2270931Z if compiled: 2025-05-07T20:33:06.2271188Z op = torch.compile(op) 2025-05-07T20:33:06.2271514Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.2271830Z 2025-05-07T20:33:06.2272028Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.2272202Z 2025-05-07T20:33:06.2272309Z moe/activation_test.py:117: 2025-05-07T20:33:06.2272617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.2272956Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.2273252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.2273831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.2274393Z return fn(*args, **kwargs) 2025-05-07T20:33:06.2275061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.2275766Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.2276316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.2276997Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.2277669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.2278271Z kernel = self.compile( 2025-05-07T20:33:06.2278819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.2279522Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.2279936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.2280169Z 2025-05-07T20:33:06.2280387Z self = 2025-05-07T20:33:06.2281531Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.2282975Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917497550>} 2025-05-07T20:33:06.2284334Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.2285372Z context = 2025-05-07T20:33:06.2285661Z 2025-05-07T20:33:06.2285842Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.2286367Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.2286889Z module_map=module_map) 2025-05-07T20:33:06.2287270Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.2287627Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.2287899Z E ^ 2025-05-07T20:33:06.2288377Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.2288834Z 2025-05-07T20:33:06.2289262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.2289777Z 2025-05-07T20:33:06.2289887Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.2290316Z self=, 2025-05-07T20:33:06.2290724Z T=16384, 2025-05-07T20:33:06.2290922Z D=5120, 2025-05-07T20:33:06.2291129Z scale_ub=None, 2025-05-07T20:33:06.2291354Z contiguous=False, 2025-05-07T20:33:06.2291594Z compiled=False, 2025-05-07T20:33:06.2291798Z ) 2025-05-07T20:33:06.2292123Z self = 2025-05-07T20:33:06.2292626Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:06.2292906Z 2025-05-07T20:33:06.2292986Z @given( 2025-05-07T20:33:06.2293224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.2293547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.2293856Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.2294195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.2294535Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.2294825Z ) 2025-05-07T20:33:06.2295183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.2295633Z def test_silu_mul_quant( 2025-05-07T20:33:06.2295883Z self, 2025-05-07T20:33:06.2296078Z T: int, 2025-05-07T20:33:06.2296287Z D: int, 2025-05-07T20:33:06.2296515Z scale_ub: Optional[float], 2025-05-07T20:33:06.2296789Z contiguous: bool, 2025-05-07T20:33:06.2297038Z compiled: bool, 2025-05-07T20:33:06.2297274Z ) -> None: 2025-05-07T20:33:06.2297493Z torch.manual_seed(2025) 2025-05-07T20:33:06.2297747Z 2025-05-07T20:33:06.2298030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.2298450Z 2025-05-07T20:33:06.2298654Z x_sign = torch.sign(x) 2025-05-07T20:33:06.2298955Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.2301032Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.2302937Z 2025-05-07T20:33:06.2303072Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:06.2303291Z 2025-05-07T20:33:06.2303400Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.2304113Z self=, 2025-05-07T20:33:06.2304531Z T=4096, 2025-05-07T20:33:06.2304719Z D=7168, 2025-05-07T20:33:06.2304921Z scale_ub=1200.0, 2025-05-07T20:33:06.2305157Z contiguous=True, 2025-05-07T20:33:06.2305378Z compiled=True, 2025-05-07T20:33:06.2305589Z ) 2025-05-07T20:33:06.2305915Z self = 2025-05-07T20:33:06.2306404Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:06.2306690Z 2025-05-07T20:33:06.2306877Z @given( 2025-05-07T20:33:06.2307118Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.2307436Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.2307744Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.2308091Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.2308431Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.2308717Z ) 2025-05-07T20:33:06.2309072Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.2309521Z def test_silu_mul_quant( 2025-05-07T20:33:06.2309761Z self, 2025-05-07T20:33:06.2310016Z T: int, 2025-05-07T20:33:06.2310221Z D: int, 2025-05-07T20:33:06.2310439Z scale_ub: Optional[float], 2025-05-07T20:33:06.2310723Z contiguous: bool, 2025-05-07T20:33:06.2310976Z compiled: bool, 2025-05-07T20:33:06.2311214Z ) -> None: 2025-05-07T20:33:06.2311433Z torch.manual_seed(2025) 2025-05-07T20:33:06.2311690Z 2025-05-07T20:33:06.2311970Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.2312310Z 2025-05-07T20:33:06.2312513Z x_sign = torch.sign(x) 2025-05-07T20:33:06.2312815Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.2314991Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.2316863Z 2025-05-07T20:33:06.2316987Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:06.2317212Z 2025-05-07T20:33:06.2317313Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.2317727Z self=, 2025-05-07T20:33:06.2318127Z T=16384, 2025-05-07T20:33:06.2318316Z D=7168, 2025-05-07T20:33:06.2318511Z scale_ub=None, 2025-05-07T20:33:06.2318732Z contiguous=False, 2025-05-07T20:33:06.2319027Z compiled=False, 2025-05-07T20:33:06.2319234Z ) 2025-05-07T20:33:06.3370833Z self = 2025-05-07T20:33:06.3371870Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:06.3372549Z 2025-05-07T20:33:06.3372754Z @given( 2025-05-07T20:33:06.3373232Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.3386039Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.3386399Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.3386913Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.3387256Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.3387576Z ) 2025-05-07T20:33:06.3387929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.3388381Z def test_silu_mul_quant( 2025-05-07T20:33:06.3388636Z self, 2025-05-07T20:33:06.3388838Z T: int, 2025-05-07T20:33:06.3389045Z D: int, 2025-05-07T20:33:06.3389273Z scale_ub: Optional[float], 2025-05-07T20:33:06.3389545Z contiguous: bool, 2025-05-07T20:33:06.3389876Z compiled: bool, 2025-05-07T20:33:06.3390117Z ) -> None: 2025-05-07T20:33:06.3390341Z torch.manual_seed(2025) 2025-05-07T20:33:06.3390598Z 2025-05-07T20:33:06.3390881Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.3393050Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.3394945Z 2025-05-07T20:33:06.3395078Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:06.3395296Z 2025-05-07T20:33:06.3395402Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.3395831Z self=, 2025-05-07T20:33:06.3396242Z T=2048, 2025-05-07T20:33:06.3396438Z D=7168, 2025-05-07T20:33:06.3396645Z scale_ub=1200.0, 2025-05-07T20:33:06.3396880Z contiguous=True, 2025-05-07T20:33:06.3397114Z compiled=True, 2025-05-07T20:33:06.3397330Z ) 2025-05-07T20:33:06.3397660Z self = 2025-05-07T20:33:06.3398172Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:06.3398444Z 2025-05-07T20:33:06.3398524Z @given( 2025-05-07T20:33:06.3398767Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.3399094Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.3399405Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.3399746Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.3400085Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.3400378Z ) 2025-05-07T20:33:06.3400737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.3401189Z def test_silu_mul_quant( 2025-05-07T20:33:06.3401444Z self, 2025-05-07T20:33:06.3401641Z T: int, 2025-05-07T20:33:06.3401852Z D: int, 2025-05-07T20:33:06.3402086Z scale_ub: Optional[float], 2025-05-07T20:33:06.3402359Z contiguous: bool, 2025-05-07T20:33:06.3402609Z compiled: bool, 2025-05-07T20:33:06.3402848Z ) -> None: 2025-05-07T20:33:06.3403066Z torch.manual_seed(2025) 2025-05-07T20:33:06.3403318Z 2025-05-07T20:33:06.3403600Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.3404294Z 2025-05-07T20:33:06.3404497Z x_sign = torch.sign(x) 2025-05-07T20:33:06.3404800Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.3406919Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.3408839Z 2025-05-07T20:33:06.3408975Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:06.3409192Z 2025-05-07T20:33:06.3409300Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.3409727Z self=, 2025-05-07T20:33:06.3410141Z T=2048, 2025-05-07T20:33:06.3410338Z D=7168, 2025-05-07T20:33:06.3410529Z scale_ub=None, 2025-05-07T20:33:06.3410750Z contiguous=True, 2025-05-07T20:33:06.3410986Z compiled=False, 2025-05-07T20:33:06.3411191Z ) 2025-05-07T20:33:06.3411515Z self = 2025-05-07T20:33:06.3412016Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:06.3412290Z 2025-05-07T20:33:06.3412374Z @given( 2025-05-07T20:33:06.3412672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.3412991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.3413307Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.3413636Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.3413976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.3414274Z ) 2025-05-07T20:33:06.3414620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.3415068Z def test_silu_mul_quant( 2025-05-07T20:33:06.3415317Z self, 2025-05-07T20:33:06.3415511Z T: int, 2025-05-07T20:33:06.3415721Z D: int, 2025-05-07T20:33:06.3415947Z scale_ub: Optional[float], 2025-05-07T20:33:06.3416224Z contiguous: bool, 2025-05-07T20:33:06.3416462Z compiled: bool, 2025-05-07T20:33:06.3416691Z ) -> None: 2025-05-07T20:33:06.3416912Z torch.manual_seed(2025) 2025-05-07T20:33:06.3417160Z 2025-05-07T20:33:06.3417435Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.3417781Z 2025-05-07T20:33:06.3417974Z > x_sign = torch.sign(x) 2025-05-07T20:33:06.3419943Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.3421876Z 2025-05-07T20:33:06.3421998Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:06.3422219Z 2025-05-07T20:33:06.3422324Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.3422753Z self=, 2025-05-07T20:33:06.3423152Z T=1, 2025-05-07T20:33:06.3423341Z D=7168, 2025-05-07T20:33:06.3423540Z scale_ub=1200.0, 2025-05-07T20:33:06.3423765Z contiguous=True, 2025-05-07T20:33:06.3423995Z compiled=False, 2025-05-07T20:33:06.3424207Z ) 2025-05-07T20:33:06.6714158Z self = 2025-05-07T20:33:06.6714928Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:06.6715295Z 2025-05-07T20:33:06.6715407Z @given( 2025-05-07T20:33:06.6715976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.6716312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.6716634Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.6716967Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.6717497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.6717912Z ) 2025-05-07T20:33:06.6718266Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.6718719Z def test_silu_mul_quant( 2025-05-07T20:33:06.6718976Z self, 2025-05-07T20:33:06.6719173Z T: int, 2025-05-07T20:33:06.6719383Z D: int, 2025-05-07T20:33:06.6719613Z scale_ub: Optional[float], 2025-05-07T20:33:06.6719895Z contiguous: bool, 2025-05-07T20:33:06.6720148Z compiled: bool, 2025-05-07T20:33:06.6720391Z ) -> None: 2025-05-07T20:33:06.6720611Z torch.manual_seed(2025) 2025-05-07T20:33:06.6720867Z 2025-05-07T20:33:06.6721156Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.6721509Z 2025-05-07T20:33:06.6721709Z x_sign = torch.sign(x) 2025-05-07T20:33:06.6722015Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.6722337Z x = x_sign * x_clamp 2025-05-07T20:33:06.6722678Z x0 = x[:, :D] 2025-05-07T20:33:06.6722909Z x1 = x[:, D:] 2025-05-07T20:33:06.6723128Z 2025-05-07T20:33:06.6723316Z if contiguous: 2025-05-07T20:33:06.6723557Z x0 = x0.contiguous() 2025-05-07T20:33:06.6723828Z x1 = x1.contiguous() 2025-05-07T20:33:06.6724073Z 2025-05-07T20:33:06.6724278Z if scale_ub is not None: 2025-05-07T20:33:06.6724567Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.6724912Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.6725230Z ) 2025-05-07T20:33:06.6725434Z else: 2025-05-07T20:33:06.6725652Z scale_ub_tensor = None 2025-05-07T20:33:06.6725917Z 2025-05-07T20:33:06.6726161Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.6726482Z op = silu_mul_quant 2025-05-07T20:33:06.6726746Z if compiled: 2025-05-07T20:33:06.6727007Z op = torch.compile(op) 2025-05-07T20:33:06.6727325Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.6727605Z 2025-05-07T20:33:06.6727809Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.6727978Z 2025-05-07T20:33:06.6728093Z moe/activation_test.py:117: 2025-05-07T20:33:06.6728397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.6728745Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.6729043Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.6729744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.6730452Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.6731003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.6731717Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.6732424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.6732972Z kernel = self.compile( 2025-05-07T20:33:06.6733529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.6734197Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.6734681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.6734925Z 2025-05-07T20:33:06.6735134Z self = 2025-05-07T20:33:06.6736273Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.6737680Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89170df040>} 2025-05-07T20:33:06.6739066Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.6740100Z context = 2025-05-07T20:33:06.6740400Z 2025-05-07T20:33:06.6740573Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.6741112Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.6741587Z module_map=module_map) 2025-05-07T20:33:06.6741966Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.6742381Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.6742649Z E ^ 2025-05-07T20:33:06.6743178Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.6743647Z 2025-05-07T20:33:06.6744067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.6744583Z 2025-05-07T20:33:06.6744698Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.6745120Z self=, 2025-05-07T20:33:06.6745535Z T=128, 2025-05-07T20:33:06.6745750Z D=5120, 2025-05-07T20:33:06.6745956Z scale_ub=None, 2025-05-07T20:33:06.6746175Z contiguous=True, 2025-05-07T20:33:06.6746414Z compiled=False, 2025-05-07T20:33:06.6746638Z ) 2025-05-07T20:33:06.6746964Z self = 2025-05-07T20:33:06.6747468Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:06.6747751Z 2025-05-07T20:33:06.6747832Z @given( 2025-05-07T20:33:06.6748079Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.6748402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.6748724Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.6749068Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.6749402Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.6749701Z ) 2025-05-07T20:33:06.6750154Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.6750602Z def test_silu_mul_quant( 2025-05-07T20:33:06.6750859Z self, 2025-05-07T20:33:06.6751064Z T: int, 2025-05-07T20:33:06.6751268Z D: int, 2025-05-07T20:33:06.6751502Z scale_ub: Optional[float], 2025-05-07T20:33:06.6751783Z contiguous: bool, 2025-05-07T20:33:06.6752025Z compiled: bool, 2025-05-07T20:33:06.6752258Z ) -> None: 2025-05-07T20:33:06.6752485Z torch.manual_seed(2025) 2025-05-07T20:33:06.6752738Z 2025-05-07T20:33:06.6753015Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.6753367Z 2025-05-07T20:33:06.6753574Z x_sign = torch.sign(x) 2025-05-07T20:33:06.6753869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.6754194Z x = x_sign * x_clamp 2025-05-07T20:33:06.6754454Z x0 = x[:, :D] 2025-05-07T20:33:06.6754756Z x1 = x[:, D:] 2025-05-07T20:33:06.6754972Z 2025-05-07T20:33:06.6755162Z if contiguous: 2025-05-07T20:33:06.6755396Z x0 = x0.contiguous() 2025-05-07T20:33:06.6755665Z x1 = x1.contiguous() 2025-05-07T20:33:06.6755914Z 2025-05-07T20:33:06.6756148Z if scale_ub is not None: 2025-05-07T20:33:06.6756436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.6756781Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.6757093Z ) 2025-05-07T20:33:06.6757292Z else: 2025-05-07T20:33:06.6757514Z scale_ub_tensor = None 2025-05-07T20:33:06.6757810Z 2025-05-07T20:33:06.6758050Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.6758372Z op = silu_mul_quant 2025-05-07T20:33:06.6758634Z if compiled: 2025-05-07T20:33:06.6758886Z op = torch.compile(op) 2025-05-07T20:33:06.6759196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.6759485Z 2025-05-07T20:33:06.6759678Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.6759858Z 2025-05-07T20:33:06.6759961Z moe/activation_test.py:117: 2025-05-07T20:33:06.6760266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.6760610Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.6760903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.6761601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.6762342Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.6762890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.6763584Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.6764259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.6764795Z kernel = self.compile( 2025-05-07T20:33:06.6765345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.6766009Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.6766420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.6766651Z 2025-05-07T20:33:06.6766859Z self = 2025-05-07T20:33:06.6767958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.6769350Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89170dfa60>} 2025-05-07T20:33:06.6770700Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.6774467Z context = 2025-05-07T20:33:06.6774762Z 2025-05-07T20:33:06.6774932Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.6775456Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.6775934Z module_map=module_map) 2025-05-07T20:33:06.6776295Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.6776655Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.6776919Z E ^ 2025-05-07T20:33:06.6777391Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.6777848Z 2025-05-07T20:33:06.6778263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.6778778Z 2025-05-07T20:33:06.6778880Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.6779381Z self=, 2025-05-07T20:33:06.6779781Z T=128, 2025-05-07T20:33:06.6779968Z D=7168, 2025-05-07T20:33:06.6780164Z scale_ub=None, 2025-05-07T20:33:06.6780375Z contiguous=True, 2025-05-07T20:33:06.6780603Z compiled=False, 2025-05-07T20:33:06.6780859Z ) 2025-05-07T20:33:06.7678713Z self = 2025-05-07T20:33:06.7679424Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:06.7679805Z 2025-05-07T20:33:06.7679917Z @given( 2025-05-07T20:33:06.7680225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.7680652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.7681059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.7681394Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.7681723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.7682029Z ) 2025-05-07T20:33:06.7682387Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.7682826Z def test_silu_mul_quant( 2025-05-07T20:33:06.7683081Z self, 2025-05-07T20:33:06.7683283Z T: int, 2025-05-07T20:33:06.7683486Z D: int, 2025-05-07T20:33:06.7683952Z scale_ub: Optional[float], 2025-05-07T20:33:06.7684241Z contiguous: bool, 2025-05-07T20:33:06.7684489Z compiled: bool, 2025-05-07T20:33:06.7684710Z ) -> None: 2025-05-07T20:33:06.7684929Z torch.manual_seed(2025) 2025-05-07T20:33:06.7685175Z 2025-05-07T20:33:06.7685444Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.7685788Z 2025-05-07T20:33:06.7685983Z x_sign = torch.sign(x) 2025-05-07T20:33:06.7686272Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.7686585Z x = x_sign * x_clamp 2025-05-07T20:33:06.7686833Z x0 = x[:, :D] 2025-05-07T20:33:06.7687043Z x1 = x[:, D:] 2025-05-07T20:33:06.7687248Z 2025-05-07T20:33:06.7687432Z if contiguous: 2025-05-07T20:33:06.7687655Z x0 = x0.contiguous() 2025-05-07T20:33:06.7687915Z x1 = x1.contiguous() 2025-05-07T20:33:06.7688156Z 2025-05-07T20:33:06.7688348Z if scale_ub is not None: 2025-05-07T20:33:06.7688642Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.7688980Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.7689289Z ) 2025-05-07T20:33:06.7689481Z else: 2025-05-07T20:33:06.7689686Z scale_ub_tensor = None 2025-05-07T20:33:06.7689938Z 2025-05-07T20:33:06.7690169Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.7690476Z op = silu_mul_quant 2025-05-07T20:33:06.7690731Z if compiled: 2025-05-07T20:33:06.7690980Z op = torch.compile(op) 2025-05-07T20:33:06.7691421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.7691696Z 2025-05-07T20:33:06.7691888Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.7692052Z 2025-05-07T20:33:06.7692154Z moe/activation_test.py:117: 2025-05-07T20:33:06.7692449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.7692788Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.7693062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.7693751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.7694442Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.7694982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.7695655Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.7696397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.7696937Z kernel = self.compile( 2025-05-07T20:33:06.7697483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.7698129Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.7698595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.7698824Z 2025-05-07T20:33:06.7699033Z self = 2025-05-07T20:33:06.7700119Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.7701508Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891735e790>} 2025-05-07T20:33:06.7702857Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.7704247Z context = 2025-05-07T20:33:06.7704539Z 2025-05-07T20:33:06.7704712Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.7705226Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.7705697Z module_map=module_map) 2025-05-07T20:33:06.7706065Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.7706421Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.7706675Z E ^ 2025-05-07T20:33:06.7707147Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.7707601Z 2025-05-07T20:33:06.7708024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.7708540Z 2025-05-07T20:33:06.7708642Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.7709063Z self=, 2025-05-07T20:33:06.7709468Z T=2048, 2025-05-07T20:33:06.7709655Z D=7168, 2025-05-07T20:33:06.7709930Z scale_ub=1200.0, 2025-05-07T20:33:06.7710153Z contiguous=True, 2025-05-07T20:33:06.7710375Z compiled=False, 2025-05-07T20:33:06.7710577Z ) 2025-05-07T20:33:06.7710890Z self = 2025-05-07T20:33:06.7711380Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:06.7711651Z 2025-05-07T20:33:06.7711729Z @given( 2025-05-07T20:33:06.7711960Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.7712356Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.7712654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.7712982Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.7713308Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.7713592Z ) 2025-05-07T20:33:06.7713932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.7714371Z def test_silu_mul_quant( 2025-05-07T20:33:06.7714606Z self, 2025-05-07T20:33:06.7714788Z T: int, 2025-05-07T20:33:06.7714981Z D: int, 2025-05-07T20:33:06.7715196Z scale_ub: Optional[float], 2025-05-07T20:33:06.7715462Z contiguous: bool, 2025-05-07T20:33:06.7715693Z compiled: bool, 2025-05-07T20:33:06.7715909Z ) -> None: 2025-05-07T20:33:06.7716116Z torch.manual_seed(2025) 2025-05-07T20:33:06.7716352Z 2025-05-07T20:33:06.7716700Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.7718782Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.7720690Z 2025-05-07T20:33:06.7720815Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:06.7721026Z 2025-05-07T20:33:06.7721127Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.7721537Z self=, 2025-05-07T20:33:06.7721942Z T=1, 2025-05-07T20:33:06.7722128Z D=5120, 2025-05-07T20:33:06.7722320Z scale_ub=1200.0, 2025-05-07T20:33:06.7731003Z contiguous=True, 2025-05-07T20:33:06.7731287Z compiled=False, 2025-05-07T20:33:06.7731490Z ) 2025-05-07T20:33:06.8209894Z self = 2025-05-07T20:33:06.8210855Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:06.8211227Z 2025-05-07T20:33:06.8211340Z @given( 2025-05-07T20:33:06.8211647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.8212036Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.8212404Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.8212738Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.8213074Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.8213366Z ) 2025-05-07T20:33:06.8213722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.8214177Z def test_silu_mul_quant( 2025-05-07T20:33:06.8214427Z self, 2025-05-07T20:33:06.8214629Z T: int, 2025-05-07T20:33:06.8214826Z D: int, 2025-05-07T20:33:06.8215055Z scale_ub: Optional[float], 2025-05-07T20:33:06.8215339Z contiguous: bool, 2025-05-07T20:33:06.8215588Z compiled: bool, 2025-05-07T20:33:06.8215826Z ) -> None: 2025-05-07T20:33:06.8216056Z torch.manual_seed(2025) 2025-05-07T20:33:06.8216302Z 2025-05-07T20:33:06.8216587Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.8216936Z 2025-05-07T20:33:06.8217128Z x_sign = torch.sign(x) 2025-05-07T20:33:06.8217427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.8217751Z x = x_sign * x_clamp 2025-05-07T20:33:06.8218028Z x0 = x[:, :D] 2025-05-07T20:33:06.8218254Z x1 = x[:, D:] 2025-05-07T20:33:06.8218465Z 2025-05-07T20:33:06.8218772Z if contiguous: 2025-05-07T20:33:06.8219021Z x0 = x0.contiguous() 2025-05-07T20:33:06.8219289Z x1 = x1.contiguous() 2025-05-07T20:33:06.8219539Z 2025-05-07T20:33:06.8219737Z if scale_ub is not None: 2025-05-07T20:33:06.8220016Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.8220365Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.8220681Z ) 2025-05-07T20:33:06.8220873Z else: 2025-05-07T20:33:06.8221092Z scale_ub_tensor = None 2025-05-07T20:33:06.8221351Z 2025-05-07T20:33:06.8221593Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.8221911Z op = silu_mul_quant 2025-05-07T20:33:06.8222168Z if compiled: 2025-05-07T20:33:06.8222426Z op = torch.compile(op) 2025-05-07T20:33:06.8222726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.8223007Z 2025-05-07T20:33:06.8223208Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.8223460Z 2025-05-07T20:33:06.8223565Z moe/activation_test.py:117: 2025-05-07T20:33:06.8223873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.8224211Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.8224493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.8225267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.8225971Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.8226518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.8227199Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.8227871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.8228407Z kernel = self.compile( 2025-05-07T20:33:06.8228957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.8229617Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.8230110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.8230399Z 2025-05-07T20:33:06.8230609Z self = 2025-05-07T20:33:06.8231704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.8233139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8916f15040>} 2025-05-07T20:33:06.8234487Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.8235524Z context = 2025-05-07T20:33:06.8235811Z 2025-05-07T20:33:06.8235994Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.8236530Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.8237002Z module_map=module_map) 2025-05-07T20:33:06.8237384Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.8237743Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.8238002Z E ^ 2025-05-07T20:33:06.8238480Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.8238931Z 2025-05-07T20:33:06.8239359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.8239923Z 2025-05-07T20:33:06.8240033Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.8240444Z self=, 2025-05-07T20:33:06.8240853Z T=2048, 2025-05-07T20:33:06.8241058Z D=5120, 2025-05-07T20:33:06.8241253Z scale_ub=None, 2025-05-07T20:33:06.8241475Z contiguous=True, 2025-05-07T20:33:06.8241700Z compiled=False, 2025-05-07T20:33:06.8241906Z ) 2025-05-07T20:33:06.8242258Z self = 2025-05-07T20:33:06.8242771Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:06.8243042Z 2025-05-07T20:33:06.8243125Z @given( 2025-05-07T20:33:06.8243357Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.8243675Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.8244030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.8244357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.8244691Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.8244982Z ) 2025-05-07T20:33:06.8245332Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.8245820Z def test_silu_mul_quant( 2025-05-07T20:33:06.8246066Z self, 2025-05-07T20:33:06.8246260Z T: int, 2025-05-07T20:33:06.8246466Z D: int, 2025-05-07T20:33:06.8246688Z scale_ub: Optional[float], 2025-05-07T20:33:06.8246956Z contiguous: bool, 2025-05-07T20:33:06.8247198Z compiled: bool, 2025-05-07T20:33:06.8247425Z ) -> None: 2025-05-07T20:33:06.8247640Z torch.manual_seed(2025) 2025-05-07T20:33:06.8247892Z 2025-05-07T20:33:06.8248170Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.8248523Z 2025-05-07T20:33:06.8248722Z > x_sign = torch.sign(x) 2025-05-07T20:33:06.8250724Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.8252638Z 2025-05-07T20:33:06.8252759Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:06.8252975Z 2025-05-07T20:33:06.8253082Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.8253500Z self=, 2025-05-07T20:33:06.8253918Z T=16384, 2025-05-07T20:33:06.8254128Z D=5120, 2025-05-07T20:33:06.8254335Z scale_ub=None, 2025-05-07T20:33:06.8254545Z contiguous=True, 2025-05-07T20:33:06.8254779Z compiled=False, 2025-05-07T20:33:06.8254992Z ) 2025-05-07T20:33:06.8255306Z self = 2025-05-07T20:33:06.8255809Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:06.8256086Z 2025-05-07T20:33:06.8256172Z @given( 2025-05-07T20:33:06.8256405Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.8256729Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.8257042Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.8257375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.8257709Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.8257999Z ) 2025-05-07T20:33:06.8258351Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.8258843Z def test_silu_mul_quant( 2025-05-07T20:33:06.8259092Z self, 2025-05-07T20:33:06.8259287Z T: int, 2025-05-07T20:33:06.8259484Z D: int, 2025-05-07T20:33:06.8259703Z scale_ub: Optional[float], 2025-05-07T20:33:06.8259973Z contiguous: bool, 2025-05-07T20:33:06.8260211Z compiled: bool, 2025-05-07T20:33:06.8260447Z ) -> None: 2025-05-07T20:33:06.8260668Z torch.manual_seed(2025) 2025-05-07T20:33:06.8260906Z 2025-05-07T20:33:06.8261187Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.8263271Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.8265139Z 2025-05-07T20:33:06.8265258Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:06.8265472Z 2025-05-07T20:33:06.8265583Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.8266038Z self=, 2025-05-07T20:33:06.8266446Z T=4096, 2025-05-07T20:33:06.8266639Z D=5120, 2025-05-07T20:33:06.8266823Z scale_ub=None, 2025-05-07T20:33:06.8267047Z contiguous=True, 2025-05-07T20:33:06.8267272Z compiled=False, 2025-05-07T20:33:06.8267474Z ) 2025-05-07T20:33:06.9304818Z self = 2025-05-07T20:33:06.9305507Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:06.9305882Z 2025-05-07T20:33:06.9305986Z @given( 2025-05-07T20:33:06.9306314Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.9306732Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.9307032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.9307363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.9307836Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.9308130Z ) 2025-05-07T20:33:06.9308470Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.9308906Z def test_silu_mul_quant( 2025-05-07T20:33:06.9309148Z self, 2025-05-07T20:33:06.9309334Z T: int, 2025-05-07T20:33:06.9309529Z D: int, 2025-05-07T20:33:06.9309745Z scale_ub: Optional[float], 2025-05-07T20:33:06.9310097Z contiguous: bool, 2025-05-07T20:33:06.9310335Z compiled: bool, 2025-05-07T20:33:06.9310562Z ) -> None: 2025-05-07T20:33:06.9310772Z torch.manual_seed(2025) 2025-05-07T20:33:06.9311020Z 2025-05-07T20:33:06.9311329Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.9313491Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.9315383Z 2025-05-07T20:33:06.9315500Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:06.9315718Z 2025-05-07T20:33:06.9315819Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.9316228Z self=, 2025-05-07T20:33:06.9316718Z T=2048, 2025-05-07T20:33:06.9316909Z D=5120, 2025-05-07T20:33:06.9317113Z scale_ub=None, 2025-05-07T20:33:06.9317324Z contiguous=False, 2025-05-07T20:33:06.9317544Z compiled=False, 2025-05-07T20:33:06.9317750Z ) 2025-05-07T20:33:06.9318068Z self = 2025-05-07T20:33:06.9318563Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:06.9318832Z 2025-05-07T20:33:06.9318909Z @given( 2025-05-07T20:33:06.9319138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.9319448Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.9319747Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.9320074Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.9320403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.9320679Z ) 2025-05-07T20:33:06.9321135Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.9321577Z def test_silu_mul_quant( 2025-05-07T20:33:06.9321823Z self, 2025-05-07T20:33:06.9322013Z T: int, 2025-05-07T20:33:06.9322208Z D: int, 2025-05-07T20:33:06.9322422Z scale_ub: Optional[float], 2025-05-07T20:33:06.9322758Z contiguous: bool, 2025-05-07T20:33:06.9322995Z compiled: bool, 2025-05-07T20:33:06.9323212Z ) -> None: 2025-05-07T20:33:06.9323419Z torch.manual_seed(2025) 2025-05-07T20:33:06.9323657Z 2025-05-07T20:33:06.9323929Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.9325978Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.9327856Z 2025-05-07T20:33:06.9327973Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:06.9328237Z 2025-05-07T20:33:06.9328340Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.9328751Z self=, 2025-05-07T20:33:06.9329148Z T=4096, 2025-05-07T20:33:06.9329327Z D=7168, 2025-05-07T20:33:06.9329514Z scale_ub=None, 2025-05-07T20:33:06.9329725Z contiguous=True, 2025-05-07T20:33:06.9329941Z compiled=True, 2025-05-07T20:33:06.9330138Z ) 2025-05-07T20:33:06.9330449Z self = 2025-05-07T20:33:06.9330926Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:06.9331205Z 2025-05-07T20:33:06.9331281Z @given( 2025-05-07T20:33:06.9331511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.9331818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.9332118Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.9332446Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.9332776Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.9333056Z ) 2025-05-07T20:33:06.9333403Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.9333842Z def test_silu_mul_quant( 2025-05-07T20:33:06.9334072Z self, 2025-05-07T20:33:06.9334267Z T: int, 2025-05-07T20:33:06.9334461Z D: int, 2025-05-07T20:33:06.9334672Z scale_ub: Optional[float], 2025-05-07T20:33:06.9334942Z contiguous: bool, 2025-05-07T20:33:06.9335180Z compiled: bool, 2025-05-07T20:33:06.9335396Z ) -> None: 2025-05-07T20:33:06.9335670Z torch.manual_seed(2025) 2025-05-07T20:33:06.9335913Z 2025-05-07T20:33:06.9336175Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.9338227Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.9340113Z 2025-05-07T20:33:06.9340230Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:06.9340453Z 2025-05-07T20:33:06.9340555Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.9341010Z self=, 2025-05-07T20:33:06.9341406Z T=2048, 2025-05-07T20:33:06.9341592Z D=5120, 2025-05-07T20:33:06.9341784Z scale_ub=1200.0, 2025-05-07T20:33:06.9342001Z contiguous=False, 2025-05-07T20:33:06.9342226Z compiled=False, 2025-05-07T20:33:06.9342429Z ) 2025-05-07T20:33:06.9342788Z self = 2025-05-07T20:33:06.9343274Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:06.9343550Z 2025-05-07T20:33:06.9343624Z @given( 2025-05-07T20:33:06.9343849Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.9344151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.9344453Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.9344779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.9345097Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.9345387Z ) 2025-05-07T20:33:06.9345730Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.9346166Z def test_silu_mul_quant( 2025-05-07T20:33:06.9346401Z self, 2025-05-07T20:33:06.9346591Z T: int, 2025-05-07T20:33:06.9346784Z D: int, 2025-05-07T20:33:06.9347045Z scale_ub: Optional[float], 2025-05-07T20:33:06.9347314Z contiguous: bool, 2025-05-07T20:33:06.9347550Z compiled: bool, 2025-05-07T20:33:06.9347764Z ) -> None: 2025-05-07T20:33:06.9347973Z torch.manual_seed(2025) 2025-05-07T20:33:06.9348211Z 2025-05-07T20:33:06.9348475Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.9350594Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.9352525Z 2025-05-07T20:33:06.9352642Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:06.9352859Z 2025-05-07T20:33:06.9352958Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.9353365Z self=, 2025-05-07T20:33:06.9353758Z T=4096, 2025-05-07T20:33:06.9353942Z D=7168, 2025-05-07T20:33:06.9354130Z scale_ub=1200.0, 2025-05-07T20:33:06.9354342Z contiguous=True, 2025-05-07T20:33:06.9354562Z compiled=False, 2025-05-07T20:33:06.9354768Z ) 2025-05-07T20:33:06.9355075Z self = 2025-05-07T20:33:06.9355686Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:06.9355957Z 2025-05-07T20:33:06.9356039Z @given( 2025-05-07T20:33:06.9356268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.9356572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.9356877Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.9357206Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.9357528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.9357813Z ) 2025-05-07T20:33:06.9358161Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.9358595Z def test_silu_mul_quant( 2025-05-07T20:33:06.9358837Z self, 2025-05-07T20:33:06.9359033Z T: int, 2025-05-07T20:33:06.9359222Z D: int, 2025-05-07T20:33:06.9359444Z scale_ub: Optional[float], 2025-05-07T20:33:06.9359713Z contiguous: bool, 2025-05-07T20:33:06.9359993Z compiled: bool, 2025-05-07T20:33:06.9360216Z ) -> None: 2025-05-07T20:33:06.9360425Z torch.manual_seed(2025) 2025-05-07T20:33:06.9360659Z 2025-05-07T20:33:06.9360926Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.9362978Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:06.9364900Z 2025-05-07T20:33:06.9365016Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:06.9365229Z 2025-05-07T20:33:06.9365342Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.9365745Z self=, 2025-05-07T20:33:06.9366143Z T=16384, 2025-05-07T20:33:06.9366332Z D=7168, 2025-05-07T20:33:06.9366515Z scale_ub=None, 2025-05-07T20:33:06.9366728Z contiguous=False, 2025-05-07T20:33:06.9366993Z compiled=True, 2025-05-07T20:33:06.9367194Z ) 2025-05-07T20:33:07.0669582Z self = 2025-05-07T20:33:07.0670373Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.0670751Z 2025-05-07T20:33:07.0670862Z @given( 2025-05-07T20:33:07.0671186Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.0671497Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.0671801Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.0672134Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.0672497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.0672778Z ) 2025-05-07T20:33:07.0673130Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.0673570Z def test_silu_mul_quant( 2025-05-07T20:33:07.0673809Z self, 2025-05-07T20:33:07.0674019Z T: int, 2025-05-07T20:33:07.0674223Z D: int, 2025-05-07T20:33:07.0674442Z scale_ub: Optional[float], 2025-05-07T20:33:07.0674711Z contiguous: bool, 2025-05-07T20:33:07.0674955Z compiled: bool, 2025-05-07T20:33:07.0675186Z ) -> None: 2025-05-07T20:33:07.0675396Z torch.manual_seed(2025) 2025-05-07T20:33:07.0675640Z 2025-05-07T20:33:07.0675916Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.0677992Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.0680185Z 2025-05-07T20:33:07.0680304Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.0680522Z 2025-05-07T20:33:07.0680622Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.0681036Z self=, 2025-05-07T20:33:07.0681467Z T=4096, 2025-05-07T20:33:07.0681674Z D=7168, 2025-05-07T20:33:07.0681866Z scale_ub=None, 2025-05-07T20:33:07.0682084Z contiguous=True, 2025-05-07T20:33:07.0682301Z compiled=False, 2025-05-07T20:33:07.0682511Z ) 2025-05-07T20:33:07.0682907Z self = 2025-05-07T20:33:07.0683397Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.0683670Z 2025-05-07T20:33:07.0683747Z @given( 2025-05-07T20:33:07.0683980Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.0684363Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.0684668Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.0684997Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.0685321Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.0685599Z ) 2025-05-07T20:33:07.0685949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.0686386Z def test_silu_mul_quant( 2025-05-07T20:33:07.0686622Z self, 2025-05-07T20:33:07.0686817Z T: int, 2025-05-07T20:33:07.0687018Z D: int, 2025-05-07T20:33:07.0687230Z scale_ub: Optional[float], 2025-05-07T20:33:07.0687508Z contiguous: bool, 2025-05-07T20:33:07.0687746Z compiled: bool, 2025-05-07T20:33:07.0687964Z ) -> None: 2025-05-07T20:33:07.0688177Z torch.manual_seed(2025) 2025-05-07T20:33:07.0688418Z 2025-05-07T20:33:07.0688682Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.0690820Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.0692727Z 2025-05-07T20:33:07.0692847Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.0693066Z 2025-05-07T20:33:07.0693168Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.0693579Z self=, 2025-05-07T20:33:07.0693976Z T=16384, 2025-05-07T20:33:07.0694172Z D=7168, 2025-05-07T20:33:07.0694369Z scale_ub=None, 2025-05-07T20:33:07.0694578Z contiguous=True, 2025-05-07T20:33:07.0703212Z compiled=False, 2025-05-07T20:33:07.0703431Z ) 2025-05-07T20:33:07.0704018Z self = 2025-05-07T20:33:07.0704535Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.0704814Z 2025-05-07T20:33:07.0704891Z @given( 2025-05-07T20:33:07.0705124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.0705444Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.0705748Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.0706217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.0706553Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.0706848Z ) 2025-05-07T20:33:07.0707197Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.0707644Z def test_silu_mul_quant( 2025-05-07T20:33:07.0707901Z self, 2025-05-07T20:33:07.0708094Z T: int, 2025-05-07T20:33:07.0708293Z D: int, 2025-05-07T20:33:07.0708516Z scale_ub: Optional[float], 2025-05-07T20:33:07.0708785Z contiguous: bool, 2025-05-07T20:33:07.0709030Z compiled: bool, 2025-05-07T20:33:07.0709261Z ) -> None: 2025-05-07T20:33:07.0709478Z torch.manual_seed(2025) 2025-05-07T20:33:07.0709730Z 2025-05-07T20:33:07.0710072Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.0712260Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.0714247Z 2025-05-07T20:33:07.0714374Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.0714585Z 2025-05-07T20:33:07.0714684Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.0715102Z self=, 2025-05-07T20:33:07.0715510Z T=16384, 2025-05-07T20:33:07.0715694Z D=7168, 2025-05-07T20:33:07.0715883Z scale_ub=1200.0, 2025-05-07T20:33:07.0716105Z contiguous=True, 2025-05-07T20:33:07.0716318Z compiled=False, 2025-05-07T20:33:07.0716530Z ) 2025-05-07T20:33:07.0716847Z self = 2025-05-07T20:33:07.0717337Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.0717625Z 2025-05-07T20:33:07.0717697Z @given( 2025-05-07T20:33:07.0717990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.0718303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.0718604Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.0718934Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.0719261Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.0719538Z ) 2025-05-07T20:33:07.0719887Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.0720327Z def test_silu_mul_quant( 2025-05-07T20:33:07.0720563Z self, 2025-05-07T20:33:07.0720753Z T: int, 2025-05-07T20:33:07.0720951Z D: int, 2025-05-07T20:33:07.0721162Z scale_ub: Optional[float], 2025-05-07T20:33:07.0721438Z contiguous: bool, 2025-05-07T20:33:07.0721676Z compiled: bool, 2025-05-07T20:33:07.0721900Z ) -> None: 2025-05-07T20:33:07.0722106Z torch.manual_seed(2025) 2025-05-07T20:33:07.0722350Z 2025-05-07T20:33:07.0722627Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.0724713Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.0726656Z 2025-05-07T20:33:07.0726773Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.0726991Z 2025-05-07T20:33:07.0727093Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.0727507Z self=, 2025-05-07T20:33:07.0727910Z T=128, 2025-05-07T20:33:07.0728089Z D=5120, 2025-05-07T20:33:07.0728275Z scale_ub=1200.0, 2025-05-07T20:33:07.0728503Z contiguous=False, 2025-05-07T20:33:07.0728718Z compiled=False, 2025-05-07T20:33:07.0728924Z ) 2025-05-07T20:33:07.2350125Z self = 2025-05-07T20:33:07.2350893Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.2351269Z 2025-05-07T20:33:07.2351379Z @given( 2025-05-07T20:33:07.2351670Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2351990Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2352605Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2352943Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2353278Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2353573Z ) 2025-05-07T20:33:07.2353927Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2354460Z def test_silu_mul_quant( 2025-05-07T20:33:07.2354708Z self, 2025-05-07T20:33:07.2354903Z T: int, 2025-05-07T20:33:07.2355105Z D: int, 2025-05-07T20:33:07.2355328Z scale_ub: Optional[float], 2025-05-07T20:33:07.2355602Z contiguous: bool, 2025-05-07T20:33:07.2355838Z compiled: bool, 2025-05-07T20:33:07.2356069Z ) -> None: 2025-05-07T20:33:07.2356295Z torch.manual_seed(2025) 2025-05-07T20:33:07.2356533Z 2025-05-07T20:33:07.2356810Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2357158Z 2025-05-07T20:33:07.2357353Z x_sign = torch.sign(x) 2025-05-07T20:33:07.2357650Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.2357963Z x = x_sign * x_clamp 2025-05-07T20:33:07.2358197Z x0 = x[:, :D] 2025-05-07T20:33:07.2358415Z x1 = x[:, D:] 2025-05-07T20:33:07.2358626Z 2025-05-07T20:33:07.2358887Z if contiguous: 2025-05-07T20:33:07.2359128Z x0 = x0.contiguous() 2025-05-07T20:33:07.2359392Z x1 = x1.contiguous() 2025-05-07T20:33:07.2359630Z 2025-05-07T20:33:07.2359825Z if scale_ub is not None: 2025-05-07T20:33:07.2360101Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.2360437Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.2360750Z ) 2025-05-07T20:33:07.2360950Z else: 2025-05-07T20:33:07.2361163Z scale_ub_tensor = None 2025-05-07T20:33:07.2361426Z 2025-05-07T20:33:07.2361701Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.2362027Z op = silu_mul_quant 2025-05-07T20:33:07.2362277Z if compiled: 2025-05-07T20:33:07.2362528Z op = torch.compile(op) 2025-05-07T20:33:07.2362833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.2363107Z 2025-05-07T20:33:07.2363305Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.2363475Z 2025-05-07T20:33:07.2363588Z moe/activation_test.py:117: 2025-05-07T20:33:07.2363884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.2364222Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.2364510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.2365214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.2365908Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.2366457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.2367241Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.2367903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.2368447Z kernel = self.compile( 2025-05-07T20:33:07.2369006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.2369669Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.2370068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.2370302Z 2025-05-07T20:33:07.2370509Z self = 2025-05-07T20:33:07.2371662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.2373133Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8916e4cca0>} 2025-05-07T20:33:07.2374480Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.2375550Z context = 2025-05-07T20:33:07.2375844Z 2025-05-07T20:33:07.2376012Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.2376539Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.2377004Z module_map=module_map) 2025-05-07T20:33:07.2377382Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.2377740Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.2378005Z E ^ 2025-05-07T20:33:07.2378468Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.2378929Z 2025-05-07T20:33:07.2379388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.2379908Z 2025-05-07T20:33:07.2380020Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2380440Z self=, 2025-05-07T20:33:07.2380844Z T=2048, 2025-05-07T20:33:07.2381034Z D=7168, 2025-05-07T20:33:07.2381227Z scale_ub=None, 2025-05-07T20:33:07.2381440Z contiguous=False, 2025-05-07T20:33:07.2381669Z compiled=False, 2025-05-07T20:33:07.2381879Z ) 2025-05-07T20:33:07.2382195Z self = 2025-05-07T20:33:07.2382705Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.2382978Z 2025-05-07T20:33:07.2383061Z @given( 2025-05-07T20:33:07.2383289Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2383605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2383922Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2384255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2384586Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2384872Z ) 2025-05-07T20:33:07.2385224Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2385658Z def test_silu_mul_quant( 2025-05-07T20:33:07.2385902Z self, 2025-05-07T20:33:07.2386103Z T: int, 2025-05-07T20:33:07.2386297Z D: int, 2025-05-07T20:33:07.2386519Z scale_ub: Optional[float], 2025-05-07T20:33:07.2386802Z contiguous: bool, 2025-05-07T20:33:07.2387097Z compiled: bool, 2025-05-07T20:33:07.2387323Z ) -> None: 2025-05-07T20:33:07.2387544Z torch.manual_seed(2025) 2025-05-07T20:33:07.2387783Z 2025-05-07T20:33:07.2388066Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2390212Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.2392088Z 2025-05-07T20:33:07.2392211Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.2392426Z 2025-05-07T20:33:07.2392584Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2392997Z self=, 2025-05-07T20:33:07.2393404Z T=128, 2025-05-07T20:33:07.2393593Z D=7168, 2025-05-07T20:33:07.2393780Z scale_ub=1200.0, 2025-05-07T20:33:07.2394054Z contiguous=True, 2025-05-07T20:33:07.2394279Z compiled=True, 2025-05-07T20:33:07.2394485Z ) 2025-05-07T20:33:07.2843577Z self = 2025-05-07T20:33:07.2844169Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.2844448Z 2025-05-07T20:33:07.2844527Z @given( 2025-05-07T20:33:07.2844761Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2845073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2845372Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2845725Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2846066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2846349Z ) 2025-05-07T20:33:07.2846698Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2847157Z def test_silu_mul_quant( 2025-05-07T20:33:07.2847401Z self, 2025-05-07T20:33:07.2847767Z T: int, 2025-05-07T20:33:07.2847975Z D: int, 2025-05-07T20:33:07.2848202Z scale_ub: Optional[float], 2025-05-07T20:33:07.2848469Z contiguous: bool, 2025-05-07T20:33:07.2848712Z compiled: bool, 2025-05-07T20:33:07.2848941Z ) -> None: 2025-05-07T20:33:07.2849158Z torch.manual_seed(2025) 2025-05-07T20:33:07.2849396Z 2025-05-07T20:33:07.2849667Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2850009Z 2025-05-07T20:33:07.2850198Z x_sign = torch.sign(x) 2025-05-07T20:33:07.2850488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.2850805Z x = x_sign * x_clamp 2025-05-07T20:33:07.2851040Z x0 = x[:, :D] 2025-05-07T20:33:07.2851262Z x1 = x[:, D:] 2025-05-07T20:33:07.2851496Z 2025-05-07T20:33:07.2851701Z if contiguous: 2025-05-07T20:33:07.2851934Z x0 = x0.contiguous() 2025-05-07T20:33:07.2852207Z x1 = x1.contiguous() 2025-05-07T20:33:07.2852442Z 2025-05-07T20:33:07.2852633Z if scale_ub is not None: 2025-05-07T20:33:07.2852907Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.2853240Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.2853550Z ) 2025-05-07T20:33:07.2853742Z else: 2025-05-07T20:33:07.2853947Z scale_ub_tensor = None 2025-05-07T20:33:07.2854200Z 2025-05-07T20:33:07.2854432Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.2854747Z op = silu_mul_quant 2025-05-07T20:33:07.2854996Z if compiled: 2025-05-07T20:33:07.2855336Z op = torch.compile(op) 2025-05-07T20:33:07.2855635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.2855904Z 2025-05-07T20:33:07.2856095Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.2856260Z 2025-05-07T20:33:07.2856364Z moe/activation_test.py:117: 2025-05-07T20:33:07.2856669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.2857007Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.2857290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.2857847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.2858402Z return fn(*args, **kwargs) 2025-05-07T20:33:07.2859058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.2859749Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.2860361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.2861046Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.2861708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.2862317Z kernel = self.compile( 2025-05-07T20:33:07.2862852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.2863506Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.2863905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.2864133Z 2025-05-07T20:33:07.2864336Z self = 2025-05-07T20:33:07.2865424Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.2866816Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8916d390d0>} 2025-05-07T20:33:07.2868250Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.2869273Z context = 2025-05-07T20:33:07.2869557Z 2025-05-07T20:33:07.2869725Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.2870356Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.2870820Z module_map=module_map) 2025-05-07T20:33:07.2871197Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.2871547Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.2871810Z E ^ 2025-05-07T20:33:07.2872284Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.2872739Z 2025-05-07T20:33:07.2873151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.2873666Z 2025-05-07T20:33:07.2873769Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2874183Z self=, 2025-05-07T20:33:07.2874581Z T=128, 2025-05-07T20:33:07.2874762Z D=7168, 2025-05-07T20:33:07.2874955Z scale_ub=1200.0, 2025-05-07T20:33:07.2875178Z contiguous=True, 2025-05-07T20:33:07.2875395Z compiled=False, 2025-05-07T20:33:07.2875607Z ) 2025-05-07T20:33:07.2875925Z self = 2025-05-07T20:33:07.2876469Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.2876749Z 2025-05-07T20:33:07.2876826Z @given( 2025-05-07T20:33:07.2877063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2877373Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2877680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2878245Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2878583Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2878864Z ) 2025-05-07T20:33:07.2879215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2879653Z def test_silu_mul_quant( 2025-05-07T20:33:07.2879891Z self, 2025-05-07T20:33:07.2880090Z T: int, 2025-05-07T20:33:07.2880290Z D: int, 2025-05-07T20:33:07.2880501Z scale_ub: Optional[float], 2025-05-07T20:33:07.2880827Z contiguous: bool, 2025-05-07T20:33:07.2881067Z compiled: bool, 2025-05-07T20:33:07.2881288Z ) -> None: 2025-05-07T20:33:07.2881509Z torch.manual_seed(2025) 2025-05-07T20:33:07.2881752Z 2025-05-07T20:33:07.2882025Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2882462Z 2025-05-07T20:33:07.2882662Z x_sign = torch.sign(x) 2025-05-07T20:33:07.2882955Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.2885105Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.2886969Z 2025-05-07T20:33:07.2887091Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.2887309Z 2025-05-07T20:33:07.2887411Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2887878Z self=, 2025-05-07T20:33:07.2888287Z T=128, 2025-05-07T20:33:07.2888476Z D=5120, 2025-05-07T20:33:07.2888671Z scale_ub=1200.0, 2025-05-07T20:33:07.2888897Z contiguous=True, 2025-05-07T20:33:07.2889115Z compiled=True, 2025-05-07T20:33:07.2889319Z ) 2025-05-07T20:33:07.2889637Z self = 2025-05-07T20:33:07.2890118Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.2890388Z 2025-05-07T20:33:07.2890464Z @given( 2025-05-07T20:33:07.2890699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2891005Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2891315Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2891694Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2892017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2892308Z ) 2025-05-07T20:33:07.2892656Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2893092Z def test_silu_mul_quant( 2025-05-07T20:33:07.2893326Z self, 2025-05-07T20:33:07.2893520Z T: int, 2025-05-07T20:33:07.2893716Z D: int, 2025-05-07T20:33:07.2893926Z scale_ub: Optional[float], 2025-05-07T20:33:07.2894196Z contiguous: bool, 2025-05-07T20:33:07.2894436Z compiled: bool, 2025-05-07T20:33:07.2894651Z ) -> None: 2025-05-07T20:33:07.2894866Z torch.manual_seed(2025) 2025-05-07T20:33:07.2895110Z 2025-05-07T20:33:07.2895375Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2895780Z 2025-05-07T20:33:07.2896020Z x_sign = torch.sign(x) 2025-05-07T20:33:07.2896408Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.2898579Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.2900460Z 2025-05-07T20:33:07.2900578Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.2900796Z 2025-05-07T20:33:07.2900960Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2901380Z self=, 2025-05-07T20:33:07.2901827Z T=128, 2025-05-07T20:33:07.2902017Z D=7168, 2025-05-07T20:33:07.2902209Z scale_ub=None, 2025-05-07T20:33:07.2902418Z contiguous=True, 2025-05-07T20:33:07.2902693Z compiled=True, 2025-05-07T20:33:07.2902893Z ) 2025-05-07T20:33:07.5018193Z self = 2025-05-07T20:33:07.5018832Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.5019101Z 2025-05-07T20:33:07.5019192Z @given( 2025-05-07T20:33:07.5019425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5019750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5020069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5020410Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5020768Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5021068Z ) 2025-05-07T20:33:07.5021429Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5021872Z def test_silu_mul_quant( 2025-05-07T20:33:07.5022121Z self, 2025-05-07T20:33:07.5022326Z T: int, 2025-05-07T20:33:07.5022750Z D: int, 2025-05-07T20:33:07.5022979Z scale_ub: Optional[float], 2025-05-07T20:33:07.5023258Z contiguous: bool, 2025-05-07T20:33:07.5023496Z compiled: bool, 2025-05-07T20:33:07.5023730Z ) -> None: 2025-05-07T20:33:07.5023952Z torch.manual_seed(2025) 2025-05-07T20:33:07.5024196Z 2025-05-07T20:33:07.5024473Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5026573Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.5028454Z 2025-05-07T20:33:07.5028577Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.5028795Z 2025-05-07T20:33:07.5040486Z FAILED 2025-05-07T20:33:07.5040683Z 2025-05-07T20:33:07.5040871Z =================================== FAILURES =================================== 2025-05-07T20:33:07.5041487Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:07.5042111Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:07.5055483Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 59, in testPartExecutor 2025-05-07T20:33:07.5056545Z | yield 2025-05-07T20:33:07.5057150Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 592, in run 2025-05-07T20:33:07.5057888Z | self._callTestMethod(testMethod) 2025-05-07T20:33:07.5058696Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/unittest/case.py", line 550, in _callTestMethod 2025-05-07T20:33:07.5059457Z | method() 2025-05-07T20:33:07.5060345Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:07.5061368Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5062273Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:07.5063126Z | raise the_error_hypothesis_found 2025-05-07T20:33:07.5063931Z | exceptiongroup.ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:07.5064615Z +-+---------------- 1 ---------------- 2025-05-07T20:33:07.5065016Z | Traceback (most recent call last): 2025-05-07T20:33:07.5065993Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:07.5067237Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5070276Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.5073069Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:07.5073674Z | self=, 2025-05-07T20:33:07.5074256Z | T=2048, 2025-05-07T20:33:07.5074577Z | D=5120, # or any other generated value 2025-05-07T20:33:07.5075139Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:07.5075644Z | contiguous=True, # or any other generated value 2025-05-07T20:33:07.5076131Z | compiled=False, # or any other generated value 2025-05-07T20:33:07.5076561Z | ) 2025-05-07T20:33:07.5076806Z | 2025-05-07T20:33:07.5077519Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:07.5078364Z +---------------- 2 ---------------- 2025-05-07T20:33:07.5078774Z | Traceback (most recent call last): 2025-05-07T20:33:07.5079761Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:07.5080826Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5083676Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.5086432Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:07.5087043Z | self=, 2025-05-07T20:33:07.5087663Z | T=128, 2025-05-07T20:33:07.5088708Z | D=7168, 2025-05-07T20:33:07.5089006Z | scale_ub=None, 2025-05-07T20:33:07.5089267Z | contiguous=True, 2025-05-07T20:33:07.5089505Z | compiled=True, 2025-05-07T20:33:07.5089730Z | ) 2025-05-07T20:33:07.5089920Z | 2025-05-07T20:33:07.5090468Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:07.5091076Z +---------------- 3 ---------------- 2025-05-07T20:33:07.5091370Z | Traceback (most recent call last): 2025-05-07T20:33:07.5092142Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:07.5092934Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5095362Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.5098318Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:07.5098957Z | self=, 2025-05-07T20:33:07.5099559Z | T=128, 2025-05-07T20:33:07.5099842Z | D=5120, 2025-05-07T20:33:07.5100144Z | scale_ub=1200.0, 2025-05-07T20:33:07.5100397Z | contiguous=True, 2025-05-07T20:33:07.5100634Z | compiled=True, 2025-05-07T20:33:07.5100920Z | ) 2025-05-07T20:33:07.5101176Z | 2025-05-07T20:33:07.5101900Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:07.5102723Z +---------------- 4 ---------------- 2025-05-07T20:33:07.5103116Z | Traceback (most recent call last): 2025-05-07T20:33:07.5104592Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:07.5105596Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.5106510Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:07.5107481Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5108660Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:07.5109769Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.5110712Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:07.5111735Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5112753Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:07.5113826Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5114945Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:33:07.5116078Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5117319Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:07.5118283Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.5119199Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:07.5119988Z | fn() 2025-05-07T20:33:07.5120768Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:07.5121655Z | self.fn.run( 2025-05-07T20:33:07.5122391Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:07.5123184Z | kernel = self.compile( 2025-05-07T20:33:07.5124110Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:07.5125087Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5126059Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:07.5127196Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5127993Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5128465Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.5128816Z | ^ 2025-05-07T20:33:07.5129447Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5130217Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:07.5130774Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:07.5131495Z | self=, 2025-05-07T20:33:07.5132153Z | T=1, # or any other generated value 2025-05-07T20:33:07.5132580Z | D=5120, # or any other generated value 2025-05-07T20:33:07.5133046Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:07.5133529Z | contiguous=True, # or any other generated value 2025-05-07T20:33:07.5134089Z | compiled=True, # or any other generated value 2025-05-07T20:33:07.5134503Z | ) 2025-05-07T20:33:07.5134737Z | 2025-05-07T20:33:07.5135460Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:07.5136306Z +------------------------------------ 2025-05-07T20:33:07.5136808Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:07.5137328Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5137917Z self=, 2025-05-07T20:33:07.5138493Z T=1, 2025-05-07T20:33:07.5138737Z D=5120, 2025-05-07T20:33:07.5139004Z scale_ub=None, 2025-05-07T20:33:07.5139294Z contiguous=True, 2025-05-07T20:33:07.5139609Z compiled=True, 2025-05-07T20:33:07.5139901Z ) 2025-05-07T20:33:07.5140346Z self = 2025-05-07T20:33:07.5140998Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.5141364Z 2025-05-07T20:33:07.5141469Z @given( 2025-05-07T20:33:07.5141780Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5142230Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5142651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5143110Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5143564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5143950Z ) 2025-05-07T20:33:07.5144419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5145084Z def test_silu_mul_quant( 2025-05-07T20:33:07.5145404Z self, 2025-05-07T20:33:07.5145653Z T: int, 2025-05-07T20:33:07.5145911Z D: int, 2025-05-07T20:33:07.5146194Z scale_ub: Optional[float], 2025-05-07T20:33:07.5146545Z contiguous: bool, 2025-05-07T20:33:07.5146852Z compiled: bool, 2025-05-07T20:33:07.5147140Z ) -> None: 2025-05-07T20:33:07.5147413Z torch.manual_seed(2025) 2025-05-07T20:33:07.5147724Z 2025-05-07T20:33:07.5148076Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5148515Z 2025-05-07T20:33:07.5148761Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5149132Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5149536Z x = x_sign * x_clamp 2025-05-07T20:33:07.5149949Z x0 = x[:, :D] 2025-05-07T20:33:07.5150229Z x1 = x[:, D:] 2025-05-07T20:33:07.5150556Z 2025-05-07T20:33:07.5150803Z if contiguous: 2025-05-07T20:33:07.5151124Z x0 = x0.contiguous() 2025-05-07T20:33:07.5151488Z x1 = x1.contiguous() 2025-05-07T20:33:07.5151833Z 2025-05-07T20:33:07.5152083Z if scale_ub is not None: 2025-05-07T20:33:07.5152448Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5153332Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5153733Z ) 2025-05-07T20:33:07.5153976Z else: 2025-05-07T20:33:07.5154237Z scale_ub_tensor = None 2025-05-07T20:33:07.5154566Z 2025-05-07T20:33:07.5154861Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5155260Z op = silu_mul_quant 2025-05-07T20:33:07.5155583Z if compiled: 2025-05-07T20:33:07.5155915Z op = torch.compile(op) 2025-05-07T20:33:07.5156312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5156704Z 2025-05-07T20:33:07.5156971Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.5157363Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.5157765Z 2025-05-07T20:33:07.5158053Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5158484Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.5158896Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.5159336Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.5159836Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5160251Z 2025-05-07T20:33:07.5160513Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.5160768Z 2025-05-07T20:33:07.5160903Z moe/activation_test.py:126: 2025-05-07T20:33:07.5161289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5161743Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.5162196Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5163229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.5164213Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.5164926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5165817Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5166724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.5167669Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5168662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.5169650Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5170694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.5171528Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.5172326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.5173018Z fn() 2025-05-07T20:33:07.5173692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.5174490Z self.fn.run( 2025-05-07T20:33:07.5175124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5175834Z kernel = self.compile( 2025-05-07T20:33:07.5176573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5177547Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5178118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5178437Z 2025-05-07T20:33:07.5178721Z self = 2025-05-07T20:33:07.5180271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5182286Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891b3dd9d0>} 2025-05-07T20:33:07.5184175Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5185540Z context = 2025-05-07T20:33:07.5185919Z 2025-05-07T20:33:07.5186132Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5186812Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5187477Z module_map=module_map) 2025-05-07T20:33:07.5187948Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5188397Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.5188742Z E ^ 2025-05-07T20:33:07.5189344Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5190046Z 2025-05-07T20:33:07.5190614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5191293Z 2025-05-07T20:33:07.5191429Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5192002Z self=, 2025-05-07T20:33:07.5192538Z T=2048, 2025-05-07T20:33:07.5192770Z D=5120, 2025-05-07T20:33:07.5193017Z scale_ub=1200.0, 2025-05-07T20:33:07.5193303Z contiguous=True, 2025-05-07T20:33:07.5193582Z compiled=False, 2025-05-07T20:33:07.5193856Z ) 2025-05-07T20:33:07.5194276Z self = 2025-05-07T20:33:07.5194926Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.5195300Z 2025-05-07T20:33:07.5195399Z @given( 2025-05-07T20:33:07.5195702Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5196110Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5196515Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5196962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5197416Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5197876Z ) 2025-05-07T20:33:07.5198363Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5198966Z def test_silu_mul_quant( 2025-05-07T20:33:07.5199282Z self, 2025-05-07T20:33:07.5199540Z T: int, 2025-05-07T20:33:07.5199806Z D: int, 2025-05-07T20:33:07.5200088Z scale_ub: Optional[float], 2025-05-07T20:33:07.5200457Z contiguous: bool, 2025-05-07T20:33:07.5200787Z compiled: bool, 2025-05-07T20:33:07.5201081Z ) -> None: 2025-05-07T20:33:07.5201356Z torch.manual_seed(2025) 2025-05-07T20:33:07.5201672Z 2025-05-07T20:33:07.5202024Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5202459Z 2025-05-07T20:33:07.5202705Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5203079Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5203469Z x = x_sign * x_clamp 2025-05-07T20:33:07.5204154Z x0 = x[:, :D] 2025-05-07T20:33:07.5204450Z x1 = x[:, D:] 2025-05-07T20:33:07.5204712Z 2025-05-07T20:33:07.5204948Z if contiguous: 2025-05-07T20:33:07.5205243Z x0 = x0.contiguous() 2025-05-07T20:33:07.5205573Z x1 = x1.contiguous() 2025-05-07T20:33:07.5205962Z 2025-05-07T20:33:07.5206216Z if scale_ub is not None: 2025-05-07T20:33:07.5206563Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5207015Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5207414Z ) 2025-05-07T20:33:07.5207661Z else: 2025-05-07T20:33:07.5207941Z scale_ub_tensor = None 2025-05-07T20:33:07.5208278Z 2025-05-07T20:33:07.5208600Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5209029Z op = silu_mul_quant 2025-05-07T20:33:07.5209379Z if compiled: 2025-05-07T20:33:07.5209709Z op = torch.compile(op) 2025-05-07T20:33:07.5210109Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5210473Z 2025-05-07T20:33:07.5210728Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5210946Z 2025-05-07T20:33:07.5211078Z moe/activation_test.py:117: 2025-05-07T20:33:07.5211562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5212017Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5212380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5213301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5214226Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5214945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5215878Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5216769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5217504Z kernel = self.compile( 2025-05-07T20:33:07.5218250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5219147Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5219689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5220006Z 2025-05-07T20:33:07.5220289Z self = 2025-05-07T20:33:07.5221837Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5223791Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f88f9ced5e0>} 2025-05-07T20:33:07.5225766Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5227204Z context = 2025-05-07T20:33:07.5227604Z 2025-05-07T20:33:07.5227841Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5228564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5229204Z module_map=module_map) 2025-05-07T20:33:07.5229694Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5230282Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5230626Z E ^ 2025-05-07T20:33:07.5231319Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5231960Z 2025-05-07T20:33:07.5232532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5233245Z 2025-05-07T20:33:07.5233383Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5234009Z self=, 2025-05-07T20:33:07.5234558Z T=2048, 2025-05-07T20:33:07.5234811Z D=5120, 2025-05-07T20:33:07.5235063Z scale_ub=1200.0, 2025-05-07T20:33:07.5235376Z contiguous=True, 2025-05-07T20:33:07.5235698Z compiled=True, 2025-05-07T20:33:07.5235968Z ) 2025-05-07T20:33:07.5236404Z self = 2025-05-07T20:33:07.5237087Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.5237469Z 2025-05-07T20:33:07.5237574Z @given( 2025-05-07T20:33:07.5237892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5238319Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5238730Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5239184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5239692Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5240084Z ) 2025-05-07T20:33:07.5240544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5241143Z def test_silu_mul_quant( 2025-05-07T20:33:07.5241457Z self, 2025-05-07T20:33:07.5241709Z T: int, 2025-05-07T20:33:07.5241959Z D: int, 2025-05-07T20:33:07.5242268Z scale_ub: Optional[float], 2025-05-07T20:33:07.5242629Z contiguous: bool, 2025-05-07T20:33:07.5262631Z compiled: bool, 2025-05-07T20:33:07.5262879Z ) -> None: 2025-05-07T20:33:07.5263102Z torch.manual_seed(2025) 2025-05-07T20:33:07.5263416Z 2025-05-07T20:33:07.5263758Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5264111Z 2025-05-07T20:33:07.5264310Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5264601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5264918Z x = x_sign * x_clamp 2025-05-07T20:33:07.5265173Z x0 = x[:, :D] 2025-05-07T20:33:07.5265394Z x1 = x[:, D:] 2025-05-07T20:33:07.5265605Z 2025-05-07T20:33:07.5265796Z if contiguous: 2025-05-07T20:33:07.5266025Z x0 = x0.contiguous() 2025-05-07T20:33:07.5266292Z x1 = x1.contiguous() 2025-05-07T20:33:07.5266534Z 2025-05-07T20:33:07.5266721Z if scale_ub is not None: 2025-05-07T20:33:07.5266999Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5267336Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5267649Z ) 2025-05-07T20:33:07.5267837Z else: 2025-05-07T20:33:07.5268199Z scale_ub_tensor = None 2025-05-07T20:33:07.5268452Z 2025-05-07T20:33:07.5268679Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5268996Z op = silu_mul_quant 2025-05-07T20:33:07.5269245Z if compiled: 2025-05-07T20:33:07.5269488Z op = torch.compile(op) 2025-05-07T20:33:07.5269866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5270157Z 2025-05-07T20:33:07.5270343Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.5270629Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.5270923Z 2025-05-07T20:33:07.5271157Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5271497Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.5271791Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.5272105Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.5272460Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5272834Z 2025-05-07T20:33:07.5273037Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.5273233Z 2025-05-07T20:33:07.5273334Z moe/activation_test.py:126: 2025-05-07T20:33:07.5273638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5274026Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.5274347Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5275139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.5275907Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.5276452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5277124Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5277813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.5278613Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5280525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.5281395Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5282184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.5282824Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.5283435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.5283958Z fn() 2025-05-07T20:33:07.5284467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.5285055Z self.fn.run( 2025-05-07T20:33:07.5285518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5286053Z kernel = self.compile( 2025-05-07T20:33:07.5286607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5287275Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5287680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5287925Z 2025-05-07T20:33:07.5288135Z self = 2025-05-07T20:33:07.5289238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5290706Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8919e54160>} 2025-05-07T20:33:07.5292112Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5293143Z context = 2025-05-07T20:33:07.5293433Z 2025-05-07T20:33:07.5293598Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5294120Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5294583Z module_map=module_map) 2025-05-07T20:33:07.5294950Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5295306Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.5295619Z E ^ 2025-05-07T20:33:07.5296087Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5296547Z 2025-05-07T20:33:07.5296967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5297521Z 2025-05-07T20:33:07.5297635Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5298043Z self=, 2025-05-07T20:33:07.5298452Z T=16384, 2025-05-07T20:33:07.5298645Z D=7168, 2025-05-07T20:33:07.5298832Z scale_ub=1200.0, 2025-05-07T20:33:07.5299060Z contiguous=False, 2025-05-07T20:33:07.5299293Z compiled=False, 2025-05-07T20:33:07.5299493Z ) 2025-05-07T20:33:07.5299815Z self = 2025-05-07T20:33:07.5300321Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.5300608Z 2025-05-07T20:33:07.5300692Z @given( 2025-05-07T20:33:07.5300917Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5301237Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5301573Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5301980Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5302322Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5302610Z ) 2025-05-07T20:33:07.5302960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5303407Z def test_silu_mul_quant( 2025-05-07T20:33:07.5303649Z self, 2025-05-07T20:33:07.5304125Z T: int, 2025-05-07T20:33:07.5304318Z D: int, 2025-05-07T20:33:07.5304534Z scale_ub: Optional[float], 2025-05-07T20:33:07.5304806Z contiguous: bool, 2025-05-07T20:33:07.5305042Z compiled: bool, 2025-05-07T20:33:07.5305276Z ) -> None: 2025-05-07T20:33:07.5305495Z torch.manual_seed(2025) 2025-05-07T20:33:07.5305734Z 2025-05-07T20:33:07.5306010Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5306357Z 2025-05-07T20:33:07.5306550Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5306855Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5307166Z x = x_sign * x_clamp 2025-05-07T20:33:07.5307399Z x0 = x[:, :D] 2025-05-07T20:33:07.5307623Z x1 = x[:, D:] 2025-05-07T20:33:07.5307840Z 2025-05-07T20:33:07.5308020Z if contiguous: 2025-05-07T20:33:07.5308252Z x0 = x0.contiguous() 2025-05-07T20:33:07.5308519Z x1 = x1.contiguous() 2025-05-07T20:33:07.5308756Z 2025-05-07T20:33:07.5308948Z if scale_ub is not None: 2025-05-07T20:33:07.5309221Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5309556Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5310077Z ) 2025-05-07T20:33:07.5310270Z else: 2025-05-07T20:33:07.5310482Z scale_ub_tensor = None 2025-05-07T20:33:07.5310726Z 2025-05-07T20:33:07.5310958Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5311272Z op = silu_mul_quant 2025-05-07T20:33:07.5311523Z if compiled: 2025-05-07T20:33:07.5311769Z op = torch.compile(op) 2025-05-07T20:33:07.5312065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5312332Z 2025-05-07T20:33:07.5312523Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5312688Z 2025-05-07T20:33:07.5312793Z moe/activation_test.py:117: 2025-05-07T20:33:07.5313079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5313408Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5313687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5314476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5315171Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5315708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5316390Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5317117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5317644Z kernel = self.compile( 2025-05-07T20:33:07.5318186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5318840Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5319230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5319463Z 2025-05-07T20:33:07.5319670Z self = 2025-05-07T20:33:07.5320759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5322213Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8919dffe50>} 2025-05-07T20:33:07.5323565Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5324581Z context = 2025-05-07T20:33:07.5324870Z 2025-05-07T20:33:07.5325037Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5325564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5326037Z module_map=module_map) 2025-05-07T20:33:07.5326396Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5326749Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5327013Z E ^ 2025-05-07T20:33:07.5327476Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5327934Z 2025-05-07T20:33:07.5328357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5328877Z 2025-05-07T20:33:07.5328981Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5329399Z self=, 2025-05-07T20:33:07.5329793Z T=1, 2025-05-07T20:33:07.5329973Z D=7168, 2025-05-07T20:33:07.5330216Z scale_ub=None, 2025-05-07T20:33:07.5330424Z contiguous=True, 2025-05-07T20:33:07.5330645Z compiled=True, 2025-05-07T20:33:07.5330844Z ) 2025-05-07T20:33:07.5331156Z self = 2025-05-07T20:33:07.5331659Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.5331948Z 2025-05-07T20:33:07.5332029Z @given( 2025-05-07T20:33:07.5332254Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5332564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5332868Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5333197Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5333521Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5333804Z ) 2025-05-07T20:33:07.5334148Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5334581Z def test_silu_mul_quant( 2025-05-07T20:33:07.5334871Z self, 2025-05-07T20:33:07.5335066Z T: int, 2025-05-07T20:33:07.5335255Z D: int, 2025-05-07T20:33:07.5335468Z scale_ub: Optional[float], 2025-05-07T20:33:07.5335738Z contiguous: bool, 2025-05-07T20:33:07.5335969Z compiled: bool, 2025-05-07T20:33:07.5336190Z ) -> None: 2025-05-07T20:33:07.5336454Z torch.manual_seed(2025) 2025-05-07T20:33:07.5336690Z 2025-05-07T20:33:07.5336956Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5337294Z 2025-05-07T20:33:07.5337485Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5337769Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5338076Z x = x_sign * x_clamp 2025-05-07T20:33:07.5338320Z x0 = x[:, :D] 2025-05-07T20:33:07.5338530Z x1 = x[:, D:] 2025-05-07T20:33:07.5338734Z 2025-05-07T20:33:07.5338924Z if contiguous: 2025-05-07T20:33:07.5339150Z x0 = x0.contiguous() 2025-05-07T20:33:07.5339417Z x1 = x1.contiguous() 2025-05-07T20:33:07.5339659Z 2025-05-07T20:33:07.5339845Z if scale_ub is not None: 2025-05-07T20:33:07.5340121Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5340461Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5340843Z ) 2025-05-07T20:33:07.5341042Z else: 2025-05-07T20:33:07.5341254Z scale_ub_tensor = None 2025-05-07T20:33:07.5341503Z 2025-05-07T20:33:07.5341772Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5342099Z op = silu_mul_quant 2025-05-07T20:33:07.5342346Z if compiled: 2025-05-07T20:33:07.5342597Z op = torch.compile(op) 2025-05-07T20:33:07.5343171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5343525Z 2025-05-07T20:33:07.5343836Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.5344262Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.5344631Z 2025-05-07T20:33:07.5344975Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5345437Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.5345799Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.5346228Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.5346718Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5347155Z 2025-05-07T20:33:07.5347408Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.5347657Z 2025-05-07T20:33:07.5347838Z moe/activation_test.py:126: 2025-05-07T20:33:07.5348263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5348650Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.5349121Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5350239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.5351293Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.5352105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5353140Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5354116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.5355090Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5355899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.5356738Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5357706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.5358435Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.5359089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.5359773Z fn() 2025-05-07T20:33:07.5360408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.5361063Z self.fn.run( 2025-05-07T20:33:07.5361716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5362349Z kernel = self.compile( 2025-05-07T20:33:07.5362989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5363771Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5364262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5364579Z 2025-05-07T20:33:07.5364800Z self = 2025-05-07T20:33:07.5366108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5367582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8919dff550>} 2025-05-07T20:33:07.5369019Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5370220Z context = 2025-05-07T20:33:07.5370533Z 2025-05-07T20:33:07.5370793Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5371445Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5371979Z module_map=module_map) 2025-05-07T20:33:07.5372451Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5372941Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.5373275Z E ^ 2025-05-07T20:33:07.5373842Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5374341Z 2025-05-07T20:33:07.5374855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5375411Z 2025-05-07T20:33:07.5375592Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5376140Z self=, 2025-05-07T20:33:07.5376743Z T=4096, 2025-05-07T20:33:07.5377065Z D=5120, 2025-05-07T20:33:07.5377340Z scale_ub=None, 2025-05-07T20:33:07.5377662Z contiguous=False, 2025-05-07T20:33:07.5378021Z compiled=False, 2025-05-07T20:33:07.5378325Z ) 2025-05-07T20:33:07.5378741Z self = 2025-05-07T20:33:07.5379371Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.5379674Z 2025-05-07T20:33:07.5379815Z @given( 2025-05-07T20:33:07.5380159Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5380578Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5381050Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5381555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5381972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5382369Z ) 2025-05-07T20:33:07.5382933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5383439Z def test_silu_mul_quant( 2025-05-07T20:33:07.5383799Z self, 2025-05-07T20:33:07.5384148Z T: int, 2025-05-07T20:33:07.5384392Z D: int, 2025-05-07T20:33:07.5384719Z scale_ub: Optional[float], 2025-05-07T20:33:07.5385204Z contiguous: bool, 2025-05-07T20:33:07.5385623Z compiled: bool, 2025-05-07T20:33:07.5385895Z ) -> None: 2025-05-07T20:33:07.5386261Z torch.manual_seed(2025) 2025-05-07T20:33:07.5386609Z 2025-05-07T20:33:07.5386938Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5387434Z 2025-05-07T20:33:07.5387733Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5388080Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5388536Z x = x_sign * x_clamp 2025-05-07T20:33:07.5388888Z x0 = x[:, :D] 2025-05-07T20:33:07.5389153Z x1 = x[:, D:] 2025-05-07T20:33:07.5389506Z 2025-05-07T20:33:07.5389933Z if contiguous: 2025-05-07T20:33:07.5390234Z x0 = x0.contiguous() 2025-05-07T20:33:07.5390629Z x1 = x1.contiguous() 2025-05-07T20:33:07.5390979Z 2025-05-07T20:33:07.5391245Z if scale_ub is not None: 2025-05-07T20:33:07.5391663Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5392194Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5392579Z ) 2025-05-07T20:33:07.5392918Z else: 2025-05-07T20:33:07.5393218Z scale_ub_tensor = None 2025-05-07T20:33:07.5393547Z 2025-05-07T20:33:07.5393998Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5394402Z op = silu_mul_quant 2025-05-07T20:33:07.5394724Z if compiled: 2025-05-07T20:33:07.5395164Z op = torch.compile(op) 2025-05-07T20:33:07.5395517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5395888Z 2025-05-07T20:33:07.5396274Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5396468Z 2025-05-07T20:33:07.5396594Z moe/activation_test.py:117: 2025-05-07T20:33:07.5396984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5397478Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5397853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5398622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5399561Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5400210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5400931Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5401806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5402456Z kernel = self.compile( 2025-05-07T20:33:07.5403181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5404159Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5404682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5404962Z 2025-05-07T20:33:07.5405268Z self = 2025-05-07T20:33:07.5406495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5407941Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8919a23940>} 2025-05-07T20:33:07.5409591Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5410756Z context = 2025-05-07T20:33:07.5411108Z 2025-05-07T20:33:07.5411375Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5412141Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5412664Z module_map=module_map) 2025-05-07T20:33:07.5413115Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5413630Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5413947Z E ^ 2025-05-07T20:33:07.5414501Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5415076Z 2025-05-07T20:33:07.5415572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5416116Z 2025-05-07T20:33:07.5416280Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5416816Z self=, 2025-05-07T20:33:07.5417460Z T=4096, 2025-05-07T20:33:07.5417745Z D=7168, 2025-05-07T20:33:07.5418028Z scale_ub=None, 2025-05-07T20:33:07.5418369Z contiguous=False, 2025-05-07T20:33:07.5418691Z compiled=False, 2025-05-07T20:33:07.5418987Z ) 2025-05-07T20:33:07.5419426Z self = 2025-05-07T20:33:07.5420011Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.5420338Z 2025-05-07T20:33:07.5420529Z @given( 2025-05-07T20:33:07.5420925Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5421336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5421820Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5422381Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5422847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5423354Z ) 2025-05-07T20:33:07.5423949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5424554Z def test_silu_mul_quant( 2025-05-07T20:33:07.5424876Z self, 2025-05-07T20:33:07.5425276Z T: int, 2025-05-07T20:33:07.5425546Z D: int, 2025-05-07T20:33:07.5425895Z scale_ub: Optional[float], 2025-05-07T20:33:07.5426375Z contiguous: bool, 2025-05-07T20:33:07.5426795Z compiled: bool, 2025-05-07T20:33:07.5427113Z ) -> None: 2025-05-07T20:33:07.5427623Z torch.manual_seed(2025) 2025-05-07T20:33:07.5428019Z 2025-05-07T20:33:07.5428381Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5429031Z 2025-05-07T20:33:07.5429315Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5429639Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5430211Z x = x_sign * x_clamp 2025-05-07T20:33:07.5430544Z x0 = x[:, :D] 2025-05-07T20:33:07.5430902Z x1 = x[:, D:] 2025-05-07T20:33:07.5431205Z 2025-05-07T20:33:07.5431583Z if contiguous: 2025-05-07T20:33:07.5431975Z x0 = x0.contiguous() 2025-05-07T20:33:07.5432321Z x1 = x1.contiguous() 2025-05-07T20:33:07.5432652Z 2025-05-07T20:33:07.5432979Z if scale_ub is not None: 2025-05-07T20:33:07.5433319Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5433746Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5439767Z ) 2025-05-07T20:33:07.5439999Z else: 2025-05-07T20:33:07.5440221Z scale_ub_tensor = None 2025-05-07T20:33:07.5440474Z 2025-05-07T20:33:07.5440794Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5441128Z op = silu_mul_quant 2025-05-07T20:33:07.5441382Z if compiled: 2025-05-07T20:33:07.5441635Z op = torch.compile(op) 2025-05-07T20:33:07.5441990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5442260Z 2025-05-07T20:33:07.5442505Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5442671Z 2025-05-07T20:33:07.5442775Z moe/activation_test.py:117: 2025-05-07T20:33:07.5443075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5443414Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5443699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5444403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5445097Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5445643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5446333Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5447004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5447535Z kernel = self.compile( 2025-05-07T20:33:07.5448137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5448791Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5449184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5449417Z 2025-05-07T20:33:07.5449621Z self = 2025-05-07T20:33:07.5450717Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5452119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89199e65e0>} 2025-05-07T20:33:07.5453480Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5454511Z context = 2025-05-07T20:33:07.5454806Z 2025-05-07T20:33:07.5454968Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5455494Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5455964Z module_map=module_map) 2025-05-07T20:33:07.5456328Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5456728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5456983Z E ^ 2025-05-07T20:33:07.5457451Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5457911Z 2025-05-07T20:33:07.5458334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5458854Z 2025-05-07T20:33:07.5458955Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5459374Z self=, 2025-05-07T20:33:07.5459774Z T=128, 2025-05-07T20:33:07.5459960Z D=7168, 2025-05-07T20:33:07.5460155Z scale_ub=None, 2025-05-07T20:33:07.5460371Z contiguous=False, 2025-05-07T20:33:07.5460596Z compiled=True, 2025-05-07T20:33:07.5460797Z ) 2025-05-07T20:33:07.5461117Z self = 2025-05-07T20:33:07.5461655Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.5461923Z 2025-05-07T20:33:07.5462010Z @given( 2025-05-07T20:33:07.5462237Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5462549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5462907Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5463238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5463556Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5463838Z ) 2025-05-07T20:33:07.5464197Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5464631Z def test_silu_mul_quant( 2025-05-07T20:33:07.5464866Z self, 2025-05-07T20:33:07.5465051Z T: int, 2025-05-07T20:33:07.5465236Z D: int, 2025-05-07T20:33:07.5465449Z scale_ub: Optional[float], 2025-05-07T20:33:07.5465721Z contiguous: bool, 2025-05-07T20:33:07.5465961Z compiled: bool, 2025-05-07T20:33:07.5466173Z ) -> None: 2025-05-07T20:33:07.5466387Z torch.manual_seed(2025) 2025-05-07T20:33:07.5466625Z 2025-05-07T20:33:07.5466892Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5467232Z 2025-05-07T20:33:07.5467509Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5467793Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5468102Z x = x_sign * x_clamp 2025-05-07T20:33:07.5468338Z x0 = x[:, :D] 2025-05-07T20:33:07.5468543Z x1 = x[:, D:] 2025-05-07T20:33:07.5468744Z 2025-05-07T20:33:07.5468922Z if contiguous: 2025-05-07T20:33:07.5469141Z x0 = x0.contiguous() 2025-05-07T20:33:07.5469392Z x1 = x1.contiguous() 2025-05-07T20:33:07.5469627Z 2025-05-07T20:33:07.5469874Z if scale_ub is not None: 2025-05-07T20:33:07.5470146Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5470487Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5470785Z ) 2025-05-07T20:33:07.5470973Z else: 2025-05-07T20:33:07.5471177Z scale_ub_tensor = None 2025-05-07T20:33:07.5471415Z 2025-05-07T20:33:07.5471649Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5471962Z op = silu_mul_quant 2025-05-07T20:33:07.5472202Z if compiled: 2025-05-07T20:33:07.5472451Z op = torch.compile(op) 2025-05-07T20:33:07.5472745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5473014Z 2025-05-07T20:33:07.5473196Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.5473478Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.5473763Z 2025-05-07T20:33:07.5473985Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5474311Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.5474599Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.5474966Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.5475317Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5475624Z 2025-05-07T20:33:07.5475810Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.5476011Z 2025-05-07T20:33:07.5476111Z moe/activation_test.py:126: 2025-05-07T20:33:07.5476403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5476729Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.5477044Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5477840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.5478605Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.5479186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5479865Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5480556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.5481274Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5482109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.5483029Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5483757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.5484398Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.5484999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.5485518Z fn() 2025-05-07T20:33:07.5486017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.5486590Z self.fn.run( 2025-05-07T20:33:07.5487100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5487628Z kernel = self.compile( 2025-05-07T20:33:07.5488163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5488810Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5489215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5489439Z 2025-05-07T20:33:07.5489644Z self = 2025-05-07T20:33:07.5490733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5492231Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8919e54310>} 2025-05-07T20:33:07.5493587Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5494661Z context = 2025-05-07T20:33:07.5494944Z 2025-05-07T20:33:07.5495155Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5495897Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5496562Z module_map=module_map) 2025-05-07T20:33:07.5497065Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5497566Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.5497917Z E ^ 2025-05-07T20:33:07.5498518Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5499113Z 2025-05-07T20:33:07.5499664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5500374Z 2025-05-07T20:33:07.5500524Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5501047Z self=, 2025-05-07T20:33:07.5501611Z T=128, 2025-05-07T20:33:07.5501873Z D=7168, 2025-05-07T20:33:07.5502082Z scale_ub=None, 2025-05-07T20:33:07.5502385Z contiguous=False, 2025-05-07T20:33:07.5502694Z compiled=False, 2025-05-07T20:33:07.5502979Z ) 2025-05-07T20:33:07.5503517Z self = 2025-05-07T20:33:07.5504411Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.5504784Z 2025-05-07T20:33:07.5504888Z @given( 2025-05-07T20:33:07.5505208Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5505776Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5506192Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5506592Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5506916Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5507200Z ) 2025-05-07T20:33:07.5507541Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5507986Z def test_silu_mul_quant( 2025-05-07T20:33:07.5508226Z self, 2025-05-07T20:33:07.5508409Z T: int, 2025-05-07T20:33:07.5508602Z D: int, 2025-05-07T20:33:07.5508820Z scale_ub: Optional[float], 2025-05-07T20:33:07.5509081Z contiguous: bool, 2025-05-07T20:33:07.5509314Z compiled: bool, 2025-05-07T20:33:07.5509534Z ) -> None: 2025-05-07T20:33:07.5509739Z torch.manual_seed(2025) 2025-05-07T20:33:07.5510047Z 2025-05-07T20:33:07.5510404Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5510753Z 2025-05-07T20:33:07.5510942Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5511227Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5511540Z x = x_sign * x_clamp 2025-05-07T20:33:07.5511770Z x0 = x[:, :D] 2025-05-07T20:33:07.5511982Z x1 = x[:, D:] 2025-05-07T20:33:07.5512189Z 2025-05-07T20:33:07.5512362Z if contiguous: 2025-05-07T20:33:07.5512588Z x0 = x0.contiguous() 2025-05-07T20:33:07.5512842Z x1 = x1.contiguous() 2025-05-07T20:33:07.5513072Z 2025-05-07T20:33:07.5513261Z if scale_ub is not None: 2025-05-07T20:33:07.5513530Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5513857Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5514163Z ) 2025-05-07T20:33:07.5514350Z else: 2025-05-07T20:33:07.5514552Z scale_ub_tensor = None 2025-05-07T20:33:07.5514804Z 2025-05-07T20:33:07.5515035Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5515336Z op = silu_mul_quant 2025-05-07T20:33:07.5515590Z if compiled: 2025-05-07T20:33:07.5515837Z op = torch.compile(op) 2025-05-07T20:33:07.5516125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5516410Z 2025-05-07T20:33:07.5516599Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5516761Z 2025-05-07T20:33:07.5516860Z moe/activation_test.py:117: 2025-05-07T20:33:07.5517154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5517564Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5517835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5518526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5519219Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5519763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5520436Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5521093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5521621Z kernel = self.compile( 2025-05-07T20:33:07.5522159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5522880Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5523278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5523503Z 2025-05-07T20:33:07.5523712Z self = 2025-05-07T20:33:07.5524797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5526237Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891957e160>} 2025-05-07T20:33:07.5527590Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5528620Z context = 2025-05-07T20:33:07.5528906Z 2025-05-07T20:33:07.5529081Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5529597Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5530106Z module_map=module_map) 2025-05-07T20:33:07.5530477Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5530825Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5531080Z E ^ 2025-05-07T20:33:07.5531537Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5531989Z 2025-05-07T20:33:07.5532407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5532918Z 2025-05-07T20:33:07.5533018Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5533432Z self=, 2025-05-07T20:33:07.5533827Z T=4096, 2025-05-07T20:33:07.5534005Z D=5120, 2025-05-07T20:33:07.5534192Z scale_ub=1200.0, 2025-05-07T20:33:07.5534409Z contiguous=True, 2025-05-07T20:33:07.5534638Z compiled=False, 2025-05-07T20:33:07.5534845Z ) 2025-05-07T20:33:07.5535161Z self = 2025-05-07T20:33:07.5535749Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.5536027Z 2025-05-07T20:33:07.5536102Z @given( 2025-05-07T20:33:07.5536324Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5536633Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5536929Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5537255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5537580Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5537924Z ) 2025-05-07T20:33:07.5538273Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5538715Z def test_silu_mul_quant( 2025-05-07T20:33:07.5538957Z self, 2025-05-07T20:33:07.5539146Z T: int, 2025-05-07T20:33:07.5539344Z D: int, 2025-05-07T20:33:07.5539567Z scale_ub: Optional[float], 2025-05-07T20:33:07.5539834Z contiguous: bool, 2025-05-07T20:33:07.5540075Z compiled: bool, 2025-05-07T20:33:07.5540298Z ) -> None: 2025-05-07T20:33:07.5540509Z torch.manual_seed(2025) 2025-05-07T20:33:07.5540750Z 2025-05-07T20:33:07.5541018Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5541359Z 2025-05-07T20:33:07.5541558Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5541848Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5542152Z x = x_sign * x_clamp 2025-05-07T20:33:07.5542450Z x0 = x[:, :D] 2025-05-07T20:33:07.5542667Z x1 = x[:, D:] 2025-05-07T20:33:07.5542868Z 2025-05-07T20:33:07.5543060Z if contiguous: 2025-05-07T20:33:07.5543288Z x0 = x0.contiguous() 2025-05-07T20:33:07.5543541Z x1 = x1.contiguous() 2025-05-07T20:33:07.5543782Z 2025-05-07T20:33:07.5544028Z if scale_ub is not None: 2025-05-07T20:33:07.5544302Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5544631Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5544940Z ) 2025-05-07T20:33:07.5545133Z else: 2025-05-07T20:33:07.5545342Z scale_ub_tensor = None 2025-05-07T20:33:07.5545595Z 2025-05-07T20:33:07.5545823Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5546125Z op = silu_mul_quant 2025-05-07T20:33:07.5546376Z if compiled: 2025-05-07T20:33:07.5546623Z op = torch.compile(op) 2025-05-07T20:33:07.5546919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5547193Z 2025-05-07T20:33:07.5547379Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5547541Z 2025-05-07T20:33:07.5547640Z moe/activation_test.py:117: 2025-05-07T20:33:07.5547933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5548385Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5548663Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5549342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5550095Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5550627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5551300Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5552009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5552536Z kernel = self.compile( 2025-05-07T20:33:07.5552912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5553094Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5553220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5553225Z 2025-05-07T20:33:07.5553426Z self = 2025-05-07T20:33:07.5554213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5554723Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89194f75e0>} 2025-05-07T20:33:07.5555526Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5555716Z context = 2025-05-07T20:33:07.5555724Z 2025-05-07T20:33:07.5555894Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5556155Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5556260Z module_map=module_map) 2025-05-07T20:33:07.5556429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5556523Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5556596Z E ^ 2025-05-07T20:33:07.5556993Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5557000Z 2025-05-07T20:33:07.5557413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5557418Z 2025-05-07T20:33:07.5557520Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5557791Z self=, 2025-05-07T20:33:07.5557868Z T=1, 2025-05-07T20:33:07.5557949Z D=5120, 2025-05-07T20:33:07.5558028Z scale_ub=None, 2025-05-07T20:33:07.5558110Z contiguous=True, 2025-05-07T20:33:07.5558195Z compiled=True, 2025-05-07T20:33:07.5558263Z ) 2025-05-07T20:33:07.5558489Z self = 2025-05-07T20:33:07.5558651Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.5558656Z 2025-05-07T20:33:07.5558729Z @given( 2025-05-07T20:33:07.5558860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5558960Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5559070Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5559191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5559301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5559417Z ) 2025-05-07T20:33:07.5559667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5559759Z def test_silu_mul_quant( 2025-05-07T20:33:07.5559841Z self, 2025-05-07T20:33:07.5559914Z T: int, 2025-05-07T20:33:07.5559986Z D: int, 2025-05-07T20:33:07.5560085Z scale_ub: Optional[float], 2025-05-07T20:33:07.5560171Z contiguous: bool, 2025-05-07T20:33:07.5560255Z compiled: bool, 2025-05-07T20:33:07.5560335Z ) -> None: 2025-05-07T20:33:07.5560433Z torch.manual_seed(2025) 2025-05-07T20:33:07.5560501Z 2025-05-07T20:33:07.5560677Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5560753Z 2025-05-07T20:33:07.5560840Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5560969Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5561055Z x = x_sign * x_clamp 2025-05-07T20:33:07.5561141Z x0 = x[:, :D] 2025-05-07T20:33:07.5561225Z x1 = x[:, D:] 2025-05-07T20:33:07.5561294Z 2025-05-07T20:33:07.5561380Z if contiguous: 2025-05-07T20:33:07.5561469Z x0 = x0.contiguous() 2025-05-07T20:33:07.5561557Z x1 = x1.contiguous() 2025-05-07T20:33:07.5561631Z 2025-05-07T20:33:07.5561719Z if scale_ub is not None: 2025-05-07T20:33:07.5561848Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5562029Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5562105Z ) 2025-05-07T20:33:07.5562180Z else: 2025-05-07T20:33:07.5562299Z scale_ub_tensor = None 2025-05-07T20:33:07.5562433Z 2025-05-07T20:33:07.5562563Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5562657Z op = silu_mul_quant 2025-05-07T20:33:07.5562741Z if compiled: 2025-05-07T20:33:07.5562847Z op = torch.compile(op) 2025-05-07T20:33:07.5562952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5563023Z 2025-05-07T20:33:07.5563114Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.5563232Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.5563304Z 2025-05-07T20:33:07.5563443Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5563542Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.5563639Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.5563764Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.5563902Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5563981Z 2025-05-07T20:33:07.5564152Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.5564158Z 2025-05-07T20:33:07.5564256Z moe/activation_test.py:126: 2025-05-07T20:33:07.5564385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5564487Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.5564666Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5565230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.5565327Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.5565707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5565935Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5566353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.5566617Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5567107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.5567472Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5567850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.5568036Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.5568431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.5568507Z fn() 2025-05-07T20:33:07.5568977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.5569069Z self.fn.run( 2025-05-07T20:33:07.5569438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5569559Z kernel = self.compile( 2025-05-07T20:33:07.5569943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5570119Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5570250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5570255Z 2025-05-07T20:33:07.5570457Z self = 2025-05-07T20:33:07.5571251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5571765Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891928c5e0>} 2025-05-07T20:33:07.5572575Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5572768Z context = 2025-05-07T20:33:07.5572773Z 2025-05-07T20:33:07.5572935Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5573201Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5573304Z module_map=module_map) 2025-05-07T20:33:07.5573463Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5573572Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.5573644Z E ^ 2025-05-07T20:33:07.5574047Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5574060Z 2025-05-07T20:33:07.5574475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5574517Z 2025-05-07T20:33:07.5574620Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5574852Z self=, 2025-05-07T20:33:07.5574928Z T=2048, 2025-05-07T20:33:07.5575002Z D=5120, 2025-05-07T20:33:07.5575086Z scale_ub=None, 2025-05-07T20:33:07.5575171Z contiguous=True, 2025-05-07T20:33:07.5575250Z compiled=True, 2025-05-07T20:33:07.5575323Z ) 2025-05-07T20:33:07.5575539Z self = 2025-05-07T20:33:07.5575717Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.5575725Z 2025-05-07T20:33:07.5575804Z @given( 2025-05-07T20:33:07.5575922Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5576021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5576134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5576297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5576416Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5576487Z ) 2025-05-07T20:33:07.5576733Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5576829Z def test_silu_mul_quant( 2025-05-07T20:33:07.5576901Z self, 2025-05-07T20:33:07.5576982Z T: int, 2025-05-07T20:33:07.5577055Z D: int, 2025-05-07T20:33:07.5577150Z scale_ub: Optional[float], 2025-05-07T20:33:07.5577241Z contiguous: bool, 2025-05-07T20:33:07.5577323Z compiled: bool, 2025-05-07T20:33:07.5577398Z ) -> None: 2025-05-07T20:33:07.5577504Z torch.manual_seed(2025) 2025-05-07T20:33:07.5577577Z 2025-05-07T20:33:07.5577750Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5577826Z 2025-05-07T20:33:07.5577917Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5578038Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5578133Z x = x_sign * x_clamp 2025-05-07T20:33:07.5578211Z x0 = x[:, :D] 2025-05-07T20:33:07.5578293Z x1 = x[:, D:] 2025-05-07T20:33:07.5578363Z 2025-05-07T20:33:07.5578449Z if contiguous: 2025-05-07T20:33:07.5578542Z x0 = x0.contiguous() 2025-05-07T20:33:07.5578628Z x1 = x1.contiguous() 2025-05-07T20:33:07.5578699Z 2025-05-07T20:33:07.5578793Z if scale_ub is not None: 2025-05-07T20:33:07.5578896Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5579026Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5579110Z ) 2025-05-07T20:33:07.5579236Z else: 2025-05-07T20:33:07.5579328Z scale_ub_tensor = None 2025-05-07T20:33:07.5579410Z 2025-05-07T20:33:07.5579536Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5579623Z op = silu_mul_quant 2025-05-07T20:33:07.5579708Z if compiled: 2025-05-07T20:33:07.5579811Z op = torch.compile(op) 2025-05-07T20:33:07.5579919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5579991Z 2025-05-07T20:33:07.5580081Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.5580203Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.5580269Z 2025-05-07T20:33:07.5580401Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5580502Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.5580601Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.5580720Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.5580914Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5580991Z 2025-05-07T20:33:07.5581096Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.5581100Z 2025-05-07T20:33:07.5581196Z moe/activation_test.py:126: 2025-05-07T20:33:07.5581325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5581477Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.5581609Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5582183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.5582289Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.5582649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5582875Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5583251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.5583512Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5583955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.5584211Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5584588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.5584978Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.5590313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.5590416Z fn() 2025-05-07T20:33:07.5590855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.5590951Z self.fn.run( 2025-05-07T20:33:07.5591296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5591398Z kernel = self.compile( 2025-05-07T20:33:07.5591789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5591965Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5592103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5592109Z 2025-05-07T20:33:07.5592312Z self = 2025-05-07T20:33:07.5593104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5593692Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8918da8f70>} 2025-05-07T20:33:07.5594441Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5594634Z context = 2025-05-07T20:33:07.5594638Z 2025-05-07T20:33:07.5594800Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5595073Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5595180Z module_map=module_map) 2025-05-07T20:33:07.5595338Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5595488Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.5595571Z E ^ 2025-05-07T20:33:07.5595933Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5595937Z 2025-05-07T20:33:07.5596362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5596406Z 2025-05-07T20:33:07.5596507Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5596732Z self=, 2025-05-07T20:33:07.5596805Z T=128, 2025-05-07T20:33:07.5596889Z D=5120, 2025-05-07T20:33:07.5596976Z scale_ub=None, 2025-05-07T20:33:07.5597059Z contiguous=True, 2025-05-07T20:33:07.5597138Z compiled=True, 2025-05-07T20:33:07.5597208Z ) 2025-05-07T20:33:07.5597428Z self = 2025-05-07T20:33:07.5597602Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.5597607Z 2025-05-07T20:33:07.5597682Z @given( 2025-05-07T20:33:07.5597804Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5597909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5598062Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5598181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5598300Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5598371Z ) 2025-05-07T20:33:07.5598614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5598709Z def test_silu_mul_quant( 2025-05-07T20:33:07.5598781Z self, 2025-05-07T20:33:07.5598855Z T: int, 2025-05-07T20:33:07.5598926Z D: int, 2025-05-07T20:33:07.5599021Z scale_ub: Optional[float], 2025-05-07T20:33:07.5599112Z contiguous: bool, 2025-05-07T20:33:07.5599202Z compiled: bool, 2025-05-07T20:33:07.5599283Z ) -> None: 2025-05-07T20:33:07.5599379Z torch.manual_seed(2025) 2025-05-07T20:33:07.5599451Z 2025-05-07T20:33:07.5599616Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5599690Z 2025-05-07T20:33:07.5599783Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5599906Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5599997Z x = x_sign * x_clamp 2025-05-07T20:33:07.5600074Z x0 = x[:, :D] 2025-05-07T20:33:07.5600150Z x1 = x[:, D:] 2025-05-07T20:33:07.5600228Z 2025-05-07T20:33:07.5600311Z if contiguous: 2025-05-07T20:33:07.5600403Z x0 = x0.contiguous() 2025-05-07T20:33:07.5600494Z x1 = x1.contiguous() 2025-05-07T20:33:07.5600561Z 2025-05-07T20:33:07.5600656Z if scale_ub is not None: 2025-05-07T20:33:07.5600759Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5600895Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5601023Z ) 2025-05-07T20:33:07.5601098Z else: 2025-05-07T20:33:07.5601190Z scale_ub_tensor = None 2025-05-07T20:33:07.5601267Z 2025-05-07T20:33:07.5601393Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5601487Z op = silu_mul_quant 2025-05-07T20:33:07.5601572Z if compiled: 2025-05-07T20:33:07.5601669Z op = torch.compile(op) 2025-05-07T20:33:07.5601778Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5601856Z 2025-05-07T20:33:07.5601947Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.5602071Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.5602142Z 2025-05-07T20:33:07.5602276Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5602383Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.5602484Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.5602655Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.5602802Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5602873Z 2025-05-07T20:33:07.5602973Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.5602982Z 2025-05-07T20:33:07.5603123Z moe/activation_test.py:126: 2025-05-07T20:33:07.5603256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5603361Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.5603491Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5604300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.5604407Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.5604764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5605002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5605366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.5605718Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5606128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.5606380Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5606762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.5606926Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.5607263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.5607352Z fn() 2025-05-07T20:33:07.5607751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.5607829Z self.fn.run( 2025-05-07T20:33:07.5608173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5608267Z kernel = self.compile( 2025-05-07T20:33:07.5608649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5608822Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5608952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5608957Z 2025-05-07T20:33:07.5609164Z self = 2025-05-07T20:33:07.5609951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5610544Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8919050a60>} 2025-05-07T20:33:07.5611304Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5611491Z context = 2025-05-07T20:33:07.5611496Z 2025-05-07T20:33:07.5611659Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5611922Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5612041Z module_map=module_map) 2025-05-07T20:33:07.5612266Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5612375Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.5612451Z E ^ 2025-05-07T20:33:07.5612810Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5612877Z 2025-05-07T20:33:07.5613289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5613298Z 2025-05-07T20:33:07.5613398Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5613618Z self=, 2025-05-07T20:33:07.5613698Z T=4096, 2025-05-07T20:33:07.5613768Z D=5120, 2025-05-07T20:33:07.5613847Z scale_ub=None, 2025-05-07T20:33:07.5613931Z contiguous=True, 2025-05-07T20:33:07.5614009Z compiled=True, 2025-05-07T20:33:07.5614079Z ) 2025-05-07T20:33:07.5614303Z self = 2025-05-07T20:33:07.5614474Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.5614478Z 2025-05-07T20:33:07.5614555Z @given( 2025-05-07T20:33:07.5614671Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5614813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5614938Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5615050Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5615161Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5615239Z ) 2025-05-07T20:33:07.5615483Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5615574Z def test_silu_mul_quant( 2025-05-07T20:33:07.5615651Z self, 2025-05-07T20:33:07.5615726Z T: int, 2025-05-07T20:33:07.5615799Z D: int, 2025-05-07T20:33:07.5615905Z scale_ub: Optional[float], 2025-05-07T20:33:07.5615992Z contiguous: bool, 2025-05-07T20:33:07.5616080Z compiled: bool, 2025-05-07T20:33:07.5616155Z ) -> None: 2025-05-07T20:33:07.5616246Z torch.manual_seed(2025) 2025-05-07T20:33:07.5616319Z 2025-05-07T20:33:07.5616486Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5616558Z 2025-05-07T20:33:07.5616653Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5616774Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5616860Z x = x_sign * x_clamp 2025-05-07T20:33:07.5616941Z x0 = x[:, :D] 2025-05-07T20:33:07.5617017Z x1 = x[:, D:] 2025-05-07T20:33:07.5617092Z 2025-05-07T20:33:07.5617183Z if contiguous: 2025-05-07T20:33:07.5617270Z x0 = x0.contiguous() 2025-05-07T20:33:07.5617357Z x1 = x1.contiguous() 2025-05-07T20:33:07.5617431Z 2025-05-07T20:33:07.5617518Z if scale_ub is not None: 2025-05-07T20:33:07.5617673Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5617803Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5617876Z ) 2025-05-07T20:33:07.5617955Z else: 2025-05-07T20:33:07.5618047Z scale_ub_tensor = None 2025-05-07T20:33:07.5618118Z 2025-05-07T20:33:07.5618253Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5618346Z op = silu_mul_quant 2025-05-07T20:33:07.5618426Z if compiled: 2025-05-07T20:33:07.5618527Z op = torch.compile(op) 2025-05-07T20:33:07.5618630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5618702Z 2025-05-07T20:33:07.5618788Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.5618905Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.5618986Z 2025-05-07T20:33:07.5619120Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5619263Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.5619367Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.5619486Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.5619622Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5619698Z 2025-05-07T20:33:07.5619870Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.5619874Z 2025-05-07T20:33:07.5619970Z moe/activation_test.py:126: 2025-05-07T20:33:07.5620095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5620194Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.5620330Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5620889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.5620986Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.5621351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5621573Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5621987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.5622246Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5622642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.5622897Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5623269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.5623437Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.5623787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.5623867Z fn() 2025-05-07T20:33:07.5624268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.5624347Z self.fn.run( 2025-05-07T20:33:07.5624690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5624781Z kernel = self.compile( 2025-05-07T20:33:07.5625152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5625329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5625451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5625456Z 2025-05-07T20:33:07.5625659Z self = 2025-05-07T20:33:07.5626499Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5627007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8918c30670>} 2025-05-07T20:33:07.5627767Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5627954Z context = 2025-05-07T20:33:07.5627959Z 2025-05-07T20:33:07.5628119Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5628422Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5628533Z module_map=module_map) 2025-05-07T20:33:07.5628702Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5628799Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.5628874Z E ^ 2025-05-07T20:33:07.5629283Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5629288Z 2025-05-07T20:33:07.5629699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5629707Z 2025-05-07T20:33:07.5629880Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5630102Z self=, 2025-05-07T20:33:07.5630176Z T=16384, 2025-05-07T20:33:07.5630258Z D=5120, 2025-05-07T20:33:07.5630335Z scale_ub=None, 2025-05-07T20:33:07.5630418Z contiguous=True, 2025-05-07T20:33:07.5630508Z compiled=True, 2025-05-07T20:33:07.5630580Z ) 2025-05-07T20:33:07.5630793Z self = 2025-05-07T20:33:07.5630969Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.5630974Z 2025-05-07T20:33:07.5631097Z @given( 2025-05-07T20:33:07.5631218Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5631311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5631422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5631542Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5631656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5631729Z ) 2025-05-07T20:33:07.5631977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5632069Z def test_silu_mul_quant( 2025-05-07T20:33:07.5632143Z self, 2025-05-07T20:33:07.5632226Z T: int, 2025-05-07T20:33:07.5632300Z D: int, 2025-05-07T20:33:07.5632397Z scale_ub: Optional[float], 2025-05-07T20:33:07.5632487Z contiguous: bool, 2025-05-07T20:33:07.5632568Z compiled: bool, 2025-05-07T20:33:07.5632648Z ) -> None: 2025-05-07T20:33:07.5632740Z torch.manual_seed(2025) 2025-05-07T20:33:07.5632815Z 2025-05-07T20:33:07.5632985Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5633056Z 2025-05-07T20:33:07.5633143Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5633269Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5633355Z x = x_sign * x_clamp 2025-05-07T20:33:07.5633432Z x0 = x[:, :D] 2025-05-07T20:33:07.5633516Z x1 = x[:, D:] 2025-05-07T20:33:07.5633586Z 2025-05-07T20:33:07.5633665Z if contiguous: 2025-05-07T20:33:07.5633754Z x0 = x0.contiguous() 2025-05-07T20:33:07.5633839Z x1 = x1.contiguous() 2025-05-07T20:33:07.5633958Z 2025-05-07T20:33:07.5634048Z if scale_ub is not None: 2025-05-07T20:33:07.5634149Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5634286Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5634358Z ) 2025-05-07T20:33:07.5634436Z else: 2025-05-07T20:33:07.5634533Z scale_ub_tensor = None 2025-05-07T20:33:07.5634604Z 2025-05-07T20:33:07.5634730Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5634818Z op = silu_mul_quant 2025-05-07T20:33:07.5634897Z if compiled: 2025-05-07T20:33:07.5634992Z op = torch.compile(op) 2025-05-07T20:33:07.5635098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5635167Z 2025-05-07T20:33:07.5635259Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.5635375Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.5635445Z 2025-05-07T20:33:07.5635627Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5635728Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.5635824Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.5635944Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.5636082Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5636189Z 2025-05-07T20:33:07.5636291Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.5636296Z 2025-05-07T20:33:07.5636392Z moe/activation_test.py:126: 2025-05-07T20:33:07.5636518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5636620Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.5636748Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5637310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.5637414Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.5637768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5637993Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5638397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.5638659Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5639050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.5639298Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5639670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.5639837Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.5640179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.5640254Z fn() 2025-05-07T20:33:07.5640651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.5640737Z self.fn.run( 2025-05-07T20:33:07.5641068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5641156Z kernel = self.compile( 2025-05-07T20:33:07.5641536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5641708Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5641834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5641839Z 2025-05-07T20:33:07.5642083Z self = 2025-05-07T20:33:07.5642868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5643377Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8918befc10>} 2025-05-07T20:33:07.5644120Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5644312Z context = 2025-05-07T20:33:07.5644316Z 2025-05-07T20:33:07.5644516Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5644780Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5644887Z module_map=module_map) 2025-05-07T20:33:07.5645047Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5645188Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.5645262Z E ^ 2025-05-07T20:33:07.5645610Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5645615Z 2025-05-07T20:33:07.5646027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5646032Z 2025-05-07T20:33:07.5646130Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5646349Z self=, 2025-05-07T20:33:07.5646420Z T=1, 2025-05-07T20:33:07.5646492Z D=5120, 2025-05-07T20:33:07.5646577Z scale_ub=1200.0, 2025-05-07T20:33:07.5646658Z contiguous=True, 2025-05-07T20:33:07.5646735Z compiled=True, 2025-05-07T20:33:07.5646811Z ) 2025-05-07T20:33:07.5647024Z self = 2025-05-07T20:33:07.5647227Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.5647235Z 2025-05-07T20:33:07.5647317Z @given( 2025-05-07T20:33:07.5647432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5647537Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5647650Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5647761Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5647877Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5647948Z ) 2025-05-07T20:33:07.5648192Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5648295Z def test_silu_mul_quant( 2025-05-07T20:33:07.5648369Z self, 2025-05-07T20:33:07.5648442Z T: int, 2025-05-07T20:33:07.5648519Z D: int, 2025-05-07T20:33:07.5648614Z scale_ub: Optional[float], 2025-05-07T20:33:07.5648698Z contiguous: bool, 2025-05-07T20:33:07.5648785Z compiled: bool, 2025-05-07T20:33:07.5648864Z ) -> None: 2025-05-07T20:33:07.5648958Z torch.manual_seed(2025) 2025-05-07T20:33:07.5649028Z 2025-05-07T20:33:07.5649194Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5649273Z 2025-05-07T20:33:07.5649360Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5649480Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5649573Z x = x_sign * x_clamp 2025-05-07T20:33:07.5649648Z x0 = x[:, :D] 2025-05-07T20:33:07.5649723Z x1 = x[:, D:] 2025-05-07T20:33:07.5649794Z 2025-05-07T20:33:07.5649873Z if contiguous: 2025-05-07T20:33:07.5650006Z x0 = x0.contiguous() 2025-05-07T20:33:07.5650096Z x1 = x1.contiguous() 2025-05-07T20:33:07.5650164Z 2025-05-07T20:33:07.5650254Z if scale_ub is not None: 2025-05-07T20:33:07.5650363Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5650496Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5650577Z ) 2025-05-07T20:33:07.5650656Z else: 2025-05-07T20:33:07.5650748Z scale_ub_tensor = None 2025-05-07T20:33:07.5650822Z 2025-05-07T20:33:07.5650947Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5651035Z op = silu_mul_quant 2025-05-07T20:33:07.5651119Z if compiled: 2025-05-07T20:33:07.5651219Z op = torch.compile(op) 2025-05-07T20:33:07.5651324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5651394Z 2025-05-07T20:33:07.5651482Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5651486Z 2025-05-07T20:33:07.5651656Z moe/activation_test.py:117: 2025-05-07T20:33:07.5651782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5651880Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5651983Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5652349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5652484Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5652973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5653069Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5653427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5653649Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5653995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5654089Z kernel = self.compile( 2025-05-07T20:33:07.5654463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5654677Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5654804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5654809Z 2025-05-07T20:33:07.5655011Z self = 2025-05-07T20:33:07.5655796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5656302Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8918457670>} 2025-05-07T20:33:07.5657052Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5657247Z context = 2025-05-07T20:33:07.5657251Z 2025-05-07T20:33:07.5657417Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5657676Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5657780Z module_map=module_map) 2025-05-07T20:33:07.5657944Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5658039Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5658114Z E ^ 2025-05-07T20:33:07.5658470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5658532Z 2025-05-07T20:33:07.5658940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5658945Z 2025-05-07T20:33:07.5659050Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5659275Z self=, 2025-05-07T20:33:07.5659350Z T=1, 2025-05-07T20:33:07.5659429Z D=5120, 2025-05-07T20:33:07.5659507Z scale_ub=None, 2025-05-07T20:33:07.5659591Z contiguous=False, 2025-05-07T20:33:07.5659673Z compiled=True, 2025-05-07T20:33:07.5659743Z ) 2025-05-07T20:33:07.5659961Z self = 2025-05-07T20:33:07.5660122Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.5660127Z 2025-05-07T20:33:07.5660198Z @given( 2025-05-07T20:33:07.5660363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5660462Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5660576Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5660695Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5660810Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5660923Z ) 2025-05-07T20:33:07.5661170Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5661262Z def test_silu_mul_quant( 2025-05-07T20:33:07.5661336Z self, 2025-05-07T20:33:07.5661408Z T: int, 2025-05-07T20:33:07.5661482Z D: int, 2025-05-07T20:33:07.5661602Z scale_ub: Optional[float], 2025-05-07T20:33:07.5661695Z contiguous: bool, 2025-05-07T20:33:07.5661794Z compiled: bool, 2025-05-07T20:33:07.5661871Z ) -> None: 2025-05-07T20:33:07.5661961Z torch.manual_seed(2025) 2025-05-07T20:33:07.5662031Z 2025-05-07T20:33:07.5662206Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5662273Z 2025-05-07T20:33:07.5662362Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5662485Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5662570Z x = x_sign * x_clamp 2025-05-07T20:33:07.5662695Z x0 = x[:, :D] 2025-05-07T20:33:07.5662773Z x1 = x[:, D:] 2025-05-07T20:33:07.5662842Z 2025-05-07T20:33:07.5662925Z if contiguous: 2025-05-07T20:33:07.5663014Z x0 = x0.contiguous() 2025-05-07T20:33:07.5663100Z x1 = x1.contiguous() 2025-05-07T20:33:07.5663173Z 2025-05-07T20:33:07.5663257Z if scale_ub is not None: 2025-05-07T20:33:07.5663357Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5663490Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5663566Z ) 2025-05-07T20:33:07.5663639Z else: 2025-05-07T20:33:07.5663735Z scale_ub_tensor = None 2025-05-07T20:33:07.5663805Z 2025-05-07T20:33:07.5663930Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5664023Z op = silu_mul_quant 2025-05-07T20:33:07.5664104Z if compiled: 2025-05-07T20:33:07.5664204Z op = torch.compile(op) 2025-05-07T20:33:07.5664313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5664383Z 2025-05-07T20:33:07.5664473Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.5664592Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.5664661Z 2025-05-07T20:33:07.5664798Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5664897Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.5664992Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.5665114Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.5665251Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5665375Z 2025-05-07T20:33:07.5665471Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.5665475Z 2025-05-07T20:33:07.5665573Z moe/activation_test.py:126: 2025-05-07T20:33:07.5665701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5665811Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.5665941Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5666499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.5666597Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.5666957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5667179Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5667582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.5667845Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5668242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.5668537Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5668906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.5669069Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.5669408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.5669483Z fn() 2025-05-07T20:33:07.5670003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.5670096Z self.fn.run( 2025-05-07T20:33:07.5670425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5670517Z kernel = self.compile( 2025-05-07T20:33:07.5670935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5671116Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5671244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5671249Z 2025-05-07T20:33:07.5671450Z self = 2025-05-07T20:33:07.5672288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5672797Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89184b79d0>} 2025-05-07T20:33:07.5673540Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5673732Z context = 2025-05-07T20:33:07.5673737Z 2025-05-07T20:33:07.5673898Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5674160Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5674264Z module_map=module_map) 2025-05-07T20:33:07.5674423Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5674526Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.5674644Z E ^ 2025-05-07T20:33:07.5675005Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5675013Z 2025-05-07T20:33:07.5675422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5675428Z 2025-05-07T20:33:07.5675526Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5675749Z self=, 2025-05-07T20:33:07.5675821Z T=1, 2025-05-07T20:33:07.5675891Z D=5120, 2025-05-07T20:33:07.5675976Z scale_ub=None, 2025-05-07T20:33:07.5676057Z contiguous=True, 2025-05-07T20:33:07.5676135Z compiled=False, 2025-05-07T20:33:07.5676212Z ) 2025-05-07T20:33:07.5676430Z self = 2025-05-07T20:33:07.5676595Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.5676712Z 2025-05-07T20:33:07.5676784Z @given( 2025-05-07T20:33:07.5676901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5677000Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5677111Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5677279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5677396Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5677468Z ) 2025-05-07T20:33:07.5677711Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5677808Z def test_silu_mul_quant( 2025-05-07T20:33:07.5677886Z self, 2025-05-07T20:33:07.5677963Z T: int, 2025-05-07T20:33:07.5678038Z D: int, 2025-05-07T20:33:07.5678133Z scale_ub: Optional[float], 2025-05-07T20:33:07.5678223Z contiguous: bool, 2025-05-07T20:33:07.5678305Z compiled: bool, 2025-05-07T20:33:07.5678380Z ) -> None: 2025-05-07T20:33:07.5678481Z torch.manual_seed(2025) 2025-05-07T20:33:07.5678551Z 2025-05-07T20:33:07.5678717Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5678794Z 2025-05-07T20:33:07.5678885Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5679050Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5679143Z x = x_sign * x_clamp 2025-05-07T20:33:07.5679221Z x0 = x[:, :D] 2025-05-07T20:33:07.5679300Z x1 = x[:, D:] 2025-05-07T20:33:07.5679374Z 2025-05-07T20:33:07.5679452Z if contiguous: 2025-05-07T20:33:07.5679549Z x0 = x0.contiguous() 2025-05-07T20:33:07.5679636Z x1 = x1.contiguous() 2025-05-07T20:33:07.5679705Z 2025-05-07T20:33:07.5679795Z if scale_ub is not None: 2025-05-07T20:33:07.5679897Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5680028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5680109Z ) 2025-05-07T20:33:07.5680182Z else: 2025-05-07T20:33:07.5680272Z scale_ub_tensor = None 2025-05-07T20:33:07.5680346Z 2025-05-07T20:33:07.5680472Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5680560Z op = silu_mul_quant 2025-05-07T20:33:07.5680650Z if compiled: 2025-05-07T20:33:07.5680746Z op = torch.compile(op) 2025-05-07T20:33:07.5680852Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5680920Z 2025-05-07T20:33:07.5681008Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5681012Z 2025-05-07T20:33:07.5681112Z moe/activation_test.py:117: 2025-05-07T20:33:07.5681237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5681333Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5681437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5681937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5682106Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5682461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5682682Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5683022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5683112Z kernel = self.compile( 2025-05-07T20:33:07.5683489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5683663Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5683784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5683788Z 2025-05-07T20:33:07.5684029Z self = 2025-05-07T20:33:07.5684817Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5685364Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8918480940>} 2025-05-07T20:33:07.5686114Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5686301Z context = 2025-05-07T20:33:07.5686306Z 2025-05-07T20:33:07.5686474Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5686736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5686845Z module_map=module_map) 2025-05-07T20:33:07.5687005Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5687098Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5687223Z E ^ 2025-05-07T20:33:07.5687575Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5687580Z 2025-05-07T20:33:07.5687988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5687993Z 2025-05-07T20:33:07.5688094Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5688309Z self=, 2025-05-07T20:33:07.5688387Z T=128, 2025-05-07T20:33:07.5688461Z D=5120, 2025-05-07T20:33:07.5688541Z scale_ub=None, 2025-05-07T20:33:07.5688636Z contiguous=False, 2025-05-07T20:33:07.5688717Z compiled=True, 2025-05-07T20:33:07.5688788Z ) 2025-05-07T20:33:07.5689004Z self = 2025-05-07T20:33:07.5689170Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.5689180Z 2025-05-07T20:33:07.5689251Z @given( 2025-05-07T20:33:07.5689371Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5689467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5689582Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5689694Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5689808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5689886Z ) 2025-05-07T20:33:07.5690127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5690217Z def test_silu_mul_quant( 2025-05-07T20:33:07.5690340Z self, 2025-05-07T20:33:07.5690411Z T: int, 2025-05-07T20:33:07.5690485Z D: int, 2025-05-07T20:33:07.5690583Z scale_ub: Optional[float], 2025-05-07T20:33:07.5690666Z contiguous: bool, 2025-05-07T20:33:07.5690746Z compiled: bool, 2025-05-07T20:33:07.5690825Z ) -> None: 2025-05-07T20:33:07.5690921Z torch.manual_seed(2025) 2025-05-07T20:33:07.5690995Z 2025-05-07T20:33:07.5691160Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5691233Z 2025-05-07T20:33:07.5691325Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5691445Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5691530Z x = x_sign * x_clamp 2025-05-07T20:33:07.5691620Z x0 = x[:, :D] 2025-05-07T20:33:07.5691711Z x1 = x[:, D:] 2025-05-07T20:33:07.5691783Z 2025-05-07T20:33:07.5691888Z if contiguous: 2025-05-07T20:33:07.5691975Z x0 = x0.contiguous() 2025-05-07T20:33:07.5692106Z x1 = x1.contiguous() 2025-05-07T20:33:07.5692182Z 2025-05-07T20:33:07.5692269Z if scale_ub is not None: 2025-05-07T20:33:07.5692375Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5692507Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5692619Z ) 2025-05-07T20:33:07.5692701Z else: 2025-05-07T20:33:07.5692793Z scale_ub_tensor = None 2025-05-07T20:33:07.5692863Z 2025-05-07T20:33:07.5692993Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5693077Z op = silu_mul_quant 2025-05-07T20:33:07.5693159Z if compiled: 2025-05-07T20:33:07.5693256Z op = torch.compile(op) 2025-05-07T20:33:07.5693357Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5693427Z 2025-05-07T20:33:07.5693519Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5693524Z 2025-05-07T20:33:07.5693616Z moe/activation_test.py:117: 2025-05-07T20:33:07.5693751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5693846Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5693941Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5694354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5694449Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5694941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5695038Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5695390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5695615Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5695954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5696046Z kernel = self.compile( 2025-05-07T20:33:07.5696428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5696598Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5696730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5696741Z 2025-05-07T20:33:07.5696942Z self = 2025-05-07T20:33:07.5697721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5698232Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917cbd040>} 2025-05-07T20:33:07.5699019Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5699209Z context = 2025-05-07T20:33:07.5699216Z 2025-05-07T20:33:07.5699377Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5699635Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5699743Z module_map=module_map) 2025-05-07T20:33:07.5699905Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5700003Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5700074Z E ^ 2025-05-07T20:33:07.5700463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5700471Z 2025-05-07T20:33:07.5700885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5700890Z 2025-05-07T20:33:07.5700989Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5701209Z self=, 2025-05-07T20:33:07.5701324Z T=128, 2025-05-07T20:33:07.5701398Z D=7168, 2025-05-07T20:33:07.5701482Z scale_ub=1200.0, 2025-05-07T20:33:07.5701565Z contiguous=False, 2025-05-07T20:33:07.5701668Z compiled=False, 2025-05-07T20:33:07.5701745Z ) 2025-05-07T20:33:07.5701990Z self = 2025-05-07T20:33:07.5702158Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.5702163Z 2025-05-07T20:33:07.5702239Z @given( 2025-05-07T20:33:07.5702353Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5702452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5702568Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5702680Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5702793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5702863Z ) 2025-05-07T20:33:07.5703149Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5703246Z def test_silu_mul_quant( 2025-05-07T20:33:07.5703320Z self, 2025-05-07T20:33:07.5703391Z T: int, 2025-05-07T20:33:07.5703468Z D: int, 2025-05-07T20:33:07.5703564Z scale_ub: Optional[float], 2025-05-07T20:33:07.5703649Z contiguous: bool, 2025-05-07T20:33:07.5703945Z compiled: bool, 2025-05-07T20:33:07.5704056Z ) -> None: 2025-05-07T20:33:07.5704163Z torch.manual_seed(2025) 2025-05-07T20:33:07.5704235Z 2025-05-07T20:33:07.5704409Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5704494Z 2025-05-07T20:33:07.5704586Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5704707Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5704794Z x = x_sign * x_clamp 2025-05-07T20:33:07.5704870Z x0 = x[:, :D] 2025-05-07T20:33:07.5704953Z x1 = x[:, D:] 2025-05-07T20:33:07.5705028Z 2025-05-07T20:33:07.5705105Z if contiguous: 2025-05-07T20:33:07.5705192Z x0 = x0.contiguous() 2025-05-07T20:33:07.5705282Z x1 = x1.contiguous() 2025-05-07T20:33:07.5705353Z 2025-05-07T20:33:07.5705438Z if scale_ub is not None: 2025-05-07T20:33:07.5705551Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5705682Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5705757Z ) 2025-05-07T20:33:07.5705833Z else: 2025-05-07T20:33:07.5705923Z scale_ub_tensor = None 2025-05-07T20:33:07.5706082Z 2025-05-07T20:33:07.5706207Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5706294Z op = silu_mul_quant 2025-05-07T20:33:07.5706381Z if compiled: 2025-05-07T20:33:07.5706484Z op = torch.compile(op) 2025-05-07T20:33:07.5706590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5706665Z 2025-05-07T20:33:07.5706754Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5706758Z 2025-05-07T20:33:07.5706853Z moe/activation_test.py:117: 2025-05-07T20:33:07.5706985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5707082Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5707189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5707697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5707796Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5708221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5708445Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5708788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5708943Z kernel = self.compile( 2025-05-07T20:33:07.5709329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5709509Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5709635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5709639Z 2025-05-07T20:33:07.5709917Z self = 2025-05-07T20:33:07.5710710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5711303Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917cbdd30>} 2025-05-07T20:33:07.5712056Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5712250Z context = 2025-05-07T20:33:07.5712255Z 2025-05-07T20:33:07.5712424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5712686Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5712794Z module_map=module_map) 2025-05-07T20:33:07.5712961Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5713060Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5713135Z E ^ 2025-05-07T20:33:07.5713494Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5713501Z 2025-05-07T20:33:07.5713916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5713920Z 2025-05-07T20:33:07.5714066Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5714323Z self=, 2025-05-07T20:33:07.5714436Z T=128, 2025-05-07T20:33:07.5718668Z D=5120, 2025-05-07T20:33:07.5718772Z scale_ub=None, 2025-05-07T20:33:07.5718861Z contiguous=False, 2025-05-07T20:33:07.5718948Z compiled=False, 2025-05-07T20:33:07.5719021Z ) 2025-05-07T20:33:07.5719323Z self = 2025-05-07T20:33:07.5719501Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.5719506Z 2025-05-07T20:33:07.5719581Z @given( 2025-05-07T20:33:07.5719703Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5719812Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5719927Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5720041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5720156Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5720230Z ) 2025-05-07T20:33:07.5720481Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5720575Z def test_silu_mul_quant( 2025-05-07T20:33:07.5720651Z self, 2025-05-07T20:33:07.5720728Z T: int, 2025-05-07T20:33:07.5720801Z D: int, 2025-05-07T20:33:07.5720944Z scale_ub: Optional[float], 2025-05-07T20:33:07.5721037Z contiguous: bool, 2025-05-07T20:33:07.5721122Z compiled: bool, 2025-05-07T20:33:07.5721200Z ) -> None: 2025-05-07T20:33:07.5721298Z torch.manual_seed(2025) 2025-05-07T20:33:07.5721370Z 2025-05-07T20:33:07.5721543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5721662Z 2025-05-07T20:33:07.5721753Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5721878Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5721963Z x = x_sign * x_clamp 2025-05-07T20:33:07.5722041Z x0 = x[:, :D] 2025-05-07T20:33:07.5722124Z x1 = x[:, D:] 2025-05-07T20:33:07.5722194Z 2025-05-07T20:33:07.5722276Z if contiguous: 2025-05-07T20:33:07.5722371Z x0 = x0.contiguous() 2025-05-07T20:33:07.5722458Z x1 = x1.contiguous() 2025-05-07T20:33:07.5722529Z 2025-05-07T20:33:07.5722623Z if scale_ub is not None: 2025-05-07T20:33:07.5722734Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5722871Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5722952Z ) 2025-05-07T20:33:07.5723028Z else: 2025-05-07T20:33:07.5723124Z scale_ub_tensor = None 2025-05-07T20:33:07.5723197Z 2025-05-07T20:33:07.5723367Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5723459Z op = silu_mul_quant 2025-05-07T20:33:07.5723542Z if compiled: 2025-05-07T20:33:07.5723642Z op = torch.compile(op) 2025-05-07T20:33:07.5723750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5723824Z 2025-05-07T20:33:07.5723913Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5723917Z 2025-05-07T20:33:07.5724021Z moe/activation_test.py:117: 2025-05-07T20:33:07.5724149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5724258Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5724359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5724863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5724965Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5725331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5725554Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5725895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5725987Z kernel = self.compile( 2025-05-07T20:33:07.5726368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5726544Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5726715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5726720Z 2025-05-07T20:33:07.5726928Z self = 2025-05-07T20:33:07.5727712Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5728223Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917c82310>} 2025-05-07T20:33:07.5728967Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5729193Z context = 2025-05-07T20:33:07.5729201Z 2025-05-07T20:33:07.5729371Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5729633Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5729743Z module_map=module_map) 2025-05-07T20:33:07.5729945Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5730044Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5730123Z E ^ 2025-05-07T20:33:07.5730486Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5730491Z 2025-05-07T20:33:07.5730904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5730908Z 2025-05-07T20:33:07.5731008Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5731231Z self=, 2025-05-07T20:33:07.5731315Z T=128, 2025-05-07T20:33:07.5731391Z D=5120, 2025-05-07T20:33:07.5731472Z scale_ub=1200.0, 2025-05-07T20:33:07.5731557Z contiguous=True, 2025-05-07T20:33:07.5731656Z compiled=False, 2025-05-07T20:33:07.5731734Z ) 2025-05-07T20:33:07.5732018Z self = 2025-05-07T20:33:07.5732190Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.5732194Z 2025-05-07T20:33:07.5732272Z @given( 2025-05-07T20:33:07.5732388Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5732485Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5732601Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5732714Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5732825Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5732904Z ) 2025-05-07T20:33:07.5733152Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5733244Z def test_silu_mul_quant( 2025-05-07T20:33:07.5733325Z self, 2025-05-07T20:33:07.5733399Z T: int, 2025-05-07T20:33:07.5733479Z D: int, 2025-05-07T20:33:07.5733580Z scale_ub: Optional[float], 2025-05-07T20:33:07.5733666Z contiguous: bool, 2025-05-07T20:33:07.5733755Z compiled: bool, 2025-05-07T20:33:07.5733831Z ) -> None: 2025-05-07T20:33:07.5733923Z torch.manual_seed(2025) 2025-05-07T20:33:07.5733997Z 2025-05-07T20:33:07.5734167Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5734241Z 2025-05-07T20:33:07.5734333Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5734455Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5734541Z x = x_sign * x_clamp 2025-05-07T20:33:07.5734622Z x0 = x[:, :D] 2025-05-07T20:33:07.5734749Z x1 = x[:, D:] 2025-05-07T20:33:07.5734819Z 2025-05-07T20:33:07.5734903Z if contiguous: 2025-05-07T20:33:07.5734991Z x0 = x0.contiguous() 2025-05-07T20:33:07.5735083Z x1 = x1.contiguous() 2025-05-07T20:33:07.5735153Z 2025-05-07T20:33:07.5735242Z if scale_ub is not None: 2025-05-07T20:33:07.5735353Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5735491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5735566Z ) 2025-05-07T20:33:07.5735644Z else: 2025-05-07T20:33:07.5735738Z scale_ub_tensor = None 2025-05-07T20:33:07.5735807Z 2025-05-07T20:33:07.5735943Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5736032Z op = silu_mul_quant 2025-05-07T20:33:07.5736115Z if compiled: 2025-05-07T20:33:07.5736217Z op = torch.compile(op) 2025-05-07T20:33:07.5736321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5736462Z 2025-05-07T20:33:07.5736552Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5736557Z 2025-05-07T20:33:07.5736651Z moe/activation_test.py:117: 2025-05-07T20:33:07.5736782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5736885Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5737023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5737527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5737626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5737988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5738210Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5738552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5738651Z kernel = self.compile( 2025-05-07T20:33:07.5739029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5739202Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5739372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5739377Z 2025-05-07T20:33:07.5739581Z self = 2025-05-07T20:33:07.5740363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5740874Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917c82ee0>} 2025-05-07T20:33:07.5741628Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5741818Z context = 2025-05-07T20:33:07.5741827Z 2025-05-07T20:33:07.5742015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5742307Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5742414Z module_map=module_map) 2025-05-07T20:33:07.5742577Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5742675Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5742751Z E ^ 2025-05-07T20:33:07.5743108Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5743153Z 2025-05-07T20:33:07.5743567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5743572Z 2025-05-07T20:33:07.5743674Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5743899Z self=, 2025-05-07T20:33:07.5743977Z T=1, 2025-05-07T20:33:07.5744053Z D=7168, 2025-05-07T20:33:07.5744134Z scale_ub=1200.0, 2025-05-07T20:33:07.5744218Z contiguous=True, 2025-05-07T20:33:07.5744302Z compiled=True, 2025-05-07T20:33:07.5744376Z ) 2025-05-07T20:33:07.5744589Z self = 2025-05-07T20:33:07.5744757Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.5744762Z 2025-05-07T20:33:07.5744837Z @given( 2025-05-07T20:33:07.5744958Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5745099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5745214Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5745332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5745446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5745559Z ) 2025-05-07T20:33:07.5745808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5745900Z def test_silu_mul_quant( 2025-05-07T20:33:07.5745977Z self, 2025-05-07T20:33:07.5746054Z T: int, 2025-05-07T20:33:07.5746127Z D: int, 2025-05-07T20:33:07.5746223Z scale_ub: Optional[float], 2025-05-07T20:33:07.5746309Z contiguous: bool, 2025-05-07T20:33:07.5746392Z compiled: bool, 2025-05-07T20:33:07.5746467Z ) -> None: 2025-05-07T20:33:07.5746562Z torch.manual_seed(2025) 2025-05-07T20:33:07.5746633Z 2025-05-07T20:33:07.5746808Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5746886Z 2025-05-07T20:33:07.5746976Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5747103Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5747188Z x = x_sign * x_clamp 2025-05-07T20:33:07.5747266Z x0 = x[:, :D] 2025-05-07T20:33:07.5747392Z x1 = x[:, D:] 2025-05-07T20:33:07.5747465Z 2025-05-07T20:33:07.5747546Z if contiguous: 2025-05-07T20:33:07.5747638Z x0 = x0.contiguous() 2025-05-07T20:33:07.5747724Z x1 = x1.contiguous() 2025-05-07T20:33:07.5747795Z 2025-05-07T20:33:07.5747885Z if scale_ub is not None: 2025-05-07T20:33:07.5747989Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5748124Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5748202Z ) 2025-05-07T20:33:07.5748278Z else: 2025-05-07T20:33:07.5748373Z scale_ub_tensor = None 2025-05-07T20:33:07.5748445Z 2025-05-07T20:33:07.5748574Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5748665Z op = silu_mul_quant 2025-05-07T20:33:07.5748747Z if compiled: 2025-05-07T20:33:07.5748842Z op = torch.compile(op) 2025-05-07T20:33:07.5748949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5749024Z 2025-05-07T20:33:07.5749113Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5749117Z 2025-05-07T20:33:07.5749218Z moe/activation_test.py:117: 2025-05-07T20:33:07.5749347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5749450Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5749547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5750034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5750127Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5750623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5750768Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5751127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5751359Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5751697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5751790Z kernel = self.compile( 2025-05-07T20:33:07.5752168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5752347Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5752473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5752478Z 2025-05-07T20:33:07.5752729Z self = 2025-05-07T20:33:07.5753515Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5754063Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89181f1940>} 2025-05-07T20:33:07.5754809Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5754996Z context = 2025-05-07T20:33:07.5755000Z 2025-05-07T20:33:07.5755165Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5755428Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5755532Z module_map=module_map) 2025-05-07T20:33:07.5755702Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5755835Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5755914Z E ^ 2025-05-07T20:33:07.5756282Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5756287Z 2025-05-07T20:33:07.5756697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5756702Z 2025-05-07T20:33:07.5756802Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5757023Z self=, 2025-05-07T20:33:07.5757096Z T=1, 2025-05-07T20:33:07.5757170Z D=7168, 2025-05-07T20:33:07.5757257Z scale_ub=1200.0, 2025-05-07T20:33:07.5757347Z contiguous=False, 2025-05-07T20:33:07.5757430Z compiled=True, 2025-05-07T20:33:07.5757500Z ) 2025-05-07T20:33:07.5757723Z self = 2025-05-07T20:33:07.5757889Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.5757896Z 2025-05-07T20:33:07.5757971Z @given( 2025-05-07T20:33:07.5758091Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5758189Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5758303Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5758421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5758533Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5758611Z ) 2025-05-07T20:33:07.5758854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5758991Z def test_silu_mul_quant( 2025-05-07T20:33:07.5759068Z self, 2025-05-07T20:33:07.5759143Z T: int, 2025-05-07T20:33:07.5759218Z D: int, 2025-05-07T20:33:07.5759318Z scale_ub: Optional[float], 2025-05-07T20:33:07.5759404Z contiguous: bool, 2025-05-07T20:33:07.5759490Z compiled: bool, 2025-05-07T20:33:07.5759573Z ) -> None: 2025-05-07T20:33:07.5759666Z torch.manual_seed(2025) 2025-05-07T20:33:07.5759739Z 2025-05-07T20:33:07.5759908Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5759980Z 2025-05-07T20:33:07.5760073Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5760196Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5760282Z x = x_sign * x_clamp 2025-05-07T20:33:07.5760364Z x0 = x[:, :D] 2025-05-07T20:33:07.5760440Z x1 = x[:, D:] 2025-05-07T20:33:07.5760510Z 2025-05-07T20:33:07.5760593Z if contiguous: 2025-05-07T20:33:07.5760728Z x0 = x0.contiguous() 2025-05-07T20:33:07.5760819Z x1 = x1.contiguous() 2025-05-07T20:33:07.5760892Z 2025-05-07T20:33:07.5760979Z if scale_ub is not None: 2025-05-07T20:33:07.5761081Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5761218Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5761340Z ) 2025-05-07T20:33:07.5761416Z else: 2025-05-07T20:33:07.5761512Z scale_ub_tensor = None 2025-05-07T20:33:07.5761582Z 2025-05-07T20:33:07.5761711Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5761799Z op = silu_mul_quant 2025-05-07T20:33:07.5761883Z if compiled: 2025-05-07T20:33:07.5761983Z op = torch.compile(op) 2025-05-07T20:33:07.5762086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5762157Z 2025-05-07T20:33:07.5762251Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5762255Z 2025-05-07T20:33:07.5762357Z moe/activation_test.py:117: 2025-05-07T20:33:07.5762482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5762585Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5762680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5763092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5763186Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5763681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5763782Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5764138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5764358Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5764702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5764796Z kernel = self.compile( 2025-05-07T20:33:07.5765176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5765351Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5765479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5765483Z 2025-05-07T20:33:07.5765692Z self = 2025-05-07T20:33:07.5766474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5766983Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917fec5e0>} 2025-05-07T20:33:07.5767793Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5767989Z context = 2025-05-07T20:33:07.5767993Z 2025-05-07T20:33:07.5768155Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5768416Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5768522Z module_map=module_map) 2025-05-07T20:33:07.5768684Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5768779Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5768859Z E ^ 2025-05-07T20:33:07.5769257Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5769266Z 2025-05-07T20:33:07.5769683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5769687Z 2025-05-07T20:33:07.5769787Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5770048Z self=, 2025-05-07T20:33:07.5770126Z T=1, 2025-05-07T20:33:07.5770198Z D=7168, 2025-05-07T20:33:07.5770278Z scale_ub=None, 2025-05-07T20:33:07.5770366Z contiguous=False, 2025-05-07T20:33:07.5770446Z compiled=True, 2025-05-07T20:33:07.5770516Z ) 2025-05-07T20:33:07.5770732Z self = 2025-05-07T20:33:07.5770892Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.5770897Z 2025-05-07T20:33:07.5770977Z @given( 2025-05-07T20:33:07.5771101Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5771201Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5771317Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5771431Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5771542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5771663Z ) 2025-05-07T20:33:07.5771935Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5772050Z def test_silu_mul_quant( 2025-05-07T20:33:07.5772130Z self, 2025-05-07T20:33:07.5772205Z T: int, 2025-05-07T20:33:07.5772282Z D: int, 2025-05-07T20:33:07.5772376Z scale_ub: Optional[float], 2025-05-07T20:33:07.5772462Z contiguous: bool, 2025-05-07T20:33:07.5772548Z compiled: bool, 2025-05-07T20:33:07.5772625Z ) -> None: 2025-05-07T20:33:07.5772716Z torch.manual_seed(2025) 2025-05-07T20:33:07.5772792Z 2025-05-07T20:33:07.5772965Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5773037Z 2025-05-07T20:33:07.5773127Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5773250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5773340Z x = x_sign * x_clamp 2025-05-07T20:33:07.5773423Z x0 = x[:, :D] 2025-05-07T20:33:07.5773500Z x1 = x[:, D:] 2025-05-07T20:33:07.5773572Z 2025-05-07T20:33:07.5773653Z if contiguous: 2025-05-07T20:33:07.5773745Z x0 = x0.contiguous() 2025-05-07T20:33:07.5773839Z x1 = x1.contiguous() 2025-05-07T20:33:07.5773909Z 2025-05-07T20:33:07.5773996Z if scale_ub is not None: 2025-05-07T20:33:07.5774101Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5774234Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5774309Z ) 2025-05-07T20:33:07.5774388Z else: 2025-05-07T20:33:07.5774483Z scale_ub_tensor = None 2025-05-07T20:33:07.5774599Z 2025-05-07T20:33:07.5774730Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5774820Z op = silu_mul_quant 2025-05-07T20:33:07.5774907Z if compiled: 2025-05-07T20:33:07.5775003Z op = torch.compile(op) 2025-05-07T20:33:07.5775112Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5775187Z 2025-05-07T20:33:07.5775275Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.5775396Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.5775467Z 2025-05-07T20:33:07.5775601Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5775705Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.5775803Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.5775922Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.5776064Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5776182Z 2025-05-07T20:33:07.5776283Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.5776288Z 2025-05-07T20:33:07.5776392Z moe/activation_test.py:126: 2025-05-07T20:33:07.5776515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5776625Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.5776799Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.5777359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.5777460Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.5777818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5778038Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5778410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.5778669Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5779069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:33:07.5779364Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.5779736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.5779906Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.5780243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.5780321Z fn() 2025-05-07T20:33:07.5780716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.5780802Z self.fn.run( 2025-05-07T20:33:07.5781138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5781229Z kernel = self.compile( 2025-05-07T20:33:07.5781609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5781816Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5781965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5781969Z 2025-05-07T20:33:07.5782175Z self = 2025-05-07T20:33:07.5782959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5783516Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917e44160>} 2025-05-07T20:33:07.5784268Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5784460Z context = 2025-05-07T20:33:07.5784464Z 2025-05-07T20:33:07.5784633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5784895Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5785004Z module_map=module_map) 2025-05-07T20:33:07.5785165Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5785265Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.5785386Z E ^ 2025-05-07T20:33:07.5785751Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5785756Z 2025-05-07T20:33:07.5786170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5786219Z 2025-05-07T20:33:07.5786320Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5786539Z self=, 2025-05-07T20:33:07.5786617Z T=1, 2025-05-07T20:33:07.5786691Z D=5120, 2025-05-07T20:33:07.5786773Z scale_ub=1200.0, 2025-05-07T20:33:07.5786860Z contiguous=False, 2025-05-07T20:33:07.5786942Z compiled=True, 2025-05-07T20:33:07.5787012Z ) 2025-05-07T20:33:07.5787231Z self = 2025-05-07T20:33:07.5787402Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.5787412Z 2025-05-07T20:33:07.5787488Z @given( 2025-05-07T20:33:07.5787608Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5787706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5787821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5787977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5788092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5788168Z ) 2025-05-07T20:33:07.5788412Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5788506Z def test_silu_mul_quant( 2025-05-07T20:33:07.5788586Z self, 2025-05-07T20:33:07.5788659Z T: int, 2025-05-07T20:33:07.5788732Z D: int, 2025-05-07T20:33:07.5788834Z scale_ub: Optional[float], 2025-05-07T20:33:07.5788919Z contiguous: bool, 2025-05-07T20:33:07.5789007Z compiled: bool, 2025-05-07T20:33:07.5789084Z ) -> None: 2025-05-07T20:33:07.5789186Z torch.manual_seed(2025) 2025-05-07T20:33:07.5789260Z 2025-05-07T20:33:07.5789428Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5789501Z 2025-05-07T20:33:07.5789594Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5789719Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5789886Z x = x_sign * x_clamp 2025-05-07T20:33:07.5789972Z x0 = x[:, :D] 2025-05-07T20:33:07.5790051Z x1 = x[:, D:] 2025-05-07T20:33:07.5790121Z 2025-05-07T20:33:07.5790205Z if contiguous: 2025-05-07T20:33:07.5790296Z x0 = x0.contiguous() 2025-05-07T20:33:07.5790384Z x1 = x1.contiguous() 2025-05-07T20:33:07.5790458Z 2025-05-07T20:33:07.5790545Z if scale_ub is not None: 2025-05-07T20:33:07.5790652Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5790787Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5790938Z ) 2025-05-07T20:33:07.5791020Z else: 2025-05-07T20:33:07.5791112Z scale_ub_tensor = None 2025-05-07T20:33:07.5791183Z 2025-05-07T20:33:07.5791316Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5791404Z op = silu_mul_quant 2025-05-07T20:33:07.5791488Z if compiled: 2025-05-07T20:33:07.5791595Z op = torch.compile(op) 2025-05-07T20:33:07.5791698Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5791769Z 2025-05-07T20:33:07.5791860Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5791864Z 2025-05-07T20:33:07.5791957Z moe/activation_test.py:117: 2025-05-07T20:33:07.5792086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5792193Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5792291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5792708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5792802Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5793306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5793406Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5793806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5794032Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5794370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5794462Z kernel = self.compile( 2025-05-07T20:33:07.5794847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5795020Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5795158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5795166Z 2025-05-07T20:33:07.5795369Z self = 2025-05-07T20:33:07.5796279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5796795Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917e44b80>} 2025-05-07T20:33:07.5797552Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5797744Z context = 2025-05-07T20:33:07.5797752Z 2025-05-07T20:33:07.5797916Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5798178Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5798293Z module_map=module_map) 2025-05-07T20:33:07.5798453Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5798554Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5798625Z E ^ 2025-05-07T20:33:07.5798978Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5798982Z 2025-05-07T20:33:07.5799396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5799404Z 2025-05-07T20:33:07.5799504Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5799730Z self=, 2025-05-07T20:33:07.5799854Z T=1, 2025-05-07T20:33:07.5799927Z D=5120, 2025-05-07T20:33:07.5800013Z scale_ub=1200.0, 2025-05-07T20:33:07.5800096Z contiguous=False, 2025-05-07T20:33:07.5800176Z compiled=False, 2025-05-07T20:33:07.5800247Z ) 2025-05-07T20:33:07.5800465Z self = 2025-05-07T20:33:07.5800628Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.5800632Z 2025-05-07T20:33:07.5800710Z @given( 2025-05-07T20:33:07.5800825Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5800926Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5801042Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5801154Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5801266Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5801336Z ) 2025-05-07T20:33:07.5801627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5801724Z def test_silu_mul_quant( 2025-05-07T20:33:07.5801796Z self, 2025-05-07T20:33:07.5801866Z T: int, 2025-05-07T20:33:07.5801946Z D: int, 2025-05-07T20:33:07.5802089Z scale_ub: Optional[float], 2025-05-07T20:33:07.5802174Z contiguous: bool, 2025-05-07T20:33:07.5802260Z compiled: bool, 2025-05-07T20:33:07.5802335Z ) -> None: 2025-05-07T20:33:07.5802427Z torch.manual_seed(2025) 2025-05-07T20:33:07.5802500Z 2025-05-07T20:33:07.5802666Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5802741Z 2025-05-07T20:33:07.5802829Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5802946Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5803035Z x = x_sign * x_clamp 2025-05-07T20:33:07.5803114Z x0 = x[:, :D] 2025-05-07T20:33:07.5803200Z x1 = x[:, D:] 2025-05-07T20:33:07.5803275Z 2025-05-07T20:33:07.5803353Z if contiguous: 2025-05-07T20:33:07.5803438Z x0 = x0.contiguous() 2025-05-07T20:33:07.5803524Z x1 = x1.contiguous() 2025-05-07T20:33:07.5803591Z 2025-05-07T20:33:07.5803675Z if scale_ub is not None: 2025-05-07T20:33:07.5804060Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5804200Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5804274Z ) 2025-05-07T20:33:07.5804345Z else: 2025-05-07T20:33:07.5804435Z scale_ub_tensor = None 2025-05-07T20:33:07.5804505Z 2025-05-07T20:33:07.5804631Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5804716Z op = silu_mul_quant 2025-05-07T20:33:07.5804799Z if compiled: 2025-05-07T20:33:07.5804894Z op = torch.compile(op) 2025-05-07T20:33:07.5804997Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5805078Z 2025-05-07T20:33:07.5805165Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5805170Z 2025-05-07T20:33:07.5805263Z moe/activation_test.py:117: 2025-05-07T20:33:07.5805393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5805498Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5805602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5806098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5806189Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5806545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5806763Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5807103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5807273Z kernel = self.compile( 2025-05-07T20:33:07.5807648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5807821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5807947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5807952Z 2025-05-07T20:33:07.5808153Z self = 2025-05-07T20:33:07.5808939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5809498Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917927550>} 2025-05-07T20:33:07.5810249Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5810435Z context = 2025-05-07T20:33:07.5810497Z 2025-05-07T20:33:07.5810660Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5810918Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5811021Z module_map=module_map) 2025-05-07T20:33:07.5811181Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5811272Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5811345Z E ^ 2025-05-07T20:33:07.5811701Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5811709Z 2025-05-07T20:33:07.5812115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5812120Z 2025-05-07T20:33:07.5812220Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5812494Z self=, 2025-05-07T20:33:07.5812573Z T=16384, 2025-05-07T20:33:07.5812646Z D=5120, 2025-05-07T20:33:07.5812729Z scale_ub=1200.0, 2025-05-07T20:33:07.5812813Z contiguous=False, 2025-05-07T20:33:07.5812893Z compiled=True, 2025-05-07T20:33:07.5812962Z ) 2025-05-07T20:33:07.5813188Z self = 2025-05-07T20:33:07.5813364Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.5813369Z 2025-05-07T20:33:07.5813443Z @given( 2025-05-07T20:33:07.5813559Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5813660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5813772Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5813889Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5813999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5814070Z ) 2025-05-07T20:33:07.5814317Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5814407Z def test_silu_mul_quant( 2025-05-07T20:33:07.5814482Z self, 2025-05-07T20:33:07.5814550Z T: int, 2025-05-07T20:33:07.5814620Z D: int, 2025-05-07T20:33:07.5814718Z scale_ub: Optional[float], 2025-05-07T20:33:07.5814803Z contiguous: bool, 2025-05-07T20:33:07.5814883Z compiled: bool, 2025-05-07T20:33:07.5814959Z ) -> None: 2025-05-07T20:33:07.5815051Z torch.manual_seed(2025) 2025-05-07T20:33:07.5815122Z 2025-05-07T20:33:07.5815294Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5815414Z 2025-05-07T20:33:07.5815503Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5815633Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5815718Z x = x_sign * x_clamp 2025-05-07T20:33:07.5815799Z x0 = x[:, :D] 2025-05-07T20:33:07.5815878Z x1 = x[:, D:] 2025-05-07T20:33:07.5815944Z 2025-05-07T20:33:07.5816023Z if contiguous: 2025-05-07T20:33:07.5816110Z x0 = x0.contiguous() 2025-05-07T20:33:07.5816195Z x1 = x1.contiguous() 2025-05-07T20:33:07.5816267Z 2025-05-07T20:33:07.5816353Z if scale_ub is not None: 2025-05-07T20:33:07.5816453Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5816589Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5816660Z ) 2025-05-07T20:33:07.5816734Z else: 2025-05-07T20:33:07.5816828Z scale_ub_tensor = None 2025-05-07T20:33:07.5816897Z 2025-05-07T20:33:07.5817066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5817159Z op = silu_mul_quant 2025-05-07T20:33:07.5817239Z if compiled: 2025-05-07T20:33:07.5817339Z op = torch.compile(op) 2025-05-07T20:33:07.5817440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5817553Z 2025-05-07T20:33:07.5817640Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5817644Z 2025-05-07T20:33:07.5817736Z moe/activation_test.py:117: 2025-05-07T20:33:07.5817858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5817957Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5818050Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5818413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5818504Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5818998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5819100Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5819452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5819716Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5820053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5820142Z kernel = self.compile( 2025-05-07T20:33:07.5820520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5820689Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5820812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5820817Z 2025-05-07T20:33:07.5821030Z self = 2025-05-07T20:33:07.5821812Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5822317Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89180401f0>} 2025-05-07T20:33:07.5823061Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5823246Z context = 2025-05-07T20:33:07.5823254Z 2025-05-07T20:33:07.5823416Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5823720Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5823830Z module_map=module_map) 2025-05-07T20:33:07.5823988Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5824088Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5824165Z E ^ 2025-05-07T20:33:07.5824515Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5824520Z 2025-05-07T20:33:07.5824933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5824937Z 2025-05-07T20:33:07.5825040Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5825258Z self=, 2025-05-07T20:33:07.5825339Z T=2048, 2025-05-07T20:33:07.5825410Z D=7168, 2025-05-07T20:33:07.5825531Z scale_ub=1200.0, 2025-05-07T20:33:07.5825616Z contiguous=False, 2025-05-07T20:33:07.5825697Z compiled=True, 2025-05-07T20:33:07.5825767Z ) 2025-05-07T20:33:07.5825983Z self = 2025-05-07T20:33:07.5826156Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.5826223Z 2025-05-07T20:33:07.5826300Z @given( 2025-05-07T20:33:07.5826415Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5826508Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5826622Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5826733Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5826842Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5826922Z ) 2025-05-07T20:33:07.5827164Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5827258Z def test_silu_mul_quant( 2025-05-07T20:33:07.5827336Z self, 2025-05-07T20:33:07.5827410Z T: int, 2025-05-07T20:33:07.5827486Z D: int, 2025-05-07T20:33:07.5827581Z scale_ub: Optional[float], 2025-05-07T20:33:07.5827666Z contiguous: bool, 2025-05-07T20:33:07.5827750Z compiled: bool, 2025-05-07T20:33:07.5827868Z ) -> None: 2025-05-07T20:33:07.5827960Z torch.manual_seed(2025) 2025-05-07T20:33:07.5828035Z 2025-05-07T20:33:07.5828199Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5828269Z 2025-05-07T20:33:07.5828359Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5828478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5828565Z x = x_sign * x_clamp 2025-05-07T20:33:07.5828647Z x0 = x[:, :D] 2025-05-07T20:33:07.5828723Z x1 = x[:, D:] 2025-05-07T20:33:07.5828795Z 2025-05-07T20:33:07.5828875Z if contiguous: 2025-05-07T20:33:07.5828968Z x0 = x0.contiguous() 2025-05-07T20:33:07.5829055Z x1 = x1.contiguous() 2025-05-07T20:33:07.5829123Z 2025-05-07T20:33:07.5829210Z if scale_ub is not None: 2025-05-07T20:33:07.5829316Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5829450Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5829523Z ) 2025-05-07T20:33:07.5829596Z else: 2025-05-07T20:33:07.5829685Z scale_ub_tensor = None 2025-05-07T20:33:07.5829752Z 2025-05-07T20:33:07.5829982Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5830068Z op = silu_mul_quant 2025-05-07T20:33:07.5830148Z if compiled: 2025-05-07T20:33:07.5830248Z op = torch.compile(op) 2025-05-07T20:33:07.5830349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5830417Z 2025-05-07T20:33:07.5830504Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5830508Z 2025-05-07T20:33:07.5830654Z moe/activation_test.py:117: 2025-05-07T20:33:07.5830782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5830878Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5830973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5831342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5831433Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5831949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5832053Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5832416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5832638Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5833011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5833104Z kernel = self.compile( 2025-05-07T20:33:07.5833481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5833657Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5833821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5833826Z 2025-05-07T20:33:07.5834028Z self = 2025-05-07T20:33:07.5834809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5835317Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8918040ee0>} 2025-05-07T20:33:07.5836066Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5836296Z context = 2025-05-07T20:33:07.5836301Z 2025-05-07T20:33:07.5836464Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5836726Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5836828Z module_map=module_map) 2025-05-07T20:33:07.5836989Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5837089Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5837162Z E ^ 2025-05-07T20:33:07.5837517Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5837525Z 2025-05-07T20:33:07.5842009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5842018Z 2025-05-07T20:33:07.5842130Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5842363Z self=, 2025-05-07T20:33:07.5842442Z T=1, 2025-05-07T20:33:07.5842514Z D=5120, 2025-05-07T20:33:07.5842590Z scale_ub=None, 2025-05-07T20:33:07.5842674Z contiguous=False, 2025-05-07T20:33:07.5842752Z compiled=False, 2025-05-07T20:33:07.5842822Z ) 2025-05-07T20:33:07.5843047Z self = 2025-05-07T20:33:07.5843212Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.5843217Z 2025-05-07T20:33:07.5843293Z @given( 2025-05-07T20:33:07.5843481Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5843576Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5843691Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5843803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5843915Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5843988Z ) 2025-05-07T20:33:07.5844230Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5844324Z def test_silu_mul_quant( 2025-05-07T20:33:07.5844398Z self, 2025-05-07T20:33:07.5844472Z T: int, 2025-05-07T20:33:07.5844546Z D: int, 2025-05-07T20:33:07.5844641Z scale_ub: Optional[float], 2025-05-07T20:33:07.5844724Z contiguous: bool, 2025-05-07T20:33:07.5844810Z compiled: bool, 2025-05-07T20:33:07.5844883Z ) -> None: 2025-05-07T20:33:07.5844971Z torch.manual_seed(2025) 2025-05-07T20:33:07.5845043Z 2025-05-07T20:33:07.5845253Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5845324Z 2025-05-07T20:33:07.5845416Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5845535Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5845620Z x = x_sign * x_clamp 2025-05-07T20:33:07.5845744Z x0 = x[:, :D] 2025-05-07T20:33:07.5845819Z x1 = x[:, D:] 2025-05-07T20:33:07.5845892Z 2025-05-07T20:33:07.5845971Z if contiguous: 2025-05-07T20:33:07.5846057Z x0 = x0.contiguous() 2025-05-07T20:33:07.5846144Z x1 = x1.contiguous() 2025-05-07T20:33:07.5846214Z 2025-05-07T20:33:07.5846300Z if scale_ub is not None: 2025-05-07T20:33:07.5846402Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5846534Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5846608Z ) 2025-05-07T20:33:07.5846681Z else: 2025-05-07T20:33:07.5846773Z scale_ub_tensor = None 2025-05-07T20:33:07.5846847Z 2025-05-07T20:33:07.5846978Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5847065Z op = silu_mul_quant 2025-05-07T20:33:07.5847152Z if compiled: 2025-05-07T20:33:07.5847249Z op = torch.compile(op) 2025-05-07T20:33:07.5847394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5847470Z 2025-05-07T20:33:07.5847558Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5847562Z 2025-05-07T20:33:07.5847655Z moe/activation_test.py:117: 2025-05-07T20:33:07.5847783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5847881Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5847977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5848481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5848579Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5848936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5849155Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5849491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5849588Z kernel = self.compile( 2025-05-07T20:33:07.5849962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5850135Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5850258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5850262Z 2025-05-07T20:33:07.5850464Z self = 2025-05-07T20:33:07.5851249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5851798Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89178db5e0>} 2025-05-07T20:33:07.5852549Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5852733Z context = 2025-05-07T20:33:07.5852737Z 2025-05-07T20:33:07.5852900Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5853162Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5853308Z module_map=module_map) 2025-05-07T20:33:07.5853470Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5853564Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5853637Z E ^ 2025-05-07T20:33:07.5854003Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5854046Z 2025-05-07T20:33:07.5854456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5854460Z 2025-05-07T20:33:07.5854560Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5854776Z self=, 2025-05-07T20:33:07.5854847Z T=4096, 2025-05-07T20:33:07.5854921Z D=7168, 2025-05-07T20:33:07.5854998Z scale_ub=1200.0, 2025-05-07T20:33:07.5855079Z contiguous=False, 2025-05-07T20:33:07.5855162Z compiled=False, 2025-05-07T20:33:07.5855237Z ) 2025-05-07T20:33:07.5855455Z self = 2025-05-07T20:33:07.5855632Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.5855636Z 2025-05-07T20:33:07.5855709Z @given( 2025-05-07T20:33:07.5855866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5855963Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5856075Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5856195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5856306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5856376Z ) 2025-05-07T20:33:07.5856619Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5856707Z def test_silu_mul_quant( 2025-05-07T20:33:07.5856776Z self, 2025-05-07T20:33:07.5856853Z T: int, 2025-05-07T20:33:07.5856933Z D: int, 2025-05-07T20:33:07.5857027Z scale_ub: Optional[float], 2025-05-07T20:33:07.5857113Z contiguous: bool, 2025-05-07T20:33:07.5857195Z compiled: bool, 2025-05-07T20:33:07.5857273Z ) -> None: 2025-05-07T20:33:07.5857364Z torch.manual_seed(2025) 2025-05-07T20:33:07.5857433Z 2025-05-07T20:33:07.5857605Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5857676Z 2025-05-07T20:33:07.5857764Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5857885Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5857969Z x = x_sign * x_clamp 2025-05-07T20:33:07.5858043Z x0 = x[:, :D] 2025-05-07T20:33:07.5858120Z x1 = x[:, D:] 2025-05-07T20:33:07.5858190Z 2025-05-07T20:33:07.5858266Z if contiguous: 2025-05-07T20:33:07.5858356Z x0 = x0.contiguous() 2025-05-07T20:33:07.5858446Z x1 = x1.contiguous() 2025-05-07T20:33:07.5858515Z 2025-05-07T20:33:07.5858652Z if scale_ub is not None: 2025-05-07T20:33:07.5858754Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5858886Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5858957Z ) 2025-05-07T20:33:07.5859032Z else: 2025-05-07T20:33:07.5859133Z scale_ub_tensor = None 2025-05-07T20:33:07.5859202Z 2025-05-07T20:33:07.5859327Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5859413Z op = silu_mul_quant 2025-05-07T20:33:07.5859493Z if compiled: 2025-05-07T20:33:07.5859586Z op = torch.compile(op) 2025-05-07T20:33:07.5859689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5859756Z 2025-05-07T20:33:07.5859842Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5859850Z 2025-05-07T20:33:07.5859943Z moe/activation_test.py:117: 2025-05-07T20:33:07.5860135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5860241Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5860337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5860838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5860979Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5861334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5861555Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5861917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5862017Z kernel = self.compile( 2025-05-07T20:33:07.5862410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5862583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5862707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5862712Z 2025-05-07T20:33:07.5862915Z self = 2025-05-07T20:33:07.5863732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5864248Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917f8b1f0>} 2025-05-07T20:33:07.5864993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5865186Z context = 2025-05-07T20:33:07.5865191Z 2025-05-07T20:33:07.5865351Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5865608Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5865721Z module_map=module_map) 2025-05-07T20:33:07.5865883Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5865979Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5866056Z E ^ 2025-05-07T20:33:07.5866407Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5866412Z 2025-05-07T20:33:07.5866824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5866829Z 2025-05-07T20:33:07.5866928Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5867191Z self=, 2025-05-07T20:33:07.5867274Z T=16384, 2025-05-07T20:33:07.5867349Z D=7168, 2025-05-07T20:33:07.5867425Z scale_ub=None, 2025-05-07T20:33:07.5867506Z contiguous=True, 2025-05-07T20:33:07.5867586Z compiled=True, 2025-05-07T20:33:07.5867668Z ) 2025-05-07T20:33:07.5867886Z self = 2025-05-07T20:33:07.5868056Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.5868061Z 2025-05-07T20:33:07.5868135Z @given( 2025-05-07T20:33:07.5868254Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5868347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5868460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5868572Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5868721Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5868800Z ) 2025-05-07T20:33:07.5869041Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5869133Z def test_silu_mul_quant( 2025-05-07T20:33:07.5869205Z self, 2025-05-07T20:33:07.5869274Z T: int, 2025-05-07T20:33:07.5869392Z D: int, 2025-05-07T20:33:07.5869485Z scale_ub: Optional[float], 2025-05-07T20:33:07.5869568Z contiguous: bool, 2025-05-07T20:33:07.5869653Z compiled: bool, 2025-05-07T20:33:07.5869728Z ) -> None: 2025-05-07T20:33:07.5869930Z torch.manual_seed(2025) 2025-05-07T20:33:07.5870003Z 2025-05-07T20:33:07.5870170Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5870241Z 2025-05-07T20:33:07.5870331Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5870449Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5870537Z x = x_sign * x_clamp 2025-05-07T20:33:07.5870621Z x0 = x[:, :D] 2025-05-07T20:33:07.5870702Z x1 = x[:, D:] 2025-05-07T20:33:07.5870777Z 2025-05-07T20:33:07.5870857Z if contiguous: 2025-05-07T20:33:07.5870948Z x0 = x0.contiguous() 2025-05-07T20:33:07.5871039Z x1 = x1.contiguous() 2025-05-07T20:33:07.5871108Z 2025-05-07T20:33:07.5871248Z if scale_ub is not None: 2025-05-07T20:33:07.5871358Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5871490Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5871564Z ) 2025-05-07T20:33:07.5871650Z else: 2025-05-07T20:33:07.5871759Z scale_ub_tensor = None 2025-05-07T20:33:07.5871838Z 2025-05-07T20:33:07.5871984Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5872072Z op = silu_mul_quant 2025-05-07T20:33:07.5872159Z if compiled: 2025-05-07T20:33:07.5872256Z op = torch.compile(op) 2025-05-07T20:33:07.5872370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5872447Z 2025-05-07T20:33:07.5872536Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5872541Z 2025-05-07T20:33:07.5872635Z moe/activation_test.py:117: 2025-05-07T20:33:07.5872762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5872866Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5872962Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5873332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5873422Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5873919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5874020Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5874377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5874656Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5874994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5875090Z kernel = self.compile( 2025-05-07T20:33:07.5875477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5875652Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5875778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5875783Z 2025-05-07T20:33:07.5875985Z self = 2025-05-07T20:33:07.5876815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5877327Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917f8bee0>} 2025-05-07T20:33:07.5878082Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5878315Z context = 2025-05-07T20:33:07.5878320Z 2025-05-07T20:33:07.5878484Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5878755Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5878862Z module_map=module_map) 2025-05-07T20:33:07.5879026Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5879128Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5879204Z E ^ 2025-05-07T20:33:07.5879563Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5879571Z 2025-05-07T20:33:07.5880021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5880028Z 2025-05-07T20:33:07.5880125Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5880349Z self=, 2025-05-07T20:33:07.5880421Z T=4096, 2025-05-07T20:33:07.5880492Z D=5120, 2025-05-07T20:33:07.5880572Z scale_ub=None, 2025-05-07T20:33:07.5880651Z contiguous=False, 2025-05-07T20:33:07.5880729Z compiled=True, 2025-05-07T20:33:07.5880801Z ) 2025-05-07T20:33:07.5881020Z self = 2025-05-07T20:33:07.5881196Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.5881201Z 2025-05-07T20:33:07.5881272Z @given( 2025-05-07T20:33:07.5881388Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5881493Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5881612Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5881724Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5881836Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5881906Z ) 2025-05-07T20:33:07.5882151Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5882242Z def test_silu_mul_quant( 2025-05-07T20:33:07.5882312Z self, 2025-05-07T20:33:07.5882390Z T: int, 2025-05-07T20:33:07.5882463Z D: int, 2025-05-07T20:33:07.5882558Z scale_ub: Optional[float], 2025-05-07T20:33:07.5882647Z contiguous: bool, 2025-05-07T20:33:07.5882777Z compiled: bool, 2025-05-07T20:33:07.5882851Z ) -> None: 2025-05-07T20:33:07.5882943Z torch.manual_seed(2025) 2025-05-07T20:33:07.5883013Z 2025-05-07T20:33:07.5883179Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5883253Z 2025-05-07T20:33:07.5883345Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5883464Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5883550Z x = x_sign * x_clamp 2025-05-07T20:33:07.5883626Z x0 = x[:, :D] 2025-05-07T20:33:07.5883704Z x1 = x[:, D:] 2025-05-07T20:33:07.5883773Z 2025-05-07T20:33:07.5883851Z if contiguous: 2025-05-07T20:33:07.5883940Z x0 = x0.contiguous() 2025-05-07T20:33:07.5884025Z x1 = x1.contiguous() 2025-05-07T20:33:07.5884095Z 2025-05-07T20:33:07.5884185Z if scale_ub is not None: 2025-05-07T20:33:07.5884286Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5884461Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5884538Z ) 2025-05-07T20:33:07.5884612Z else: 2025-05-07T20:33:07.5884702Z scale_ub_tensor = None 2025-05-07T20:33:07.5884778Z 2025-05-07T20:33:07.5884903Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5885037Z op = silu_mul_quant 2025-05-07T20:33:07.5885124Z if compiled: 2025-05-07T20:33:07.5885219Z op = torch.compile(op) 2025-05-07T20:33:07.5885321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5885389Z 2025-05-07T20:33:07.5885475Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5885480Z 2025-05-07T20:33:07.5885576Z moe/activation_test.py:117: 2025-05-07T20:33:07.5885698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5885795Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5885895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5886275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5886369Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5886902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5887003Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5887360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5887583Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5887920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5888019Z kernel = self.compile( 2025-05-07T20:33:07.5888394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5888573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5888695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5888699Z 2025-05-07T20:33:07.5888899Z self = 2025-05-07T20:33:07.5889691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5890194Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917d7e940>} 2025-05-07T20:33:07.5890954Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5891205Z context = 2025-05-07T20:33:07.5891210Z 2025-05-07T20:33:07.5891373Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5891639Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5891743Z module_map=module_map) 2025-05-07T20:33:07.5891903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5892024Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5892125Z E ^ 2025-05-07T20:33:07.5892559Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5892565Z 2025-05-07T20:33:07.5892978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5892983Z 2025-05-07T20:33:07.5893138Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5893358Z self=, 2025-05-07T20:33:07.5893432Z T=4096, 2025-05-07T20:33:07.5893503Z D=5120, 2025-05-07T20:33:07.5893580Z scale_ub=1200.0, 2025-05-07T20:33:07.5893662Z contiguous=False, 2025-05-07T20:33:07.5893790Z compiled=False, 2025-05-07T20:33:07.5893862Z ) 2025-05-07T20:33:07.5894074Z self = 2025-05-07T20:33:07.5894249Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.5894254Z 2025-05-07T20:33:07.5894325Z @given( 2025-05-07T20:33:07.5894443Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5894538Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5894649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5894767Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5894885Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5894959Z ) 2025-05-07T20:33:07.5895206Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5895293Z def test_silu_mul_quant( 2025-05-07T20:33:07.5895362Z self, 2025-05-07T20:33:07.5895484Z T: int, 2025-05-07T20:33:07.5895558Z D: int, 2025-05-07T20:33:07.5895654Z scale_ub: Optional[float], 2025-05-07T20:33:07.5895740Z contiguous: bool, 2025-05-07T20:33:07.5895819Z compiled: bool, 2025-05-07T20:33:07.5895896Z ) -> None: 2025-05-07T20:33:07.5895986Z torch.manual_seed(2025) 2025-05-07T20:33:07.5896055Z 2025-05-07T20:33:07.5896221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5896291Z 2025-05-07T20:33:07.5896378Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5896499Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5896592Z x = x_sign * x_clamp 2025-05-07T20:33:07.5896670Z x0 = x[:, :D] 2025-05-07T20:33:07.5896750Z x1 = x[:, D:] 2025-05-07T20:33:07.5896819Z 2025-05-07T20:33:07.5896901Z if contiguous: 2025-05-07T20:33:07.5896989Z x0 = x0.contiguous() 2025-05-07T20:33:07.5897073Z x1 = x1.contiguous() 2025-05-07T20:33:07.5897155Z 2025-05-07T20:33:07.5897243Z if scale_ub is not None: 2025-05-07T20:33:07.5897343Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5897478Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5897552Z ) 2025-05-07T20:33:07.5897632Z else: 2025-05-07T20:33:07.5897727Z scale_ub_tensor = None 2025-05-07T20:33:07.5897796Z 2025-05-07T20:33:07.5897924Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5898010Z op = silu_mul_quant 2025-05-07T20:33:07.5898090Z if compiled: 2025-05-07T20:33:07.5898240Z op = torch.compile(op) 2025-05-07T20:33:07.5898342Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5898416Z 2025-05-07T20:33:07.5898506Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5898511Z 2025-05-07T20:33:07.5898603Z moe/activation_test.py:117: 2025-05-07T20:33:07.5898734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5898832Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5898927Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5899432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5899527Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5899883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5900112Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5900494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5900596Z kernel = self.compile( 2025-05-07T20:33:07.5900974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5901188Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5901313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5901318Z 2025-05-07T20:33:07.5901519Z self = 2025-05-07T20:33:07.5902304Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5902811Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917bae3a0>} 2025-05-07T20:33:07.5903595Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5904003Z context = 2025-05-07T20:33:07.5904010Z 2025-05-07T20:33:07.5904176Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5904438Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5904540Z module_map=module_map) 2025-05-07T20:33:07.5904696Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5904794Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5904867Z E ^ 2025-05-07T20:33:07.5905230Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5905241Z 2025-05-07T20:33:07.5905653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5905661Z 2025-05-07T20:33:07.5905762Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5905985Z self=, 2025-05-07T20:33:07.5906059Z T=4096, 2025-05-07T20:33:07.5906127Z D=5120, 2025-05-07T20:33:07.5906217Z scale_ub=1200.0, 2025-05-07T20:33:07.5906299Z contiguous=False, 2025-05-07T20:33:07.5906379Z compiled=True, 2025-05-07T20:33:07.5906453Z ) 2025-05-07T20:33:07.5906669Z self = 2025-05-07T20:33:07.5906844Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.5906849Z 2025-05-07T20:33:07.5907022Z @given( 2025-05-07T20:33:07.5907137Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5907235Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5907347Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5907460Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5907579Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5907649Z ) 2025-05-07T20:33:07.5907893Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5907980Z def test_silu_mul_quant( 2025-05-07T20:33:07.5908052Z self, 2025-05-07T20:33:07.5908128Z T: int, 2025-05-07T20:33:07.5908199Z D: int, 2025-05-07T20:33:07.5908292Z scale_ub: Optional[float], 2025-05-07T20:33:07.5908380Z contiguous: bool, 2025-05-07T20:33:07.5908461Z compiled: bool, 2025-05-07T20:33:07.5908531Z ) -> None: 2025-05-07T20:33:07.5908695Z torch.manual_seed(2025) 2025-05-07T20:33:07.5908770Z 2025-05-07T20:33:07.5908934Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5909009Z 2025-05-07T20:33:07.5909097Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5909215Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5909368Z x = x_sign * x_clamp 2025-05-07T20:33:07.5909449Z x0 = x[:, :D] 2025-05-07T20:33:07.5909527Z x1 = x[:, D:] 2025-05-07T20:33:07.5909596Z 2025-05-07T20:33:07.5909673Z if contiguous: 2025-05-07T20:33:07.5909771Z x0 = x0.contiguous() 2025-05-07T20:33:07.5909920Z x1 = x1.contiguous() 2025-05-07T20:33:07.5909988Z 2025-05-07T20:33:07.5910080Z if scale_ub is not None: 2025-05-07T20:33:07.5910183Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5910312Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5910387Z ) 2025-05-07T20:33:07.5910465Z else: 2025-05-07T20:33:07.5910553Z scale_ub_tensor = None 2025-05-07T20:33:07.5910626Z 2025-05-07T20:33:07.5910752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5910839Z op = silu_mul_quant 2025-05-07T20:33:07.5910918Z if compiled: 2025-05-07T20:33:07.5911084Z op = torch.compile(op) 2025-05-07T20:33:07.5911190Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5911259Z 2025-05-07T20:33:07.5911346Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5911350Z 2025-05-07T20:33:07.5911447Z moe/activation_test.py:117: 2025-05-07T20:33:07.5911570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5911663Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5911762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5912125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5912223Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5912710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5912805Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5913159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5913383Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5913714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5913808Z kernel = self.compile( 2025-05-07T20:33:07.5914180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5914353Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5914477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5914526Z 2025-05-07T20:33:07.5914727Z self = 2025-05-07T20:33:07.5915511Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5916014Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917bae280>} 2025-05-07T20:33:07.5916762Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5916948Z context = 2025-05-07T20:33:07.5916996Z 2025-05-07T20:33:07.5917161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5917417Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5917520Z module_map=module_map) 2025-05-07T20:33:07.5917721Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5917817Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5917892Z E ^ 2025-05-07T20:33:07.5918255Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5918260Z 2025-05-07T20:33:07.5918669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5918674Z 2025-05-07T20:33:07.5918776Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5918998Z self=, 2025-05-07T20:33:07.5919074Z T=2048, 2025-05-07T20:33:07.5919150Z D=7168, 2025-05-07T20:33:07.5919230Z scale_ub=1200.0, 2025-05-07T20:33:07.5919310Z contiguous=False, 2025-05-07T20:33:07.5919393Z compiled=False, 2025-05-07T20:33:07.5919462Z ) 2025-05-07T20:33:07.5919740Z self = 2025-05-07T20:33:07.5919919Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.5919924Z 2025-05-07T20:33:07.5919994Z @given( 2025-05-07T20:33:07.5920114Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5920207Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5920318Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5920433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5920542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5920613Z ) 2025-05-07T20:33:07.5920863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5920951Z def test_silu_mul_quant( 2025-05-07T20:33:07.5921027Z self, 2025-05-07T20:33:07.5921100Z T: int, 2025-05-07T20:33:07.5921169Z D: int, 2025-05-07T20:33:07.5921267Z scale_ub: Optional[float], 2025-05-07T20:33:07.5921355Z contiguous: bool, 2025-05-07T20:33:07.5921436Z compiled: bool, 2025-05-07T20:33:07.5921511Z ) -> None: 2025-05-07T20:33:07.5921601Z torch.manual_seed(2025) 2025-05-07T20:33:07.5921670Z 2025-05-07T20:33:07.5921839Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5921910Z 2025-05-07T20:33:07.5921998Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5922118Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5922204Z x = x_sign * x_clamp 2025-05-07T20:33:07.5922281Z x0 = x[:, :D] 2025-05-07T20:33:07.5922361Z x1 = x[:, D:] 2025-05-07T20:33:07.5922489Z 2025-05-07T20:33:07.5922571Z if contiguous: 2025-05-07T20:33:07.5922656Z x0 = x0.contiguous() 2025-05-07T20:33:07.5922741Z x1 = x1.contiguous() 2025-05-07T20:33:07.5922816Z 2025-05-07T20:33:07.5922901Z if scale_ub is not None: 2025-05-07T20:33:07.5923007Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5923138Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5923210Z ) 2025-05-07T20:33:07.5923280Z else: 2025-05-07T20:33:07.5923374Z scale_ub_tensor = None 2025-05-07T20:33:07.5923440Z 2025-05-07T20:33:07.5923565Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5923656Z op = silu_mul_quant 2025-05-07T20:33:07.5923737Z if compiled: 2025-05-07T20:33:07.5923836Z op = torch.compile(op) 2025-05-07T20:33:07.5923938Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5924048Z 2025-05-07T20:33:07.5924139Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5924143Z 2025-05-07T20:33:07.5924235Z moe/activation_test.py:117: 2025-05-07T20:33:07.5924358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5924457Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5924661Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5925322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5925447Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5925942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5926249Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5926680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5926783Z kernel = self.compile( 2025-05-07T20:33:07.5927169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5927342Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5927537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5927546Z 2025-05-07T20:33:07.5927751Z self = 2025-05-07T20:33:07.5928530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5929043Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917b7d670>} 2025-05-07T20:33:07.5929794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5929989Z context = 2025-05-07T20:33:07.5929998Z 2025-05-07T20:33:07.5930160Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5930421Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5930529Z module_map=module_map) 2025-05-07T20:33:07.5930690Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5930787Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5930861Z E ^ 2025-05-07T20:33:07.5931222Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5931275Z 2025-05-07T20:33:07.5931691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5931696Z 2025-05-07T20:33:07.5931801Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5932029Z self=, 2025-05-07T20:33:07.5932114Z T=1, 2025-05-07T20:33:07.5932189Z D=7168, 2025-05-07T20:33:07.5932276Z scale_ub=None, 2025-05-07T20:33:07.5932360Z contiguous=True, 2025-05-07T20:33:07.5932441Z compiled=False, 2025-05-07T20:33:07.5932516Z ) 2025-05-07T20:33:07.5932730Z self = 2025-05-07T20:33:07.5932892Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.5932897Z 2025-05-07T20:33:07.5932974Z @given( 2025-05-07T20:33:07.5933091Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5933236Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5933353Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5933468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5933586Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5933659Z ) 2025-05-07T20:33:07.5933944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5934043Z def test_silu_mul_quant( 2025-05-07T20:33:07.5934119Z self, 2025-05-07T20:33:07.5934193Z T: int, 2025-05-07T20:33:07.5934272Z D: int, 2025-05-07T20:33:07.5934370Z scale_ub: Optional[float], 2025-05-07T20:33:07.5934457Z contiguous: bool, 2025-05-07T20:33:07.5934547Z compiled: bool, 2025-05-07T20:33:07.5934620Z ) -> None: 2025-05-07T20:33:07.5934716Z torch.manual_seed(2025) 2025-05-07T20:33:07.5934788Z 2025-05-07T20:33:07.5934951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5935034Z 2025-05-07T20:33:07.5935120Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5935238Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5935326Z x = x_sign * x_clamp 2025-05-07T20:33:07.5935402Z x0 = x[:, :D] 2025-05-07T20:33:07.5935478Z x1 = x[:, D:] 2025-05-07T20:33:07.5935596Z 2025-05-07T20:33:07.5935681Z if contiguous: 2025-05-07T20:33:07.5935768Z x0 = x0.contiguous() 2025-05-07T20:33:07.5935853Z x1 = x1.contiguous() 2025-05-07T20:33:07.5935924Z 2025-05-07T20:33:07.5936011Z if scale_ub is not None: 2025-05-07T20:33:07.5936116Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5936247Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5936326Z ) 2025-05-07T20:33:07.5936401Z else: 2025-05-07T20:33:07.5936491Z scale_ub_tensor = None 2025-05-07T20:33:07.5936565Z 2025-05-07T20:33:07.5936698Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5936783Z op = silu_mul_quant 2025-05-07T20:33:07.5936866Z if compiled: 2025-05-07T20:33:07.5936960Z op = torch.compile(op) 2025-05-07T20:33:07.5937061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5937136Z 2025-05-07T20:33:07.5937224Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5937229Z 2025-05-07T20:33:07.5937326Z moe/activation_test.py:117: 2025-05-07T20:33:07.5937452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5937549Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5937649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5938206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5938298Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5938656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5938928Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5939266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5939362Z kernel = self.compile( 2025-05-07T20:33:07.5939738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5939912Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5940033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5940037Z 2025-05-07T20:33:07.5940239Z self = 2025-05-07T20:33:07.5941057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5941565Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89177e0280>} 2025-05-07T20:33:07.5942354Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5942539Z context = 2025-05-07T20:33:07.5942544Z 2025-05-07T20:33:07.5942709Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5942967Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5943072Z module_map=module_map) 2025-05-07T20:33:07.5943240Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5943336Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5943406Z E ^ 2025-05-07T20:33:07.5943762Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5943770Z 2025-05-07T20:33:07.5944224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5944229Z 2025-05-07T20:33:07.5944332Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5944549Z self=, 2025-05-07T20:33:07.5944622Z T=16384, 2025-05-07T20:33:07.5944700Z D=7168, 2025-05-07T20:33:07.5944779Z scale_ub=1200.0, 2025-05-07T20:33:07.5944861Z contiguous=False, 2025-05-07T20:33:07.5944944Z compiled=True, 2025-05-07T20:33:07.5945013Z ) 2025-05-07T20:33:07.5945232Z self = 2025-05-07T20:33:07.5945408Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.5945413Z 2025-05-07T20:33:07.5945486Z @given( 2025-05-07T20:33:07.5945603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5945705Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5945814Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5945930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5946037Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5946106Z ) 2025-05-07T20:33:07.5946351Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5946441Z def test_silu_mul_quant( 2025-05-07T20:33:07.5946514Z self, 2025-05-07T20:33:07.5946586Z T: int, 2025-05-07T20:33:07.5946655Z D: int, 2025-05-07T20:33:07.5946755Z scale_ub: Optional[float], 2025-05-07T20:33:07.5946886Z contiguous: bool, 2025-05-07T20:33:07.5946968Z compiled: bool, 2025-05-07T20:33:07.5947041Z ) -> None: 2025-05-07T20:33:07.5947133Z torch.manual_seed(2025) 2025-05-07T20:33:07.5947202Z 2025-05-07T20:33:07.5947371Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5947447Z 2025-05-07T20:33:07.5947533Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5947654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5947744Z x = x_sign * x_clamp 2025-05-07T20:33:07.5947821Z x0 = x[:, :D] 2025-05-07T20:33:07.5947897Z x1 = x[:, D:] 2025-05-07T20:33:07.5947964Z 2025-05-07T20:33:07.5948049Z if contiguous: 2025-05-07T20:33:07.5948135Z x0 = x0.contiguous() 2025-05-07T20:33:07.5948221Z x1 = x1.contiguous() 2025-05-07T20:33:07.5948295Z 2025-05-07T20:33:07.5948381Z if scale_ub is not None: 2025-05-07T20:33:07.5948529Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5948667Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5948738Z ) 2025-05-07T20:33:07.5948812Z else: 2025-05-07T20:33:07.5948908Z scale_ub_tensor = None 2025-05-07T20:33:07.5948976Z 2025-05-07T20:33:07.5949181Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5949271Z op = silu_mul_quant 2025-05-07T20:33:07.5949351Z if compiled: 2025-05-07T20:33:07.5949450Z op = torch.compile(op) 2025-05-07T20:33:07.5949551Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5949622Z 2025-05-07T20:33:07.5949712Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5949716Z 2025-05-07T20:33:07.5949896Z moe/activation_test.py:117: 2025-05-07T20:33:07.5950027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5950126Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5950227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5950599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5950685Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5951219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5951324Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5951678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5951900Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5952237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5952326Z kernel = self.compile( 2025-05-07T20:33:07.5952709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5952881Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5953002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5953006Z 2025-05-07T20:33:07.5953216Z self = 2025-05-07T20:33:07.5954001Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5954509Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89177e0ee0>} 2025-05-07T20:33:07.5955254Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5955485Z context = 2025-05-07T20:33:07.5955493Z 2025-05-07T20:33:07.5955652Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5955916Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5956020Z module_map=module_map) 2025-05-07T20:33:07.5956179Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5956275Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5956353Z E ^ 2025-05-07T20:33:07.5956713Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5956718Z 2025-05-07T20:33:07.5957165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5957173Z 2025-05-07T20:33:07.5957271Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5957490Z self=, 2025-05-07T20:33:07.5957569Z T=1, 2025-05-07T20:33:07.5957643Z D=7168, 2025-05-07T20:33:07.5957763Z scale_ub=None, 2025-05-07T20:33:07.5957853Z contiguous=False, 2025-05-07T20:33:07.5957937Z compiled=False, 2025-05-07T20:33:07.5958006Z ) 2025-05-07T20:33:07.5958221Z self = 2025-05-07T20:33:07.5958385Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.5958390Z 2025-05-07T20:33:07.5958473Z @given( 2025-05-07T20:33:07.5958589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5958686Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5958800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5963456Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5963592Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5963666Z ) 2025-05-07T20:33:07.5963918Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5964012Z def test_silu_mul_quant( 2025-05-07T20:33:07.5964155Z self, 2025-05-07T20:33:07.5964232Z T: int, 2025-05-07T20:33:07.5964304Z D: int, 2025-05-07T20:33:07.5964398Z scale_ub: Optional[float], 2025-05-07T20:33:07.5964483Z contiguous: bool, 2025-05-07T20:33:07.5964566Z compiled: bool, 2025-05-07T20:33:07.5964647Z ) -> None: 2025-05-07T20:33:07.5964737Z torch.manual_seed(2025) 2025-05-07T20:33:07.5964811Z 2025-05-07T20:33:07.5964985Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5965057Z 2025-05-07T20:33:07.5965152Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5965277Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5965364Z x = x_sign * x_clamp 2025-05-07T20:33:07.5965443Z x0 = x[:, :D] 2025-05-07T20:33:07.5965517Z x1 = x[:, D:] 2025-05-07T20:33:07.5965589Z 2025-05-07T20:33:07.5965673Z if contiguous: 2025-05-07T20:33:07.5965760Z x0 = x0.contiguous() 2025-05-07T20:33:07.5965852Z x1 = x1.contiguous() 2025-05-07T20:33:07.5965923Z 2025-05-07T20:33:07.5966010Z if scale_ub is not None: 2025-05-07T20:33:07.5966111Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5966245Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5966318Z ) 2025-05-07T20:33:07.5966392Z else: 2025-05-07T20:33:07.5966484Z scale_ub_tensor = None 2025-05-07T20:33:07.5966552Z 2025-05-07T20:33:07.5966683Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5966771Z op = silu_mul_quant 2025-05-07T20:33:07.5966904Z if compiled: 2025-05-07T20:33:07.5967007Z op = torch.compile(op) 2025-05-07T20:33:07.5967112Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5967181Z 2025-05-07T20:33:07.5967270Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5967275Z 2025-05-07T20:33:07.5967376Z moe/activation_test.py:117: 2025-05-07T20:33:07.5967508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5967607Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5967701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5968207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5968298Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5968651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5968920Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5969261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5969350Z kernel = self.compile( 2025-05-07T20:33:07.5969736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5969948Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5970074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5970079Z 2025-05-07T20:33:07.5970282Z self = 2025-05-07T20:33:07.5971063Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5971576Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917abd670>} 2025-05-07T20:33:07.5972355Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5972551Z context = 2025-05-07T20:33:07.5972556Z 2025-05-07T20:33:07.5972716Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5972978Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5973081Z module_map=module_map) 2025-05-07T20:33:07.5973239Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5973334Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5973414Z E ^ 2025-05-07T20:33:07.5973767Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5973772Z 2025-05-07T20:33:07.5974183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5974193Z 2025-05-07T20:33:07.5974290Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5974511Z self=, 2025-05-07T20:33:07.5974584Z T=2048, 2025-05-07T20:33:07.5974653Z D=7168, 2025-05-07T20:33:07.5974735Z scale_ub=None, 2025-05-07T20:33:07.5974816Z contiguous=False, 2025-05-07T20:33:07.5974893Z compiled=True, 2025-05-07T20:33:07.5974967Z ) 2025-05-07T20:33:07.5975179Z self = 2025-05-07T20:33:07.5975352Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.5975406Z 2025-05-07T20:33:07.5975478Z @given( 2025-05-07T20:33:07.5975595Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5975695Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5975807Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5975927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5976042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5976114Z ) 2025-05-07T20:33:07.5976356Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5976445Z def test_silu_mul_quant( 2025-05-07T20:33:07.5976519Z self, 2025-05-07T20:33:07.5976594Z T: int, 2025-05-07T20:33:07.5976664Z D: int, 2025-05-07T20:33:07.5976761Z scale_ub: Optional[float], 2025-05-07T20:33:07.5976849Z contiguous: bool, 2025-05-07T20:33:07.5976929Z compiled: bool, 2025-05-07T20:33:07.5977004Z ) -> None: 2025-05-07T20:33:07.5977142Z torch.manual_seed(2025) 2025-05-07T20:33:07.5977213Z 2025-05-07T20:33:07.5977378Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5977451Z 2025-05-07T20:33:07.5977539Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5977661Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5977791Z x = x_sign * x_clamp 2025-05-07T20:33:07.5977867Z x0 = x[:, :D] 2025-05-07T20:33:07.5977941Z x1 = x[:, D:] 2025-05-07T20:33:07.5978017Z 2025-05-07T20:33:07.5978096Z if contiguous: 2025-05-07T20:33:07.5978190Z x0 = x0.contiguous() 2025-05-07T20:33:07.5978275Z x1 = x1.contiguous() 2025-05-07T20:33:07.5978346Z 2025-05-07T20:33:07.5978437Z if scale_ub is not None: 2025-05-07T20:33:07.5978537Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5978666Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5978744Z ) 2025-05-07T20:33:07.5978817Z else: 2025-05-07T20:33:07.5978905Z scale_ub_tensor = None 2025-05-07T20:33:07.5978976Z 2025-05-07T20:33:07.5979103Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5979187Z op = silu_mul_quant 2025-05-07T20:33:07.5979316Z if compiled: 2025-05-07T20:33:07.5979415Z op = torch.compile(op) 2025-05-07T20:33:07.5979521Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5979587Z 2025-05-07T20:33:07.5979673Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5979678Z 2025-05-07T20:33:07.5979776Z moe/activation_test.py:117: 2025-05-07T20:33:07.5979900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5979995Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5980092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5980458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5980554Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5981050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5981143Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5981504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5981725Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5982059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5982149Z kernel = self.compile( 2025-05-07T20:33:07.5982526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5982702Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5982871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5982875Z 2025-05-07T20:33:07.5983078Z self = 2025-05-07T20:33:07.5983870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5984377Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917658550>} 2025-05-07T20:33:07.5985130Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5985387Z context = 2025-05-07T20:33:07.5985395Z 2025-05-07T20:33:07.5985559Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5985822Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5985928Z module_map=module_map) 2025-05-07T20:33:07.5986131Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5986228Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5986304Z E ^ 2025-05-07T20:33:07.5986660Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5986665Z 2025-05-07T20:33:07.5987073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5987078Z 2025-05-07T20:33:07.5987180Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5987401Z self=, 2025-05-07T20:33:07.5987474Z T=4096, 2025-05-07T20:33:07.5987552Z D=7168, 2025-05-07T20:33:07.5987629Z scale_ub=None, 2025-05-07T20:33:07.5987713Z contiguous=False, 2025-05-07T20:33:07.5987794Z compiled=True, 2025-05-07T20:33:07.5987861Z ) 2025-05-07T20:33:07.5988121Z self = 2025-05-07T20:33:07.5988298Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.5988303Z 2025-05-07T20:33:07.5988375Z @given( 2025-05-07T20:33:07.5988493Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5988587Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5988697Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5988814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5988922Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5988998Z ) 2025-05-07T20:33:07.5989243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5989332Z def test_silu_mul_quant( 2025-05-07T20:33:07.5989404Z self, 2025-05-07T20:33:07.5989482Z T: int, 2025-05-07T20:33:07.5989555Z D: int, 2025-05-07T20:33:07.5989653Z scale_ub: Optional[float], 2025-05-07T20:33:07.5989741Z contiguous: bool, 2025-05-07T20:33:07.5989910Z compiled: bool, 2025-05-07T20:33:07.5990003Z ) -> None: 2025-05-07T20:33:07.5990093Z torch.manual_seed(2025) 2025-05-07T20:33:07.5990161Z 2025-05-07T20:33:07.5990337Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5990409Z 2025-05-07T20:33:07.5990496Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5990617Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5990700Z x = x_sign * x_clamp 2025-05-07T20:33:07.5990778Z x0 = x[:, :D] 2025-05-07T20:33:07.5990913Z x1 = x[:, D:] 2025-05-07T20:33:07.5990981Z 2025-05-07T20:33:07.5991061Z if contiguous: 2025-05-07T20:33:07.5991150Z x0 = x0.contiguous() 2025-05-07T20:33:07.5991240Z x1 = x1.contiguous() 2025-05-07T20:33:07.5991315Z 2025-05-07T20:33:07.5991404Z if scale_ub is not None: 2025-05-07T20:33:07.5991511Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5991644Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5991716Z ) 2025-05-07T20:33:07.5991790Z else: 2025-05-07T20:33:07.5991884Z scale_ub_tensor = None 2025-05-07T20:33:07.5991955Z 2025-05-07T20:33:07.5992079Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5992167Z op = silu_mul_quant 2025-05-07T20:33:07.5992247Z if compiled: 2025-05-07T20:33:07.5992341Z op = torch.compile(op) 2025-05-07T20:33:07.5992495Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5992568Z 2025-05-07T20:33:07.5992660Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5992664Z 2025-05-07T20:33:07.5992755Z moe/activation_test.py:117: 2025-05-07T20:33:07.5992921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5993120Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5993242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5993620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5993712Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5994203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5994309Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5994666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5994899Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5995240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5995330Z kernel = self.compile( 2025-05-07T20:33:07.5995764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5995942Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5996065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5996070Z 2025-05-07T20:33:07.5996276Z self = 2025-05-07T20:33:07.5997067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5997578Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891777b160>} 2025-05-07T20:33:07.5998338Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5998528Z context = 2025-05-07T20:33:07.5998533Z 2025-05-07T20:33:07.5998700Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5998957Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5999066Z module_map=module_map) 2025-05-07T20:33:07.5999223Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5999366Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5999445Z E ^ 2025-05-07T20:33:07.5999798Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5999802Z 2025-05-07T20:33:07.6000216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6000223Z 2025-05-07T20:33:07.6000325Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6000541Z self=, 2025-05-07T20:33:07.6000618Z T=16384, 2025-05-07T20:33:07.6000700Z D=5120, 2025-05-07T20:33:07.6000784Z scale_ub=1200.0, 2025-05-07T20:33:07.6000866Z contiguous=False, 2025-05-07T20:33:07.6000949Z compiled=False, 2025-05-07T20:33:07.6001018Z ) 2025-05-07T20:33:07.6001239Z self = 2025-05-07T20:33:07.6001468Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.6001477Z 2025-05-07T20:33:07.6001549Z @given( 2025-05-07T20:33:07.6001669Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6001762Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6001880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6002038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6002149Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6002223Z ) 2025-05-07T20:33:07.6002468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6002564Z def test_silu_mul_quant( 2025-05-07T20:33:07.6002651Z self, 2025-05-07T20:33:07.6002725Z T: int, 2025-05-07T20:33:07.6002805Z D: int, 2025-05-07T20:33:07.6002905Z scale_ub: Optional[float], 2025-05-07T20:33:07.6002991Z contiguous: bool, 2025-05-07T20:33:07.6003074Z compiled: bool, 2025-05-07T20:33:07.6003163Z ) -> None: 2025-05-07T20:33:07.6003253Z torch.manual_seed(2025) 2025-05-07T20:33:07.6003322Z 2025-05-07T20:33:07.6003485Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6003556Z 2025-05-07T20:33:07.6003648Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6004156Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6004279Z x = x_sign * x_clamp 2025-05-07T20:33:07.6004385Z x0 = x[:, :D] 2025-05-07T20:33:07.6004489Z x1 = x[:, D:] 2025-05-07T20:33:07.6004583Z 2025-05-07T20:33:07.6004669Z if contiguous: 2025-05-07T20:33:07.6004754Z x0 = x0.contiguous() 2025-05-07T20:33:07.6004836Z x1 = x1.contiguous() 2025-05-07T20:33:07.6004907Z 2025-05-07T20:33:07.6004995Z if scale_ub is not None: 2025-05-07T20:33:07.6005101Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6005236Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6005312Z ) 2025-05-07T20:33:07.6005390Z else: 2025-05-07T20:33:07.6005479Z scale_ub_tensor = None 2025-05-07T20:33:07.6005544Z 2025-05-07T20:33:07.6005673Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6005762Z op = silu_mul_quant 2025-05-07T20:33:07.6005841Z if compiled: 2025-05-07T20:33:07.6005937Z op = torch.compile(op) 2025-05-07T20:33:07.6006037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6006108Z 2025-05-07T20:33:07.6006200Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6006204Z 2025-05-07T20:33:07.6006298Z moe/activation_test.py:117: 2025-05-07T20:33:07.6006426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6006523Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6006614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6007195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6007286Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6007637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6007867Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6008203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6008297Z kernel = self.compile( 2025-05-07T20:33:07.6008673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6008842Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6008967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6008973Z 2025-05-07T20:33:07.6009234Z self = 2025-05-07T20:33:07.6010019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6010578Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891777b940>} 2025-05-07T20:33:07.6011323Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6011515Z context = 2025-05-07T20:33:07.6011520Z 2025-05-07T20:33:07.6011683Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6011951Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6012054Z module_map=module_map) 2025-05-07T20:33:07.6012210Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6012349Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6012424Z E ^ 2025-05-07T20:33:07.6012776Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6012785Z 2025-05-07T20:33:07.6013193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6013197Z 2025-05-07T20:33:07.6013293Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6013515Z self=, 2025-05-07T20:33:07.6013587Z T=16384, 2025-05-07T20:33:07.6013666Z D=5120, 2025-05-07T20:33:07.6013748Z scale_ub=1200.0, 2025-05-07T20:33:07.6013825Z contiguous=True, 2025-05-07T20:33:07.6013903Z compiled=True, 2025-05-07T20:33:07.6013973Z ) 2025-05-07T20:33:07.6014187Z self = 2025-05-07T20:33:07.6014364Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.6014372Z 2025-05-07T20:33:07.6014443Z @given( 2025-05-07T20:33:07.6014555Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6014655Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6014765Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6014877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6014988Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6015059Z ) 2025-05-07T20:33:07.6015303Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6015467Z def test_silu_mul_quant( 2025-05-07T20:33:07.6015538Z self, 2025-05-07T20:33:07.6015611Z T: int, 2025-05-07T20:33:07.6015682Z D: int, 2025-05-07T20:33:07.6015776Z scale_ub: Optional[float], 2025-05-07T20:33:07.6015863Z contiguous: bool, 2025-05-07T20:33:07.6015944Z compiled: bool, 2025-05-07T20:33:07.6016024Z ) -> None: 2025-05-07T20:33:07.6016118Z torch.manual_seed(2025) 2025-05-07T20:33:07.6016188Z 2025-05-07T20:33:07.6016355Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6016426Z 2025-05-07T20:33:07.6016514Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6016634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6016722Z x = x_sign * x_clamp 2025-05-07T20:33:07.6016795Z x0 = x[:, :D] 2025-05-07T20:33:07.6016870Z x1 = x[:, D:] 2025-05-07T20:33:07.6016937Z 2025-05-07T20:33:07.6017014Z if contiguous: 2025-05-07T20:33:07.6017148Z x0 = x0.contiguous() 2025-05-07T20:33:07.6017236Z x1 = x1.contiguous() 2025-05-07T20:33:07.6017306Z 2025-05-07T20:33:07.6017399Z if scale_ub is not None: 2025-05-07T20:33:07.6017501Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6017635Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6017755Z ) 2025-05-07T20:33:07.6017829Z else: 2025-05-07T20:33:07.6017918Z scale_ub_tensor = None 2025-05-07T20:33:07.6017986Z 2025-05-07T20:33:07.6018111Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6018203Z op = silu_mul_quant 2025-05-07T20:33:07.6018282Z if compiled: 2025-05-07T20:33:07.6018383Z op = torch.compile(op) 2025-05-07T20:33:07.6018489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6018554Z 2025-05-07T20:33:07.6018640Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6018648Z 2025-05-07T20:33:07.6018749Z moe/activation_test.py:117: 2025-05-07T20:33:07.6018873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6018969Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6019066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6019475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.6019572Z return fn(*args, **kwargs) 2025-05-07T20:33:07.6020066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6020156Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6020507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6020732Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6021073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6021164Z kernel = self.compile( 2025-05-07T20:33:07.6021544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6021720Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6021849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6021854Z 2025-05-07T20:33:07.6022054Z self = 2025-05-07T20:33:07.6022832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6023337Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917551550>} 2025-05-07T20:33:07.6024121Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6024319Z context = 2025-05-07T20:33:07.6024324Z 2025-05-07T20:33:07.6024488Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6024756Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6024859Z module_map=module_map) 2025-05-07T20:33:07.6025021Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6025117Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6025186Z E ^ 2025-05-07T20:33:07.6025577Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6025585Z 2025-05-07T20:33:07.6025999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6026004Z 2025-05-07T20:33:07.6026105Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6026364Z self=, 2025-05-07T20:33:07.6026439Z T=16384, 2025-05-07T20:33:07.6026511Z D=5120, 2025-05-07T20:33:07.6026588Z scale_ub=None, 2025-05-07T20:33:07.6026670Z contiguous=False, 2025-05-07T20:33:07.6026750Z compiled=True, 2025-05-07T20:33:07.6026817Z ) 2025-05-07T20:33:07.6027027Z self = 2025-05-07T20:33:07.6027197Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.6027205Z 2025-05-07T20:33:07.6027276Z @given( 2025-05-07T20:33:07.6027395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6027495Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6027605Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6027718Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6027880Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6027951Z ) 2025-05-07T20:33:07.6028193Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6028288Z def test_silu_mul_quant( 2025-05-07T20:33:07.6028359Z self, 2025-05-07T20:33:07.6028435Z T: int, 2025-05-07T20:33:07.6028509Z D: int, 2025-05-07T20:33:07.6028604Z scale_ub: Optional[float], 2025-05-07T20:33:07.6028696Z contiguous: bool, 2025-05-07T20:33:07.6028775Z compiled: bool, 2025-05-07T20:33:07.6028848Z ) -> None: 2025-05-07T20:33:07.6028943Z torch.manual_seed(2025) 2025-05-07T20:33:07.6029015Z 2025-05-07T20:33:07.6029182Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6029255Z 2025-05-07T20:33:07.6029342Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6029461Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6029549Z x = x_sign * x_clamp 2025-05-07T20:33:07.6029628Z x0 = x[:, :D] 2025-05-07T20:33:07.6029705Z x1 = x[:, D:] 2025-05-07T20:33:07.6029773Z 2025-05-07T20:33:07.6029940Z if contiguous: 2025-05-07T20:33:07.6030031Z x0 = x0.contiguous() 2025-05-07T20:33:07.6030115Z x1 = x1.contiguous() 2025-05-07T20:33:07.6030184Z 2025-05-07T20:33:07.6030278Z if scale_ub is not None: 2025-05-07T20:33:07.6030379Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6030509Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6030588Z ) 2025-05-07T20:33:07.6030662Z else: 2025-05-07T20:33:07.6030805Z scale_ub_tensor = None 2025-05-07T20:33:07.6030880Z 2025-05-07T20:33:07.6031006Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6031092Z op = silu_mul_quant 2025-05-07T20:33:07.6031175Z if compiled: 2025-05-07T20:33:07.6031270Z op = torch.compile(op) 2025-05-07T20:33:07.6031379Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6031449Z 2025-05-07T20:33:07.6031536Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6031541Z 2025-05-07T20:33:07.6031638Z moe/activation_test.py:117: 2025-05-07T20:33:07.6031760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6031855Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6031956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6032319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.6032449Z return fn(*args, **kwargs) 2025-05-07T20:33:07.6032948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6033044Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6033403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6033664Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6033999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6034092Z kernel = self.compile( 2025-05-07T20:33:07.6034467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6034644Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6034772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6034779Z 2025-05-07T20:33:07.6034979Z self = 2025-05-07T20:33:07.6035802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6036304Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891764a1f0>} 2025-05-07T20:33:07.6037050Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6037237Z context = 2025-05-07T20:33:07.6037242Z 2025-05-07T20:33:07.6037408Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6037668Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6037773Z module_map=module_map) 2025-05-07T20:33:07.6037941Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6038038Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6038114Z E ^ 2025-05-07T20:33:07.6038478Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6038483Z 2025-05-07T20:33:07.6038896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6038900Z 2025-05-07T20:33:07.6039003Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6039221Z self=, 2025-05-07T20:33:07.6039340Z T=2048, 2025-05-07T20:33:07.6039413Z D=5120, 2025-05-07T20:33:07.6039490Z scale_ub=None, 2025-05-07T20:33:07.6039572Z contiguous=False, 2025-05-07T20:33:07.6039655Z compiled=True, 2025-05-07T20:33:07.6039723Z ) 2025-05-07T20:33:07.6039940Z self = 2025-05-07T20:33:07.6040119Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.6040124Z 2025-05-07T20:33:07.6040198Z @given( 2025-05-07T20:33:07.6040323Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6040418Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6040533Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6040651Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6040763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6040830Z ) 2025-05-07T20:33:07.6041117Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6041210Z def test_silu_mul_quant( 2025-05-07T20:33:07.6041283Z self, 2025-05-07T20:33:07.6041358Z T: int, 2025-05-07T20:33:07.6041429Z D: int, 2025-05-07T20:33:07.6041523Z scale_ub: Optional[float], 2025-05-07T20:33:07.6041610Z contiguous: bool, 2025-05-07T20:33:07.6041733Z compiled: bool, 2025-05-07T20:33:07.6041810Z ) -> None: 2025-05-07T20:33:07.6041901Z torch.manual_seed(2025) 2025-05-07T20:33:07.6041966Z 2025-05-07T20:33:07.6042135Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6042202Z 2025-05-07T20:33:07.6042288Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6042413Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6042497Z x = x_sign * x_clamp 2025-05-07T20:33:07.6042574Z x0 = x[:, :D] 2025-05-07T20:33:07.6042652Z x1 = x[:, D:] 2025-05-07T20:33:07.6042721Z 2025-05-07T20:33:07.6042801Z if contiguous: 2025-05-07T20:33:07.6042892Z x0 = x0.contiguous() 2025-05-07T20:33:07.6042975Z x1 = x1.contiguous() 2025-05-07T20:33:07.6043046Z 2025-05-07T20:33:07.6043133Z if scale_ub is not None: 2025-05-07T20:33:07.6043234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6043439Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6043513Z ) 2025-05-07T20:33:07.6043589Z else: 2025-05-07T20:33:07.6043685Z scale_ub_tensor = None 2025-05-07T20:33:07.6043755Z 2025-05-07T20:33:07.6043880Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6043970Z op = silu_mul_quant 2025-05-07T20:33:07.6044049Z if compiled: 2025-05-07T20:33:07.6044147Z op = torch.compile(op) 2025-05-07T20:33:07.6044257Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6044325Z 2025-05-07T20:33:07.6044422Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6044427Z 2025-05-07T20:33:07.6044520Z moe/activation_test.py:117: 2025-05-07T20:33:07.6044644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6044743Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6044841Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6045207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.6045298Z return fn(*args, **kwargs) 2025-05-07T20:33:07.6045786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6045884Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6046239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6046460Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6046844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6046932Z kernel = self.compile( 2025-05-07T20:33:07.6047312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6047487Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6047610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6047614Z 2025-05-07T20:33:07.6047819Z self = 2025-05-07T20:33:07.6048599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6049140Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891764af70>} 2025-05-07T20:33:07.6049893Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6050114Z context = 2025-05-07T20:33:07.6050119Z 2025-05-07T20:33:07.6050284Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6050542Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6050645Z module_map=module_map) 2025-05-07T20:33:07.6050805Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6050900Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6050978Z E ^ 2025-05-07T20:33:07.6051337Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6051342Z 2025-05-07T20:33:07.6051799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6051807Z 2025-05-07T20:33:07.6051961Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6052180Z self=, 2025-05-07T20:33:07.6052260Z T=2048, 2025-05-07T20:33:07.6052331Z D=5120, 2025-05-07T20:33:07.6052406Z scale_ub=1200.0, 2025-05-07T20:33:07.6052488Z contiguous=False, 2025-05-07T20:33:07.6052562Z compiled=True, 2025-05-07T20:33:07.6052630Z ) 2025-05-07T20:33:07.6052843Z self = 2025-05-07T20:33:07.6053010Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.6053017Z 2025-05-07T20:33:07.6053092Z @given( 2025-05-07T20:33:07.6053208Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6053302Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6053422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6053536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6053649Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6053722Z ) 2025-05-07T20:33:07.6053964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6054052Z def test_silu_mul_quant( 2025-05-07T20:33:07.6054126Z self, 2025-05-07T20:33:07.6054197Z T: int, 2025-05-07T20:33:07.6054265Z D: int, 2025-05-07T20:33:07.6054360Z scale_ub: Optional[float], 2025-05-07T20:33:07.6054444Z contiguous: bool, 2025-05-07T20:33:07.6054522Z compiled: bool, 2025-05-07T20:33:07.6054598Z ) -> None: 2025-05-07T20:33:07.6054745Z torch.manual_seed(2025) 2025-05-07T20:33:07.6054812Z 2025-05-07T20:33:07.6054974Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6055042Z 2025-05-07T20:33:07.6055134Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6055253Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6055342Z x = x_sign * x_clamp 2025-05-07T20:33:07.6055421Z x0 = x[:, :D] 2025-05-07T20:33:07.6055495Z x1 = x[:, D:] 2025-05-07T20:33:07.6055563Z 2025-05-07T20:33:07.6055646Z if contiguous: 2025-05-07T20:33:07.6055731Z x0 = x0.contiguous() 2025-05-07T20:33:07.6055815Z x1 = x1.contiguous() 2025-05-07T20:33:07.6055888Z 2025-05-07T20:33:07.6055975Z if scale_ub is not None: 2025-05-07T20:33:07.6056078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6056206Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6056273Z ) 2025-05-07T20:33:07.6056444Z else: 2025-05-07T20:33:07.6056537Z scale_ub_tensor = None 2025-05-07T20:33:07.6056606Z 2025-05-07T20:33:07.6056731Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6056820Z op = silu_mul_quant 2025-05-07T20:33:07.6056899Z if compiled: 2025-05-07T20:33:07.6057040Z op = torch.compile(op) 2025-05-07T20:33:07.6057139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6057208Z 2025-05-07T20:33:07.6057300Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6057304Z 2025-05-07T20:33:07.6057397Z moe/activation_test.py:117: 2025-05-07T20:33:07.6057525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6057619Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6057717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6058087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.6058177Z return fn(*args, **kwargs) 2025-05-07T20:33:07.6058667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6058761Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6059155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6059377Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6059709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6059798Z kernel = self.compile( 2025-05-07T20:33:07.6060176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6060346Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6060474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6060484Z 2025-05-07T20:33:07.6060684Z self = 2025-05-07T20:33:07.6061465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6062007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891746e940>} 2025-05-07T20:33:07.6062763Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6062956Z context = 2025-05-07T20:33:07.6063002Z 2025-05-07T20:33:07.6063162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6063417Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6063521Z module_map=module_map) 2025-05-07T20:33:07.6063683Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6063777Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6063850Z E ^ 2025-05-07T20:33:07.6064200Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6064205Z 2025-05-07T20:33:07.6064616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6064621Z 2025-05-07T20:33:07.6064716Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6064971Z self=, 2025-05-07T20:33:07.6065047Z T=4096, 2025-05-07T20:33:07.6065119Z D=5120, 2025-05-07T20:33:07.6065201Z scale_ub=1200.0, 2025-05-07T20:33:07.6065277Z contiguous=True, 2025-05-07T20:33:07.6065355Z compiled=True, 2025-05-07T20:33:07.6065427Z ) 2025-05-07T20:33:07.6065682Z self = 2025-05-07T20:33:07.6065851Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.6065855Z 2025-05-07T20:33:07.6065931Z @given( 2025-05-07T20:33:07.6066041Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6066136Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6066250Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6066365Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6066475Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6066543Z ) 2025-05-07T20:33:07.6066791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6066881Z def test_silu_mul_quant( 2025-05-07T20:33:07.6066951Z self, 2025-05-07T20:33:07.6067024Z T: int, 2025-05-07T20:33:07.6067098Z D: int, 2025-05-07T20:33:07.6067229Z scale_ub: Optional[float], 2025-05-07T20:33:07.6067315Z contiguous: bool, 2025-05-07T20:33:07.6067397Z compiled: bool, 2025-05-07T20:33:07.6067468Z ) -> None: 2025-05-07T20:33:07.6067557Z torch.manual_seed(2025) 2025-05-07T20:33:07.6067627Z 2025-05-07T20:33:07.6067790Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6067861Z 2025-05-07T20:33:07.6067947Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6068064Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6068152Z x = x_sign * x_clamp 2025-05-07T20:33:07.6068227Z x0 = x[:, :D] 2025-05-07T20:33:07.6068307Z x1 = x[:, D:] 2025-05-07T20:33:07.6068380Z 2025-05-07T20:33:07.6068457Z if contiguous: 2025-05-07T20:33:07.6068543Z x0 = x0.contiguous() 2025-05-07T20:33:07.6068630Z x1 = x1.contiguous() 2025-05-07T20:33:07.6068699Z 2025-05-07T20:33:07.6068786Z if scale_ub is not None: 2025-05-07T20:33:07.6068893Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6069022Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6069100Z ) 2025-05-07T20:33:07.6069172Z else: 2025-05-07T20:33:07.6069263Z scale_ub_tensor = None 2025-05-07T20:33:07.6069338Z 2025-05-07T20:33:07.6069462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6069544Z op = silu_mul_quant 2025-05-07T20:33:07.6069630Z if compiled: 2025-05-07T20:33:07.6069724Z op = torch.compile(op) 2025-05-07T20:33:07.6069902Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6070028Z 2025-05-07T20:33:07.6070114Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6070119Z 2025-05-07T20:33:07.6070211Z moe/activation_test.py:117: 2025-05-07T20:33:07.6070340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6070434Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6070542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6070910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.6070997Z return fn(*args, **kwargs) 2025-05-07T20:33:07.6071490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6071582Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6071936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6072216Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6072551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6072643Z kernel = self.compile( 2025-05-07T20:33:07.6073022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6073252Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6073380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6073384Z 2025-05-07T20:33:07.6073583Z self = 2025-05-07T20:33:07.6074367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6074878Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917409790>} 2025-05-07T20:33:07.6075672Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6075865Z context = 2025-05-07T20:33:07.6075870Z 2025-05-07T20:33:07.6076030Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6076292Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6076393Z module_map=module_map) 2025-05-07T20:33:07.6076551Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6076649Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6076729Z E ^ 2025-05-07T20:33:07.6077089Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6077094Z 2025-05-07T20:33:07.6077507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6077514Z 2025-05-07T20:33:07.6077610Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6077831Z self=, 2025-05-07T20:33:07.6077903Z T=128, 2025-05-07T20:33:07.6077973Z D=5120, 2025-05-07T20:33:07.6078062Z scale_ub=1200.0, 2025-05-07T20:33:07.6078145Z contiguous=False, 2025-05-07T20:33:07.6078228Z compiled=True, 2025-05-07T20:33:07.6078296Z ) 2025-05-07T20:33:07.6078509Z self = 2025-05-07T20:33:07.6078681Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.6078731Z 2025-05-07T20:33:07.6078803Z @given( 2025-05-07T20:33:07.6078916Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6079012Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6079125Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6079245Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6079359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6079430Z ) 2025-05-07T20:33:07.6079675Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6079765Z def test_silu_mul_quant( 2025-05-07T20:33:07.6079844Z self, 2025-05-07T20:33:07.6079920Z T: int, 2025-05-07T20:33:07.6079997Z D: int, 2025-05-07T20:33:07.6080093Z scale_ub: Optional[float], 2025-05-07T20:33:07.6080181Z contiguous: bool, 2025-05-07T20:33:07.6080263Z compiled: bool, 2025-05-07T20:33:07.6085186Z ) -> None: 2025-05-07T20:33:07.6085311Z torch.manual_seed(2025) 2025-05-07T20:33:07.6085383Z 2025-05-07T20:33:07.6085560Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6085635Z 2025-05-07T20:33:07.6085729Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6085901Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6085985Z x = x_sign * x_clamp 2025-05-07T20:33:07.6086062Z x0 = x[:, :D] 2025-05-07T20:33:07.6086137Z x1 = x[:, D:] 2025-05-07T20:33:07.6086206Z 2025-05-07T20:33:07.6086298Z if contiguous: 2025-05-07T20:33:07.6086387Z x0 = x0.contiguous() 2025-05-07T20:33:07.6086470Z x1 = x1.contiguous() 2025-05-07T20:33:07.6086543Z 2025-05-07T20:33:07.6086631Z if scale_ub is not None: 2025-05-07T20:33:07.6086734Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6086870Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6086950Z ) 2025-05-07T20:33:07.6087023Z else: 2025-05-07T20:33:07.6087116Z scale_ub_tensor = None 2025-05-07T20:33:07.6087184Z 2025-05-07T20:33:07.6087324Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6087409Z op = silu_mul_quant 2025-05-07T20:33:07.6087535Z if compiled: 2025-05-07T20:33:07.6087639Z op = torch.compile(op) 2025-05-07T20:33:07.6087741Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6087806Z 2025-05-07T20:33:07.6087894Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6087900Z 2025-05-07T20:33:07.6087994Z moe/activation_test.py:117: 2025-05-07T20:33:07.6088119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6088217Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6088313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6088700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.6088792Z return fn(*args, **kwargs) 2025-05-07T20:33:07.6089294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6089396Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6089758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6089979Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6090322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6090413Z kernel = self.compile( 2025-05-07T20:33:07.6090798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6090975Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6091141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6091145Z 2025-05-07T20:33:07.6091351Z self = 2025-05-07T20:33:07.6092138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6092652Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89172fe0d0>} 2025-05-07T20:33:07.6093403Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6093637Z context = 2025-05-07T20:33:07.6093642Z 2025-05-07T20:33:07.6093805Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6094071Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6094221Z module_map=module_map) 2025-05-07T20:33:07.6094379Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6094479Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6094556Z E ^ 2025-05-07T20:33:07.6094908Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6094913Z 2025-05-07T20:33:07.6095328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6095332Z 2025-05-07T20:33:07.6095430Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6095654Z self=, 2025-05-07T20:33:07.6095725Z T=16384, 2025-05-07T20:33:07.6095796Z D=7168, 2025-05-07T20:33:07.6095874Z scale_ub=1200.0, 2025-05-07T20:33:07.6095955Z contiguous=True, 2025-05-07T20:33:07.6096039Z compiled=True, 2025-05-07T20:33:07.6096158Z ) 2025-05-07T20:33:07.6096375Z self = 2025-05-07T20:33:07.6096544Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.6096549Z 2025-05-07T20:33:07.6096624Z @given( 2025-05-07T20:33:07.6096737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6096839Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6096954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6097069Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6097185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6097266Z ) 2025-05-07T20:33:07.6097510Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6097604Z def test_silu_mul_quant( 2025-05-07T20:33:07.6097676Z self, 2025-05-07T20:33:07.6097749Z T: int, 2025-05-07T20:33:07.6097827Z D: int, 2025-05-07T20:33:07.6097927Z scale_ub: Optional[float], 2025-05-07T20:33:07.6098015Z contiguous: bool, 2025-05-07T20:33:07.6098101Z compiled: bool, 2025-05-07T20:33:07.6098175Z ) -> None: 2025-05-07T20:33:07.6098267Z torch.manual_seed(2025) 2025-05-07T20:33:07.6098341Z 2025-05-07T20:33:07.6098507Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6098586Z 2025-05-07T20:33:07.6098678Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6098799Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6098888Z x = x_sign * x_clamp 2025-05-07T20:33:07.6099011Z x0 = x[:, :D] 2025-05-07T20:33:07.6099087Z x1 = x[:, D:] 2025-05-07T20:33:07.6099159Z 2025-05-07T20:33:07.6099242Z if contiguous: 2025-05-07T20:33:07.6099329Z x0 = x0.contiguous() 2025-05-07T20:33:07.6099419Z x1 = x1.contiguous() 2025-05-07T20:33:07.6099490Z 2025-05-07T20:33:07.6099584Z if scale_ub is not None: 2025-05-07T20:33:07.6099690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6099825Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6099903Z ) 2025-05-07T20:33:07.6099981Z else: 2025-05-07T20:33:07.6100073Z scale_ub_tensor = None 2025-05-07T20:33:07.6100142Z 2025-05-07T20:33:07.6100273Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6100360Z op = silu_mul_quant 2025-05-07T20:33:07.6100449Z if compiled: 2025-05-07T20:33:07.6100544Z op = torch.compile(op) 2025-05-07T20:33:07.6101174Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6101249Z 2025-05-07T20:33:07.6101335Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6101339Z 2025-05-07T20:33:07.6101438Z moe/activation_test.py:117: 2025-05-07T20:33:07.6101565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6101705Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6101798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6102167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.6102253Z return fn(*args, **kwargs) 2025-05-07T20:33:07.6102758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6102850Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6103214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6103452Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6104115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6104216Z kernel = self.compile( 2025-05-07T20:33:07.6104699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6104875Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6105010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6105015Z 2025-05-07T20:33:07.6105218Z self = 2025-05-07T20:33:07.6106019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6106530Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89172fed30>} 2025-05-07T20:33:07.6107286Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6107484Z context = 2025-05-07T20:33:07.6107488Z 2025-05-07T20:33:07.6107653Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6107922Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6108030Z module_map=module_map) 2025-05-07T20:33:07.6108193Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6108365Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6108442Z E ^ 2025-05-07T20:33:07.6108798Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6108808Z 2025-05-07T20:33:07.6109225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6109232Z 2025-05-07T20:33:07.6109333Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6109559Z self=, 2025-05-07T20:33:07.6109633Z T=16384, 2025-05-07T20:33:07.6109706Z D=5120, 2025-05-07T20:33:07.6109791Z scale_ub=1200.0, 2025-05-07T20:33:07.6109951Z contiguous=True, 2025-05-07T20:33:07.6110032Z compiled=False, 2025-05-07T20:33:07.6110106Z ) 2025-05-07T20:33:07.6110322Z self = 2025-05-07T20:33:07.6110569Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.6110574Z 2025-05-07T20:33:07.6110648Z @given( 2025-05-07T20:33:07.6110767Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6110871Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6111046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6111161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6111275Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6111346Z ) 2025-05-07T20:33:07.6111600Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6111697Z def test_silu_mul_quant( 2025-05-07T20:33:07.6111772Z self, 2025-05-07T20:33:07.6111851Z T: int, 2025-05-07T20:33:07.6111923Z D: int, 2025-05-07T20:33:07.6112017Z scale_ub: Optional[float], 2025-05-07T20:33:07.6112106Z contiguous: bool, 2025-05-07T20:33:07.6112193Z compiled: bool, 2025-05-07T20:33:07.6112270Z ) -> None: 2025-05-07T20:33:07.6112363Z torch.manual_seed(2025) 2025-05-07T20:33:07.6112433Z 2025-05-07T20:33:07.6112599Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6112677Z 2025-05-07T20:33:07.6112837Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6112959Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6113047Z x = x_sign * x_clamp 2025-05-07T20:33:07.6113126Z x0 = x[:, :D] 2025-05-07T20:33:07.6113205Z x1 = x[:, D:] 2025-05-07T20:33:07.6113274Z 2025-05-07T20:33:07.6113354Z if contiguous: 2025-05-07T20:33:07.6113449Z x0 = x0.contiguous() 2025-05-07T20:33:07.6113534Z x1 = x1.contiguous() 2025-05-07T20:33:07.6113605Z 2025-05-07T20:33:07.6113695Z if scale_ub is not None: 2025-05-07T20:33:07.6113795Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6113935Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6114013Z ) 2025-05-07T20:33:07.6114091Z else: 2025-05-07T20:33:07.6114182Z scale_ub_tensor = None 2025-05-07T20:33:07.6114254Z 2025-05-07T20:33:07.6114384Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6114481Z op = silu_mul_quant 2025-05-07T20:33:07.6114567Z if compiled: 2025-05-07T20:33:07.6114665Z op = torch.compile(op) 2025-05-07T20:33:07.6114768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6114839Z 2025-05-07T20:33:07.6114926Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6114931Z 2025-05-07T20:33:07.6115029Z moe/activation_test.py:117: 2025-05-07T20:33:07.6115155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6115251Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6115352Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6115901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6116000Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6116359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6116582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6116924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6117015Z kernel = self.compile( 2025-05-07T20:33:07.6117393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6117568Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6117689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6117738Z 2025-05-07T20:33:07.6117946Z self = 2025-05-07T20:33:07.6118727Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6119270Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891725c700>} 2025-05-07T20:33:07.6120025Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6120213Z context = 2025-05-07T20:33:07.6120218Z 2025-05-07T20:33:07.6120394Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6120658Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6120770Z module_map=module_map) 2025-05-07T20:33:07.6120931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6121067Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6121144Z E ^ 2025-05-07T20:33:07.6121499Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6121504Z 2025-05-07T20:33:07.6121915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6121920Z 2025-05-07T20:33:07.6122026Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6122245Z self=, 2025-05-07T20:33:07.6122327Z T=1, 2025-05-07T20:33:07.6122411Z D=7168, 2025-05-07T20:33:07.6122491Z scale_ub=1200.0, 2025-05-07T20:33:07.6122580Z contiguous=False, 2025-05-07T20:33:07.6122660Z compiled=False, 2025-05-07T20:33:07.6122731Z ) 2025-05-07T20:33:07.6122952Z self = 2025-05-07T20:33:07.6123119Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.6123124Z 2025-05-07T20:33:07.6123198Z @given( 2025-05-07T20:33:07.6123317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6123411Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6123526Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6123640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6123752Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6123825Z ) 2025-05-07T20:33:07.6124074Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6124212Z def test_silu_mul_quant( 2025-05-07T20:33:07.6124288Z self, 2025-05-07T20:33:07.6124360Z T: int, 2025-05-07T20:33:07.6124432Z D: int, 2025-05-07T20:33:07.6124532Z scale_ub: Optional[float], 2025-05-07T20:33:07.6124617Z contiguous: bool, 2025-05-07T20:33:07.6124704Z compiled: bool, 2025-05-07T20:33:07.6124782Z ) -> None: 2025-05-07T20:33:07.6124873Z torch.manual_seed(2025) 2025-05-07T20:33:07.6124945Z 2025-05-07T20:33:07.6125110Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6125182Z 2025-05-07T20:33:07.6125273Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6125393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6125481Z x = x_sign * x_clamp 2025-05-07T20:33:07.6125562Z x0 = x[:, :D] 2025-05-07T20:33:07.6125638Z x1 = x[:, D:] 2025-05-07T20:33:07.6125708Z 2025-05-07T20:33:07.6125840Z if contiguous: 2025-05-07T20:33:07.6125930Z x0 = x0.contiguous() 2025-05-07T20:33:07.6126015Z x1 = x1.contiguous() 2025-05-07T20:33:07.6126087Z 2025-05-07T20:33:07.6126178Z if scale_ub is not None: 2025-05-07T20:33:07.6126282Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6126460Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6126533Z ) 2025-05-07T20:33:07.6126611Z else: 2025-05-07T20:33:07.6126702Z scale_ub_tensor = None 2025-05-07T20:33:07.6126773Z 2025-05-07T20:33:07.6126903Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6126990Z op = silu_mul_quant 2025-05-07T20:33:07.6127070Z if compiled: 2025-05-07T20:33:07.6127169Z op = torch.compile(op) 2025-05-07T20:33:07.6127273Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6127345Z 2025-05-07T20:33:07.6127435Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6127445Z 2025-05-07T20:33:07.6127539Z moe/activation_test.py:117: 2025-05-07T20:33:07.6127666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6127763Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6127857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6128404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6128500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6128855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6129078Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6129413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6129513Z kernel = self.compile( 2025-05-07T20:33:07.6129897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6130071Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6130199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6130208Z 2025-05-07T20:33:07.6130411Z self = 2025-05-07T20:33:07.6131194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6131701Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89173940d0>} 2025-05-07T20:33:07.6132499Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6132731Z context = 2025-05-07T20:33:07.6132736Z 2025-05-07T20:33:07.6132902Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6133175Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6133281Z module_map=module_map) 2025-05-07T20:33:07.6133440Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6133539Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6133612Z E ^ 2025-05-07T20:33:07.6133969Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6133977Z 2025-05-07T20:33:07.6134426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6134433Z 2025-05-07T20:33:07.6134534Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6134759Z self=, 2025-05-07T20:33:07.6134834Z T=4096, 2025-05-07T20:33:07.6134948Z D=7168, 2025-05-07T20:33:07.6135033Z scale_ub=1200.0, 2025-05-07T20:33:07.6135117Z contiguous=False, 2025-05-07T20:33:07.6135196Z compiled=True, 2025-05-07T20:33:07.6135270Z ) 2025-05-07T20:33:07.6135487Z self = 2025-05-07T20:33:07.6135661Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.6135666Z 2025-05-07T20:33:07.6135739Z @given( 2025-05-07T20:33:07.6135855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6135958Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6136077Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6136193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6136308Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6136380Z ) 2025-05-07T20:33:07.6136670Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6136767Z def test_silu_mul_quant( 2025-05-07T20:33:07.6136841Z self, 2025-05-07T20:33:07.6136918Z T: int, 2025-05-07T20:33:07.6136991Z D: int, 2025-05-07T20:33:07.6137085Z scale_ub: Optional[float], 2025-05-07T20:33:07.6137172Z contiguous: bool, 2025-05-07T20:33:07.6137258Z compiled: bool, 2025-05-07T20:33:07.6137332Z ) -> None: 2025-05-07T20:33:07.6137425Z torch.manual_seed(2025) 2025-05-07T20:33:07.6137496Z 2025-05-07T20:33:07.6137659Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6137733Z 2025-05-07T20:33:07.6137827Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6137946Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6138034Z x = x_sign * x_clamp 2025-05-07T20:33:07.6138112Z x0 = x[:, :D] 2025-05-07T20:33:07.6138190Z x1 = x[:, D:] 2025-05-07T20:33:07.6138259Z 2025-05-07T20:33:07.6138344Z if contiguous: 2025-05-07T20:33:07.6138435Z x0 = x0.contiguous() 2025-05-07T20:33:07.6138521Z x1 = x1.contiguous() 2025-05-07T20:33:07.6138593Z 2025-05-07T20:33:07.6138683Z if scale_ub is not None: 2025-05-07T20:33:07.6138787Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6138918Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6138997Z ) 2025-05-07T20:33:07.6139071Z else: 2025-05-07T20:33:07.6139160Z scale_ub_tensor = None 2025-05-07T20:33:07.6139231Z 2025-05-07T20:33:07.6139358Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6139501Z op = silu_mul_quant 2025-05-07T20:33:07.6139585Z if compiled: 2025-05-07T20:33:07.6139683Z op = torch.compile(op) 2025-05-07T20:33:07.6139790Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6139861Z 2025-05-07T20:33:07.6139948Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6139957Z 2025-05-07T20:33:07.6140055Z moe/activation_test.py:117: 2025-05-07T20:33:07.6140179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6140277Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6140374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6140740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.6140834Z return fn(*args, **kwargs) 2025-05-07T20:33:07.6141370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6141471Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6141838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6142100Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6142502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6142597Z kernel = self.compile( 2025-05-07T20:33:07.6142975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6143154Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6143277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6143281Z 2025-05-07T20:33:07.6143483Z self = 2025-05-07T20:33:07.6144274Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6144821Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917394dc0>} 2025-05-07T20:33:07.6145571Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6145766Z context = 2025-05-07T20:33:07.6145771Z 2025-05-07T20:33:07.6145936Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6146203Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6146313Z module_map=module_map) 2025-05-07T20:33:07.6146478Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6146580Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6146655Z E ^ 2025-05-07T20:33:07.6147014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6147019Z 2025-05-07T20:33:07.6147437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6147442Z 2025-05-07T20:33:07.6147540Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6147764Z self=, 2025-05-07T20:33:07.6147841Z T=128, 2025-05-07T20:33:07.6147916Z D=7168, 2025-05-07T20:33:07.6148002Z scale_ub=1200.0, 2025-05-07T20:33:07.6148086Z contiguous=False, 2025-05-07T20:33:07.6148212Z compiled=True, 2025-05-07T20:33:07.6148285Z ) 2025-05-07T20:33:07.6148499Z self = 2025-05-07T20:33:07.6148667Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.6148675Z 2025-05-07T20:33:07.6148754Z @given( 2025-05-07T20:33:07.6148870Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6148968Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6149081Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6149196Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6149310Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6149382Z ) 2025-05-07T20:33:07.6149628Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6149720Z def test_silu_mul_quant( 2025-05-07T20:33:07.6149860Z self, 2025-05-07T20:33:07.6150000Z T: int, 2025-05-07T20:33:07.6150075Z D: int, 2025-05-07T20:33:07.6150170Z scale_ub: Optional[float], 2025-05-07T20:33:07.6150258Z contiguous: bool, 2025-05-07T20:33:07.6150341Z compiled: bool, 2025-05-07T20:33:07.6150417Z ) -> None: 2025-05-07T20:33:07.6150511Z torch.manual_seed(2025) 2025-05-07T20:33:07.6150625Z 2025-05-07T20:33:07.6150790Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6150868Z 2025-05-07T20:33:07.6150957Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6151076Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6151167Z x = x_sign * x_clamp 2025-05-07T20:33:07.6151245Z x0 = x[:, :D] 2025-05-07T20:33:07.6151324Z x1 = x[:, D:] 2025-05-07T20:33:07.6151399Z 2025-05-07T20:33:07.6151479Z if contiguous: 2025-05-07T20:33:07.6151571Z x0 = x0.contiguous() 2025-05-07T20:33:07.6151658Z x1 = x1.contiguous() 2025-05-07T20:33:07.6151737Z 2025-05-07T20:33:07.6151828Z if scale_ub is not None: 2025-05-07T20:33:07.6151930Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6152064Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6152140Z ) 2025-05-07T20:33:07.6152218Z else: 2025-05-07T20:33:07.6152350Z scale_ub_tensor = None 2025-05-07T20:33:07.6152424Z 2025-05-07T20:33:07.6152551Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6152637Z op = silu_mul_quant 2025-05-07T20:33:07.6152723Z if compiled: 2025-05-07T20:33:07.6152821Z op = torch.compile(op) 2025-05-07T20:33:07.6152927Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6152997Z 2025-05-07T20:33:07.6153084Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6153088Z 2025-05-07T20:33:07.6153188Z moe/activation_test.py:117: 2025-05-07T20:33:07.6153315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6153415Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6153515Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6153880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.6153973Z return fn(*args, **kwargs) 2025-05-07T20:33:07.6154470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6154567Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6154922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6155143Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6155480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6155622Z kernel = self.compile( 2025-05-07T20:33:07.6155997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6156180Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6156306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6156314Z 2025-05-07T20:33:07.6156517Z self = 2025-05-07T20:33:07.6157303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6157805Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89171c1940>} 2025-05-07T20:33:07.6158595Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6158784Z context = 2025-05-07T20:33:07.6158826Z 2025-05-07T20:33:07.6158991Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6159253Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6159359Z module_map=module_map) 2025-05-07T20:33:07.6159524Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6159621Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6159697Z E ^ 2025-05-07T20:33:07.6160055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6160063Z 2025-05-07T20:33:07.6160479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6160484Z 2025-05-07T20:33:07.6160589Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6160847Z self=, 2025-05-07T20:33:07.6160926Z T=2048, 2025-05-07T20:33:07.6161002Z D=7168, 2025-05-07T20:33:07.6161081Z scale_ub=None, 2025-05-07T20:33:07.6161163Z contiguous=True, 2025-05-07T20:33:07.6161246Z compiled=True, 2025-05-07T20:33:07.6161317Z ) 2025-05-07T20:33:07.6161530Z self = 2025-05-07T20:33:07.6161703Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.6161708Z 2025-05-07T20:33:07.6161781Z @given( 2025-05-07T20:33:07.6161900Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6162000Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6162115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6162232Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6162343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6162414Z ) 2025-05-07T20:33:07.6162664Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6162756Z def test_silu_mul_quant( 2025-05-07T20:33:07.6162829Z self, 2025-05-07T20:33:07.6162906Z T: int, 2025-05-07T20:33:07.6162980Z D: int, 2025-05-07T20:33:07.6163076Z scale_ub: Optional[float], 2025-05-07T20:33:07.6163166Z contiguous: bool, 2025-05-07T20:33:07.6163249Z compiled: bool, 2025-05-07T20:33:07.6163331Z ) -> None: 2025-05-07T20:33:07.6163425Z torch.manual_seed(2025) 2025-05-07T20:33:07.6163493Z 2025-05-07T20:33:07.6163663Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6163779Z 2025-05-07T20:33:07.6163870Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6163995Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6164081Z x = x_sign * x_clamp 2025-05-07T20:33:07.6164159Z x0 = x[:, :D] 2025-05-07T20:33:07.6164237Z x1 = x[:, D:] 2025-05-07T20:33:07.6164312Z 2025-05-07T20:33:07.6164393Z if contiguous: 2025-05-07T20:33:07.6164483Z x0 = x0.contiguous() 2025-05-07T20:33:07.6164568Z x1 = x1.contiguous() 2025-05-07T20:33:07.6164641Z 2025-05-07T20:33:07.6164729Z if scale_ub is not None: 2025-05-07T20:33:07.6164830Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6164967Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6165040Z ) 2025-05-07T20:33:07.6165114Z else: 2025-05-07T20:33:07.6165207Z scale_ub_tensor = None 2025-05-07T20:33:07.6165277Z 2025-05-07T20:33:07.6165446Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6165541Z op = silu_mul_quant 2025-05-07T20:33:07.6165623Z if compiled: 2025-05-07T20:33:07.6165718Z op = torch.compile(op) 2025-05-07T20:33:07.6165825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6165934Z 2025-05-07T20:33:07.6166030Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6166035Z 2025-05-07T20:33:07.6166129Z moe/activation_test.py:117: 2025-05-07T20:33:07.6166255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6166355Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6166450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6166816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.6166910Z return fn(*args, **kwargs) 2025-05-07T20:33:07.6167406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6167511Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6167869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6168134Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6168478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6168572Z kernel = self.compile( 2025-05-07T20:33:07.6168951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6169128Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6169251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6169256Z 2025-05-07T20:33:07.6169464Z self = 2025-05-07T20:33:07.6170247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6170757Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8917497550>} 2025-05-07T20:33:07.6171526Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6171751Z context = 2025-05-07T20:33:07.6171757Z 2025-05-07T20:33:07.6171932Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6172199Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6172350Z module_map=module_map) 2025-05-07T20:33:07.6172513Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6172611Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6172693Z E ^ 2025-05-07T20:33:07.6173050Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6173055Z 2025-05-07T20:33:07.6173467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6173471Z 2025-05-07T20:33:07.6173575Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6173793Z self=, 2025-05-07T20:33:07.6173873Z T=16384, 2025-05-07T20:33:07.6173946Z D=5120, 2025-05-07T20:33:07.6174023Z scale_ub=None, 2025-05-07T20:33:07.6174178Z contiguous=False, 2025-05-07T20:33:07.6174262Z compiled=False, 2025-05-07T20:33:07.6174332Z ) 2025-05-07T20:33:07.6174550Z self = 2025-05-07T20:33:07.6174725Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.6174771Z 2025-05-07T20:33:07.6174845Z @given( 2025-05-07T20:33:07.6174969Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6175065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6175182Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6175299Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6175409Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6175483Z ) 2025-05-07T20:33:07.6175727Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6175815Z def test_silu_mul_quant( 2025-05-07T20:33:07.6175903Z self, 2025-05-07T20:33:07.6175975Z T: int, 2025-05-07T20:33:07.6176047Z D: int, 2025-05-07T20:33:07.6176144Z scale_ub: Optional[float], 2025-05-07T20:33:07.6176229Z contiguous: bool, 2025-05-07T20:33:07.6176315Z compiled: bool, 2025-05-07T20:33:07.6176395Z ) -> None: 2025-05-07T20:33:07.6176528Z torch.manual_seed(2025) 2025-05-07T20:33:07.6176608Z 2025-05-07T20:33:07.6176772Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6176843Z 2025-05-07T20:33:07.6176935Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6177056Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6178861Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6178873Z 2025-05-07T20:33:07.6178992Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.6178997Z 2025-05-07T20:33:07.6179094Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6179316Z self=, 2025-05-07T20:33:07.6179389Z T=4096, 2025-05-07T20:33:07.6179461Z D=7168, 2025-05-07T20:33:07.6179543Z scale_ub=1200.0, 2025-05-07T20:33:07.6179623Z contiguous=True, 2025-05-07T20:33:07.6179705Z compiled=True, 2025-05-07T20:33:07.6179776Z ) 2025-05-07T20:33:07.6179989Z self = 2025-05-07T20:33:07.6180167Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.6180215Z 2025-05-07T20:33:07.6180288Z @given( 2025-05-07T20:33:07.6180401Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6180501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6180615Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6180732Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6180847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6180920Z ) 2025-05-07T20:33:07.6181177Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6181268Z def test_silu_mul_quant( 2025-05-07T20:33:07.6181341Z self, 2025-05-07T20:33:07.6181417Z T: int, 2025-05-07T20:33:07.6181488Z D: int, 2025-05-07T20:33:07.6181583Z scale_ub: Optional[float], 2025-05-07T20:33:07.6181670Z contiguous: bool, 2025-05-07T20:33:07.6181754Z compiled: bool, 2025-05-07T20:33:07.6181895Z ) -> None: 2025-05-07T20:33:07.6182011Z torch.manual_seed(2025) 2025-05-07T20:33:07.6182087Z 2025-05-07T20:33:07.6182251Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6182327Z 2025-05-07T20:33:07.6182417Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6182582Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6184374Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6184386Z 2025-05-07T20:33:07.6184507Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.6184512Z 2025-05-07T20:33:07.6184619Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6184929Z self=, 2025-05-07T20:33:07.6185030Z T=16384, 2025-05-07T20:33:07.6185219Z D=7168, 2025-05-07T20:33:07.6185304Z scale_ub=None, 2025-05-07T20:33:07.6185388Z contiguous=False, 2025-05-07T20:33:07.6185468Z compiled=False, 2025-05-07T20:33:07.6185536Z ) 2025-05-07T20:33:07.6185752Z self = 2025-05-07T20:33:07.6185922Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.6185927Z 2025-05-07T20:33:07.6186004Z @given( 2025-05-07T20:33:07.6186117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6186211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6186328Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6186438Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6186545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6186616Z ) 2025-05-07T20:33:07.6186861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6186957Z def test_silu_mul_quant( 2025-05-07T20:33:07.6187028Z self, 2025-05-07T20:33:07.6187096Z T: int, 2025-05-07T20:33:07.6187169Z D: int, 2025-05-07T20:33:07.6187260Z scale_ub: Optional[float], 2025-05-07T20:33:07.6187349Z contiguous: bool, 2025-05-07T20:33:07.6187433Z compiled: bool, 2025-05-07T20:33:07.6187506Z ) -> None: 2025-05-07T20:33:07.6187595Z torch.manual_seed(2025) 2025-05-07T20:33:07.6187667Z 2025-05-07T20:33:07.6187829Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6189646Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6189713Z 2025-05-07T20:33:07.6189886Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6189891Z 2025-05-07T20:33:07.6189987Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6190212Z self=, 2025-05-07T20:33:07.6190280Z T=2048, 2025-05-07T20:33:07.6190354Z D=7168, 2025-05-07T20:33:07.6190434Z scale_ub=1200.0, 2025-05-07T20:33:07.6190565Z contiguous=True, 2025-05-07T20:33:07.6190650Z compiled=True, 2025-05-07T20:33:07.6190720Z ) 2025-05-07T20:33:07.6190941Z self = 2025-05-07T20:33:07.6191112Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.6191161Z 2025-05-07T20:33:07.6191235Z @given( 2025-05-07T20:33:07.6191349Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6191443Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6191551Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6191665Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6191773Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6191842Z ) 2025-05-07T20:33:07.6192087Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6192174Z def test_silu_mul_quant( 2025-05-07T20:33:07.6192243Z self, 2025-05-07T20:33:07.6192322Z T: int, 2025-05-07T20:33:07.6192393Z D: int, 2025-05-07T20:33:07.6192484Z scale_ub: Optional[float], 2025-05-07T20:33:07.6192578Z contiguous: bool, 2025-05-07T20:33:07.6192658Z compiled: bool, 2025-05-07T20:33:07.6192729Z ) -> None: 2025-05-07T20:33:07.6192866Z torch.manual_seed(2025) 2025-05-07T20:33:07.6192935Z 2025-05-07T20:33:07.6193101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6193168Z 2025-05-07T20:33:07.6193255Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6193377Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6195171Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6195183Z 2025-05-07T20:33:07.6195303Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.6195310Z 2025-05-07T20:33:07.6195406Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6195627Z self=, 2025-05-07T20:33:07.6195702Z T=2048, 2025-05-07T20:33:07.6195771Z D=7168, 2025-05-07T20:33:07.6195846Z scale_ub=None, 2025-05-07T20:33:07.6195930Z contiguous=True, 2025-05-07T20:33:07.6196008Z compiled=False, 2025-05-07T20:33:07.6196079Z ) 2025-05-07T20:33:07.6196290Z self = 2025-05-07T20:33:07.6196462Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.6196509Z 2025-05-07T20:33:07.6196590Z @given( 2025-05-07T20:33:07.6196703Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6196796Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6196910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6197027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6197136Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6197213Z ) 2025-05-07T20:33:07.6197455Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6197550Z def test_silu_mul_quant( 2025-05-07T20:33:07.6197625Z self, 2025-05-07T20:33:07.6197694Z T: int, 2025-05-07T20:33:07.6197765Z D: int, 2025-05-07T20:33:07.6197857Z scale_ub: Optional[float], 2025-05-07T20:33:07.6197939Z contiguous: bool, 2025-05-07T20:33:07.6198024Z compiled: bool, 2025-05-07T20:33:07.6198147Z ) -> None: 2025-05-07T20:33:07.6198238Z torch.manual_seed(2025) 2025-05-07T20:33:07.6198308Z 2025-05-07T20:33:07.6198470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6198541Z 2025-05-07T20:33:07.6198626Z > x_sign = torch.sign(x) 2025-05-07T20:33:07.6200458Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6200468Z 2025-05-07T20:33:07.6200588Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:07.6200599Z 2025-05-07T20:33:07.6200695Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6200925Z self=, 2025-05-07T20:33:07.6200999Z T=1, 2025-05-07T20:33:07.6201071Z D=7168, 2025-05-07T20:33:07.6201151Z scale_ub=1200.0, 2025-05-07T20:33:07.6201273Z contiguous=True, 2025-05-07T20:33:07.6201353Z compiled=False, 2025-05-07T20:33:07.6201425Z ) 2025-05-07T20:33:07.6201635Z self = 2025-05-07T20:33:07.6201794Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.6201802Z 2025-05-07T20:33:07.6201874Z @given( 2025-05-07T20:33:07.6201984Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6202080Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6202190Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6202302Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6202424Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6202493Z ) 2025-05-07T20:33:07.6202733Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6202830Z def test_silu_mul_quant( 2025-05-07T20:33:07.6202903Z self, 2025-05-07T20:33:07.6202975Z T: int, 2025-05-07T20:33:07.6203050Z D: int, 2025-05-07T20:33:07.6203146Z scale_ub: Optional[float], 2025-05-07T20:33:07.6203231Z contiguous: bool, 2025-05-07T20:33:07.6203311Z compiled: bool, 2025-05-07T20:33:07.6203382Z ) -> None: 2025-05-07T20:33:07.6203472Z torch.manual_seed(2025) 2025-05-07T20:33:07.6203549Z 2025-05-07T20:33:07.6204027Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6204109Z 2025-05-07T20:33:07.6204204Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6204328Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6204539Z x = x_sign * x_clamp 2025-05-07T20:33:07.6204617Z x0 = x[:, :D] 2025-05-07T20:33:07.6209124Z x1 = x[:, D:] 2025-05-07T20:33:07.6209215Z 2025-05-07T20:33:07.6209301Z if contiguous: 2025-05-07T20:33:07.6209395Z x0 = x0.contiguous() 2025-05-07T20:33:07.6209490Z x1 = x1.contiguous() 2025-05-07T20:33:07.6209562Z 2025-05-07T20:33:07.6209649Z if scale_ub is not None: 2025-05-07T20:33:07.6209751Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6209886Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6209956Z ) 2025-05-07T20:33:07.6210029Z else: 2025-05-07T20:33:07.6210125Z scale_ub_tensor = None 2025-05-07T20:33:07.6210196Z 2025-05-07T20:33:07.6210323Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6210413Z op = silu_mul_quant 2025-05-07T20:33:07.6210499Z if compiled: 2025-05-07T20:33:07.6210711Z op = torch.compile(op) 2025-05-07T20:33:07.6210819Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6210887Z 2025-05-07T20:33:07.6210975Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6210980Z 2025-05-07T20:33:07.6211077Z moe/activation_test.py:117: 2025-05-07T20:33:07.6211298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6211403Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6211500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6212008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6212115Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6212473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6212699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6213037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6213125Z kernel = self.compile( 2025-05-07T20:33:07.6213565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6213755Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6213879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6213887Z 2025-05-07T20:33:07.6214089Z self = 2025-05-07T20:33:07.6214876Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6215391Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89170df040>} 2025-05-07T20:33:07.6216148Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6216346Z context = 2025-05-07T20:33:07.6216350Z 2025-05-07T20:33:07.6216513Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6216772Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6216881Z module_map=module_map) 2025-05-07T20:33:07.6217039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6217137Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6217210Z E ^ 2025-05-07T20:33:07.6217620Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6217625Z 2025-05-07T20:33:07.6218038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6218045Z 2025-05-07T20:33:07.6218145Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6218364Z self=, 2025-05-07T20:33:07.6218440Z T=128, 2025-05-07T20:33:07.6218513Z D=5120, 2025-05-07T20:33:07.6218595Z scale_ub=None, 2025-05-07T20:33:07.6218675Z contiguous=True, 2025-05-07T20:33:07.6218754Z compiled=False, 2025-05-07T20:33:07.6218835Z ) 2025-05-07T20:33:07.6219054Z self = 2025-05-07T20:33:07.6219218Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.6219227Z 2025-05-07T20:33:07.6219346Z @given( 2025-05-07T20:33:07.6219463Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6219555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6219668Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6219785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6219937Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6220007Z ) 2025-05-07T20:33:07.6220249Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6220354Z def test_silu_mul_quant( 2025-05-07T20:33:07.6220428Z self, 2025-05-07T20:33:07.6220498Z T: int, 2025-05-07T20:33:07.6220570Z D: int, 2025-05-07T20:33:07.6220665Z scale_ub: Optional[float], 2025-05-07T20:33:07.6220748Z contiguous: bool, 2025-05-07T20:33:07.6220829Z compiled: bool, 2025-05-07T20:33:07.6220903Z ) -> None: 2025-05-07T20:33:07.6221005Z torch.manual_seed(2025) 2025-05-07T20:33:07.6221080Z 2025-05-07T20:33:07.6221245Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6221322Z 2025-05-07T20:33:07.6221406Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6221525Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6221658Z x = x_sign * x_clamp 2025-05-07T20:33:07.6221737Z x0 = x[:, :D] 2025-05-07T20:33:07.6221812Z x1 = x[:, D:] 2025-05-07T20:33:07.6221891Z 2025-05-07T20:33:07.6221969Z if contiguous: 2025-05-07T20:33:07.6222054Z x0 = x0.contiguous() 2025-05-07T20:33:07.6222142Z x1 = x1.contiguous() 2025-05-07T20:33:07.6222208Z 2025-05-07T20:33:07.6222294Z if scale_ub is not None: 2025-05-07T20:33:07.6222396Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6222527Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6222603Z ) 2025-05-07T20:33:07.6222678Z else: 2025-05-07T20:33:07.6222767Z scale_ub_tensor = None 2025-05-07T20:33:07.6222837Z 2025-05-07T20:33:07.6222962Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6223045Z op = silu_mul_quant 2025-05-07T20:33:07.6223136Z if compiled: 2025-05-07T20:33:07.6223237Z op = torch.compile(op) 2025-05-07T20:33:07.6223339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6223412Z 2025-05-07T20:33:07.6223497Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6223501Z 2025-05-07T20:33:07.6223596Z moe/activation_test.py:117: 2025-05-07T20:33:07.6223730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6223824Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6223919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6224424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6224561Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6224929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6225155Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6225495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6225583Z kernel = self.compile( 2025-05-07T20:33:07.6225961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6226134Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6226255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6226260Z 2025-05-07T20:33:07.6226459Z self = 2025-05-07T20:33:07.6227289Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6227805Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f89170dfa60>} 2025-05-07T20:33:07.6228596Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6228853Z context = 2025-05-07T20:33:07.6228861Z 2025-05-07T20:33:07.6229075Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6229343Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6229451Z module_map=module_map) 2025-05-07T20:33:07.6229614Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6229708Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6229779Z E ^ 2025-05-07T20:33:07.6230283Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6230289Z 2025-05-07T20:33:07.6230701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6230706Z 2025-05-07T20:33:07.6230810Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6231028Z self=, 2025-05-07T20:33:07.6231101Z T=128, 2025-05-07T20:33:07.6231177Z D=7168, 2025-05-07T20:33:07.6231255Z scale_ub=None, 2025-05-07T20:33:07.6231343Z contiguous=True, 2025-05-07T20:33:07.6231430Z compiled=False, 2025-05-07T20:33:07.6231502Z ) 2025-05-07T20:33:07.6231716Z self = 2025-05-07T20:33:07.6231889Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.6231899Z 2025-05-07T20:33:07.6231977Z @given( 2025-05-07T20:33:07.6232099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6232197Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6232309Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6232427Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6232537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6232608Z ) 2025-05-07T20:33:07.6232856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6232948Z def test_silu_mul_quant( 2025-05-07T20:33:07.6233025Z self, 2025-05-07T20:33:07.6233148Z T: int, 2025-05-07T20:33:07.6233222Z D: int, 2025-05-07T20:33:07.6233320Z scale_ub: Optional[float], 2025-05-07T20:33:07.6233405Z contiguous: bool, 2025-05-07T20:33:07.6233488Z compiled: bool, 2025-05-07T20:33:07.6233570Z ) -> None: 2025-05-07T20:33:07.6233661Z torch.manual_seed(2025) 2025-05-07T20:33:07.6233735Z 2025-05-07T20:33:07.6233902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6233973Z 2025-05-07T20:33:07.6234064Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6234189Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6234274Z x = x_sign * x_clamp 2025-05-07T20:33:07.6234351Z x0 = x[:, :D] 2025-05-07T20:33:07.6234431Z x1 = x[:, D:] 2025-05-07T20:33:07.6234499Z 2025-05-07T20:33:07.6234583Z if contiguous: 2025-05-07T20:33:07.6234669Z x0 = x0.contiguous() 2025-05-07T20:33:07.6234798Z x1 = x1.contiguous() 2025-05-07T20:33:07.6234875Z 2025-05-07T20:33:07.6234963Z if scale_ub is not None: 2025-05-07T20:33:07.6235064Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6235201Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6235273Z ) 2025-05-07T20:33:07.6235388Z else: 2025-05-07T20:33:07.6235482Z scale_ub_tensor = None 2025-05-07T20:33:07.6235553Z 2025-05-07T20:33:07.6235679Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6235768Z op = silu_mul_quant 2025-05-07T20:33:07.6235848Z if compiled: 2025-05-07T20:33:07.6235949Z op = torch.compile(op) 2025-05-07T20:33:07.6236051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6236120Z 2025-05-07T20:33:07.6236209Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6236214Z 2025-05-07T20:33:07.6236306Z moe/activation_test.py:117: 2025-05-07T20:33:07.6236437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6236537Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6236634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6237199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6237304Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6237661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6237887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6238228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6238318Z kernel = self.compile( 2025-05-07T20:33:07.6238700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6238877Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6239004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6239009Z 2025-05-07T20:33:07.6239211Z self = 2025-05-07T20:33:07.6239998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6240505Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f891735e790>} 2025-05-07T20:33:07.6241256Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6241491Z context = 2025-05-07T20:33:07.6241496Z 2025-05-07T20:33:07.6241658Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6241920Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6242028Z module_map=module_map) 2025-05-07T20:33:07.6242189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6242289Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6242361Z E ^ 2025-05-07T20:33:07.6242714Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6242719Z 2025-05-07T20:33:07.6243130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6243138Z 2025-05-07T20:33:07.6243276Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6243497Z self=, 2025-05-07T20:33:07.6243572Z T=2048, 2025-05-07T20:33:07.6243644Z D=7168, 2025-05-07T20:33:07.6243730Z scale_ub=1200.0, 2025-05-07T20:33:07.6243852Z contiguous=True, 2025-05-07T20:33:07.6243934Z compiled=False, 2025-05-07T20:33:07.6244008Z ) 2025-05-07T20:33:07.6244231Z self = 2025-05-07T20:33:07.6244400Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.6244405Z 2025-05-07T20:33:07.6244479Z @given( 2025-05-07T20:33:07.6244595Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6244692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6244805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6244919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6245039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6245112Z ) 2025-05-07T20:33:07.6245356Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6245451Z def test_silu_mul_quant( 2025-05-07T20:33:07.6245524Z self, 2025-05-07T20:33:07.6245644Z T: int, 2025-05-07T20:33:07.6245727Z D: int, 2025-05-07T20:33:07.6245822Z scale_ub: Optional[float], 2025-05-07T20:33:07.6245907Z contiguous: bool, 2025-05-07T20:33:07.6245992Z compiled: bool, 2025-05-07T20:33:07.6246066Z ) -> None: 2025-05-07T20:33:07.6246159Z torch.manual_seed(2025) 2025-05-07T20:33:07.6246229Z 2025-05-07T20:33:07.6246397Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6248199Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6248210Z 2025-05-07T20:33:07.6248327Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6248331Z 2025-05-07T20:33:07.6248434Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6248655Z self=, 2025-05-07T20:33:07.6248728Z T=1, 2025-05-07T20:33:07.6248806Z D=5120, 2025-05-07T20:33:07.6248887Z scale_ub=1200.0, 2025-05-07T20:33:07.6248969Z contiguous=True, 2025-05-07T20:33:07.6249051Z compiled=False, 2025-05-07T20:33:07.6249123Z ) 2025-05-07T20:33:07.6249388Z self = 2025-05-07T20:33:07.6249550Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.6249555Z 2025-05-07T20:33:07.6249628Z @given( 2025-05-07T20:33:07.6249746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6249846Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6249958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6250075Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6250185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6250256Z ) 2025-05-07T20:33:07.6250505Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6250594Z def test_silu_mul_quant( 2025-05-07T20:33:07.6250669Z self, 2025-05-07T20:33:07.6250741Z T: int, 2025-05-07T20:33:07.6250813Z D: int, 2025-05-07T20:33:07.6250956Z scale_ub: Optional[float], 2025-05-07T20:33:07.6251048Z contiguous: bool, 2025-05-07T20:33:07.6251130Z compiled: bool, 2025-05-07T20:33:07.6251208Z ) -> None: 2025-05-07T20:33:07.6251298Z torch.manual_seed(2025) 2025-05-07T20:33:07.6251367Z 2025-05-07T20:33:07.6251540Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6251654Z 2025-05-07T20:33:07.6251744Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6251865Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6251960Z x = x_sign * x_clamp 2025-05-07T20:33:07.6252062Z x0 = x[:, :D] 2025-05-07T20:33:07.6252144Z x1 = x[:, D:] 2025-05-07T20:33:07.6252229Z 2025-05-07T20:33:07.6252314Z if contiguous: 2025-05-07T20:33:07.6252402Z x0 = x0.contiguous() 2025-05-07T20:33:07.6252487Z x1 = x1.contiguous() 2025-05-07T20:33:07.6252558Z 2025-05-07T20:33:07.6252644Z if scale_ub is not None: 2025-05-07T20:33:07.6252752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6252886Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6252959Z ) 2025-05-07T20:33:07.6253032Z else: 2025-05-07T20:33:07.6253128Z scale_ub_tensor = None 2025-05-07T20:33:07.6253197Z 2025-05-07T20:33:07.6253370Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6253462Z op = silu_mul_quant 2025-05-07T20:33:07.6253545Z if compiled: 2025-05-07T20:33:07.6253644Z op = torch.compile(op) 2025-05-07T20:33:07.6253750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6253820Z 2025-05-07T20:33:07.6253911Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6253916Z 2025-05-07T20:33:07.6254011Z moe/activation_test.py:117: 2025-05-07T20:33:07.6254137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6254240Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6254340Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6254844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6254937Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6255301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6255527Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6255865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6255957Z kernel = self.compile( 2025-05-07T20:33:07.6256344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6256516Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6256689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6256694Z 2025-05-07T20:33:07.6256898Z self = 2025-05-07T20:33:07.6257683Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6258194Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8916f15040>} 2025-05-07T20:33:07.6258945Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6259134Z context = 2025-05-07T20:33:07.6259180Z 2025-05-07T20:33:07.6259346Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6259608Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6259714Z module_map=module_map) 2025-05-07T20:33:07.6259918Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6260017Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6260091Z E ^ 2025-05-07T20:33:07.6260444Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6260449Z 2025-05-07T20:33:07.6260867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6260872Z 2025-05-07T20:33:07.6260971Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6261201Z self=, 2025-05-07T20:33:07.6261277Z T=2048, 2025-05-07T20:33:07.6261349Z D=5120, 2025-05-07T20:33:07.6261430Z scale_ub=None, 2025-05-07T20:33:07.6261510Z contiguous=True, 2025-05-07T20:33:07.6261591Z compiled=False, 2025-05-07T20:33:07.6261662Z ) 2025-05-07T20:33:07.6261949Z self = 2025-05-07T20:33:07.6262140Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.6262145Z 2025-05-07T20:33:07.6262220Z @given( 2025-05-07T20:33:07.6262335Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6262433Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6262545Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6262662Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6262775Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6262844Z ) 2025-05-07T20:33:07.6263094Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6263188Z def test_silu_mul_quant( 2025-05-07T20:33:07.6263263Z self, 2025-05-07T20:33:07.6263335Z T: int, 2025-05-07T20:33:07.6263416Z D: int, 2025-05-07T20:33:07.6263510Z scale_ub: Optional[float], 2025-05-07T20:33:07.6263602Z contiguous: bool, 2025-05-07T20:33:07.6263687Z compiled: bool, 2025-05-07T20:33:07.6263761Z ) -> None: 2025-05-07T20:33:07.6263853Z torch.manual_seed(2025) 2025-05-07T20:33:07.6263922Z 2025-05-07T20:33:07.6264088Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6264162Z 2025-05-07T20:33:07.6264251Z > x_sign = torch.sign(x) 2025-05-07T20:33:07.6266046Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6266102Z 2025-05-07T20:33:07.6266219Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:07.6266224Z 2025-05-07T20:33:07.6266320Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6266548Z self=, 2025-05-07T20:33:07.6266623Z T=16384, 2025-05-07T20:33:07.6266696Z D=5120, 2025-05-07T20:33:07.6266778Z scale_ub=None, 2025-05-07T20:33:07.6266859Z contiguous=True, 2025-05-07T20:33:07.6266941Z compiled=False, 2025-05-07T20:33:07.6267012Z ) 2025-05-07T20:33:07.6267274Z self = 2025-05-07T20:33:07.6267453Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.6267458Z 2025-05-07T20:33:07.6267533Z @given( 2025-05-07T20:33:07.6267646Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6267745Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6267921Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6268036Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6268150Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6268223Z ) 2025-05-07T20:33:07.6268467Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6268560Z def test_silu_mul_quant( 2025-05-07T20:33:07.6268633Z self, 2025-05-07T20:33:07.6268705Z T: int, 2025-05-07T20:33:07.6268780Z D: int, 2025-05-07T20:33:07.6268874Z scale_ub: Optional[float], 2025-05-07T20:33:07.6268967Z contiguous: bool, 2025-05-07T20:33:07.6269052Z compiled: bool, 2025-05-07T20:33:07.6269125Z ) -> None: 2025-05-07T20:33:07.6269219Z torch.manual_seed(2025) 2025-05-07T20:33:07.6269294Z 2025-05-07T20:33:07.6269459Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6271418Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6271426Z 2025-05-07T20:33:07.6271540Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6271551Z 2025-05-07T20:33:07.6271654Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6271878Z self=, 2025-05-07T20:33:07.6271953Z T=4096, 2025-05-07T20:33:07.6272032Z D=5120, 2025-05-07T20:33:07.6272112Z scale_ub=None, 2025-05-07T20:33:07.6272199Z contiguous=True, 2025-05-07T20:33:07.6272284Z compiled=False, 2025-05-07T20:33:07.6272354Z ) 2025-05-07T20:33:07.6272571Z self = 2025-05-07T20:33:07.6272744Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.6272749Z 2025-05-07T20:33:07.6272822Z @given( 2025-05-07T20:33:07.6272941Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6273036Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6273147Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6273264Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6273418Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6273490Z ) 2025-05-07T20:33:07.6273734Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6273823Z def test_silu_mul_quant( 2025-05-07T20:33:07.6273902Z self, 2025-05-07T20:33:07.6273978Z T: int, 2025-05-07T20:33:07.6274052Z D: int, 2025-05-07T20:33:07.6274146Z scale_ub: Optional[float], 2025-05-07T20:33:07.6274235Z contiguous: bool, 2025-05-07T20:33:07.6274317Z compiled: bool, 2025-05-07T20:33:07.6274395Z ) -> None: 2025-05-07T20:33:07.6274485Z torch.manual_seed(2025) 2025-05-07T20:33:07.6274554Z 2025-05-07T20:33:07.6274723Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6276539Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6276588Z 2025-05-07T20:33:07.6276708Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6276712Z 2025-05-07T20:33:07.6276812Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6277038Z self=, 2025-05-07T20:33:07.6277118Z T=2048, 2025-05-07T20:33:07.6277193Z D=5120, 2025-05-07T20:33:07.6277271Z scale_ub=None, 2025-05-07T20:33:07.6277358Z contiguous=False, 2025-05-07T20:33:07.6277439Z compiled=False, 2025-05-07T20:33:07.6277516Z ) 2025-05-07T20:33:07.6277733Z self = 2025-05-07T20:33:07.6277901Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.6277905Z 2025-05-07T20:33:07.6277983Z @given( 2025-05-07T20:33:07.6278139Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6278240Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6278355Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6278467Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6278576Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6278650Z ) 2025-05-07T20:33:07.6278892Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6278985Z def test_silu_mul_quant( 2025-05-07T20:33:07.6279059Z self, 2025-05-07T20:33:07.6279132Z T: int, 2025-05-07T20:33:07.6279206Z D: int, 2025-05-07T20:33:07.6279307Z scale_ub: Optional[float], 2025-05-07T20:33:07.6279393Z contiguous: bool, 2025-05-07T20:33:07.6279480Z compiled: bool, 2025-05-07T20:33:07.6279553Z ) -> None: 2025-05-07T20:33:07.6279646Z torch.manual_seed(2025) 2025-05-07T20:33:07.6279719Z 2025-05-07T20:33:07.6279888Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6281667Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6281713Z 2025-05-07T20:33:07.6281830Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6281834Z 2025-05-07T20:33:07.6281936Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6282159Z self=, 2025-05-07T20:33:07.6282233Z T=4096, 2025-05-07T20:33:07.6282315Z D=7168, 2025-05-07T20:33:07.6282394Z scale_ub=None, 2025-05-07T20:33:07.6282477Z contiguous=True, 2025-05-07T20:33:07.6282563Z compiled=True, 2025-05-07T20:33:07.6282633Z ) 2025-05-07T20:33:07.6282846Z self = 2025-05-07T20:33:07.6283017Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.6283021Z 2025-05-07T20:33:07.6283094Z @given( 2025-05-07T20:33:07.6283212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6283308Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6283459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6283583Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6283693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6283766Z ) 2025-05-07T20:33:07.6284009Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6284143Z def test_silu_mul_quant( 2025-05-07T20:33:07.6284218Z self, 2025-05-07T20:33:07.6284295Z T: int, 2025-05-07T20:33:07.6284368Z D: int, 2025-05-07T20:33:07.6284462Z scale_ub: Optional[float], 2025-05-07T20:33:07.6284551Z contiguous: bool, 2025-05-07T20:33:07.6284633Z compiled: bool, 2025-05-07T20:33:07.6284711Z ) -> None: 2025-05-07T20:33:07.6284804Z torch.manual_seed(2025) 2025-05-07T20:33:07.6284874Z 2025-05-07T20:33:07.6285041Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6286864Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6286876Z 2025-05-07T20:33:07.6286993Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6286997Z 2025-05-07T20:33:07.6287097Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6287319Z self=, 2025-05-07T20:33:07.6287395Z T=2048, 2025-05-07T20:33:07.6287469Z D=5120, 2025-05-07T20:33:07.6287550Z scale_ub=1200.0, 2025-05-07T20:33:07.6287635Z contiguous=False, 2025-05-07T20:33:07.6287722Z compiled=False, 2025-05-07T20:33:07.6287794Z ) 2025-05-07T20:33:07.6288004Z self = 2025-05-07T20:33:07.6288174Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.6288179Z 2025-05-07T20:33:07.6288261Z @given( 2025-05-07T20:33:07.6288377Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6288473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6288589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6288702Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6288812Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6288886Z ) 2025-05-07T20:33:07.6289129Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6289222Z def test_silu_mul_quant( 2025-05-07T20:33:07.6289297Z self, 2025-05-07T20:33:07.6289415Z T: int, 2025-05-07T20:33:07.6289492Z D: int, 2025-05-07T20:33:07.6289587Z scale_ub: Optional[float], 2025-05-07T20:33:07.6289672Z contiguous: bool, 2025-05-07T20:33:07.6289758Z compiled: bool, 2025-05-07T20:33:07.6289831Z ) -> None: 2025-05-07T20:33:07.6289922Z torch.manual_seed(2025) 2025-05-07T20:33:07.6290001Z 2025-05-07T20:33:07.6290163Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6292038Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6292046Z 2025-05-07T20:33:07.6292163Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6292168Z 2025-05-07T20:33:07.6292269Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6292492Z self=, 2025-05-07T20:33:07.6292605Z T=4096, 2025-05-07T20:33:07.6292681Z D=7168, 2025-05-07T20:33:07.6292759Z scale_ub=1200.0, 2025-05-07T20:33:07.6292840Z contiguous=True, 2025-05-07T20:33:07.6292925Z compiled=False, 2025-05-07T20:33:07.6292995Z ) 2025-05-07T20:33:07.6293212Z self = 2025-05-07T20:33:07.6293381Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.6293386Z 2025-05-07T20:33:07.6293461Z @given( 2025-05-07T20:33:07.6293579Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6293682Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6293793Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6293910Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6294020Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6294093Z ) 2025-05-07T20:33:07.6294384Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6294478Z def test_silu_mul_quant( 2025-05-07T20:33:07.6294551Z self, 2025-05-07T20:33:07.6294630Z T: int, 2025-05-07T20:33:07.6294702Z D: int, 2025-05-07T20:33:07.6294795Z scale_ub: Optional[float], 2025-05-07T20:33:07.6294884Z contiguous: bool, 2025-05-07T20:33:07.6294966Z compiled: bool, 2025-05-07T20:33:07.6295044Z ) -> None: 2025-05-07T20:33:07.6295135Z torch.manual_seed(2025) 2025-05-07T20:33:07.6295204Z 2025-05-07T20:33:07.6295367Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6297157Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6297165Z 2025-05-07T20:33:07.6297284Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6297289Z 2025-05-07T20:33:07.6297387Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6297609Z self=, 2025-05-07T20:33:07.6297688Z T=16384, 2025-05-07T20:33:07.6297760Z D=7168, 2025-05-07T20:33:07.6297883Z scale_ub=None, 2025-05-07T20:33:07.6297968Z contiguous=False, 2025-05-07T20:33:07.6298047Z compiled=True, 2025-05-07T20:33:07.6298122Z ) 2025-05-07T20:33:07.6298341Z self = 2025-05-07T20:33:07.6298516Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.6298523Z 2025-05-07T20:33:07.6298599Z @given( 2025-05-07T20:33:07.6298716Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6298809Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6298922Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6299034Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6299144Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6299219Z ) 2025-05-07T20:33:07.6299461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6299620Z def test_silu_mul_quant( 2025-05-07T20:33:07.6299698Z self, 2025-05-07T20:33:07.6299775Z T: int, 2025-05-07T20:33:07.6299854Z D: int, 2025-05-07T20:33:07.6299948Z scale_ub: Optional[float], 2025-05-07T20:33:07.6300033Z contiguous: bool, 2025-05-07T20:33:07.6300118Z compiled: bool, 2025-05-07T20:33:07.6300233Z ) -> None: 2025-05-07T20:33:07.6300323Z torch.manual_seed(2025) 2025-05-07T20:33:07.6300396Z 2025-05-07T20:33:07.6300560Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6302392Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6302400Z 2025-05-07T20:33:07.6302512Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6302517Z 2025-05-07T20:33:07.6302617Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6302882Z self=, 2025-05-07T20:33:07.6302958Z T=4096, 2025-05-07T20:33:07.6303034Z D=7168, 2025-05-07T20:33:07.6303114Z scale_ub=None, 2025-05-07T20:33:07.6303195Z contiguous=True, 2025-05-07T20:33:07.6303279Z compiled=False, 2025-05-07T20:33:07.6303349Z ) 2025-05-07T20:33:07.6303566Z self = 2025-05-07T20:33:07.6304052Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.6304060Z 2025-05-07T20:33:07.6304154Z @given( 2025-05-07T20:33:07.6304277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6304370Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6304479Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6304594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6304701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6304773Z ) 2025-05-07T20:33:07.6305019Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6305107Z def test_silu_mul_quant( 2025-05-07T20:33:07.6305179Z self, 2025-05-07T20:33:07.6305252Z T: int, 2025-05-07T20:33:07.6305320Z D: int, 2025-05-07T20:33:07.6305414Z scale_ub: Optional[float], 2025-05-07T20:33:07.6305503Z contiguous: bool, 2025-05-07T20:33:07.6305583Z compiled: bool, 2025-05-07T20:33:07.6305659Z ) -> None: 2025-05-07T20:33:07.6305745Z torch.manual_seed(2025) 2025-05-07T20:33:07.6305814Z 2025-05-07T20:33:07.6306076Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6307874Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6307883Z 2025-05-07T20:33:07.6307996Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6308001Z 2025-05-07T20:33:07.6308096Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6308317Z self=, 2025-05-07T20:33:07.6308461Z T=16384, 2025-05-07T20:33:07.6308535Z D=7168, 2025-05-07T20:33:07.6308611Z scale_ub=None, 2025-05-07T20:33:07.6308692Z contiguous=True, 2025-05-07T20:33:07.6308772Z compiled=False, 2025-05-07T20:33:07.6308844Z ) 2025-05-07T20:33:07.6309056Z self = 2025-05-07T20:33:07.6309298Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.6309302Z 2025-05-07T20:33:07.6309381Z @given( 2025-05-07T20:33:07.6309494Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6309588Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6309703Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6309878Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6309986Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6310060Z ) 2025-05-07T20:33:07.6310309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6310406Z def test_silu_mul_quant( 2025-05-07T20:33:07.6310478Z self, 2025-05-07T20:33:07.6310550Z T: int, 2025-05-07T20:33:07.6310623Z D: int, 2025-05-07T20:33:07.6310717Z scale_ub: Optional[float], 2025-05-07T20:33:07.6310798Z contiguous: bool, 2025-05-07T20:33:07.6310955Z compiled: bool, 2025-05-07T20:33:07.6311030Z ) -> None: 2025-05-07T20:33:07.6311120Z torch.manual_seed(2025) 2025-05-07T20:33:07.6311191Z 2025-05-07T20:33:07.6311355Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6313163Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6313172Z 2025-05-07T20:33:07.6313284Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6313292Z 2025-05-07T20:33:07.6313395Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6313614Z self=, 2025-05-07T20:33:07.6313686Z T=16384, 2025-05-07T20:33:07.6313758Z D=7168, 2025-05-07T20:33:07.6313833Z scale_ub=1200.0, 2025-05-07T20:33:07.6313913Z contiguous=True, 2025-05-07T20:33:07.6313999Z compiled=False, 2025-05-07T20:33:07.6314065Z ) 2025-05-07T20:33:07.6314279Z self = 2025-05-07T20:33:07.6314453Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.6314504Z 2025-05-07T20:33:07.6314580Z @given( 2025-05-07T20:33:07.6314697Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6314791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6314902Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6315020Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6315131Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6315200Z ) 2025-05-07T20:33:07.6315445Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6315535Z def test_silu_mul_quant( 2025-05-07T20:33:07.6315604Z self, 2025-05-07T20:33:07.6315677Z T: int, 2025-05-07T20:33:07.6315754Z D: int, 2025-05-07T20:33:07.6315848Z scale_ub: Optional[float], 2025-05-07T20:33:07.6315937Z contiguous: bool, 2025-05-07T20:33:07.6316015Z compiled: bool, 2025-05-07T20:33:07.6316092Z ) -> None: 2025-05-07T20:33:07.6316226Z torch.manual_seed(2025) 2025-05-07T20:33:07.6316295Z 2025-05-07T20:33:07.6316461Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6318260Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6318363Z 2025-05-07T20:33:07.6318478Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6318482Z 2025-05-07T20:33:07.6318578Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6318804Z self=, 2025-05-07T20:33:07.6318883Z T=128, 2025-05-07T20:33:07.6318956Z D=5120, 2025-05-07T20:33:07.6319034Z scale_ub=1200.0, 2025-05-07T20:33:07.6319116Z contiguous=False, 2025-05-07T20:33:07.6319196Z compiled=False, 2025-05-07T20:33:07.6319271Z ) 2025-05-07T20:33:07.6319525Z self = 2025-05-07T20:33:07.6319699Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.6319704Z 2025-05-07T20:33:07.6319776Z @given( 2025-05-07T20:33:07.6319888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6319984Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6320095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6320208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6320315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6320388Z ) 2025-05-07T20:33:07.6320636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6320729Z def test_silu_mul_quant( 2025-05-07T20:33:07.6320800Z self, 2025-05-07T20:33:07.6320868Z T: int, 2025-05-07T20:33:07.6320941Z D: int, 2025-05-07T20:33:07.6321043Z scale_ub: Optional[float], 2025-05-07T20:33:07.6321129Z contiguous: bool, 2025-05-07T20:33:07.6321215Z compiled: bool, 2025-05-07T20:33:07.6321288Z ) -> None: 2025-05-07T20:33:07.6321377Z torch.manual_seed(2025) 2025-05-07T20:33:07.6321448Z 2025-05-07T20:33:07.6321609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6321679Z 2025-05-07T20:33:07.6321769Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6321886Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6321976Z x = x_sign * x_clamp 2025-05-07T20:33:07.6322050Z x0 = x[:, :D] 2025-05-07T20:33:07.6322197Z x1 = x[:, D:] 2025-05-07T20:33:07.6322272Z 2025-05-07T20:33:07.6322352Z if contiguous: 2025-05-07T20:33:07.6322439Z x0 = x0.contiguous() 2025-05-07T20:33:07.6322525Z x1 = x1.contiguous() 2025-05-07T20:33:07.6322593Z 2025-05-07T20:33:07.6322681Z if scale_ub is not None: 2025-05-07T20:33:07.6322789Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6322920Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6322990Z ) 2025-05-07T20:33:07.6323064Z else: 2025-05-07T20:33:07.6323153Z scale_ub_tensor = None 2025-05-07T20:33:07.6323227Z 2025-05-07T20:33:07.6323351Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6323436Z op = silu_mul_quant 2025-05-07T20:33:07.6323520Z if compiled: 2025-05-07T20:33:07.6323615Z op = torch.compile(op) 2025-05-07T20:33:07.6323716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6323846Z 2025-05-07T20:33:07.6323938Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6323942Z 2025-05-07T20:33:07.6324040Z moe/activation_test.py:117: 2025-05-07T20:33:07.6324168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6324267Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6324401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6324910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6325002Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6325363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6325587Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6325927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6326023Z kernel = self.compile( 2025-05-07T20:33:07.6326403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6326575Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6326741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6326749Z 2025-05-07T20:33:07.6326954Z self = 2025-05-07T20:33:07.6327741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6328252Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8916e4cca0>} 2025-05-07T20:33:07.6329011Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6329203Z context = 2025-05-07T20:33:07.6329211Z 2025-05-07T20:33:07.6329373Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6329729Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6329864Z module_map=module_map) 2025-05-07T20:33:07.6330028Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6330121Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6330193Z E ^ 2025-05-07T20:33:07.6330560Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6330641Z 2025-05-07T20:33:07.6331052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6331057Z 2025-05-07T20:33:07.6331161Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6331382Z self=, 2025-05-07T20:33:07.6331459Z T=2048, 2025-05-07T20:33:07.6331539Z D=7168, 2025-05-07T20:33:07.6331634Z scale_ub=None, 2025-05-07T20:33:07.6335876Z contiguous=False, 2025-05-07T20:33:07.6335970Z compiled=False, 2025-05-07T20:33:07.6336041Z ) 2025-05-07T20:33:07.6336273Z self = 2025-05-07T20:33:07.6336447Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.6336453Z 2025-05-07T20:33:07.6336530Z @given( 2025-05-07T20:33:07.6336647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6336808Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6336923Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6337035Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6337143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6337259Z ) 2025-05-07T20:33:07.6337507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6337597Z def test_silu_mul_quant( 2025-05-07T20:33:07.6337675Z self, 2025-05-07T20:33:07.6337749Z T: int, 2025-05-07T20:33:07.6337820Z D: int, 2025-05-07T20:33:07.6337917Z scale_ub: Optional[float], 2025-05-07T20:33:07.6338001Z contiguous: bool, 2025-05-07T20:33:07.6338083Z compiled: bool, 2025-05-07T20:33:07.6338157Z ) -> None: 2025-05-07T20:33:07.6338249Z torch.manual_seed(2025) 2025-05-07T20:33:07.6338323Z 2025-05-07T20:33:07.6338493Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6340336Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6340350Z 2025-05-07T20:33:07.6340466Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6340470Z 2025-05-07T20:33:07.6340567Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6340787Z self=, 2025-05-07T20:33:07.6340861Z T=128, 2025-05-07T20:33:07.6340937Z D=7168, 2025-05-07T20:33:07.6341018Z scale_ub=1200.0, 2025-05-07T20:33:07.6341097Z contiguous=True, 2025-05-07T20:33:07.6341177Z compiled=True, 2025-05-07T20:33:07.6341248Z ) 2025-05-07T20:33:07.6341466Z self = 2025-05-07T20:33:07.6341636Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.6341644Z 2025-05-07T20:33:07.6341720Z @given( 2025-05-07T20:33:07.6341834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6341939Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6342068Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6342202Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6342316Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6342383Z ) 2025-05-07T20:33:07.6342625Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6342767Z def test_silu_mul_quant( 2025-05-07T20:33:07.6342837Z self, 2025-05-07T20:33:07.6342914Z T: int, 2025-05-07T20:33:07.6342986Z D: int, 2025-05-07T20:33:07.6343080Z scale_ub: Optional[float], 2025-05-07T20:33:07.6343168Z contiguous: bool, 2025-05-07T20:33:07.6343250Z compiled: bool, 2025-05-07T20:33:07.6343332Z ) -> None: 2025-05-07T20:33:07.6343425Z torch.manual_seed(2025) 2025-05-07T20:33:07.6343492Z 2025-05-07T20:33:07.6343655Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6343729Z 2025-05-07T20:33:07.6343820Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6343945Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6344028Z x = x_sign * x_clamp 2025-05-07T20:33:07.6344103Z x0 = x[:, :D] 2025-05-07T20:33:07.6344181Z x1 = x[:, D:] 2025-05-07T20:33:07.6344248Z 2025-05-07T20:33:07.6344328Z if contiguous: 2025-05-07T20:33:07.6344467Z x0 = x0.contiguous() 2025-05-07T20:33:07.6344553Z x1 = x1.contiguous() 2025-05-07T20:33:07.6344621Z 2025-05-07T20:33:07.6344710Z if scale_ub is not None: 2025-05-07T20:33:07.6344809Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6344943Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6345058Z ) 2025-05-07T20:33:07.6345129Z else: 2025-05-07T20:33:07.6345223Z scale_ub_tensor = None 2025-05-07T20:33:07.6345294Z 2025-05-07T20:33:07.6345424Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6345513Z op = silu_mul_quant 2025-05-07T20:33:07.6345594Z if compiled: 2025-05-07T20:33:07.6345690Z op = torch.compile(op) 2025-05-07T20:33:07.6345794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6345863Z 2025-05-07T20:33:07.6345951Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6345958Z 2025-05-07T20:33:07.6346061Z moe/activation_test.py:117: 2025-05-07T20:33:07.6346184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6346285Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6346378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6346786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.6346881Z return fn(*args, **kwargs) 2025-05-07T20:33:07.6347371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6347465Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6347820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6348042Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6348381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6348473Z kernel = self.compile( 2025-05-07T20:33:07.6348849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6349028Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6349156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6349161Z 2025-05-07T20:33:07.6349363Z self = 2025-05-07T20:33:07.6350215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6350724Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8916d390d0>} 2025-05-07T20:33:07.6351520Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6351710Z context = 2025-05-07T20:33:07.6351715Z 2025-05-07T20:33:07.6351882Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6352143Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6352248Z module_map=module_map) 2025-05-07T20:33:07.6352414Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6352509Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6352586Z E ^ 2025-05-07T20:33:07.6352983Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6352991Z 2025-05-07T20:33:07.6353400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6353405Z 2025-05-07T20:33:07.6353505Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6353761Z self=, 2025-05-07T20:33:07.6353834Z T=128, 2025-05-07T20:33:07.6353909Z D=7168, 2025-05-07T20:33:07.6353986Z scale_ub=1200.0, 2025-05-07T20:33:07.6354065Z contiguous=True, 2025-05-07T20:33:07.6354147Z compiled=False, 2025-05-07T20:33:07.6354217Z ) 2025-05-07T20:33:07.6354433Z self = 2025-05-07T20:33:07.6354600Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.6354604Z 2025-05-07T20:33:07.6354677Z @given( 2025-05-07T20:33:07.6354805Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6354902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6355012Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6355131Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6355279Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6355352Z ) 2025-05-07T20:33:07.6355604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6355695Z def test_silu_mul_quant( 2025-05-07T20:33:07.6355771Z self, 2025-05-07T20:33:07.6355841Z T: int, 2025-05-07T20:33:07.6355909Z D: int, 2025-05-07T20:33:07.6356005Z scale_ub: Optional[float], 2025-05-07T20:33:07.6356088Z contiguous: bool, 2025-05-07T20:33:07.6356168Z compiled: bool, 2025-05-07T20:33:07.6356247Z ) -> None: 2025-05-07T20:33:07.6356336Z torch.manual_seed(2025) 2025-05-07T20:33:07.6356408Z 2025-05-07T20:33:07.6356580Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6356654Z 2025-05-07T20:33:07.6356744Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6356871Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6358655Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6358667Z 2025-05-07T20:33:07.6358781Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.6358830Z 2025-05-07T20:33:07.6358931Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6359158Z self=, 2025-05-07T20:33:07.6359233Z T=128, 2025-05-07T20:33:07.6359302Z D=5120, 2025-05-07T20:33:07.6359381Z scale_ub=1200.0, 2025-05-07T20:33:07.6359459Z contiguous=True, 2025-05-07T20:33:07.6359539Z compiled=True, 2025-05-07T20:33:07.6359609Z ) 2025-05-07T20:33:07.6359820Z self = 2025-05-07T20:33:07.6359982Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.6359989Z 2025-05-07T20:33:07.6360060Z @given( 2025-05-07T20:33:07.6360173Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6360268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6360379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6360491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6360652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6360723Z ) 2025-05-07T20:33:07.6360965Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6361058Z def test_silu_mul_quant( 2025-05-07T20:33:07.6361128Z self, 2025-05-07T20:33:07.6361241Z T: int, 2025-05-07T20:33:07.6361315Z D: int, 2025-05-07T20:33:07.6361408Z scale_ub: Optional[float], 2025-05-07T20:33:07.6361506Z contiguous: bool, 2025-05-07T20:33:07.6361598Z compiled: bool, 2025-05-07T20:33:07.6361681Z ) -> None: 2025-05-07T20:33:07.6361785Z torch.manual_seed(2025) 2025-05-07T20:33:07.6361851Z 2025-05-07T20:33:07.6362014Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6362089Z 2025-05-07T20:33:07.6362175Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6362293Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6364128Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6364138Z 2025-05-07T20:33:07.6364255Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:07.6364260Z 2025-05-07T20:33:07.6364359Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6364578Z self=, 2025-05-07T20:33:07.6364654Z T=128, 2025-05-07T20:33:07.6364725Z D=7168, 2025-05-07T20:33:07.6364801Z scale_ub=None, 2025-05-07T20:33:07.6364889Z contiguous=True, 2025-05-07T20:33:07.6364967Z compiled=True, 2025-05-07T20:33:07.6365036Z ) 2025-05-07T20:33:07.6365248Z self = 2025-05-07T20:33:07.6365410Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.6365419Z 2025-05-07T20:33:07.6365494Z @given( 2025-05-07T20:33:07.6365611Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6365707Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6365818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6365929Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6366036Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6366107Z ) 2025-05-07T20:33:07.6366348Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6366434Z def test_silu_mul_quant( 2025-05-07T20:33:07.6366557Z self, 2025-05-07T20:33:07.6366629Z T: int, 2025-05-07T20:33:07.6366698Z D: int, 2025-05-07T20:33:07.6366795Z scale_ub: Optional[float], 2025-05-07T20:33:07.6366877Z contiguous: bool, 2025-05-07T20:33:07.6366957Z compiled: bool, 2025-05-07T20:33:07.6367035Z ) -> None: 2025-05-07T20:33:07.6367127Z torch.manual_seed(2025) 2025-05-07T20:33:07.6367199Z 2025-05-07T20:33:07.6367362Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6369171Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:07.6369183Z 2025-05-07T20:33:07.6369295Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:07.6369425Z =============================== warnings summary =============================== 2025-05-07T20:33:07.6369772Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:07.6370075Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:07.6370365Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:07.6371250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.9/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:07.6371484Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:07.6371488Z 2025-05-07T20:33:07.6371700Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:07.6371883Z ================= 1 failed, 1 deselected, 3 warnings in 19.25s ================= 2025-05-07T20:33:09.1173757Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:09.1806237Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:33:09.1806503Z 2025-05-07T20:33:09.1806673Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:33:09.1807238Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:33:09.1807629Z 2025-05-07T20:33:09.1807633Z 2025-05-07T20:33:09.1807637Z 2025-05-07T20:33:09.1824996Z ##[error]Process completed with exit code 1. 2025-05-07T20:33:09.1905289Z Post job cleanup. 2025-05-07T20:33:09.2887495Z [command]/usr/bin/git version 2025-05-07T20:33:09.2932511Z git version 2.47.1 2025-05-07T20:33:09.2971276Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/0437cfae-c772-4cbd-8dab-3158a79dbfad/.gitconfig' 2025-05-07T20:33:09.2982176Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/0437cfae-c772-4cbd-8dab-3158a79dbfad' before making global git config changes 2025-05-07T20:33:09.2983039Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:33:09.2987863Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:33:09.3033971Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:33:09.3068298Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:33:09.3403641Z Entering 'external/asmjit' 2025-05-07T20:33:09.3471469Z Entering 'external/composable_kernel' 2025-05-07T20:33:09.3545552Z Entering 'external/cpuinfo' 2025-05-07T20:33:09.3612149Z Entering 'external/cutlass' 2025-05-07T20:33:09.3686965Z Entering 'external/googletest' 2025-05-07T20:33:09.3752736Z Entering 'external/hipify_torch' 2025-05-07T20:33:09.3819357Z Entering 'external/json' 2025-05-07T20:33:09.3905330Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:33:09.3933286Z http.https://github.com/.extraheader 2025-05-07T20:33:09.3945447Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:33:09.3979975Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:33:09.4309195Z Entering 'external/asmjit' 2025-05-07T20:33:09.4353045Z http.https://github.com/.extraheader 2025-05-07T20:33:09.4395621Z Entering 'external/composable_kernel' 2025-05-07T20:33:09.4438595Z http.https://github.com/.extraheader 2025-05-07T20:33:09.4487744Z Entering 'external/cpuinfo' 2025-05-07T20:33:09.4531439Z http.https://github.com/.extraheader 2025-05-07T20:33:09.4575796Z Entering 'external/cutlass' 2025-05-07T20:33:09.4618746Z http.https://github.com/.extraheader 2025-05-07T20:33:09.4669899Z Entering 'external/googletest' 2025-05-07T20:33:09.4717460Z http.https://github.com/.extraheader 2025-05-07T20:33:09.4760238Z Entering 'external/hipify_torch' 2025-05-07T20:33:09.4802236Z http.https://github.com/.extraheader 2025-05-07T20:33:09.4844573Z Entering 'external/json' 2025-05-07T20:33:09.4891277Z http.https://github.com/.extraheader 2025-05-07T20:33:09.5041005Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:33:09.5071536Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:33:09.5081922Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:33:09.5082308Z ##[endgroup] 2025-05-07T20:33:09.5201906Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:33:20.2975340Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:33:36.6410716Z Cleaning up orphan processes